[C88] Continual Test-Time Fine-Tuning of Frame-Based Style Transfer Network for Video Stream Data

Abstract

Neural network-based style transfer has long attracted research interest and continues to be an active area today. While image style transfer (IST) methods have achieved impressive advancements in arbitrary style transfer under low per-frame latency constraints, video style transfer (VST) continues to struggle due to the additional challenge of maintaining temporal consistency between frames. Despite recent rapid progress, existing VST methods still face critical limitations. Some approaches rely on inefficient components, such as optical flow, to capture temporal consistency, which undermines the goals of fast style transfer. In contrast, frame-wise stylization methods without inter-frame information meet fast transfer requirements but struggle to mitigate flickering artifacts. To address these challenges, we propose a novel Continual Test-Time Fine-Tuning (C-TTFT) framework that dynamically adapts a pre-trained frame-based style transfer network during inference. Our method leverages lightweight inter-frame information by using the previously stylized output as guidance for the current frame, optimizing simple loss objectives in real time. Furthermore, we incorporate a parameter-efficient adaptation technique, Low-Rank Adaptation (LoRA), to reduce memory and computational overhead. We validate C-TTFT on MicroAST, a state-of-the-art, frame-based style transfer backbone, and demonstrate that our method reduces temporal consistency loss by up to 17% and enhances SSIM, CFSD, and ArtFID by up to 22%, 33%, and 8%, respectively, all while preserving real-time performance (24 FPS) with minimal added cost (10% more parameters, +48K; 4.9% more memory usage, +93KB).

Publication
Visual Communications and Image Processing Conference (VCIP) 2025
Un Ki Park (박운기)
Un Ki Park (박운기)
SSIT, PhD student