LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing

12.66 FPSreal-time streaming inference

4 Stepsdistilled diffusion generation

Causal DiTchunk-wise frame processing

Mask CacheAR-oriented token reuse

Abstract

Streaming video editing has made rapid progress, yet practical deployment is still limited by two core issues: maintaining stable backgrounds and non-edited regions over time, and achieving the low latency required for real-time interactive scenarios. Recent streaming video generation methods are mostly developed for synthesis and cannot be directly applied to editing due to strict preservation requirements and region-specific control. We present LiveEdit, a streaming video editing framework that performs causal, frame-by-frame editing with strong content preservation and real-time responsiveness. LiveEdit uses a three-stage distillation pipeline to transfer editing capability from a powerful bidirectional foundation model to an efficient unidirectional streaming editor. To further support real-time deployment, an AR-oriented mask cache reuses region-related computation across frames, reducing redundant processing while preserving visual quality. Extensive evaluations show state-of-the-art quality among streaming baselines and 12.66 FPS inference speed.

Method

Three-stage distillation pipeline for streaming video editing.

Stage 1: Foundation Tuning

Equips a bidirectional diffusion transformer with robust editing ability.
Uses full temporal attention to learn high-fidelity editing priors.

Stage 2: Causal Adaptation

Transitions from bidirectional processing to chunk-wise causal attention.
Uses teacher forcing to preserve editing quality under streaming constraints.

Stage 3: DMD Distillation + Cache

Compresses generation to 4 diffusion steps for low latency.
Reuses masked background tokens to avoid redundant computation.

Results Gallery

Representative real-time editing cases with source video, target video, and editing instruction.

Comparison

Cat fur edit: baseline comparison

Source

Instruction

Change the cat’s fur from dark and light patches to solid snowy white with blue eyes.

InsV2V

LucyEdit

VideoCoF

StreamDiffusion

StreamV2V

Ours

Forest lighting edit: baseline comparison

Source

Instruction

Replace the dappled sunlight with beams of pale lavender light piercing through the canopy.

InsV2V

LucyEdit

VideoCoF

StreamDiffusion

StreamV2V

Ours

Ablation

AR-oriented mask cache for selective token reuse.

Citation

@inproceedings{wang2026liveedit,
  title={LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing},
  author={Wang, Xinyu and Zhao, Chongbo and Zhan, Fangneng and Ma, Yue},
  booktitle={European Conference on Computer Vision},
  year={2026}
  organization={Springer}
}