ECCV 2026 · Project Page

LiveEdit Towards Real-Time Diffusion-Based Streaming Video Editing

Xinyu Wang1 Chongbo Zhao1 Fangneng Zhan2 Yue Ma2,†
1THU 2HKUST
Streaming Video Editing Causal DiT 4-Step Distillation 12.66 FPS
12.66 FPSreal-time streaming inference
4 Stepsdistilled diffusion generation
Causal DiTchunk-wise frame processing
Mask CacheAR-oriented token reuse

Abstract

Streaming video editing has made rapid progress, yet practical deployment is still limited by two core issues: maintaining stable backgrounds and non-edited regions over time, and achieving the low latency required for real-time interactive scenarios. Recent streaming video generation methods are mostly developed for synthesis and cannot be directly applied to editing due to strict preservation requirements and region-specific control. We present LiveEdit, a streaming video editing framework that performs causal, frame-by-frame editing with strong content preservation and real-time responsiveness. LiveEdit uses a three-stage distillation pipeline to transfer editing capability from a powerful bidirectional foundation model to an efficient unidirectional streaming editor. To further support real-time deployment, an AR-oriented mask cache reuses region-related computation across frames, reducing redundant processing while preserving visual quality. Extensive evaluations show state-of-the-art quality among streaming baselines and 12.66 FPS inference speed.

Method

LiveEdit framework
Three-stage distillation pipeline for streaming video editing.
Stage 1: Foundation Tuning
  • Equips a bidirectional diffusion transformer with robust editing ability.
  • Uses full temporal attention to learn high-fidelity editing priors.
Stage 2: Causal Adaptation
  • Transitions from bidirectional processing to chunk-wise causal attention.
  • Uses teacher forcing to preserve editing quality under streaming constraints.
Stage 3: DMD Distillation + Cache
  • Compresses generation to 4 diffusion steps for low latency.
  • Reuses masked background tokens to avoid redundant computation.

Comparison

Cat fur edit: baseline comparison

Source
Instruction
Change the cat’s fur from dark and light patches to solid snowy white with blue eyes.
InsV2V
LucyEdit
VideoCoF
StreamDiffusion
StreamV2V
Ours

Forest lighting edit: baseline comparison

Source
Instruction
Replace the dappled sunlight with beams of pale lavender light piercing through the canopy.
InsV2V
LucyEdit
VideoCoF
StreamDiffusion
StreamV2V
Ours

Ablation

LiveEdit mask cache layer
AR-oriented mask cache for selective token reuse.

Citation

@inproceedings{wang2026liveedit,
  title={LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing},
  author={Wang, Xinyu and Zhao, Chongbo and Zhan, Fangneng and Ma, Yue},
  booktitle={European Conference on Computer Vision},
  year={2026}
  organization={Springer}
}