Coming soon — under review

OmniForcing: Unleashing Real-time
Joint Audio-Visual Generation

1JD Explore Academy, 2Fudan University, 3Peking University, 4The University of Hong Kong
*Equal Contribution

▶ Real-time streaming demos — click to play with audio

"A man standing by the ocean with birds flying overhead, the sound of waves crashing against the shore blends with distant seagull calls and a gentle breeze."
"A grey tabby cat sitting on a patterned rug, suddenly letting out a precisely timed meow while gazing curiously at the camera."
"An elderly woman sitting at a sewing machine, talking about making a special quilt for her granddaughter, with the rhythmic sound of the machine blending with her warm narration."

Abstract

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity.

OmniForcing teaser figure

We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts.

Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at ~25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.




OmniForcing Method Overview

OmniForcing three-stage distillation pipeline

Figure 2: The three-stage OmniForcing distillation pipeline. Stage I employs Distribution Matching Distillation (DMD) for few-step denoising. Stage II uses Causal ODE Regression with our Audio Sink Token stabilizer. Stage III implements Joint Self-Forcing training to mitigate exposure bias.

OmniForcing restructures a pretrained bidirectional dual-stream transformer (LTX-2, 14B video + 5B audio) into a block-causal autoregressive framework through a carefully designed three-stage distillation pipeline. To bridge the extreme frequency asymmetry (3 FPS video vs. 25 FPS audio), we establish a physical-time-based Macro-block Alignment at one-second boundaries with a globally visible Global Prefix anchor. An Audio Sink Token mechanism with Identity RoPE prevents the Softmax collapse and gradient explosions caused by sparse causal audio attention. Finally, Joint Self-Forcing Distillation forces both modality streams to dynamically adapt to each other's prediction drifts, maintaining strict cross-modal synchrony during long rollouts.

Asymmetric Block-Causal Masking

Figure 3: Asymmetric Block-Causal Masking. Modalities are synchronized via 1s macro-blocks. Each audio block contains 25 latent frames (one token each), whereas each video block contains 3 × 384 tokens. The Global Prefix (orange) and Audio Sink tokens (red) remain globally visible.



Side-by-Side with Bidirectional Teacher

OmniForcing achieves a ~35× latency reduction while maintaining perceptual fidelity on par with the LTX-2 teacher.

OmniForcing (Ours) — ~25 FPS, TTFC ~0.7s

LTX-2 (Teacher) — Offline, TTFC ~197s

Distillation Fidelity (VBench)

Model Aesthetic Quality ↑ Imaging Quality ↑ Motion Smoothness ↑ Subject Consistency ↑ Temporal Flickering ↑ TTFC ↓ FPS ↑
LTX-2 0.569 0.574 0.993 0.945 0.988 197.0s
OmniForcing 0.595 0.594 0.995 0.955 0.989 0.7s 25

Main Results on JavisBench

Model Size AV-Quality Text-Consistency AV-Consistency AV-Synchrony Runtime↓
FVD↓FAD↓ TV-IB↑TA-IB↑CLIP↑CLAP↑ AV-IB↑AVHScore↑ JavisScore↑DeSync↓
MMAudio0.1B6.10.1600.4070.1980.1820.1500.84915s
JavisDiT++2.1B141.55.50.2820.1640.3160.4240.1980.1840.1590.83210s
UniVerse-16.4B194.28.70.2720.1110.3090.2450.1040.0980.0770.92913s
LTX-219B125.44.60.2900.1730.3180.4420.3180.2980.2530.384197s
OmniForcing19B137.25.70.2870.1620.3220.4010.2690.2540.2080.3925.7s


Conclusion

We presented OmniForcing, the first framework to distill a bidirectional joint audio-visual diffusion model into a real-time streaming autoregressive generator. To overcome the temporal asymmetry and gradient instability inherent in causal multi-modal distillation, we introduced Asymmetric Block-Causal Alignment and Audio Sink Tokens with Identity RoPE, coupled with Joint Self-Forcing Distillation and a modality-independent rolling KV-cache. OmniForcing achieves ~25 FPS streaming on a single GPU. We hope this work opens up new possibilities for deploying multi-modal foundation models in interactive and latency-sensitive scenarios.



BibTeX

@article{su2026omniforcing,
  title   = {OmniForcing: Unleashing Real-time Joint Audio-Visual Generation},
  author  = {Su, Yaofeng and Li, Yuming and Xue, Zeyue and Huang, Jie and Fu, Siming
             and Li, Haoran and Li, Ying and Qian, Zezhong and Huang, Haoyang and Duan, Nan},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2026}
}