OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

OmniForcing Method Overview

OmniForcing three-stage distillation pipeline

Figure 2: The three-stage OmniForcing distillation pipeline. Stage I employs Distribution Matching Distillation (DMD) for few-step denoising. Stage II uses Causal ODE Regression with our Audio Sink Token stabilizer. Stage III implements Joint Self-Forcing training to mitigate exposure bias.

OmniForcing restructures a pretrained bidirectional dual-stream transformer (LTX-2, 14B video + 5B audio) into a block-causal autoregressive framework through a carefully designed three-stage distillation pipeline. To bridge the extreme frequency asymmetry (3 FPS video vs. 25 FPS audio), we establish a physical-time-based Macro-block Alignment at one-second boundaries with a globally visible Global Prefix anchor. An Audio Sink Token mechanism with Identity RoPE prevents the Softmax collapse and gradient explosions caused by sparse causal audio attention. Finally, Joint Self-Forcing Distillation forces both modality streams to dynamically adapt to each other's prediction drifts, maintaining strict cross-modal synchrony during long rollouts.

Figure 3: Asymmetric Block-Causal Masking. Modalities are synchronized via 1s macro-blocks. Each audio block contains 25 latent frames (one token each), whereas each video block contains 3 × 384 tokens. The Global Prefix (orange) and Audio Sink tokens (red) remain globally visible.

Side-by-Side with Bidirectional Teacher

OmniForcing achieves a ~35× latency reduction while maintaining perceptual fidelity on par with the LTX-2 teacher.

OmniForcing (Ours) — ~25 FPS, TTFC ~0.7s

LTX-2 (Teacher) — Offline, TTFC ~197s

Distillation Fidelity (VBench)

Model	Aesthetic Quality ↑	Imaging Quality ↑	Motion Smoothness ↑	Subject Consistency ↑	Temporal Flickering ↑	TTFC ↓	FPS ↑
LTX-2	0.569	0.574	0.993	0.945	0.988	197.0s	–
OmniForcing	0.595	0.594	0.995	0.955	0.989	0.7s	25

Main Results on JavisBench

Model	Size	AV-Quality		Text-Consistency				AV-Consistency		AV-Synchrony		Runtime↓
Model	Size	FVD↓	FAD↓	TV-IB↑	TA-IB↑	CLIP↑	CLAP↑	AV-IB↑	AVHScore↑	JavisScore↑	DeSync↓	Runtime↓
MMAudio	0.1B	–	6.1	–	0.160	–	0.407	0.198	0.182	0.150	0.849	15s
JavisDiT++	2.1B	141.5	5.5	0.282	0.164	0.316	0.424	0.198	0.184	0.159	0.832	10s
UniVerse-1	6.4B	194.2	8.7	0.272	0.111	0.309	0.245	0.104	0.098	0.077	0.929	13s
LTX-2	19B	125.4	4.6	0.290	0.173	0.318	0.442	0.318	0.298	0.253	0.384	197s
OmniForcing	19B	137.2	5.7	0.287	0.162	0.322	0.401	0.269	0.254	0.208	0.392	5.7s

Conclusion

We presented OmniForcing, the first framework to distill a bidirectional joint audio-visual diffusion model into a real-time streaming autoregressive generator. To overcome the temporal asymmetry and gradient instability inherent in causal multi-modal distillation, we introduced Asymmetric Block-Causal Alignment and Audio Sink Tokens with Identity RoPE, coupled with Joint Self-Forcing Distillation and a modality-independent rolling KV-cache. OmniForcing achieves ~25 FPS streaming on a single GPU. We hope this work opens up new possibilities for deploying multi-modal foundation models in interactive and latency-sensitive scenarios.

BibTeX

@article{su2026omniforcing,
  title   = {OmniForcing: Unleashing Real-time Joint Audio-Visual Generation},
  author  = {Su, Yaofeng and Li, Yuming and Xue, Zeyue and Huang, Jie and Fu, Siming
             and Li, Haoran and Li, Ying and Qian, Zezhong and Huang, Haoyang and Duan, Nan},
  journal = {arXiv preprint arXiv:2603.11647},
  year    = {2026}
}