Avatar Forcing: real-time multimodal head avatar generation with diffusion forcing

Avatar Forcing tackles a tricky gap in talking head avatar models: how to create truly interactive avatars that respond instantly to both what you say and how you move. Unlike one-way talking head models that simply lip-sync or replay prerecorded motions, this framework processes audio and motion inputs causally and in real time, producing expressive avatar reactions with around 500ms latency. Achieving this with convincing expressiveness and low lag is a hard problem, and Avatar Forcing takes an interesting approach centered on diffusion forcing.

what avatar forcing does and how it works

Avatar Forcing is a framework designed for real-time interactive head avatar generation, introduced as part of a CVPR 2026 paper. Its main goal is to enable avatars that respond not just to user speech but also to nonverbal cues like nods and laughter in an expressive way, all with low latency suitable for live interaction.

At its core, the framework uses a technique called diffusion forcing. Traditional diffusion models generate outputs by iterative denoising, which is inherently non-causal and slow for real-time applications. Diffusion forcing adapts this paradigm to causal, real-time processing, allowing the model to integrate multimodal inputs — audio and motion — as they arrive and produce avatar motion with minimal delay.

The architecture centers on a motion latent diffusion model conditioned on user inputs. The system processes audio and motion cues in a way that respects causality, avoiding future input leakage. This approach results in approximately 500ms latency, which is a practical threshold for interactive applications.

Another notable feature is the use of direct preference optimization. Instead of relying solely on labeled data, the framework constructs synthetic “losing samples” by deliberately dropping user conditions. This setup trains the model in a label-free manner to prefer more expressive and interactive avatar motions, effectively pushing the system to generate responses that users find more natural and engaging.

Benchmarking results show a 6.8X speedup over baseline models that use standard diffusion processing. User studies further confirm that the avatars generated by this system are preferred over 80% of the time compared to the baseline, highlighting the effectiveness of the diffusion forcing and preference optimization techniques.

diffusion forcing: enabling real-time multimodal avatar interaction

What sets Avatar Forcing apart is this diffusion forcing mechanism. Diffusion models, by design, tend to be iterative and non-causal — they rely on multiple denoising steps that require full input data upfront, which is incompatible with real-time interactive systems.

Diffusion forcing rethinks this by structuring the diffusion process to be causal, allowing the model to update avatar motion incrementally as new audio and motion data stream in. The system essentially “forces” the diffusion model to operate with partial, current inputs rather than waiting for the full sequence.

This design introduces some tradeoffs. The model must balance speed with fidelity; operating in a causal manner may limit the ability to look ahead and smooth results over longer time horizons. However, the reported 6.8X speedup with only 500ms latency suggests the tradeoff is well-managed here.

The direct preference optimization adds another layer of complexity but greatly improves expressiveness. By generating synthetic losing samples (e.g., by dropping some user condition inputs), the model learns to distinguish and prefer more natural, expressive interactions without needing costly labeled datasets. This is a clever workaround for the challenge of collecting large-scale labeled data for expressive avatar motions.

While the code is not yet publicly available, the framework’s approach is well-documented in the accompanying paper and repo description. The architecture and training methodology reflect thoughtful engineering geared towards practical real-time deployment.

explore the project

Since the repo does not provide installation or quickstart commands, the best way to get familiar with Avatar Forcing is to dive into the documentation and research paper linked in the repository.

The repo contains detailed explanations of the diffusion forcing algorithm, the direct preference optimization method, and benchmark results. For developers interested in real-time avatar generation, these documents provide valuable insights into the model design and training strategy.

Keep an eye on the repo for future code releases which are expected to include the motion latent diffusion model implementation and training pipeline.

verdict

Avatar Forcing targets a real challenge in avatar generation: achieving low-latency, expressive, and truly interactive head avatars that respond to multimodal inputs in real time. The diffusion forcing technique is a solid engineering solution to the latency and causality problem inherent in diffusion models.

The approach to label-free preference optimization is particularly interesting and likely to influence future work in expressive avatar systems where labeled data is scarce.

Limitations include the current unavailability of code and the inherent tradeoffs in causal diffusion processing that may affect output smoothness or fidelity in edge cases.

This repo is worth following for researchers and practitioners working on real-time avatar generation, interactive communication systems, or diffusion model adaptations for low-latency applications. It shows a promising direction beyond simple one-way talking head models toward truly interactive avatars with multimodal understanding.

Hugging Face Transformers: a unified API for state-of-the-art AI models across modalities — Hugging Face Transformers offers a unified Python API to access over 1 million pretrained AI models for text, vision, an
ComfyUI: modular visual workflows for diffusion model experimentation — ComfyUI offers a graph/node interface for building complex diffusion model workflows offline, blending modularity with f
Deep-Live-Cam: Real-time face swapping optimized across diverse hardware with ONNX Runtime — Deep-Live-Cam offers real-time face swapping and deepfake video generation using ONNX Runtime with multiple execution pr
LlamaFactory: modular, extensible fine-tuning framework for large language models — LlamaFactory offers a modular Python framework for fine-tuning 100+ LLMs with diverse algorithms and optimizations, incl

→ GitHub Repo: TaekyungKi/AvatarForcing ⭐ 292

Noureddine RAMDI / Avatar Forcing: real-time multimodal head avatar generation with diffusion forcing

what avatar forcing does and how it works

diffusion forcing: enabling real-time multimodal avatar interaction

explore the project

verdict

Related Articles