Noureddine RAMDI / Omni-Diffusion: unified any-to-any multimodal generation with masked discrete diffusion

Created Mon, 04 May 2026 10:23:02 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

VITA-MLLM/Omni-Diffusion

Omni-Diffusion tackles a fundamental challenge in multimodal AI: how to build a single model that can understand and generate across text, images, and speech without relying on separate encoders and decoders for each modality pair. It does this by treating all modalities as discrete tokens within a shared vocabulary space and modeling their joint distribution using masked discrete diffusion. This yields a unified any-to-any multimodal language model capable of speech-to-image, text-to-image, spoken VQA, TTS, and ASR tasks within one framework.

what Omni-Diffusion is and how it works

At its core, Omni-Diffusion is a Python-based deep learning system implementing a masked discrete diffusion model over token sequences that represent text, images, and speech. Unlike conventional multimodal models that use distinct encoder-decoder pipelines for each pair (e.g., text-to-image or speech-to-text), Omni-Diffusion unifies these by modeling a single joint distribution over all discrete tokens from these modalities.

The architecture relies on specialized tokenizers:

  • For audio, it employs the GLM-4-Voice tokenizer and decoder, which discretize speech into token sequences suitable for diffusion modeling.
  • For images, it uses MAGVITv2, an image tokenizer producing discrete tokens representing visual content.
  • Text tokens are handled within the same framework, making the entire input a single sequence of discrete tokens.

The backbone is a diffusion-based sequence model that performs masked discrete diffusion to iteratively refine token predictions. This means the model learns to predict missing or corrupted tokens conditioned on observed context, allowing it to generate or transform data between modalities in a unified manner.

The repo includes scripts for supervised fine-tuning with DeepSpeed, enabling efficient large-scale training, and inference examples spanning multiple tasks: speech-to-image generation, text-to-image creation, spoken visual question answering (VQA), text-to-speech (TTS), and automatic speech recognition (ASR). Evaluation scripts for benchmark datasets like LibriSpeech, LibriTTS, and MME are also provided.

technical strengths and tradeoffs

The key technical strength of Omni-Diffusion lies in its masked discrete diffusion formulation that treats all modalities as sequences of discrete tokens in a joint space. This eliminates the complexity of engineering separate encoder-decoder pairs for each modality conversion, reducing architectural fragmentation.

Using MAGVITv2 and GLM-4-Voice tokenizers allows the model to handle high-dimensional data like images and speech efficiently by converting them into manageable discrete token sequences. The diffusion backbone provides a principled generative approach, iteratively denoising masked tokens to produce coherent outputs.

However, this approach comes with tradeoffs:

  • The reliance on multiple pretrained tokenizers and large pretrained weights increases the overall model size and complexity.
  • Masked discrete diffusion models tend to be computationally intensive, especially for long sequences typical of images and speech.
  • Training and inference require careful hardware setup, including a custom PyTorch Docker image and DeepSpeed for fine-tuning.

The codebase is surprisingly clean given the complexity, with clear separation of components for tokenization, diffusion modeling, and evaluation. The choice to provide scripts and pretrained weights via HuggingFace improves reproducibility but also demands significant storage and bandwidth.

quick start: preparing the environment and pretrained weights

Getting Omni-Diffusion running requires several setup steps tied to its dependencies and model weights. The README specifies precise commands:

docker pull shenyunhang/pytorch:24.11-py3_2024-1224
git clone https://github.com/VITA-MLLM/Omni-Diffusion.git
cd Omni-Diffusion
git submodule update --init --recursive
pip install -r requirements_ds_gpu.txt
pip install -e .

Pretrained weights must be downloaded and placed in specific directories:

These weights are essential for the model’s tokenization and generation capabilities.

verdict

Omni-Diffusion is a technically interesting project for researchers and practitioners working on unified multimodal generative models. Its masked discrete diffusion approach to modeling a joint distribution over text, images, and speech tokens is a thoughtful alternative to the common encoder-decoder modality pairs.

That said, the complexity and resource demands (custom Docker image, DeepSpeed, large pretrained weights) mean it’s not a lightweight tool for casual experimentation. You’ll want a solid GPU environment and familiarity with diffusion models and tokenization to make the most of it.

The repo offers a comprehensive starting point for anyone exploring any-to-any multimodal generation, especially those comfortable navigating diffusion architectures and multimodal tokenizers. The code quality and modularity suggest it can serve as a foundation for further research and adaptation.

If you’re looking for a single model to handle text, image, and speech generation tasks in a unified framework, Omni-Diffusion is worth your attention, provided you’re ready for the associated setup and compute demands.


→ GitHub Repo: VITA-MLLM/Omni-Diffusion ⭐ 134 · Python