Noureddine RAMDI / OmniVoice Studio: a local-first multi-engine voice cloning and dubbing platform with MCP server integration

Created Sat, 23 May 2026 20:41:14 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

debpalash/OmniVoice-Studio

OmniVoice Studio is a rare open-source project that provides a fully local voice AI platform combining zero-shot voice cloning, multi-engine text-to-speech (TTS), and an end-to-end video dubbing pipeline. Unlike many cloud-dependent voice tools, it runs completely on your machine without external API keys or cloud services. What really stands out is its integration of an MCP server that lets AI agents like Claude Code programmatically access voice generation, dubbing, and dictation — bridging local voice AI with agentic coding in a way few projects do.

What OmniVoice Studio is and how it is architected

At its core, OmniVoice Studio is a desktop application with a React frontend and a FastAPI backend. The backend exposes 97 API endpoints backed by an SQLite database, handling everything from voice cloning to transcription and video dubbing.

The multi-engine TTS backend supports six interchangeable engines, including its own OmniVoice, CosyVoice 3, MLX-Audio, VoxCPM2, MOSS-TTS-Nano, and KittenTTS. This multi-engine architecture enables broad language coverage (over 646 languages) and flexible voice cloning from just 3-second audio clips — an impressive zero-shot voice cloning capability.

The video dubbing pipeline is a complete stack that performs transcription, translation, synthetic speech generation, and muxing — all locally. The backend intelligently integrates state-of-the-art open-source tools: WhisperX for automatic speech recognition (ASR), Demucs for vocal isolation, Pyannote for speaker diarization, and AudioSeal for invisible AI watermarking.

One of the more subtle architectural highlights is the GPU auto-detection system, which supports CUDA (NVIDIA), MPS (Apple Silicon), ROCm (AMD), and fallback to CPU. It features VRAM-aware offloading that automatically shifts TTS workloads to CPU if the GPU has 8 GB or less VRAM, maintaining usability on modest machines.

On top of this, OmniVoice Studio ships with an MCP (Multi-Context Protocol) server. This server exposes the voice cloning, dubbing, and dictation pipelines programmatically to MCP clients such as Claude Code and Cursor, enabling agentic AI workflows to incorporate voice interaction seamlessly.

Technical strengths and design tradeoffs

The multi-engine TTS backend is a major strength. By supporting six distinct engines, OmniVoice caters to a broad range of voices, languages, and synthesis needs. The default OmniVoice TTS engine covers 600+ languages with cloning and instruction capabilities, while engines like CosyVoice 3 and VoxCPM2 add dialects and alternative voice qualities. This modular engine architecture allows users to pick the best fit per use case or hardware platform.

The tradeoff is increased complexity in managing multiple engines with varying platform support and licensing terms. For example, MLX-Audio engines have limited Linux support, and some engines lack cloning or instruction capabilities. Users need to understand which engines fit their needs and environment.

The GPU auto-detection and VRAM-aware offloading is an elegant solution to a common problem in local AI workloads: hardware variability. Many TTS models require significant GPU memory, but OmniVoice Studio detects available VRAM and offloads TTS to CPU for GPUs with 8 GB or less VRAM automatically. This allows the app to run on a wide range of machines, from high-end NVIDIA RTX 3060+ GPUs to Apple Silicon Macs and even CPU-only setups, albeit with slower performance.

The backend integration with open-source AI components like WhisperX and Pyannote shows careful engineering to chain complex audio pipelines locally without cloud dependencies. Including AudioSeal watermarking also demonstrates attention to real-world needs like content protection.

The MCP server integration is the most unique aspect. By exposing the voice pipelines as an MCP server, OmniVoice Studio enables agentic AI clients to programmatically access voice generation, dubbing, and transcription. This positions OmniVoice Studio not just as a desktop app but as a local voice AI infrastructure provider for multi-agent systems. Few other open-source TTS tools offer this level of programmatic integration.

Quick start

Per-OS install guides — pick yours and follow it end-to-end:

  • macOS — docs/install/macos.md
  • Windows — docs/install/windows.md
  • Linux — docs/install/linux.md
  • Docker — docs/install/docker.md

Stuck? See docs/install/troubleshooting.md for the top 10 install errors. The in-app error UI deeplinks to those entries when something breaks at runtime.

For Hugging Face token setup, see docs/setup/huggingface-token.md. For diarization-specific gating, see docs/features/diarization.md.

System requirements

MinimumRecommended
OSWindows 10, macOS 12+, Ubuntu 20.04+Any modern 64-bit OS
RAM8 GB16 GB+
VRAM (GPU)4 GB (auto-offloads TTS to CPU)8 GB+ (NVIDIA RTX 3060+)
Disk10 GB free (models + cache)20 GB+ SSD
Python3.10+ (managed by uv)3.11–3.12
GPUOptional — CPU worksNVIDIA CUDA · Apple Silicon MPS · AMD ROCm

[!TIP] On GPUs with ≤8 GB VRAM, OmniVoice automatically offloads TTS to CPU during transcription — no config needed. A dedicated GPU is not required; the entire pipeline runs on CPU (just slower).

TTS engines overview

OmniVoice ships a multi-engine TTS backend. The default engine (OmniVoice) is always available; additional engines are opt-in and auto-detected. Switch engines in Settings → TTS Engine or via the OMNIVOICE_TTS_BACKEND env var.

EngineLanguagesCloneInstructLinuxmacOS ARMWindowsLicense
OmniVoice (default)600+✅ CUDA/CPU✅ MPS✅ CUDA/CPUBuilt-in
CosyVoice 39 + 18 dialects✅ CUDA/CPU✅ MPS✅ CUDA/CPUApache-2.0
MLX-Audio (Kokoro, Qwen3-TTS, CSM, Dia, …)MultiVariesVaries✅ NativeVaries
VoxCPM230✅ CUDA/CPU✅ MPS✅ CUDA/CPUApache-2.0
MOSS-TTS-Nano20✅ CUDA/CPU✅ CPU

verdict

OmniVoice Studio is a solid choice for developers and researchers needing a fully local, multi-language, multi-engine voice AI platform. Its zero-shot cloning from just a 3-second clip, comprehensive video dubbing pipeline, and intelligent GPU-aware offloading makes it versatile across hardware setups.

The standout is the MCP server exposing the entire voice and dubbing pipeline programmatically, enabling integration with agentic AI systems like Claude Code. This is rare and valuable for those working at the intersection of voice AI and autonomous coding agents.

The tradeoffs include complexity managing multiple TTS engines and hardware compatibility nuances. Running on CPU-only hardware is fully supported but slower. Also, some engines have limited platform support.

If you want a local-first voice AI platform with programmatic access to advanced voice cloning and dubbing, OmniVoice Studio is worth exploring. It’s not a plug-and-play consumer app but a powerful toolkit for hands-on developers and researchers.


→ GitHub Repo: debpalash/OmniVoice-Studio ⭐ 3,824 · Python