OmniVoice Studio: a local-first multi-engine voice cloning and dubbing platform with MCP server integration

OmniVoice Studio is a rare open-source project that provides a fully local voice AI platform combining zero-shot voice cloning, multi-engine text-to-speech (TTS), and an end-to-end video dubbing pipeline. Unlike many cloud-dependent voice tools, it runs completely on your machine without external API keys or cloud services. What really stands out is its integration of an MCP server that lets AI agents like Claude Code programmatically access voice generation, dubbing, and dictation — bridging local voice AI with agentic coding in a way few projects do.

What OmniVoice Studio is and how it is architected

At its core, OmniVoice Studio is a desktop application with a React frontend and a FastAPI backend. The backend exposes 97 API endpoints backed by an SQLite database, handling everything from voice cloning to transcription and video dubbing.

The multi-engine TTS backend supports six interchangeable engines, including its own OmniVoice, CosyVoice 3, MLX-Audio, VoxCPM2, MOSS-TTS-Nano, and KittenTTS. This multi-engine architecture enables broad language coverage (over 646 languages) and flexible voice cloning from just 3-second audio clips — an impressive zero-shot voice cloning capability.

The video dubbing pipeline is a complete stack that performs transcription, translation, synthetic speech generation, and muxing — all locally. The backend intelligently integrates state-of-the-art open-source tools: WhisperX for automatic speech recognition (ASR), Demucs for vocal isolation, Pyannote for speaker diarization, and AudioSeal for invisible AI watermarking.

One of the more subtle architectural highlights is the GPU auto-detection system, which supports CUDA (NVIDIA), MPS (Apple Silicon), ROCm (AMD), and fallback to CPU. It features VRAM-aware offloading that automatically shifts TTS workloads to CPU if the GPU has 8 GB or less VRAM, maintaining usability on modest machines.

On top of this, OmniVoice Studio ships with an MCP (Multi-Context Protocol) server. This server exposes the voice cloning, dubbing, and dictation pipelines programmatically to MCP clients such as Claude Code and Cursor, enabling agentic AI workflows to incorporate voice interaction seamlessly.

Technical strengths and design tradeoffs

The multi-engine TTS backend is a major strength. By supporting six distinct engines, OmniVoice caters to a broad range of voices, languages, and synthesis needs. The default OmniVoice TTS engine covers 600+ languages with cloning and instruction capabilities, while engines like CosyVoice 3 and VoxCPM2 add dialects and alternative voice qualities. This modular engine architecture allows users to pick the best fit per use case or hardware platform.

The tradeoff is increased complexity in managing multiple engines with varying platform support and licensing terms. For example, MLX-Audio engines have limited Linux support, and some engines lack cloning or instruction capabilities. Users need to understand which engines fit their needs and environment.

The GPU auto-detection and VRAM-aware offloading is an elegant solution to a common problem in local AI workloads: hardware variability. Many TTS models require significant GPU memory, but OmniVoice Studio detects available VRAM and offloads TTS to CPU for GPUs with 8 GB or less VRAM automatically. This allows the app to run on a wide range of machines, from high-end NVIDIA RTX 3060+ GPUs to Apple Silicon Macs and even CPU-only setups, albeit with slower performance.

The backend integration with open-source AI components like WhisperX and Pyannote shows careful engineering to chain complex audio pipelines locally without cloud dependencies. Including AudioSeal watermarking also demonstrates attention to real-world needs like content protection.

The MCP server integration is the most unique aspect. By exposing the voice pipelines as an MCP server, OmniVoice Studio enables agentic AI clients to programmatically access voice generation, dubbing, and transcription. This positions OmniVoice Studio not just as a desktop app but as a local voice AI infrastructure provider for multi-agent systems. Few other open-source TTS tools offer this level of programmatic integration.

Quick start

Per-OS install guides — pick yours and follow it end-to-end:

macOS — docs/install/macos.md
Windows — docs/install/windows.md
Linux — docs/install/linux.md
Docker — docs/install/docker.md

Stuck? See docs/install/troubleshooting.md for the top 10 install errors. The in-app error UI deeplinks to those entries when something breaks at runtime.

For Hugging Face token setup, see docs/setup/huggingface-token.md. For diarization-specific gating, see docs/features/diarization.md.

System requirements

	Minimum	Recommended
OS	Windows 10, macOS 12+, Ubuntu 20.04+	Any modern 64-bit OS
RAM	8 GB	16 GB+
VRAM (GPU)	4 GB (auto-offloads TTS to CPU)	8 GB+ (NVIDIA RTX 3060+)
Disk	10 GB free (models + cache)	20 GB+ SSD
Python	3.10+ (managed by `uv`)	3.11–3.12
GPU	Optional — CPU works	NVIDIA CUDA · Apple Silicon MPS · AMD ROCm

[!TIP] On GPUs with ≤8 GB VRAM, OmniVoice automatically offloads TTS to CPU during transcription — no config needed. A dedicated GPU is not required; the entire pipeline runs on CPU (just slower).

TTS engines overview

OmniVoice ships a multi-engine TTS backend. The default engine (OmniVoice) is always available; additional engines are opt-in and auto-detected. Switch engines in Settings → TTS Engine or via the OMNIVOICE_TTS_BACKEND env var.

Engine	Languages	Clone	Instruct	Linux	macOS ARM	Windows	License
OmniVoice (default)	600+	✅	✅	✅ CUDA/CPU	✅ MPS	✅ CUDA/CPU	Built-in
CosyVoice 3	9 + 18 dialects	✅	✅	✅ CUDA/CPU	✅ MPS	✅ CUDA/CPU	Apache-2.0
MLX-Audio (Kokoro, Qwen3-TTS, CSM, Dia, …)	Multi	Varies	Varies	❌	✅ Native	❌	Varies
VoxCPM2	30	✅	✅	✅ CUDA/CPU	✅ MPS	✅ CUDA/CPU	Apache-2.0
MOSS-TTS-Nano	20	✅	❌	✅ CUDA/CPU	✅ CPU	…

verdict

OmniVoice Studio is a solid choice for developers and researchers needing a fully local, multi-language, multi-engine voice AI platform. Its zero-shot cloning from just a 3-second clip, comprehensive video dubbing pipeline, and intelligent GPU-aware offloading makes it versatile across hardware setups.

The standout is the MCP server exposing the entire voice and dubbing pipeline programmatically, enabling integration with agentic AI systems like Claude Code. This is rare and valuable for those working at the intersection of voice AI and autonomous coding agents.

The tradeoffs include complexity managing multiple TTS engines and hardware compatibility nuances. Running on CPU-only hardware is fully supported but slower. Also, some engines have limited platform support.

If you want a local-first voice AI platform with programmatic access to advanced voice cloning and dubbing, OmniVoice Studio is worth exploring. It’s not a plug-and-play consumer app but a powerful toolkit for hands-on developers and researchers.

Voice Clone Studio: unified modular web UI for multi-engine voice cloning and TTS — Voice Clone Studio unifies multiple voice AI engines in a modular Gradio web UI. Supports voice cloning, multi-speaker d
Voice-Pro: chaining Whisper, translation, and voice cloning in a portable Gradio app — Voice-Pro bundles Whisper variants, translation, and zero-shot voice cloning into a single Python Gradio app, balancing
QwenVoice: offline Apple Silicon text-to-speech with XPC isolation and model quantization tradeoffs — QwenVoice runs Qwen3-TTS 1.7B offline on Apple Silicon using MLX with XPC isolation and supports voice cloning. It balan
ChatTTS: conversational text-to-speech with prosodic control and responsible AI tradeoffs — ChatTTS is an open-source conversational text-to-speech model trained on 100,000+ hours of bilingual audio. It offers fi
How video-use turns AI agents into transcript-driven video editors — video-use replaces frame-heavy editing with transcript-driven AI agents, using ElevenLabs Scribe and self-evaluation to

→ GitHub Repo: debpalash/OmniVoice-Studio ⭐ 3,824 · Python

Noureddine RAMDI / OmniVoice Studio: a local-first multi-engine voice cloning and dubbing platform with MCP server integration