Voice Clone Studio: unified modular web UI for multi-engine voice cloning and TTS

Voice AI development often means juggling multiple repositories and tools, each with their own interfaces and audio formats. Voice Clone Studio tackles this head-on by consolidating several open-source and proprietary voice AI engines into a single modular Gradio web UI. This approach dramatically simplifies workflows involving voice cloning, multi-speaker conversations, speech-to-speech conversion, and fine-tuning pipelines.

What Voice Clone Studio does and how it’s built

Voice Clone Studio is a Python-based web application built on top of Gradio, designed to aggregate multiple TTS and voice cloning engines under one roof. It currently supports engines such as Qwen3-TTS, VibeVoice, LuxTTS, Chatterbox, Fish Speech, and MMAudio. These engines cover a broad spectrum of capabilities: from text-to-speech synthesis with premium preset voices to speech-to-speech voice conversion and multi-speaker dialogue generation.

The architecture is modular, with the UI split into dynamically loaded tabs. Each engine runs in its own self-contained tab, which can be toggled on or off based on user needs. This dynamic loading improves startup times and resource usage, since only the active tabs load their heavy models.

Under the hood, the system shares voice samples across engines, caches voice prompts for faster repeated synthesis, and uses a unified script format to handle multi-speaker dialogues consistently across engines. This unified dialogue format is a strong design choice that eases switching between engines without rewriting scripts.

The backend relies heavily on PyTorch with CUDA for GPU acceleration on Windows and Linux. For macOS users with Apple Silicon (M1/M2/M3/M4), it leverages MPS acceleration, though model training is disabled on macOS due to platform constraints. The project also integrates multiple ASR backends (Whisper, VibeVoice-ASR, Qwen3-ASR) for transcription and optionally connects to LLM backends like llama.cpp and Ollama for prompt generation.

Modular multi-engine integration and voice cloning pipeline

What sets Voice Clone Studio apart is its modular design that consolidates various voice AI engines with distinct capabilities and models into one cohesive UI. The tradeoff here is complexity: each engine has its own dependencies and quirks, and the repo handles this via toggleable tabs and platform-specific setup scripts.

The LoRA fine-tuning pipeline integration is another highlight. It supports training new voice models in roughly 10-30 minutes, depending on dataset size, which is reasonable for practical experimentation. However, this training feature is only available on Windows/Linux with CUDA GPUs—macOS users can’t train models yet.

The codebase maintains consistent voice prompt caching and script formatting, which improves developer experience when switching between engines or running multi-speaker dialogues. However, the complexity of supporting multiple TTS engines means some tradeoffs in UI consistency and resource overhead.

The project requires a CUDA-compatible GPU with at least 8GB VRAM or Apple Silicon hardware—this is a practical limitation given the model sizes (ranging from 0.5B to 4B parameters) and memory needs.

Installation and quick start

Installation

Prerequisites

Python 3.10-3.12 (3.11 recommended)
CUDA-compatible GPU (Windows/Linux) or Apple Silicon (macOS)
SOX for audio processing
FFMPEG for audio format conversion
Optional: llama.cpp or Ollama for LLM prompt generation
Optional: Flash Attention 2 (CUDA only)

Note: On macOS, model training is disabled and Whisper ASR is skipped due to compatibility issues.

Quick setup commands

Windows

git clone https://github.com/FranckyB/Voice-Clone-Studio.git
cd Voice-Clone-Studio
setup-windows.bat

Linux

git clone https://github.com/FranckyB/Voice-Clone-Studio.git
cd Voice-Clone-Studio
chmod +x setup-linux.sh
./setup-linux.sh

macOS

Clone the repo and follow manual setup instructions as model training is not supported.

verdict

Voice Clone Studio is a pragmatic platform that solves a real pain point for voice AI practitioners: juggling diverse TTS and cloning engines. Its modular Gradio UI design and unified dialogue scripting make it a solid choice for experimentation and development of multi-engine voice workflows.

The tradeoffs are clear: it demands non-trivial hardware (CUDA GPU or Apple Silicon) and has limited model training support on macOS. The complexity of supporting multiple engines means some overhead in setup and maintenance.

It’s best suited for researchers, developers, and hobbyists who want to work across multiple voice AI models without switching between separate tools. If you have the hardware and patience for setup, this repo provides a versatile and extensible base for voice cloning and multi-speaker synthesis projects.

LlamaFactory: modular, extensible fine-tuning framework for large language models — LlamaFactory offers a modular Python framework for fine-tuning 100+ LLMs with diverse algorithms and optimizations, incl
Hugging Face Transformers: a unified API for state-of-the-art AI models across modalities — Hugging Face Transformers offers a unified Python API to access over 1 million pretrained AI models for text, vision, an
Exploring Microsoft’s generative AI for beginners: a dual-language practical course — Microsoft’s “Generative AI for Beginners” offers 21 lessons with Python and TypeScript examples covering LLMs, prompt en
Ollama: a unified CLI and API platform for local large language models — Ollama simplifies running and managing open-source large language models locally with a unified CLI and REST API, suppor

→ GitHub Repo: FranckyB/Voice-Clone-Studio ⭐ 479 · Python

Noureddine RAMDI / Voice Clone Studio: unified modular web UI for multi-engine voice cloning and TTS