Noureddine RAMDI / Voice Clone Studio: unified modular web UI for multi-engine voice cloning and TTS

Created Mon, 04 May 2026 10:23:01 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

FranckyB/Voice-Clone-Studio

Voice AI development often means juggling multiple repositories and tools, each with their own interfaces and audio formats. Voice Clone Studio tackles this head-on by consolidating several open-source and proprietary voice AI engines into a single modular Gradio web UI. This approach dramatically simplifies workflows involving voice cloning, multi-speaker conversations, speech-to-speech conversion, and fine-tuning pipelines.

What Voice Clone Studio does and how it’s built

Voice Clone Studio is a Python-based web application built on top of Gradio, designed to aggregate multiple TTS and voice cloning engines under one roof. It currently supports engines such as Qwen3-TTS, VibeVoice, LuxTTS, Chatterbox, Fish Speech, and MMAudio. These engines cover a broad spectrum of capabilities: from text-to-speech synthesis with premium preset voices to speech-to-speech voice conversion and multi-speaker dialogue generation.

The architecture is modular, with the UI split into dynamically loaded tabs. Each engine runs in its own self-contained tab, which can be toggled on or off based on user needs. This dynamic loading improves startup times and resource usage, since only the active tabs load their heavy models.

Under the hood, the system shares voice samples across engines, caches voice prompts for faster repeated synthesis, and uses a unified script format to handle multi-speaker dialogues consistently across engines. This unified dialogue format is a strong design choice that eases switching between engines without rewriting scripts.

The backend relies heavily on PyTorch with CUDA for GPU acceleration on Windows and Linux. For macOS users with Apple Silicon (M1/M2/M3/M4), it leverages MPS acceleration, though model training is disabled on macOS due to platform constraints. The project also integrates multiple ASR backends (Whisper, VibeVoice-ASR, Qwen3-ASR) for transcription and optionally connects to LLM backends like llama.cpp and Ollama for prompt generation.

Modular multi-engine integration and voice cloning pipeline

What sets Voice Clone Studio apart is its modular design that consolidates various voice AI engines with distinct capabilities and models into one cohesive UI. The tradeoff here is complexity: each engine has its own dependencies and quirks, and the repo handles this via toggleable tabs and platform-specific setup scripts.

The LoRA fine-tuning pipeline integration is another highlight. It supports training new voice models in roughly 10-30 minutes, depending on dataset size, which is reasonable for practical experimentation. However, this training feature is only available on Windows/Linux with CUDA GPUs—macOS users can’t train models yet.

The codebase maintains consistent voice prompt caching and script formatting, which improves developer experience when switching between engines or running multi-speaker dialogues. However, the complexity of supporting multiple TTS engines means some tradeoffs in UI consistency and resource overhead.

The project requires a CUDA-compatible GPU with at least 8GB VRAM or Apple Silicon hardware—this is a practical limitation given the model sizes (ranging from 0.5B to 4B parameters) and memory needs.

Installation and quick start

Installation

Prerequisites

  • Python 3.10-3.12 (3.11 recommended)
  • CUDA-compatible GPU (Windows/Linux) or Apple Silicon (macOS)
  • SOX for audio processing
  • FFMPEG for audio format conversion
  • Optional: llama.cpp or Ollama for LLM prompt generation
  • Optional: Flash Attention 2 (CUDA only)

Note: On macOS, model training is disabled and Whisper ASR is skipped due to compatibility issues.

Quick setup commands

Windows
git clone https://github.com/FranckyB/Voice-Clone-Studio.git
cd Voice-Clone-Studio
setup-windows.bat
Linux
git clone https://github.com/FranckyB/Voice-Clone-Studio.git
cd Voice-Clone-Studio
chmod +x setup-linux.sh
./setup-linux.sh
macOS

Clone the repo and follow manual setup instructions as model training is not supported.

verdict

Voice Clone Studio is a pragmatic platform that solves a real pain point for voice AI practitioners: juggling diverse TTS and cloning engines. Its modular Gradio UI design and unified dialogue scripting make it a solid choice for experimentation and development of multi-engine voice workflows.

The tradeoffs are clear: it demands non-trivial hardware (CUDA GPU or Apple Silicon) and has limited model training support on macOS. The complexity of supporting multiple engines means some overhead in setup and maintenance.

It’s best suited for researchers, developers, and hobbyists who want to work across multiple voice AI models without switching between separate tools. If you have the hardware and patience for setup, this repo provides a versatile and extensible base for voice cloning and multi-speaker synthesis projects.


→ GitHub Repo: FranckyB/Voice-Clone-Studio ⭐ 479 · Python