Voice-Pro: chaining Whisper, translation, and voice cloning in a portable Gradio app

Voice-Pro is a rare example of a speech AI pipeline that combines multiple complex models into a single, portable application with a web UI. It chains speech-to-text (STT), translation, and text-to-speech (TTS) with zero-shot voice cloning, all powered by Whisper variants, Deep-Translator, and TTS engines like CozyVoice and Edge-TTS. The engineering tradeoff is clear: bundling models like CozyVoice2-0.5B (9GB) and WhisperX into a Gradio app that runs on consumer NVIDIA GPUs with CUDA 12.4.

What voice-pro does and its architecture

Voice-Pro consolidates speech recognition, multilingual translation, and voice cloning-based TTS into a single Python 3.10 application, served by a Gradio 5.14.0 web interface. The core pipeline is a sequence of AI models chained to process audio input through multiple stages:

Speech-to-text (STT) using multiple Whisper variants: Whisper, Faster-Whisper, Whisper-Timestamped, and WhisperX. These provide options balancing speed, accuracy, and timestamping.
Translation with Deep-Translator supporting 100+ languages, enabling multilingual dubbing.
Zero-shot voice cloning TTS via models like F5-TTS, E2-TTS, and CosyVoice, which can mimic voices without retraining.
Additional TTS options include Edge-TTS and kokoro (ranked #2 in HuggingFace TTS Arena).
Audio processing utilities like yt-dlp for YouTube downloads and Demucs for vocal isolation.

Under the hood, Voice-Pro targets NVIDIA GPUs with CUDA 12.4, requiring at least 4GB VRAM but recommending 8GB+. It uses Torch 2.5.1+cu124 for model inference. Storage requirements exceed 20GB, mainly due to large TTS models like CozyVoice2-0.5B (9GB). The app is designed for Windows 10/11 64-bit primarily, with Linux and Mac support via shell scripts.

The design centers on a Gradio WebUI to provide an accessible interface without complex setup, suitable for users who want a local speech pipeline without juggling multiple services or APIs.

What makes voice-pro’s approach interesting

The standout aspect of Voice-Pro is its multi-model orchestration in a single pipeline, which is not trivial given the size and complexity of the models involved. Combining multiple Whisper variants allows users to select a tradeoff between speed and detail (e.g., timestamping). This flexibility is valuable for different use cases, from quick transcription to detailed dubbing.

Zero-shot voice cloning is another highlight. Models like CozyVoice and F5-TTS allow cloning voices from audio samples without fine-tuning, a capability that usually requires extensive training. Integrating these into a user-friendly app lowers the barrier to experimenting with voice synthesis.

However, this power comes with tradeoffs:

Large download and storage footprint: CozyVoice2-0.5B alone is 9GB, and the initial setup can take over an hour to download all dependencies.
Hardware requirements: While 4GB VRAM is the minimum, 8GB+ is recommended to run the models smoothly. This excludes lower-end GPUs or integrated graphics.
Performance: Running these heavy models on consumer hardware can have latency, especially on initial runs due to model loading and JIT compilation in Torch.
Portability vs. complexity: The app is portable and Windows-first, which is great for ease of use, but bundling multiple frameworks and dependencies (CUDA, ffmpeg, yt-dlp, Demucs) increases the complexity behind the scenes.

The code quality reflects a pragmatic engineering approach. The use of batch scripts (configure.bat, start.bat) for setup and launching simplifies DX on Windows. The app is modular enough to swap STT or TTS backends, showing some architectural foresight.

Quick start

The README provides a straightforward installation and startup procedure focused on Windows, with shell script equivalents for Linux and Mac.

## [1m[34m[1mSystem Requirements[0m
- OS: Windows 10/11 (64-bit), Linux, Mac
- GPU: NVIDIA with CUDA 12.4 (recommended)
- VRAM: 4GB+ (8GB+ preferred)
- RAM: 4GB+
- Storage: 20GB+ free space
- Internet: Required

## [1m[34m[1mInstallation[0m

Install Voice-Pro with ease using configure.bat and start.bat (use configure.sh and start.sh on Mac/Linux).

### 1. Get the Package

  + Clone or download the latest release (Source code (zip)) from  
    git clone https://github.com/abus-aikorea/voice-pro.git

### 2. Install & Run
1. configure.bat
   - Sets up git, ffmpeg, and CUDA (if NVIDIA GPU)
   - Run once; takes 1+ hour with internet
   - Don9;t close the command window
2. start.bat
   - Launches Voice-Pro WebUI
   - First run installs dependencies (1+ hour)
   - Retry after deleting installer_files if issues arise

### 3. Update
- update.bat: Refreshes Python environment (faster than reinstall)

### 4. Uninstall
- Run uninstall.bat or delete the folder (portable install)

This setup emphasizes a portable installation without global system changes. The batch scripts handle environment setup, dependency downloads, and CUDA configuration, which can be a pain point in many ML projects.

verdict

Voice-Pro is a solid choice if you want a local, all-in-one speech pipeline combining Whisper-based STT, multilingual translation, and advanced zero-shot voice cloning TTS. It’s especially relevant for developers and AI enthusiasts with NVIDIA GPUs who want to experiment with full dubbing pipelines without relying on cloud services.

The tradeoffs around hardware requirements, storage, and initial setup time are real and should be factored into your decision. If you have limited VRAM or want a lightweight setup, this might be overkill. But if your use case demands flexibility in STT variants, voice cloning, and translation in a single package with a simple UI, Voice-Pro delivers a unique bundle.

The paused development status means the project might not evolve soon, but the open-source release captures a mature, production-level snapshot worth exploring. The code quality and modular design make it a good reference for anyone building multi-model AI pipelines with Gradio.

Tags: python, speech-to-text, tts, voice-cloning, whisper, gradio, translation

Hugging Face Transformers: a unified API for state-of-the-art AI models across modalities — Hugging Face Transformers offers a unified Python API to access over 1 million pretrained AI models for text, vision, an
Jan: a local-first desktop app for large language models with Tauri and Rust — Jan is an open-source desktop app that runs large language models locally using Tauri, Node.js, and Rust. It offers priv
MemPalace: local-first AI memory with strong semantic retrieval and no cloud dependency — MemPalace offers a local-first AI memory system with 96.6% recall on conversation history retrieval without any cloud or

→ GitHub Repo: abus-aikorea/voice-pro ⭐ 7,623 · Python

Noureddine RAMDI / Voice-Pro: chaining Whisper, translation, and voice cloning in a portable Gradio app

What voice-pro does and its architecture

What makes voice-pro’s approach interesting

Quick start

verdict

Related Articles