Running text-to-speech (TTS) locally remains a challenge for many developers who want to reduce cloud latency, cut costs, or ensure privacy. Supertonic-3 offers a Python-based TTS library that tackles these issues with a solid balance: it runs fully on-device using ONNX runtime, supports a broad set of languages and voices, and exposes an OpenAI-compatible HTTP API endpoint. This means you can replace cloud TTS services in your existing OpenAI clients by simply swapping the API URL.
What supertonic-py does and how it is built
Supertonic-3 is a Python TTS library designed for on-device inference. It uses ONNX runtime to run neural TTS models locally, which allows for fast synthesis without relying on cloud services. The project supports 31 languages and ships with 10 built-in voices. A standout feature is zero-shot voice cloning, enabled through Voice Builder JSON exports — allowing you to create custom voice profiles without retraining.
The library is delivered both as a Python SDK and as a local HTTP server application. The SDK can be installed via pip (pip install supertonic), while the server mode adds FastAPI and Uvicorn (pip install 'supertonic[serve]') and runs a local service exposing several endpoints.
Under the hood, Supertonic relies on just four core Python dependencies:
onnxruntimefor efficient ONNX model inferencenumpyfor numerical operationssoundfilefor audio file input/outputhuggingface-hubto download the large (~400MB) TTS model on first use
The model is automatically fetched on demand to keep the initial install lightweight.
The HTTP server exposes two key endpoints:
/v1/tts: Native TTS API with full parameter control, batch synthesis (up to 64 items), and voice style import/management APIs./v1/audio/speech: An OpenAI-compatible alias endpoint, meaning any client designed to work with OpenAI’s TTS API can switch to Supertonic simply by changing the base URL.
This API compatibility is particularly useful for developers who want to deploy local TTS without rewriting or adapting existing client code.
What makes supertonic-py technically interesting
The key technical strength of Supertonic lies in its combination of on-device neural TTS with broad language and voice support, plus its API compatibility with OpenAI TTS endpoints.
Using ONNX runtime is a pragmatic choice — it balances performance and portability across hardware without locking users into a specific deep learning framework like PyTorch or TensorFlow. This minimizes dependencies and improves the chance of running on diverse environments, from desktops to robotics or home automation setups.
The zero-shot voice cloning is another notable feature. Instead of requiring fine-tuning or retraining models for custom voices, Supertonic accepts JSON exports from a Voice Builder tool, enabling rapid voice style imports. This is a clear DX win for teams needing flexible voice options without heavy ML expertise.
The server’s batch synthesis capability (up to 64 items per request) helps optimize throughput for bulk TTS tasks.
However, there are tradeoffs. The model size (~400MB) is relatively large for an on-device TTS engine, which could be a limitation on very resource-constrained devices. The initial model download adds latency to the first run. Also, the library targets Python environments, which may not be ideal for all embedded systems.
The codebase intentionally keeps dependencies minimal, which simplifies maintenance and reduces conflicts.
Quick start with supertonic-py
Installing and running Supertonic is straightforward. From the README, these are the exact commands to get started:
pip install supertonic
For using the local HTTP server with OpenAI-compatible endpoints:
pip install 'supertonic[serve]'
supertonic serve --host 127.0.0.1 --port 7788
By default, the server binds to 127.0.0.1 for local access only, with a warning if you bind to other interfaces to encourage safe deployment behind reverse proxies.
Once running, you can access:
- Synthesis endpoint:
http://127.0.0.1:7788/v1/tts - OpenAI-compatible endpoint:
http://127.0.0.1:7788/v1/audio/speech - Interactive API docs:
http://127.0.0.1:7788/docs
The Python SDK usage aligns with the API parameters, making it easy to integrate TTS generation in scripts or applications.
Verdict
Supertonic-3 is relevant for developers and teams who need a local, on-device TTS engine that supports multiple languages and voices, including zero-shot voice cloning, without relying on cloud services. The OpenAI-compatible API endpoint is a smart design choice that lowers the barrier for adoption in existing projects.
The tradeoffs are clear: the model size and initial download may be heavy for some edge devices, and the Python dependency requires suitable runtime environments. But for desktop applications, robotics, home automation, or privacy-conscious deployments, Supertonic offers a solid balance of functionality, performance, and developer experience.
If you want to replace cloud TTS with a local engine while retaining API compatibility, or need multi-language support with flexible voice cloning, Supertonic is worth exploring. Just plan for the model download and resource footprint upfront.
Overall, the project is a practical, well-structured approach to on-device neural TTS in Python, with a clean API and minimal dependencies.
Related Articles
- QwenVoice: offline Apple Silicon text-to-speech with XPC isolation and model quantization tradeoffs — QwenVoice runs Qwen3-TTS 1.7B offline on Apple Silicon using MLX with XPC isolation and supports voice cloning. It balan
- Voice-Pro: chaining Whisper, translation, and voice cloning in a portable Gradio app — Voice-Pro bundles Whisper variants, translation, and zero-shot voice cloning into a single Python Gradio app, balancing
- Voice Clone Studio: unified modular web UI for multi-engine voice cloning and TTS — Voice Clone Studio unifies multiple voice AI engines in a modular Gradio web UI. Supports voice cloning, multi-speaker d
- ChatTTS: conversational text-to-speech with prosodic control and responsible AI tradeoffs — ChatTTS is an open-source conversational text-to-speech model trained on 100,000+ hours of bilingual audio. It offers fi
- MeanVC: real-time zero-shot voice conversion with mean flows and diffusion transformers — MeanVC enables real-time zero-shot voice conversion using mean flows and diffusion transformers for single-step inferenc
→ GitHub Repo: supertone-inc/supertonic-py ⭐ 52 · Python