Flexible chunk-size Whisper inference with optimized on-device engines in TheWhisper

The original Whisper model from OpenAI processes audio in fixed 30-second chunks, which sets a hard lower bound on latency for streaming transcription. If you want partial results faster, the 30-second chunk size is a bottleneck. TheWhisper from TheStageAI tackles this exact problem by fine-tuning Whisper variants to support flexible chunk sizes — 10, 15, 20, or 30 seconds — enabling lower-latency streaming speech-to-text inference.

Flexible chunk-size Whisper inference with platform-specific optimized engines

At its core, TheWhisper provides fine-tuned Whisper models adapted to smaller chunk sizes. This is more than just cutting audio segments shorter; the model is optimized to maintain transcription accuracy despite reduced context. Naively shrinking chunks on the original Whisper often degrades performance, so these fine-tuned variants help balance latency and accuracy.

The repo ships optimized inference engines targeting two major platforms:

CoreML engine for Apple Silicon (macOS): This engine runs efficiently on Apple Silicon, consuming about 2 watts of power and roughly 2GB of RAM for inference. It’s designed for on-device speech-to-text with a low footprint, making it practical for local transcription without needing cloud resources.
CUDA engine for NVIDIA GPUs: For users with NVIDIA hardware, TheWhisper offers CUDA-optimized engines that prioritize throughput. Benchmarks show it can reach up to 220 tokens per second on an L40s GPU using the whisper-large-v3 model. The minimum RAM required for NVIDIA is about 2.5 GB, with 5 GB recommended for the large-v3 model.

The project also supports word-level timestamps and multilingual transcription, which are important features for real-time applications and detailed analysis.

On the API side, TheWhisper exposes a Python interface that supports streaming pipelines, making it easier to integrate into real-time systems or local desktop applications.

Balancing latency, resource use, and accuracy: tradeoffs and technical highlights

The most obvious advantage of TheWhisper is breaking Whisper’s fixed 30-second chunk size, which reduces latency proportionally. Smaller chunks mean partial transcription results come in more frequently. However, this introduces tradeoffs:

Context window: Smaller chunks provide less audio context to the model, which can affect transcription quality. The fine-tuned models mitigate this but cannot fully replicate the context of longer chunks.
Latency: While latency is lower with smaller chunks, the chunk size itself still sets the minimal latency boundary. So, 10-second chunks mean you can’t get results faster than roughly 10 seconds of audio processed.
Resource requirements: The CoreML engine is optimized for low power and memory use on Apple Silicon, which is impressive given the model size. The CUDA engine achieves high throughput but requires a GPU with sufficient RAM and power.
Platform-specific optimizations: Shipping different inference engines for Apple Silicon and NVIDIA GPUs means maintaining two optimization paths, but it enables better performance on each platform’s characteristics.

The codebase is Python-centric, exposing an API designed for streaming transcription workflows. This is important for developer experience (DX), as building real-time transcription apps requires smooth data streaming, partial result handling, and timestamping.

The project also offers free access to optimized NVIDIA engines for small organizations (up to 4 GPUs per year) via TheStage AI ElasticModels, which can help lower the barrier to entry.

Quick start with TheWhisper

The repository includes clear instructions for installation targeting different platforms and optimizations. Here is the quick start based on the README:

# Clone the repository

git clone https://github.com/TheStageAI/TheWhisper.git
cd TheWhisper

For Apple Silicon:

pip install .[apple]

For NVIDIA GPUs:

pip install .[nvidia]

For NVIDIA with TheStage AI optimized engines:

pip install 'thestage-elastic-models[nvidia]==0.1.7' --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple --extra-index-url https://pypi.nvidia.com --extra-index-url https://pypi.org/simple
pip install .[nvidia]
pip install thestage

For Jetson-Thor with TheStage AI optimized engines (make sure you have tensorrt==10.13.3.9 installed):

pip install thestage-elastic-models[thor]==0.1.7 --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-jetson-thor/simple -i https://pypi.jetson-ai-lab.io/sbsa/cu130/+simple/ --extra-index-url https://pypi.org
pip install .
pip install thestage

After installation, generate an access token from TheStage AI Platform and configure it:

thestage config set -t <YOUR_API_TOKEN>

These commands provide a straightforward way to get started with TheWhisper on supported platforms.

Verdict: who should consider TheWhisper?

TheWhisper is relevant for developers needing streaming speech-to-text transcription with more flexible latency options than the original Whisper model allows. If you want to build local or edge transcription apps on Apple Silicon or NVIDIA GPUs and need lower latency than 30-second chunks provide, this repo offers a practical solution.

The tradeoffs around chunk size, context, and resource requirements are worth understanding before adopting. While smaller chunks reduce latency, they do not eliminate it — the chunk duration is the lower bound. Accuracy can be affected but is improved by the fine-tuned models.

The dual-engine approach is sensible: CoreML for low-power Apple devices and CUDA for high-throughput NVIDIA GPUs. The optimized engines and free access tiers for small orgs are helpful to get started without expensive resources.

Overall, TheWhisper is a solid choice if you want to experiment with or deploy Whisper-based streaming ASR on device with flexible chunk sizes. It’s less suited if you require ultra-low latency below chunk duration or have very constrained hardware outside of Apple Silicon or supported NVIDIA GPUs.

→ GitHub Repo: TheStageAI/TheWhisper ⭐ 878 · Python

Noureddine RAMDI / Flexible chunk-size Whisper inference with optimized on-device engines in TheWhisper

Flexible chunk-size Whisper inference with platform-specific optimized engines

Balancing latency, resource use, and accuracy: tradeoffs and technical highlights

Quick start with TheWhisper

Verdict: who should consider TheWhisper?