TurboOCR: a GPU-accelerated OCR server optimized for raw pixel input and high throughput

TurboOCR tackles a common OCR bottleneck by skipping image decoding when clients already have pixel data in memory. This zero-decode pipeline, combined with GPU acceleration via TensorRT and an efficient shared GPU pipeline pool, delivers notable speed and flexibility for document recognition workloads.

What TurboOCR does and its architecture

TurboOCR is a high-performance OCR server implemented in C++ with CUDA acceleration. It wraps the PP-OCRv5 model optimized with TensorRT FP16 for inference, focusing on document form recognition tasks. The server achieves 270 images per second on FUNSD A4 forms with low 11ms p50 latency for single requests, which is substantially faster than PaddleOCR’s Python implementation.

Architecturally, TurboOCR exposes both HTTP and gRPC interfaces from a single binary. It uses the Drogon C++ HTTP framework behind an nginx reverse proxy for connection buffering, all designed to run on NVIDIA GPUs with Turing architecture or newer. The system maintains a shared pool of GPU pipelines to handle concurrent OCR requests efficiently across both protocols.

A standout feature is the support for multiple input formats: raw image bytes (e.g., PNG), base64-encoded JSON payloads, and notably, a zero-decode path accepting raw BGR pixel buffers with width and height headers. This last mode eliminates the image decoding step entirely when the client already holds an OpenCV Mat or NumPy ndarray in memory, reducing overhead in the hot path. Parallel processing of PDF pages is also supported, with four extraction modes to tune accuracy and speed.

Additional capabilities include layout detection powered by PP-DocLayoutV3 and class-aware reading order, which can be enabled via query parameters for structured document analysis. The server also exposes Prometheus metrics for production monitoring and supports Docker deployment with automatic caching of TensorRT engine builds.

Technical strengths and tradeoffs

TurboOCR’s core strength lies in its optimized GPU inference pipeline using TensorRT FP16 precision. This provides a substantial throughput boost while keeping latency low. The integration of PP-OCRv5, a robust open-source OCR model, ensures competitive recognition accuracy (F1 score of 90.2% on FUNSD).

The zero-decode /ocr/pixels endpoint is a practical optimization that avoids redundant encoding/decoding cycles, which is a common overhead in OCR pipelines. This design choice is especially beneficial in pipelines that already preprocess images (e.g., OpenCV or NumPy) before sending them for OCR.

Supporting both HTTP and gRPC in a single binary with a shared GPU pipeline pool is a thoughtful architectural decision that simplifies deployment and resource management. It allows clients to choose their preferred RPC style without running multiple services.

The tradeoff is visible in the startup latency: the first run takes around 90 seconds to build TensorRT engines from ONNX models, which may be a limitation for use cases requiring instant availability. However, this is mitigated by caching the compiled engines in a Docker volume, enabling near-instant restarts.

Enabling layout detection incurs a roughly 20% throughput penalty, a reasonable cost given the richer document understanding it provides. The codebase appears well-organized, with clear separation between the inference engine, server logic, and input handling, making it approachable for contributions or extensions.

While the server targets NVIDIA GPUs with recent drivers, this limits deployment to compatible hardware, which is typical for CUDA-accelerated systems.

Quick start

Requirements: Linux, NVIDIA driver 595+, Turing or newer GPU (RTX 20-series / GTX 16-series+).

docker run --gpus all -p 8000:8000 -p 50051:50051 \
  -v trt-cache:/home/ocr/.cache/turbo-ocr \
  ghcr.io/aiptimizer/turboocr:v2.2.2

The first startup builds TensorRT engines from ONNX (~90s). The volume caches them for instant restarts. nginx (port 8000) reverse-proxies to Drogon (port 8080) for connection buffering — both start automatically.

Example OCR request with raw PNG image:

curl -X POST http://localhost:8000/ocr/raw \
  --data-binary @document.png -H "Content-Type: image/png"

Sample JSON response snippet:

{
  "results": [
    {"text": "Invoice Total", "confidence": 0.97, "bounding_box": [[42,10],[210,10],[210,38],[42,38]]}
  ]
}

Verdict

TurboOCR is a practical, GPU-accelerated OCR server aimed at production use cases requiring high throughput and low latency on document form recognition. Its zero-decode pixel path and dual-protocol server design provide flexibility and efficiency for varied client workflows.

The main limitation is the startup latency due to TensorRT engine compilation and the need for compatible NVIDIA hardware, which confines its use to environments with suitable GPU infrastructure.

It’s a solid choice for teams building OCR backends that process large volumes of document images and want to squeeze out GPU performance while maintaining API versatility. The code quality and architecture reflect a mature production system rather than a research prototype. If your workload fits these criteria and you have the hardware, TurboOCR is worth exploring.

PyTorch’s dynamic neural networks and tape-based autograd: a deep dive into flexible deep learning — Explore PyTorch’s unique tape-based autograd and dynamic neural networks architecture that enables flexible model develo
DeepEP: Optimizing communication for large Mixture-of-Experts models with CUDA kernels — DeepEP is a CUDA-based communication library designed for Mixture-of-Experts models, delivering high-throughput GPU kern

→ GitHub Repo: aiptimizer/TurboOCR ⭐ 258 · C++

Noureddine RAMDI / TurboOCR: a GPU-accelerated OCR server optimized for raw pixel input and high throughput

What TurboOCR does and its architecture

Technical strengths and tradeoffs

Quick start

Verdict

Related Articles