pdf-document-layout-analysis: a dual-model PDF layout analysis microservice with Docker deployment

Intelligent PDF layout analysis remains a challenging task, especially when you need to extract structured elements like titles, tables, pictures, and formulas while preserving reading order. pdf-document-layout-analysis tackles this by combining two distinct machine learning backends in one Docker-powered microservice, letting you trade off accuracy for speed with a simple flag. It also includes OCR, document translation, and multiple export formats, all accessible via a web UI or REST API.

What pdf-document-layout-analysis does and how it works

This project, maintained by HURIDOCS, is a Python-based microservice designed for intelligent segmentation and layout analysis of PDF documents. It identifies different page elements — text blocks, titles, tables, images, and even mathematical formulas — while reconstructing the correct reading order. This is not just plain text extraction but a structured understanding of the page layout.

Under the hood, the service offers two analysis backends:

Vision Grid Transformer (VGT): a high-accuracy deep learning model based on transformers tailored for visual grid analysis of PDF pages. It tends to deliver the most precise layout segmentation results.
LightGBM classifiers: a set of gradient boosting decision tree models optimized for fast inference, trading some accuracy for speed.

You switch between these backends simply by toggling a fast=true flag in the API call or the UI.

The service is built following Clean Architecture principles, ensuring separation of concerns and modularity. It is packaged entirely in Docker, with optional GPU acceleration if you have a compatible GPU and 5 GB of GPU memory available. Otherwise, it falls back to CPU usage.

The system supports over 150 OCR languages via Tesseract integration, enabling it to process scanned documents and images within PDFs. It can extract tables as HTML and formulas as LaTeX markup, making downstream consumption easier.

Outputs include Markdown, HTML, and JSON formats, and the service also features automatic translation powered by Ollama, enabling multilingual document workflows.

Two main interfaces are exposed:

A Gradio web UI running on port 7860, which provides a user-friendly dashboard for uploading PDFs, running analysis, visualizing segmentation overlays, converting formats, and translating documents.
A REST API on port 5060 offers programmatic access with over 10 endpoints, covering analysis, OCR, table of contents extraction, and format conversion.

Technical strengths and design tradeoffs

The most striking architectural choice here is the dual-model approach for layout analysis. The Vision Grid Transformer (VGT) provides high accuracy thanks to its transformer-based architecture, which has shown strong results in vision tasks. However, transformer models are typically resource-intensive and slower to run, especially on CPUs.

To address this, the repository includes LightGBM classifiers as a lightweight alternative optimized for speed. This design lets users pick the backend that suits their use case — high precision for offline batch processing or fast inference for real-time applications.

This tradeoff is exposed cleanly via a simple flag (fast=true) which improves developer experience and integration flexibility.

The Docker-first deployment is a practical choice, encapsulating all dependencies and models in a single container. This makes the service portable and easy to deploy on various environments, including homelab setups or cloud servers. The fallback to CPU if no GPU is detected is also a plus for wider usability.

On the downside, the resource requirements are non-trivial:

Minimum 2 GB RAM
5 GB GPU memory recommended for acceleration
10 GB disk space for models and dependencies

This footprint might be heavy for lightweight or edge deployments.

The integration of Tesseract OCR supports a vast number of languages, which is critical for global document processing but also adds complexity and resource load.

The inclusion of automatic translation via Ollama is a nice bonus, though it depends on that external service and may not suit all privacy or latency requirements.

Overall, the codebase appears modular and clean, with clear separation between analysis backends, OCR, translation, and API layers. The use of Gradio for the web UI is a sensible choice for rapid prototyping and visualization.

Quick start

The README provides straightforward commands to get the service running via Docker and Makefiles:

make start # or `just start` (https://github.com/casey/just)

This launches both the Gradio web UI on http://localhost:7860 and the REST API on http://localhost:5060.

You can test the REST API with curl:

curl -X POST -F 'file=@/path/to/your/document.pdf' http://localhost:5060

For faster, LightGBM-based analysis:

curl -X POST -F 'file=@/path/to/your/document.pdf' -F "fast=true" http://localhost:5060

To stop the service:

make stop

The web UI supports uploading PDFs, visualizing layout segmentation, converting to Markdown or HTML, running OCR, extracting table of contents, and translating documents.

Verdict

pdf-document-layout-analysis is a practical, well-structured microservice for anyone needing detailed, structured PDF layout analysis with the flexibility to choose between accuracy and speed. Its Docker-based deployment and multi-interface access make it suitable for both experimentation and integration into larger pipelines.

The project shines especially if you work with varied document types containing tables, formulas, and mixed content, and need outputs in multiple formats plus translation.

The main limitation is the resource footprint — requiring several GB of RAM and disk space, with the optional GPU acceleration needing a capable GPU. This means it’s less suitable for very lightweight or embedded environments.

If your workflow involves batch processing or real-time document analysis with the option to toggle precision vs. speed, and you can meet the resource demands, this repo is worth a close look. The clean architecture and modular design also make it a good base for extending or customizing document layout analysis pipelines.

DocStrange: A versatile Python library for LLM-optimized document parsing with dual-mode processing — DocStrange converts PDFs, DOCX, PPTX, XLSX, images, and URLs into LLM-ready Markdown, JSON, HTML, and CSV. It offers fre
Inside Alibaba’s Logics-Parsing-v2: end-to-end structured document parsing beyond OCR — Alibaba’s Logics-Parsing-v2 converts complex document images into structured HTML, handling formulas, tables, flowcharts
pdftochat: a cloud-integrated PDF-to-chat system with hybrid vector search — pdftochat is a TypeScript-based PDF-to-chat app leveraging Chroma Cloud for hybrid vector search and Together.ai for LLM
deepseek_ocr_app: full-stack OCR with multi-format PDF export and real-time progress — deepseek_ocr_app combines React and FastAPI to offer powerful OCR for images and multipage PDFs with exports to Markdown
Automating bank statement processing with YOLOv8, OCR, and LLMs for personal finance analysis — Explore how a hybrid pipeline using YOLOv8 layout detection, OCR, and LLMs automates messy bank statement PDFs for perso

→ GitHub Repo: huridocs/pdf-document-layout-analysis ⭐ 1,144 · Python

Noureddine RAMDI / pdf-document-layout-analysis: a dual-model PDF layout analysis microservice with Docker deployment

What pdf-document-layout-analysis does and how it works

Technical strengths and design tradeoffs

Quick start

Verdict

Related Articles