Noureddine RAMDI / Dedoc: Python library for structured document content extraction with a virtual stack machine PDF engine

Created Sat, 23 May 2026 20:41:14 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

ispras/dedoc

Document processing pipelines often hit a wall when dealing with the messiness of real-world files: PDFs with broken text layers, scanned images, nested archives, and complex tables. Dedoc tackles this problem head-on by combining a modular extraction pipeline with a unique low-level approach to PDF graphics parsing, OCR preprocessing, and table recognition.

What Dedoc does and how it works

Dedoc is an open-source Python library and REST service designed to extract the content and logical structure from a wide range of document formats. It supports Office formats such as DOCX, XLSX, and PPTX, PDF documents both with textual layers and scanned images, HTML, plain text, images, and even nested archives containing multiple documents.

The extraction process includes not just plain text but also the logical document structure — headings, lists, and tables — represented as a tree. It also captures formatting details and metadata, making the output richer and more useful for downstream processing.

The architecture is built as a pluggable pipeline, so it can be extended to support new document types or output schemas. This modularity makes it a flexible tool for different document understanding workflows.

Under the hood, Dedoc employs a few key technical components:

  • A virtual stack machine interpreter for PDF graphics extraction. This is a distinctive choice compared to most PDF parsers that rely heavily on heuristics or text layer extraction alone. The interpreter directly processes PDF graphics operators according to the PDF specification, enabling more precise content extraction, especially in complex cases.

  • Tesseract OCR integrated with machine learning-based preprocessing for scanned documents. This preprocessing includes orientation detection, column detection, and bold text detection, improving OCR accuracy and layout reconstruction.

  • Contour analysis for complex multipage table recognition, which helps parse tables that span multiple pages or have irregular layouts.

  • Handling of nested documents and archives, allowing Dedoc to process container files holding multiple heterogeneous documents seamlessly.

The system ships with a Docker-based REST API that exposes the extraction functionality as a web service, suitable for integration with other applications, such as NLP pipelines or information retrieval systems.

Technical strengths and design tradeoffs

The technical standout of Dedoc is its virtual stack machine interpreter for PDFs. Unlike typical PDF text extraction libraries that rely on text layers or heuristic-based layout analysis, Dedoc processes PDF graphics operators at a low level. This approach can yield higher fidelity extraction by interpreting the actual drawing commands in the PDF file.

This design comes with tradeoffs. It requires a deep understanding of the PDF format, which can be complex and has many edge cases. The implementation is likely more involved and potentially less performant than heuristic approaches, but it gains precision and robustness, especially for PDFs with non-standard or damaged text layers.

The integration of Tesseract OCR with ML preprocessing is another practical strength. Scanned documents often pose significant challenges for text extraction, and the machine learning models for orientation and column detection improve the OCR output quality noticeably.

The contour analysis method for table detection indicates a more sophisticated approach than simple bounding box heuristics, which is important for multipage and irregular tables common in real-world documents.

The modular pipeline design is a plus for maintainability and extensibility. Users can plug in new formats or customize output schemas without rewriting the core logic.

On the downside, running Dedoc outside Docker requires a Linux environment (Ubuntu 20+ recommended) and specific Python versions (3.9 or 3.10), which might limit its out-of-the-box usability on some platforms. Also, the resource requirements can be non-trivial, especially when running OCR and complex PDF parsing.

The codebase is in Python, which aids accessibility and integration with data science and NLP stacks but might constrain performance for very high throughput scenarios.

Quick start with Dedoc

Dedoc provides two main ways to install and run it: via Docker container or as a Python library installed with pip.

The Docker method is recommended for flexibility and avoiding OS-level dependency issues. Here’s how to get started:

# Pull the official dedoc Docker image
docker pull dedocproject/dedoc

# Run the container exposing the REST API on port 1231
docker run -p 1231:1231 --rm dedocproject/dedoc python3 /dedoc_root/dedoc/main.py

If you want to customize the application settings, you can clone the repository, modify the config.py file, and build the Docker image yourself:

# Clone the repo
git clone https://github.com/ispras/dedoc

# Change into the dedoc directory
cd dedoc

# Build and run the Docker image using docker-compose
docker compose up --build

# Optionally, run the container with tests enabled
test="true" docker compose up --build

For users preferring not to use Docker, Dedoc can be installed as a Python library via pip. However, this approach is less flexible, requires a suitable Linux environment (Ubuntu 20+), and may have higher resource demands.

The REST API exposed by the Docker container or local install can then be used to submit documents for extraction and receive structured JSON output describing the content, structure, tables, and metadata.

Verdict

Dedoc is a technically solid choice for teams needing precise and extensible document content extraction across diverse formats. Its unique virtual stack machine interpreter for PDF graphics sets it apart from many extraction tools that rely mostly on heuristics or simple text layer parsing.

This low-level approach can yield better results on complex or non-standard PDFs, which is valuable when accuracy matters, such as in legal, scientific, or enterprise document workflows.

The added OCR preprocessing and contour-based table recognition make it a practical tool for scanned documents and complex layouts, which are common pain points in document processing pipelines.

That said, Dedoc is not a lightweight or plug-and-play library. It requires a Linux environment, some familiarity with Docker or Python packaging, and can be resource-intensive. The Python implementation favors ease of integration but may not suit extremely high-volume real-time processing without additional scaling.

Overall, if you work with heterogeneous document collections and need structured, reliable content extraction as a preprocessing step before NLP or search, Dedoc is worth exploring. Its modular design means you can adapt it as your document types or output needs evolve. Just be prepared for the operational complexity and resource footprint that comes with its thorough approach to PDF and document parsing.


→ GitHub Repo: ispras/dedoc ⭐ 702 · Python