Document processing pipelines often hit a wall when dealing with the messiness of real-world files: PDFs with broken text layers, scanned images, nested archives, and complex tables. Dedoc tackles this problem head-on by combining a modular extraction pipeline with a unique low-level approach to PDF graphics parsing, OCR preprocessing, and table recognition.
What Dedoc does and how it works
Dedoc is an open-source Python library and REST service designed to extract the content and logical structure from a wide range of document formats. It supports Office formats such as DOCX, XLSX, and PPTX, PDF documents both with textual layers and scanned images, HTML, plain text, images, and even nested archives containing multiple documents.
The extraction process includes not just plain text but also the logical document structure — headings, lists, and tables — represented as a tree. It also captures formatting details and metadata, making the output richer and more useful for downstream processing.
The architecture is built as a pluggable pipeline, so it can be extended to support new document types or output schemas. This modularity makes it a flexible tool for different document understanding workflows.
Under the hood, Dedoc employs a few key technical components:
A virtual stack machine interpreter for PDF graphics extraction. This is a distinctive choice compared to most PDF parsers that rely heavily on heuristics or text layer extraction alone. The interpreter directly processes PDF graphics operators according to the PDF specification, enabling more precise content extraction, especially in complex cases.
Tesseract OCR integrated with machine learning-based preprocessing for scanned documents. This preprocessing includes orientation detection, column detection, and bold text detection, improving OCR accuracy and layout reconstruction.
Contour analysis for complex multipage table recognition, which helps parse tables that span multiple pages or have irregular layouts.
Handling of nested documents and archives, allowing Dedoc to process container files holding multiple heterogeneous documents seamlessly.
The system ships with a Docker-based REST API that exposes the extraction functionality as a web service, suitable for integration with other applications, such as NLP pipelines or information retrieval systems.
Technical strengths and design tradeoffs
The technical standout of Dedoc is its virtual stack machine interpreter for PDFs. Unlike typical PDF text extraction libraries that rely on text layers or heuristic-based layout analysis, Dedoc processes PDF graphics operators at a low level. This approach can yield higher fidelity extraction by interpreting the actual drawing commands in the PDF file.
This design comes with tradeoffs. It requires a deep understanding of the PDF format, which can be complex and has many edge cases. The implementation is likely more involved and potentially less performant than heuristic approaches, but it gains precision and robustness, especially for PDFs with non-standard or damaged text layers.
The integration of Tesseract OCR with ML preprocessing is another practical strength. Scanned documents often pose significant challenges for text extraction, and the machine learning models for orientation and column detection improve the OCR output quality noticeably.
The contour analysis method for table detection indicates a more sophisticated approach than simple bounding box heuristics, which is important for multipage and irregular tables common in real-world documents.
The modular pipeline design is a plus for maintainability and extensibility. Users can plug in new formats or customize output schemas without rewriting the core logic.
On the downside, running Dedoc outside Docker requires a Linux environment (Ubuntu 20+ recommended) and specific Python versions (3.9 or 3.10), which might limit its out-of-the-box usability on some platforms. Also, the resource requirements can be non-trivial, especially when running OCR and complex PDF parsing.
The codebase is in Python, which aids accessibility and integration with data science and NLP stacks but might constrain performance for very high throughput scenarios.
Quick start with Dedoc
Dedoc provides two main ways to install and run it: via Docker container or as a Python library installed with pip.
The Docker method is recommended for flexibility and avoiding OS-level dependency issues. Here’s how to get started:
# Pull the official dedoc Docker image
docker pull dedocproject/dedoc
# Run the container exposing the REST API on port 1231
docker run -p 1231:1231 --rm dedocproject/dedoc python3 /dedoc_root/dedoc/main.py
If you want to customize the application settings, you can clone the repository, modify the config.py file, and build the Docker image yourself:
# Clone the repo
git clone https://github.com/ispras/dedoc
# Change into the dedoc directory
cd dedoc
# Build and run the Docker image using docker-compose
docker compose up --build
# Optionally, run the container with tests enabled
test="true" docker compose up --build
For users preferring not to use Docker, Dedoc can be installed as a Python library via pip. However, this approach is less flexible, requires a suitable Linux environment (Ubuntu 20+), and may have higher resource demands.
The REST API exposed by the Docker container or local install can then be used to submit documents for extraction and receive structured JSON output describing the content, structure, tables, and metadata.
Verdict
Dedoc is a technically solid choice for teams needing precise and extensible document content extraction across diverse formats. Its unique virtual stack machine interpreter for PDF graphics sets it apart from many extraction tools that rely mostly on heuristics or simple text layer parsing.
This low-level approach can yield better results on complex or non-standard PDFs, which is valuable when accuracy matters, such as in legal, scientific, or enterprise document workflows.
The added OCR preprocessing and contour-based table recognition make it a practical tool for scanned documents and complex layouts, which are common pain points in document processing pipelines.
That said, Dedoc is not a lightweight or plug-and-play library. It requires a Linux environment, some familiarity with Docker or Python packaging, and can be resource-intensive. The Python implementation favors ease of integration but may not suit extremely high-volume real-time processing without additional scaling.
Overall, if you work with heterogeneous document collections and need structured, reliable content extraction as a preprocessing step before NLP or search, Dedoc is worth exploring. Its modular design means you can adapt it as your document types or output needs evolve. Just be prepared for the operational complexity and resource footprint that comes with its thorough approach to PDF and document parsing.
Related Articles
- DocStrange: A versatile Python library for LLM-optimized document parsing with dual-mode processing — DocStrange converts PDFs, DOCX, PPTX, XLSX, images, and URLs into LLM-ready Markdown, JSON, HTML, and CSV. It offers fre
- OpenKB: A persistent, vectorless wiki knowledge base powered by LLMs and PageIndex — OpenKB compiles documents into a persistent, interlinked wiki using LLMs and PageIndex’s vectorless retrieval, supportin
- Inside Alibaba’s Logics-Parsing-v2: end-to-end structured document parsing beyond OCR — Alibaba’s Logics-Parsing-v2 converts complex document images into structured HTML, handling formulas, tables, flowcharts
- deepseek_ocr_app: full-stack OCR with multi-format PDF export and real-time progress — deepseek_ocr_app combines React and FastAPI to offer powerful OCR for images and multipage PDFs with exports to Markdown
- goscrapy: a Go-based web scraping framework with CLI scaffolding — goscrapy is a Go framework for web scraping that includes a CLI scaffolding tool. It requires Go 1.23+ and offers a mini
→ GitHub Repo: ispras/dedoc ⭐ 702 · Python