Comic Translate uses advanced AI models and a multi-step pipeline for accurate comic translation across languages, combining speech bubble detection, OCR, and LLMs with full-page context.
Dedoc is a Python library and REST API that extracts structured content from diverse documents including PDFs, Office files, and images using a unique virtual stack machine PDF interpreter and OCR preprocessing.
Papermerge is a Python-based open-source document management system for scanned files with OCR and full-text search, using a meta-repo pattern to scale its codebase.
Nougat is Meta’s neural OCR system for academic PDFs, extracting LaTeX math and tables into structured Markdown using a Vision Transformer encoder-decoder. It offers CLI, API, and training tools.
OCRFlux is a Python OCR tool optimized for NVIDIA GPUs, enabling fast, high-quality OCR on documents using a conda environment and poppler-utils for PDF rendering.
Monopoly-core is a Python library and CLI for converting bank statement PDFs to CSV using per-bank parser classes. It supports 20+ banks, OCR, and safety checks.
pdf-document-layout-analysis is a Dockerized microservice using Vision Grid Transformer and LightGBM for PDF layout analysis, offering high accuracy or fast processing with OCR, translation, and multi-format export.
TurboOCR is a C++/CUDA OCR server leveraging TensorRT FP16 for high throughput and low latency, featuring a zero-decode pixel pipeline and multi-protocol API.
Falcon-Perception is a PyTorch engine for multimodal autoregressive Transformers handling detection, segmentation, and OCR with FlexAttention and efficient caching.
Alibaba’s Logics-Parsing-v2 converts complex document images into structured HTML, handling formulas, tables, flowcharts, music sheets, and pseudocode with a single model.
Second Brain is a Python framework that indexes local files with embeddings, runs background subagents, and lets AI agents build and hot-load their own plugins at runtime.
Explore how a hybrid pipeline using YOLOv8 layout detection, OCR, and LLMs automates messy bank statement PDFs for personal finance analysis with RAG and AI agents.
deepseek_ocr_app combines React and FastAPI to offer powerful OCR for images and multipage PDFs with exports to Markdown, HTML, DOCX, and JSON. It features real-time progress tracking and bounding box visualization.
DocStrange converts PDFs, DOCX, PPTX, XLSX, images, and URLs into LLM-ready Markdown, JSON, HTML, and CSV. It offers free cloud and private local GPU modes for flexible, privacy-compliant document parsing.
Windrecorder captures screen activity on Windows, indexes it with multiple OCR engines locally, and offers a searchable rewind UI—all without cloud dependencies.
Tesseract OCR evolved from a legacy character pattern engine to a modern LSTM-based line recognition system supporting 100+ languages and multiple output formats. Here’s a technical dive.