Bank statements come in countless shapes and formats, each bank with its own quirks and layout. Extracting structured transaction data from these PDFs is a tedious, error-prone chore that still trips up many finance automation efforts. Monopoly-core takes a pragmatic, code-driven approach to this problem: it uses per-bank Python classes to parse statement PDFs from over 20 financial institutions, converting them into CSVs ready for downstream processing.
What monopoly-core does and how it’s built
Monopoly-core is a Python library and CLI tool that extracts transaction data from bank statement PDFs and outputs CSV files. It supports a variety of banks primarily from North America and Asia, covering both credit and debit statements. Each supported bank is represented by a dedicated Python class implementing configuration details like column mappings, regex patterns for date and amount parsing, and validation rules.
Under the hood, the tool relies on the pdftotext utility to convert PDF pages into raw text. For image-based PDFs (scanned statements), it optionally leverages OCR via ocrmypdf. This two-step extraction process helps handle the wide range of PDF encodings and layouts.
The architecture centers on a clear separation: the core parsing logic is bank-agnostic, working on text extracted from PDFs, while bank-specific quirks are encapsulated in subclasses that define how to interpret those texts. This design enables adding new banks by implementing new config classes without touching the core logic.
Monopoly-core also includes safety checks that compare parsed transaction totals against statement totals to catch parsing errors or mismatches. It supports password-protected PDFs by allowing users to set the password via environment variables.
The stack is straightforward: Python 3, pdftotext from the Poppler utils, and optional OCR tooling. The library is pip-installable as monopoly-core and exposes both a CLI and a Python API for integration into automation pipelines.
How monopoly-core’s per-bank parser classes stand out
What distinguishes monopoly-core is its per-bank configuration class design pattern. Each bank is modeled as a subclass that defines specific parsing instructions:
- Column mappings to identify transaction date, description, amounts, and balances.
- Regex patterns tailored to the bank’s date and currency formats.
- Rules to handle credit vs debit statement formats.
- Total validation logic specific to the bank’s statement layout.
This approach reflects a clean separation of concerns: the core engine handles extraction and generic parsing steps, while each bank subclass manages its idiosyncrasies. This makes the codebase more maintainable and extensible.
The tradeoff is that adding support for a new bank requires detailed reverse engineering of its statement format to create a new config class. This is labor-intensive but necessary given the wildly inconsistent PDF layouts across banks.
From a code quality perspective, the repo is surprisingly clean and modular. The core parsing is in parser.py with bank subclasses in banks/. The use of environment variables for sensitive info like passwords shows attention to security practices.
The reliance on external command-line tools (pdftotext, ocrmypdf) is a practical choice that keeps the code lightweight but introduces dependencies that users must manage. This is common in PDF processing, as native Python PDF text extraction is often brittle.
Quick start with monopoly-core
Monopoly-core provides clear installation steps to get started quickly. The README specifies:
apt-get install build-essential libpoppler-cpp-dev pkg-config ocrmypdf
or on macOS:
brew install gcc@11 pkg-config poppler ocrmypdf
Then install the Python package with pipx:
pipx install monopoly-core
For OCR support (required for scanned PDFs):
pipx install 'monopoly-core[ocr]'
Once installed, you can use the CLI to convert PDFs to CSV. The project README and source code provide examples for usage and environment variable setup for passwords.
This setup process reflects the typical tradeoff in PDF tooling: you get robust text extraction and OCR by relying on battle-tested native libraries, but you must install and maintain these external dependencies.
Verdict: who should consider monopoly-core?
Monopoly-core is a useful tool for developers or finance automation specialists who need to extract transaction data from bank statement PDFs in a repeatable, extensible way. Its per-bank parser architecture is a strong point, making it easier to support multiple institutions with different layouts.
The tool is well-suited for small businesses or personal finance applications where automating bank statement ingestion can save hours of manual data entry. It can also be integrated into larger workflows that require CSV exports from bank PDFs.
Limitations include the dependency on external tools (pdftotext, ocrmypdf) which can complicate deployment, and the current bank coverage — unsupported banks require writing new parser classes, which demands some reverse-engineering effort.
Overall, monopoly-core strikes a pragmatic balance: it doesn’t try to do magic with AI or heuristics but offers a robust, maintainable codebase for a notoriously messy real-world problem. For those willing to invest in adding or tuning bank configurations, it delivers solid DX and reliability.
Related Articles
- Automating bank statement processing with YOLOv8, OCR, and LLMs for personal finance analysis — Explore how a hybrid pipeline using YOLOv8 layout detection, OCR, and LLMs automates messy bank statement PDFs for perso
- Algorithmic trading with Python: modular quant tools built on pandas — This repo offers modular Python utilities for quantitative trading research, featuring pure-Pandas indicators and OOP po
- TypeScript trading engine for Polymarket binary prediction markets: trading without low-latency overhead — A TypeScript-based trading engine for Polymarket’s binary prediction markets, designed for thin liquidity and offering b
- DocStrange: A versatile Python library for LLM-optimized document parsing with dual-mode processing — DocStrange converts PDFs, DOCX, PPTX, XLSX, images, and URLs into LLM-ready Markdown, JSON, HTML, and CSV. It offers fre
- Inside Alibaba’s Logics-Parsing-v2: end-to-end structured document parsing beyond OCR — Alibaba’s Logics-Parsing-v2 converts complex document images into structured HTML, handling formulas, tables, flowcharts
→ GitHub Repo: benjamin-awd/monopoly ⭐ 176 · Python