news-please: a Python crawler for structured news extraction with Common Crawl support

news-please is a Python-based news crawler and extractor that aims to simplify the process of gathering structured news data at scale. It combines well-established libraries like Scrapy, Newspaper, and Readability to provide a robust pipeline for recursive crawling of websites and batch processing of URLs. The standout feature is its integration with Common Crawl archives, enabling access to historical news data for longitudinal studies or RAG datasets.

what news-please does and how it works

At its core, news-please is a web crawler and information extractor optimized for news content. It supports two main modes: a command-line interface (CLI) for full-site recursive crawling and continuous crawling via RSS feeds, and a library API for batch processing of URLs.

The architecture leverages Scrapy’s pipeline system to modularize data extraction and storage. The extraction pipeline pulls out structured metadata such as headlines, authors, publication dates, and the main text of articles. This is achieved by combining Scrapy spiders with parsing libraries like Newspaper and Readability to handle the variety of news website layouts.

Storage is highly configurable through Scrapy item pipelines, with built-in options for writing JSON files, inserting into PostgreSQL databases, indexing in Elasticsearch, and caching in Redis. This flexibility makes news-please suitable for different production environments and scale requirements.

A notable component is the workflow dedicated to filtering and downloading historical news articles from Common Crawl’s archive. This opens up possibilities for time-series analysis of news content or building datasets that span years, rather than just recent data.

technical strengths and tradeoffs

news-please’s strength lies in its pragmatic combination of proven tools and thoughtful extensions around news-specific workflows. The use of Scrapy as the crawling backbone is an obvious choice for Python, but the integration with Newspaper and Readability enhances extraction accuracy across diverse sites.

The modular pipeline design offers a clean separation of concerns: crawling, extraction, filtering, and storage are distinct stages. This makes the codebase easier to maintain and extend. The configurable storage backends demonstrate an awareness of real-world deployments, where data sinks vary widely.

The Common Crawl integration is where news-please stands out compared to typical news scrapers. It includes utilities to filter Common Crawl WARC files for specific news outlets and date ranges, which is non-trivial given the scale and format of Common Crawl datasets. This is particularly valuable for researchers or engineers building datasets for retrieval-augmented generation (RAG) or longitudinal NLP analysis.

However, the blocking nature of the library API calls (e.g., from_url, from_urls) means it’s not designed for high-throughput asynchronous processing out of the box. Users needing massive concurrency will need to build additional orchestration layers or extend the framework.

The CLI mode covers typical use cases but requires familiarity with Scrapy conventions and pipeline configurations, which can impose a learning curve. Also, while multiple storage backends are supported, setting up and tuning those systems (like Elasticsearch or PostgreSQL) is outside the scope of the tool.

The code quality appears pragmatic and maintainable, but given the scope, some parts of the codebase might need customization depending on target sites and data volume. The project maintains active development and has a sizable community (2,400+ stars), which helps with ongoing improvements.

getting started with news-please

news-please supports Python 3.8 and newer. Installation is straightforward via pip:

$ pip install news-please

Using news-please as a library is simple for single or batch article extraction:

from newsplease import NewsPlease

article = NewsPlease.from_url('https://www.nytimes.com/2017/02/23/us/politics/cpac-stephen-bannon-reince-priebus.html?hp')
print(article.title)

You can process multiple URLs at once with optional HTTP request parameters:

NewsPlease.from_urls([url1, url2], request_args={"timeout": 6})

For file-based URL lists:

NewsPlease.from_file(path)

Raw HTML content extraction is supported, optionally with the original URL to improve date extraction:

NewsPlease.from_html(html, url=None)

And for WARC files — the format used by Common Crawl — there’s a dedicated method:

NewsPlease.from_warc(warc_record)

Extracted articles are returned as objects containing structured metadata. You can serialize them to JSON for downstream use:

import json

with open("article.json", "w") as f:
    json.dump(article.to_dict(), f, indent=4)

This makes it easy to integrate news-please into data pipelines or research workflows.

verdict: who should use news-please

news-please is a solid choice for developers and researchers needing reliable extraction of structured news data at scale, especially when historical data access via Common Crawl is important. It fits projects that require a blend of recursive crawling and batch URL processing with flexible storage options.

Its main limitation is the blocking API design, which may not suit ultra-high-throughput scraping without additional orchestration. Also, configuring the backend storage and tuning extraction for specific news outlets requires some hands-on work.

Overall, news-please is a practical, well-architected tool that balances capability with maintainability. If you’re working on news aggregation, longitudinal NLP datasets, or retrieval-augmented generation pipelines, it’s worth exploring.

For casual or one-off scraping tasks, simpler tools might suffice, but for anything approaching production-grade news ingestion, news-please offers a robust foundation with extensibility.

Scrapy: a modular Python framework for scalable web scraping — Scrapy is a Python framework designed for efficient and extensible web scraping, featuring a powerful selector system an
awesome-web-scraping: a curated hub for web scraping tools and resources — A comprehensive, multi-language curated list of web scraping tools, services, and resources that acts as a vital referen
Scrapling: adaptive web scraping with AI integration for resilient data extraction — Scrapling offers an adaptive web scraping framework with AI integration to handle site changes and anti-bot systems, sup
Crawlee Python: a flexible dual-crawler framework for web scraping and automation — Crawlee Python offers a dual approach to web scraping with lightweight HTML parsing and headless browser automation, bal
Crawlee: a TypeScript library for stealthy web scraping and browser automation — Crawlee is a TypeScript library for web scraping and browser automation with human-like stealth. Supports Playwright, Pu

→ GitHub Repo: fhamborg/news-please ⭐ 2,444 · Python

Noureddine RAMDI / news-please: a Python crawler for structured news extraction with Common Crawl support

what news-please does and how it works

technical strengths and tradeoffs

getting started with news-please

verdict: who should use news-please

Related Articles