WebMagic: a flexible Java web crawler framework with dual extraction modes

Web scraping is often more art than science, especially when you need to crawl complex sites or scale beyond basic scripts. WebMagic stands out by offering a Java-based framework that covers the full crawling lifecycle—from downloading pages to managing URLs, extracting content, and persisting data. What’s compelling is its dual approach to content extraction: you can either write a programmatic PageProcessor for fine-grained control or use an annotation-driven OOSpider that maps HTML directly to POJOs. This flexibility makes WebMagic a practical choice for a broad range of crawling needs.

What WebMagic does and how it works

WebMagic is a scalable web crawler framework implemented in Java. It abstracts away many common crawling concerns under a simple core, while providing flexible extension points. The framework handles downloading, URL scheduling, page processing, and data storage. Under the hood, it supports multi-threading and even distributed crawling setups to scale across machines.

The architecture centers around a few key components:

Spider: the main orchestrator that manages the crawling lifecycle.
Downloader: fetches web pages using HTTP.
Scheduler: manages the queue of URLs to crawl.
PageProcessor: the interface where extraction logic lives.
Pipeline: processes and persists extracted results.

Java’s strong typing and mature ecosystem, combined with WebMagic’s modular design, make it a solid choice for enterprise-grade scraping tasks. The framework also supports annotation-based extraction using POJOs, allowing declarative mappings of HTML content to Java objects.

WebMagic’s stack is pure Java, relying on standard APIs and common libraries like slf4j for logging. It is designed to be embedded in existing Java projects and integrates well with build tools like Maven.

What makes WebMagic interesting technically

The standout feature is the dual extraction strategy. If you want full control over crawling logic, you implement the PageProcessor interface. This lets you imperatively define how to extract data, add new URLs, and handle complex scenarios. For simpler use cases, WebMagic offers the OOSpider, which uses annotations on Java POJOs to declare extraction rules. This declarative style reduces boilerplate and improves DX for straightforward scraping.

Under the hood, the HTML extraction uses XPath, CSS selectors, and regex, giving you flexibility in how you target page elements. This is crucial because real-world websites often require a mix of strategies.

The framework is built with concurrency in mind. It supports multi-threaded crawling out-of-the-box, which is important for performance at scale. For even larger scale, WebMagic can distribute crawling tasks across multiple nodes, although this setup requires additional configuration and infrastructure.

From a code quality standpoint, the project is mature—over 11,000 stars on GitHub reflect a broad user base and active maintenance. The codebase is modular, making it relatively straightforward to extend or modify components like the downloader or scheduler.

The tradeoff is that being a Java framework, there is some verbosity and configuration overhead compared to lightweight scripting tools in Python. Also, annotation-driven extraction can be limiting if your scraping logic needs dynamic decisions based on page content. However, the coexistence of both approaches lets you pick what fits your project.

Install and quick start

To get started with WebMagic, you add dependencies to your Maven pom.xml:

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>${webmagic.version}</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>${webmagic.version}</version>
</dependency>

The project uses slf4j for logging with the slf4j-log4j12 implementation by default. If you customize your logging, you need to exclude the default binding:

<exclusions>
    <exclusion>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
    </exclusion>
</exclusions>

Once dependencies are set, you can create a simple crawler by implementing PageProcessor:

public class MyPageProcessor implements PageProcessor {
    @Override
    public void process(Page page) {
        // Extract data
        page.putField("title", page.getHtml().xpath("//title/text()"));
        // Add new URLs to crawl
        page.addTargetRequests(page.getHtml().links().regex("https://example.com/page/.*").all());
    }

    @Override
    public Site getSite() {
        return Site.me().setRetryTimes(3).setSleepTime(1000);
    }
}

// Starting the crawler
Spider.create(new MyPageProcessor())
    .addUrl("https://example.com")
    .thread(5)
    .run();

Alternatively, with OOSpider, you define a POJO annotated with extraction rules, which simplifies code for straightforward scraping.

Who should consider WebMagic

WebMagic fits Java developers needing a comprehensive yet flexible web crawling framework. Its modularity and dual extraction approach make it suitable from simple to complex scraping tasks.

If you prefer full control over crawling logic and are comfortable with Java, implementing PageProcessor gives you precise control. If you want less boilerplate and your data fits cleanly into POJOs, the annotation-driven OOSpider is a nice abstraction.

The framework’s multi-threading and distributed crawling capabilities make it viable for larger scale projects, though setting up distributed crawling requires more effort.

Limitations include the verbosity inherent to Java and the learning curve of annotation-based extraction. Also, it’s less suited for quick one-off scrapes compared to lightweight Python tools.

Overall, WebMagic delivers a practical balance of flexibility, scalability, and developer experience for Java-centric crawling projects.

Browser Harness: a self-healing LLM agent for browser automation via Chrome DevTools — Browser Harness enables LLMs to automate browsers by dynamically generating helper functions using the Chrome DevTools P
Pathway LLM App: unified pipelines for scalable retrieval-augmented generation and AI search — Pathway LLM App provides integrated pipelines for scalable RAG and AI search, combining vector and full-text indexing wi
Gin: a zero-allocation, high-performance Go web framework for REST APIs — Gin is a Go HTTP web framework known for its zero-allocation router and up to 40x faster performance. It balances speed
Shopware 6: A flexible, API-first e-commerce platform built on Symfony and Vue.js — Shopware 6 is an open-source, API-first e-commerce platform leveraging Symfony 7 and Vue.js 3. It combines a full shoppi
Awesome LLM Apps: a practical collection of runnable AI agent and RAG templates — Awesome LLM Apps offers 100+ runnable AI agent and RAG templates for quick LLM app development. It supports multiple pro

→ GitHub Repo: code4craft/webmagic ⭐ 11,689 · Java

Noureddine RAMDI / WebMagic: a flexible Java web crawler framework with dual extraction modes

What WebMagic does and how it works

What makes WebMagic interesting technically

Install and quick start

Who should consider WebMagic

Related Articles