Web scraping at scale runs into a fundamental challenge: how to mimic human browser behavior convincingly enough to evade bot detection. Crawlee tackles this head-on by combining HTTP crawling and headless browser automation under a unified TypeScript API, with built-in mechanisms to replicate browser fingerprints, TLS handshakes, and realistic headers. This is worth a closer look if you build scrapers that need to avoid blocks and bans while managing large queues and data persistence.
What Crawlee does and how it works
Crawlee is a TypeScript library designed primarily for web scraping and browser automation. It supports both HTTP-based crawling and headless browser automation through Playwright and Puppeteer, enabling developers to choose the right tool for their target sites. The architecture is modular and extensible, with separate crawler classes implementing different strategies but exposing a consistent API.
Under the hood, Crawlee manages browser-like headers, TLS fingerprints, and user-agent strings to simulate real users. This reduces the chance of bot detection systems flagging requests. It also offers persistent URL queues, allowing the crawler to maintain state across runs and scale reliably.
The library supports proxy rotation, which is essential when scraping sites that implement IP-based blocking or rate limiting. Storage for crawl results and request queues is pluggable, with a local filesystem by default but options for other backends.
The codebase is written in TypeScript, making it suitable for Node.js environments with modern language features and type safety. Crawlee is developed by Apify and integrates well with their cloud platform, offering deployment options beyond local usage.
How Crawlee’s approach to stealth and flexibility stands out
The standout feature is Crawlee’s sophisticated approach to evading bot detection by replicating human-like browser behavior. Most scraping libraries focus on either HTTP requests or browser automation, but Crawlee unifies these approaches, letting you switch seamlessly between them with the same API.
This abstraction is more than convenience: it lets developers start with lightweight HTTP crawling and escalate to full browser automation only when necessary, saving resources. The dual support for Playwright and Puppeteer also gives flexibility in browser choice and ecosystem compatibility.
Managing TLS fingerprints and headers automatically is a non-trivial problem. Crawlee tackles this by generating realistic headers and mimicking browser TLS handshakes, which many scrapers neglect. This effort reduces false positives from anti-bot services.
The persistent request queue is another strong point. It enables robust, fault-tolerant crawling workflows that can pause and resume without losing state. Proxy rotation is built-in, which is essential for scraping at scale without getting IP banned.
The codebase demonstrates good modularity and separation of concerns. Each crawler type encapsulates its logic, and the storage abstractions keep the core decoupled from specific databases or filesystems.
The tradeoff is complexity: developers need to understand the crawler lifecycle and configuration to tune stealth features effectively. Also, because it supports complex browser automation, the resource footprint can be higher than pure HTTP scrapers.
Quick start with Crawlee CLI and PlaywrightCrawler
The easiest way to experiment with Crawlee is using its CLI, which bootstraps a project with example code and installs dependencies:
npx crawlee create my-crawler
cd my-crawler
npm start
Alternatively, you can install Crawlee and Playwright manually in your own Node.js project:
npm install crawlee playwright
Here’s a minimal example using PlaywrightCrawler to crawl pages and extract titles:
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks, log }) {
const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);
await Dataset.pushData({ title, url: request.loadedUrl });
await enqueueLinks();
},
// headless: false, // uncomment to see the browser window
});
await crawler.run(['https://crawlee.dev']);
By default, data and queues are stored under ./storage but this can be configured. The CLI and documentation provide further customization options.
Verdict: who benefits from Crawlee and its limitations
Crawlee is a practical choice if you need a versatile, stealthy scraping library capable of scaling from lightweight HTTP requests to full browser automation. Its unified API across Playwright and Puppeteer stands out in this space.
The built-in bot evasion techniques—TLS fingerprinting, header management, proxy rotation—cover many common blockers but won’t guarantee invisibility against highly sophisticated anti-bot systems. Fine-tuning is often necessary.
The resource cost is higher than pure HTTP scrapers due to browser automation overhead. If your scraping needs are simple or purely API-based, Crawlee might be overkill.
Overall, Crawlee is well suited for projects that require a blend of stealth, scalability, and developer experience in a TypeScript/Node.js environment, particularly when integrated with Apify’s platform for cloud deployment.
Related Articles
- Browser Harness: a self-healing LLM agent for browser automation via Chrome DevTools — Browser Harness enables LLMs to automate browsers by dynamically generating helper functions using the Chrome DevTools P
- Cloudflare Agents: Building persistent AI agents with stateful Durable Objects — Cloudflare Agents offers a TypeScript framework for stateful AI agents on Durable Objects with real-time communication,
- PinchTab: Token-efficient Chrome automation for AI agents with Go — PinchTab is a Go HTTP server enabling AI agents to control Chrome instances efficiently by extracting structured text, c
- Pathway LLM App: unified pipelines for scalable retrieval-augmented generation and AI search — Pathway LLM App provides integrated pipelines for scalable RAG and AI search, combining vector and full-text indexing wi
- Mercury Agent: A TypeScript AI assistant with persistent “Second Brain” memory and permission-hardened safety — Mercury Agent is a TypeScript AI assistant with a persistent SQLite-based memory system, permission-hardened tools, and
→ GitHub Repo: apify/crawlee ⭐ 22,961 · TypeScript