Skip to content

potterdigital/crawl4ai-mcp

crawl4ai-mcp

License Python CI crawl4ai

An MCP server that gives AI assistants web crawling superpowers. Wraps crawl4ai (Playwright/Chromium) and exposes it as MCP tools — crawl pages, extract structured data, batch-crawl sitemaps, and manage browser sessions, all through the Model Context Protocol.

Works with any MCP-compatible client: Claude Code, Claude Desktop, Cursor, Windsurf, OpenAI Agents SDK, and others.

Why

AI coding assistants can't browse the web natively. This MCP server gives them a full Chromium-based crawler — handling JS-rendered pages, authenticated sessions, batch crawls, structured extraction, and sitemap ingestion. No API key required for basic crawling.

Available Tools

Tool Description
ping Health check — confirms server and browser are running
crawl_url Crawl a URL and return clean markdown. Supports JS rendering, custom headers/cookies, CSS scoping, and cache control
crawl_many Crawl multiple URLs concurrently with configurable parallelism, politeness delays, and optional disk persistence
deep_crawl BFS site crawl — follows links with configurable depth and page limits, politeness delays, and optional disk storage
crawl_sitemap Crawl all URLs from an XML sitemap (supports gzip and sitemap indexes, politeness delays, optional disk persistence)
extract_structured LLM-powered structured JSON extraction with a user-defined schema
extract_css CSS-selector-based structured extraction — deterministic, no LLM required
create_session Create a persistent browser session (preserves cookies and state)
list_sessions List all active browser sessions
destroy_session Destroy a named browser session
list_profiles List available crawl profiles and their settings
check_update Check if a newer version of crawl4ai is available on PyPI

Prerequisites

  • Python 3.12+
  • uv — Python package manager

Installation

# Clone the repository
git clone https://github.com/potterdigital/crawl4ai-mcp.git
cd crawl4ai-mcp

# Install dependencies (uv manages the virtualenv automatically)
uv sync

# Install Playwright browser (required by crawl4ai — downloads Chromium)
uv run crawl4ai-setup

# Verify the installation
uv run crawl4ai-doctor

Register with Your MCP Client

Claude Code

# Replace /path/to/crawl4ai-mcp with your actual clone path
claude mcp add-json --scope user crawl4ai '{
  "type": "stdio",
  "command": "uv",
  "args": [
    "run",
    "--directory",
    "/path/to/crawl4ai-mcp",
    "python",
    "-m",
    "crawl4ai_mcp.server"
  ]
}'

Verify with claude mcp listcrawl4ai should appear.

Other MCP Clients

Add the following to your client's MCP server configuration:

{
  "crawl4ai": {
    "type": "stdio",
    "command": "uv",
    "args": [
      "run",
      "--directory",
      "/path/to/crawl4ai-mcp",
      "python",
      "-m",
      "crawl4ai_mcp.server"
    ]
  }
}

Replace /path/to/crawl4ai-mcp with the absolute path to your clone. The --directory flag is required — without it, uv run looks for the virtualenv in the client's working directory.

Usage Examples

Basic Crawling

"Crawl https://docs.python.org/3/library/asyncio.html and summarize the key concepts"

"Fetch https://example.com with a custom Authorization header"

JS-Heavy Pages

"Crawl this page using the js_heavy profile and wait for #content to appear"

Structured Extraction

"Extract all product names and prices from this page as JSON using CSS selectors"

"Use the LLM extractor to pull a list of API endpoints from this documentation page"

Batch Crawling

"Crawl all pages in this sitemap and summarize each one"

"Deep crawl this docs site to depth 2 and find all pages mentioning authentication"

"Crawl this list of URLs with a 1-second delay between requests to be respectful"

"Batch crawl these URLs and save all results to /tmp/crawl_results as markdown files"

Sessions

"Create a browser session, log into this site, then crawl the dashboard page"

Batch Crawling Options

All batch tools (crawl_many, deep_crawl, crawl_sitemap) support two optional parameters:

  • delay (default: 0): Politeness delay in seconds between requests. Use this to respect target servers and avoid overwhelming them. For deep_crawl, this becomes delay_before_return_html in crawl4ai. For crawl_many and crawl_sitemap, this wires a RateLimiter into request dispatch.

  • output_dir (default: None): Directory to write per-page .md files and a manifest.json instead of returning content inline. Useful for large batch crawls. When set, the tool returns a metadata summary with file paths instead of full page content.

Example:

# Crawl with politeness delay
crawl_many(urls=[...], delay=1.5)

# Crawl and save to disk
crawl_sitemap(sitemap_url="...", output_dir="/tmp/results")

# Both
deep_crawl(url="...", delay=0.5, output_dir="/tmp/crawl")

Profiles

Four built-in profiles control crawler behavior. Use the profile parameter on any crawl tool, or call list_profiles to see all options.

Profile Use Case Key Settings
default General-purpose crawling domcontentloaded wait, 60s timeout
fast Static pages, quick fetches domcontentloaded wait, 15s timeout, low word threshold
js_heavy SPAs, lazy-loaded content networkidle wait, 90s timeout, full-page scroll, overlay removal
stealth Anti-bot protected sites networkidle wait, 90s timeout, simulated user behavior, navigator masking

Profiles are YAML files in src/crawl4ai_mcp/profiles/. You can add custom profiles there.

Merge order: defaultnamed profileper-call overrides

Development

# Run the server directly (for debugging)
uv run python -m crawl4ai_mcp.server

# See server logs (stderr) while discarding MCP frames (stdout)
uv run python -m crawl4ai_mcp.server 2>&1 1>/dev/null

# Lint (catches print() calls that would corrupt stdio transport)
uv run ruff check src/

# Run tests
uv run pytest

# Diagnose crawl4ai / Playwright health
uv run crawl4ai-doctor

Troubleshooting

Tools don't appear in your MCP client Check that the --directory path in the registration command matches the actual project location. uv run without --directory looks for the virtualenv in the client's working directory, not this project.

Server disconnects immediately Any output to stdout (from a print() call or verbose=True in crawl4ai config) corrupts the MCP stdio transport. Check stderr for the actual error:

uv run python -m crawl4ai_mcp.server 2>&1 1>/dev/null

Chromium fails to start / "Playwright Chromium binary is missing or stale" Most often happens after uv sync upgrades Playwright to a new version — the cached Chromium under ~/Library/Caches/ms-playwright/ (macOS) or ~/.cache/ms-playwright/ (Linux) goes stale. The server runs a preflight check at startup and exits with a clear message in your MCP client's log; the fix is:

uv run crawl4ai-setup

Run uv run crawl4ai-doctor for a deeper diagnostic.

extract_structured returns an error about missing API key The LLM extraction tool requires a provider and corresponding API key (e.g., OPENAI_API_KEY). The extract_css tool is a free alternative that doesn't require an LLM.

Contributing

See CONTRIBUTING.md for development setup and guidelines.

License

This project is licensed under the Apache License 2.0 — see LICENSE for details.

This project uses crawl4ai, which is also licensed under Apache 2.0.

Acknowledgments


Created by Potter Digital

About

MCP server that gives AI assistants web crawling superpowers via crawl4ai

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors