Progressive Web Scraping: Four-Tier Fallback System¶

Summary¶

A Claude Code–native web scraping architecture that automatically escalates from the simplest free tool to paid professional infrastructure, stopping as soon as any tier succeeds. Tiers 1–3 (WebFetch, cURL with Chrome headers, Playwright) are free and handle ~95% of sites. Tier 4 (Bright Data MCP) handles the residual cases at ~$0.001–0.01 per request. Real-world cost: $0.31 over 3 weeks of heavy use.

Details¶

The Four Tiers¶

Tier	Tool	Cost	Speed	Coverage
1	WebFetch (built-in)	Free	2–5s	~60–70% of sites
2	cURL + Chrome headers	Free	3–7s	+20–30%
3	Playwright browser automation	Free	10–20s	+10–15%
4	Bright Data MCP	~$0.001–0.01/req	5–15s	Remainder (95%+ success)

Worst case (all 4 tiers): ~40 seconds. Best case (Tier 1): ~3 seconds. Average: ~10 seconds.

Tier 1: WebFetch¶

Claude Code's built-in WebFetch tool handles the majority of sites:

WebFetch({
  url: "https://example.com",
  prompt: "Extract all content from this page and convert to markdown"
})

Features: - Automatic HTML → Markdown conversion with AI-powered content extraction - Built-in retry logic - 15-minute cache for repeated requests to the same URL

Fails when: The site needs proper browser headers or JavaScript execution.

Tier 2: cURL with Chrome Headers¶

When WebFetch returns empty or blocked content, send full Chrome browser headers:

curl -L -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" \
  -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8" \
  -H "Accept-Language: en-US,en;q=0.9" \
  -H "Accept-Encoding: gzip, deflate, br" \
  -H "DNT: 1" \
  -H "Connection: keep-alive" \
  -H "Upgrade-Insecure-Requests: 1" \
  -H "Sec-Fetch-Dest: document" \
  -H "Sec-Fetch-Mode: navigate" \
  -H "Sec-Fetch-Site: none" \
  -H "Sec-Fetch-User: ?1" \
  -H "Cache-Control: max-age=0" \
  --compressed \
  "https://target-site.com"

Key headers that matter: - Sec-Fetch-* headers — Chrome security headers indicating legitimate navigation context - User-Agent — Identifies as Chrome 120 on macOS - --compressed — Handles gzip/br compression

Fails when: The site requires actual JavaScript execution.

Tier 3: Playwright Browser Automation¶

Full headless Chrome execution handles SPAs and JavaScript-rendered content:

import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://dynamic-site.com');
await page.waitForLoadState('networkidle');
const content = await page.content();

Handles: React/Vue/Angular apps, dynamic content loading, cookie/session state, network interception.

Cost: Still free. Slowdown: 10–20 seconds (running actual Chrome).

Fails when: The site requires specialized infrastructure (residential IPs, CAPTCHA solving, anti-bot bypass).

Tier 4: Bright Data MCP¶

Professional scraping infrastructure accessed via MCP tool calls:

Available tools:

// Single URL
mcp__Brightdata__scrape_as_markdown({ url: "https://complex-site.com" })

// Batch (up to 10 URLs)
mcp__Brightdata__scrape_batch({ urls: ["https://site1.com", "https://site2.com"] })

// Search engine results
mcp__Brightdata__search_engine({ query: "AI tools", engine: "google" })

// Batch search
mcp__Brightdata__search_engine_batch({ queries: [{ query: "...", engine: "google" }] })

Infrastructure features: - 150M+ residential IPs from 195 countries — appears as regular consumer traffic - Automatic CAPTCHA solving (reCAPTCHA, hCaptcha) - Geolocation targeting (150+ locations) - Built-in fingerprinting and anti-bot bypass - 95%+ success rate on publicly available data

Pricing: ~$0.001–0.01 per request; pay-per-successful-result. Real-world example: $0.31 total over 3 weeks of heavy use.

Setup: Add to .claude/mcp.json:

{
  "mcpServers": {
    "brightdata": {
      "command": "bunx",
      "args": ["-y", "@brightdata/mcp"],
      "env": { "API_TOKEN": "your_token_here" }
    }
  }
}

Escalation Flow¶

Input URL
  ↓
Tier 1 (WebFetch) ──→ Success? → Return content
  ↓ fail
Tier 2 (cURL + headers) ──→ Success? → Return content
  ↓ fail
Tier 3 (Playwright) ──→ Success? → Return content
  ↓ fail
Tier 4 (Bright Data) ──→ Return content (or site is down)

Error signals for tier selection: - 403 error → skip to Tier 2+ (browser context needed) - Empty content → skip to Tier 3 (JavaScript execution needed) - CAPTCHA / block → skip to Tier 4

Smart skip rules: - Known SPAs (*.vercel.app, *.netlify.app) → start at Tier 3 - User explicitly requests Bright Data → start at Tier 4 - Previous scrape of this domain needed higher tier → start there

Use Cases¶

Japanese eCommerce research — Tiers 1–3 may return wrong region data; Tier 4's Japanese residential IPs show authentic pricing and regional products.

Cybersecurity investigation — Analyzing attacker infrastructure without revealing your organization's IP; Tier 4's residential IPs prevent tipping off adversaries.

Bypassing over-aggressive reverse proxies — Cloudflare and similar services block datacenter IPs; Tier 4's residential traffic bypasses these.

Note: Designed for publicly available data only. Not for bypassing authentication or accessing restricted content.

Cost Reality Check¶

Tiers 1–3 handle ~90–95% of requests at zero cost
Tier 4 activates only for the residual hard cases
Typical personal/professional weekly costs: $0.10–$7.50 depending on volume and Tier 4 hit rate

PAI (Personal AI) Repository¶

This system is available as a public Claude Code skill in Daniel Miessler's PAI repository — an open-source collection of AI workflows and skills aimed at democratizing advanced AI capabilities.

Key Claims & Data Points¶

WebFetch handles ~60–70% of sites with no additional tooling — [source: progressive_web_scraping_four_tier_system.md]
Tiers 1–3 are completely free and handle ~90–95% of scraping needs — [source: progressive_web_scraping_four_tier_system.md]
Real-world Tier 4 cost: $0.31 over 3 weeks of heavy use, averaging $0.01/day — [source: progressive_web_scraping_four_tier_system.md]
Bright Data 95%+ success rate on publicly available data — [source: progressive_web_scraping_four_tier_system.md]
Written by "Kai," Daniel Miessler's AI assistant (AIL Tier 5 — highest AI involvement) — [source: progressive_web_scraping_four_tier_system.md]

Open Questions¶

How does the four-tier system handle sites requiring login — is there a fifth tier for authenticated scraping? (raised by: guides/progressive-web-scraping, 2026-04-09)
How does Bright Data's pricing compare to alternatives like Apify, ScrapingBee, or Oxylabs at scale? (raised by: guides/progressive-web-scraping, 2026-04-09)
Is the PAI skill library actively maintained, and how does skill quality compare to official sources? (raised by: guides/progressive-web-scraping, 2026-04-09)

Sources¶

Progressive Web Scraping with a Four-Tier Fallback System — Daniel Miessler / Kai; four-tier escalating scraper with Bright Data MCP integration (Nov 2025)