技能详情(站内镜像,无评论)
许可证:MIT-0
MIT-0 ·免费使用、修改和重新分发。无需归因。
版本:v1.0.0
统计:⭐ 0 · 700 · 3 current installs · 3 all-time installs
⭐ 0
安装量(当前) 3
🛡 VirusTotal :可疑 · OpenClaw :可疑
Package:1kalin/afrexai-web-scraping-engine
安全扫描(ClawHub)
- VirusTotal :可疑
- OpenClaw :可疑
OpenClaw 评估
The skill's content matches its stated purpose (web scraping methodology) but it explicitly documents anti-detection and bypass techniques that raise legal/ethical concerns and could be misused.
目的
Name and description align with the SKILL.md content: comprehensive guidance for building scrapers, architecture choices, data pipelines, and anti-detection strategies. No declared credentials or installs are required, which is consistent with an instruction-only methodology skill.
说明范围
The instructions go beyond benign best-practices and include explicit anti-detection tactics (proxy rotation, fingerprint diversity, stealth configs, Cloudflare bypass references and managed 'anti-bot bypass' providers). That content can facilitate evading site protections and accessing data behind defenses; this raises legal and ethical risk even if the skill frames compliance checks first.
安装机制
Instruction-only skill with no install spec and no code files — nothing is written to disk or downloaded by the skill itself, which lowers technical supply-chain risk.
证书
The skill declares no required environment variables or credentials (proportionate for a methodology document). However, it repeatedly recommends third-party proxy and scraping services (Bright Data, Oxylabs, ScrapingBee, etc.) that in practice require credentials and billing; those are not declared in the skill metadata, so users must be mindful to supply and protect such secrets outside the skill.
持久
always:false and no special privileges requested. Autonomous invocation is allowed (platform default) — combined with the anti-detection guidance this increases potential misuse, but the skill itself does not request persistent or cross-skill configuration changes.
安装(复制给龙虾 AI)
将下方整段复制到龙虾中文库对话中,由龙虾按 SKILL.md 完成安装。
请把本段交给龙虾中文库(龙虾 AI)执行:为本机安装 OpenClaw 技能「Web Scraping & Data Extraction Engine」。简介:Complete web scraping methodology — legal compliance, architecture design, anti…。
请 fetch 以下地址读取 SKILL.md 并按文档完成安装:https://raw.githubusercontent.com/openclaw/skills/refs/heads/main/skills/1kalin/afrexai-web-scraping-engine/SKILL.md
(来源:yingzhi8.cn 技能库)
SKILL.md
---
name: Web Scraping & Data Extraction Engine
description: Complete web scraping methodology — legal compliance, architecture design, anti-detection, data pipelines, and production operations. Use when building scrapers, extracting web data, monitoring competitors, or automating data collection at scale.
---
# Web Scraping & Data Extraction Engine
## Quick Health Check (Run First)
Score your scraping operation (2 points each):
| Signal | Healthy | Unhealthy |
|--------|---------|-----------|
| Legal compliance | robots.txt checked, ToS reviewed | Scraping blindly |
| Architecture | Tool matches site complexity | Using Puppeteer for static HTML |
| Anti-detection | Rotation, delays, fingerprint diversity | Single IP, no delays |
| Data quality | Validation + dedup pipeline | Raw dumps, no cleaning |
| Error handling | Retry logic, circuit breakers | Crashes on first 403 |
| Monitoring | Success rates tracked, alerts set | No visibility |
| Storage | Structured, deduplicated, versioned | Flat files, duplicates |
| Scheduling | Appropriate frequency, off-peak | Hammering during business hours |
**Score: /16** → 12+: Production-ready | 8-11: Needs work | <8: Stop and redesign
---
## Phase 1: Legal & Ethical Foundation
### Pre-Scrape Compliance Checklist
```yaml
compliance_brief:
target_domain: ""
date_assessed: ""
robots_txt:
checked: false
target_paths_allowed: false
crawl_delay_specified: ""
ai_bot_rules: "" # Many sites now block AI crawlers specifically
terms_of_service:
reviewed: false
scraping_mentioned: false
scraping_prohibited: false
api_available: false
api_sufficient: false
data_classification:
type: "" # public-factual | public-personal | behind-auth | copyrighted
contains_pii: false
pii_types: [] # name, email, phone, address, photo
gdpr_applies: false # EU residents' data
ccpa_applies: false # California residents' data
legal_risk: "" # low | medium | high | do-not-scrape
decision: "" # proceed | use-api | request-permission | abandon
justification: ""
```
### Legal Landscape Quick Reference
| Scenario | Risk Level | Key Case Law |
|----------|-----------|--------------|
| Public data, no login, robots.txt allows | LOW | hiQ v. LinkedIn (2022) |
| Public data, robots.txt disallows | MEDIUM | Meta v. Bright Data (2024) |
| Behind authentication | HIGH | Van Buren v. US (2021), CFAA |
| Personal data without consent | HIGH | GDPR Art. 6, CCPA §1798.100 |
| Republishing copyrighted content | HIGH | Copyright Act §106 |
| Price/product comparison | LOW | eBay v. Bidder's Edge (fair use) |
| Academic/research use | LOW-MEDIUM | Varies by jurisdiction |
| Bypassing anti-bot measures | HIGH | CFAA "exceeds authorized access" |
### Decision Rules
1. **API exists and covers your needs?** → Use the API. Always.
2. **robots.txt disallows your target?** → Respect it unless you have written permission.
3. **Data behind login?** → Do not scrape without explicit authorization.
4. **Contains PII?** → GDPR/CCPA compliance required before collection.
5. **Copyrighted content?** → Extract facts/data points only, never full content.
6. **Site explicitly prohibits scraping?** → Request permission or find alternative source.
### AI Crawler Considerations (2025+)
Many sites now specifically block AI-related crawlers:
```
# Common AI bot blocks in robots.txt
User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: Google-Extended
User-agent: CCBot
User-agent: anthropic-ai
User-agent: ClaudeBot
User-agent: Bytespider
User-agent: PerplexityBot
```
**Rule**: If collecting data for AI training, check for these specific blocks.
---
## Phase 2: Architecture Decision
### Tool Selection Matrix
| Tool/Approach | Best For | Speed | JS Support | Complexity | Cost |
|---------------|----------|-------|------------|------------|------|
| HTTP client (requests/axios) | Static HTML, APIs | ⚡⚡⚡ | ❌ | Low | Free |
| Beautiful Soup / Cheerio | Static HTML parsing | ⚡⚡⚡ | ❌ | Low | Free |
| Scrapy | Large-scale structured crawling | ⚡⚡⚡ | Plugin | Medium | Free |
| Playwright / Puppeteer | JS-rendered, SPAs, interactions | ⚡ | ✅ | Medium | Free |
| Selenium | Legacy, browser automation | ⚡ | ✅ | High | Free |
| Crawlee | Hybrid (HTTP + browser fallback) | ⚡⚡ | ✅ | Medium | Free |
| Firecrawl / ScrapingBee | Managed, anti-bot bypass | ⚡⚡ | ✅ | Low | Paid |
| Bright Data / Oxylabs | Enterprise, proxy + browser | ⚡⚡ | ✅ | Low | Paid |
### Decision Tree
```
Is the content in the initial HTML source?
├── YES → Is the site structure consistent?
│ ├── YES → Static scraper (requests + BeautifulSoup/Cheerio)
│ └── NO → Scrapy with custom parsers
└── NO → Does the page require user interaction?
├── YES → Playwright/Puppeteer with interaction scripts
└── NO → Playwright in non-interactive mode
└── At scale (>10K pages)? → Crawlee (hybrid mode)
└── Heavy anti-bot? → Managed service (Firecrawl/ScrapingBee)
```
### Architecture Brief YAML
```yaml
scraping_project:
name: ""
objective: "" # What data, why, how often
targets:
- domain: ""
pages_estimated: 0
rendering: "static" | "javascript" | "spa"
anti_bot: "none" | "basic" | "cloudflare" | "advanced"
rate_limit: "" # requests per second safe limit
tool_selected: ""
justification: ""
data_schema:
fields: []
output_format: "" # json | csv | database
schedule:
frequency: "" # once | hourly | daily | weekly
preferred_time: "" # off-peak for target timezone
infrastructure:
proxy_needed: false
proxy_type: "" # residential | datacenter | mobile
storage: ""
monitoring: ""
```
---
## Phase 3: Request Engineering
### HTTP Request Best Practices
```python
# Python example — production request pattern
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
# Retry strategy
retry = Retry(
total=3,
backoff_factor=1, # 1s, 2s, 4s
status_forcelist=[429, 500, 502, 503, 504],
respect_retry_after_header=True
)
session.mount("https://", HTTPAdapter(max_retries=retry))
# Realistic headers
session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Cache-Control": "no-cache",
})
```
### Header Rotation Strategy
Rotate these to avoid fingerprinting:
| Header | Rotation Pool Size | Notes |
|--------|-------------------|-------|
| User-Agent | 20-50 real browser UAs | Match OS distribution |
| Accept-Language | 5-10 locale combos | Match proxy geo |
| Sec-Ch-Ua | Match User-Agent | Chrome/Edge/Brave |
| Referer | Vary per request | Previous page or search engine |
### Rate Limiting Rules
| Site Type | Safe Delay | Aggressive (risky) |
|-----------|-----------|-------------------|
| Small business site | 5-10 seconds | 2-3 seconds |
| Medium site | 2-5 seconds | 1-2 seconds |
| Large platform (Amazon, etc.) | 3-5 seconds | 1 second |
| API endpoint | Per API docs | Never exceed |
| robots.txt crawl-delay | Respect exactly | Never below |
**Rules:**
1. Always respect `Crawl-delay` in robots.txt
2. Add random jitter (±30%) to avoid pattern detection
3. Slow down during business hours for smaller sites
4. Respect `Retry-After` headers — they mean it
5. Watch for 429s — back off exponentially (2x each time)
---
## Phase 4: Parsing & Extraction
### CSS Selector Strategy (Priority Order)
1. **Data attributes** → `[data-product-id]`, `[data-price]` (most stable)
2. **Semantic IDs** → `#product-title`, `#price` (stable but can change)
3. **ARIA attributes** → `[aria-label="Price"]` (accessibility, fairly stable)
4. **Semantic HTML** → `article`, `main`, `nav` (structural, stable)
5. **Class names** → `.product-card` (can change with redesigns)
6. **XPath position** → `//div[3]/span[2]` (FRAGILE — last resort)
### Extraction Patterns
**Structured data first** — Check before writing CSS selectors:
```python
# 1. Check JSON-LD (best source — structured, clean)
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for script in soup.find_all('script', type='application/ld+json'):
data = json.loads(script.string)
# Often contains: Product, Article, Organization, etc.
# 2. Check Open Graph meta tags
og_title = soup.find('meta', property='og:title')
og_price = soup.find('meta', property='product:price:amount')
# 3. Check microdata
items = soup.find_all(itemtype=True)
# 4. Fall back to CSS selectors only if above are empty
```
**Table extraction pattern:**
```python
import pandas as pd
# Quick table extraction
tables = pd.read_html(html) # Returns list of DataFrames
# For complex tables with merged cells
def extract_table(soup, selector):
table = soup.select_one(selector)
headers = [th.get_text(strip=True) for th in table.select('thead th')]
rows = []
for tr in table.select('tbody tr'):
cells = [td.get_text(strip=True) for td in tr.select('td')]
rows.append(dict(zip(headers, cells)))
return rows
```
**Pagination handling:**
```python
# Pattern 1: Next button
while True:
# ... scrape current page ...
next_link = soup.select_one('a.next-page, [rel="next"], .pagination .next a')
if not next_link or not next_link.get('href'):
break
url = urljoin(base_url, next_link['href'])
# Pattern 2: API pagination (infinite scroll sites)
page = 1
while True:
resp = session.get(f"{api_url}?page={page}&limit=50")
data = resp.json()
if not data.get('results'):
break
# ... process results ...
page += 1
# Pattern 3: Cursor-based
cursor = None
while True:
params = {"limit": 50}
if cursor:
params["cursor"] = cursor
resp = session.get(api_url, params=params)
data = resp.json()
# ... process ...
cursor = data.get('next_cursor')
if not cursor:
break
```
### JavaScript-Rendered Content
```python
# Playwright pattern for JS-rendered pages
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 ...",
)
page = context.new_page()
# Block unnecessary resources (speed + stealth)
page.route("**/*.{png,jpg,jpeg,gif,svg,woff,woff2}",
lambda route: route.abort())
page.goto(url, wait_until="networkidle")
# Wait for specific content (better than arbitrary sleep)
page.wait_for_selector('[data-product-id]', timeout=10000)
# Extract after JS rendering
content = page.content()
# ... parse with BeautifulSoup/Cheerio ...
browser.close()
```
---
## Phase 5: Anti-Detection & Stealth
### Detection Signals (What Sites Check)
| Signal | Detection Method | Mitigation |
|--------|-----------------|------------|
| IP reputation | IP blacklists, datacenter ranges | Residential proxies |
| Request rate | Requests/min from same IP | Rate limiting + jitter |
| TLS fingerprint | JA3/JA4 hash matching | Use real browser or curl-impersonate |
| Browser fingerprint | Canvas, WebGL, fonts | Playwright with stealth plugin |
| JavaScript challenges | Cloudflare Turnstile, hCaptcha | Managed browser services |
| Cookie/session behavior | Missing cookies, no history | Full session management |
| Navigation pattern | Direct URL hits, no referrer | Simulate natural browsing |
| Mouse/keyboard events | No interaction telemetry | Event simulation (Playwright) |
| Header consistency | Mismatched headers vs UA | Header sets that match |
### Proxy Strategy
```yaml
proxy_strategy:
# Tier 1: Free/Datacenter (for non-protected sites)
basic:
type: "datacenter"
cost: "$1-5/GB"
success_rate: "60-80%"
use_for: "APIs, small sites, no anti-bot"
# Tier 2: Residential (for most protected sites)
standard:
type: "residential"
cost: "$5-15/GB"
success_rate: "90-95%"
use_for: "Cloudflare, major platforms"
rotation: "per-request or sticky 10min"
# Tier 3: Mobile/ISP (for maximum stealth)
premium:
type: "mobile"
cost: "$15-30/GB"
success_rate: "95-99%"
use_for: "Aggressive anti-bot, social media"
rules:
- Start with cheapest tier, escalate only on blocks
- Match proxy geo to target audience geo
- Rotate on 403/429, not every request
- Use sticky sessions for multi-page scrapes
- Monitor proxy health — remove slow/blocked IPs
```
### Playwright Stealth Configuration
```python
# Essential stealth for Playwright
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
args=[
'--disable-blink-features=AutomationControlled',
'--disable-features=IsolateOrigins,site-per-process',
]
)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
locale="en-US",
timezone_id="America/New_York",
geolocation={"latitude": 40.7128, "longitude": -74.0060},
permissions=["geolocation"],
)
# Remove automation indicators
page = context.new_page()
page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3]});
""")
```
### Cloudflare Bypass Decision
```
Cloudflare detected?
├── JS Challenge only → Playwright with stealth + residential proxy
├── Turnstile CAPTCHA → Managed service (ScrapingBee/Bright Data)
├── Under Attack Mode → Wait, try later, or managed service
└── WAF blocking → Different approach needed
├── Check for API endpoints (network tab)
├── Check for mobile app API
└── Consider if data is available elsewhere
```
---
## Phase 6: Data Pipeline & Quality
### Data Validation Rules
```python
# Validation pattern — validate BEFORE storing
from dataclasses import dataclass, field
from typing import Optional
import re
from datetime import datetime
@dataclass
class ScrapedProduct:
url: str
title: str
price: Optional[float]
currency: str = "USD"
scraped_at: str = field(default_factory=lambda: datetime.utcnow().isoformat())
def validate(self) -> list[str]:
errors = []
if not self.url.startswith('http'):
errors.append("Invalid URL")
if not self.title or len(self.title) < 3:
errors.append("Title too short or missing")
if self.price is not None and self.price < 0:
errors.append("Negative price")
if self.price is not None and self.price > 1_000_000:
errors.append("Price suspiciously high — verify")
if self.currency not in ("USD", "EUR", "GBP", "BTC"):
errors.append(f"Unknown currency: {self.currency}")
return errors
```
### Deduplication Strategy
| Method | When to Use | Implementation |
|--------|------------|----------------|
| URL-based | Pages with unique URLs | Hash the canonical URL |
| Content hash | Same URL, changing content | MD5/SHA256 of key fields |
| Fuzzy matching | Near-duplicate detection | Jaccard similarity > 0.85 |
| Composite key | Multi-field uniqueness | Hash(domain + product_id + variant) |
```python
import hashlib
def dedup_key(item: dict, fields: list[str]) -> str:
"""Generate dedup key from selected fields."""
values = "|".join(str(item.get(f, "")) for f in fields)
return hashlib.sha256(values.encode()).hexdigest()
# Usage
seen = set()
for item in scraped_items:
key = dedup_key(item, ["url", "product_id"])
if key not in seen:
seen.add(key)
clean_items.append(item)
```
### Data Cleaning Pipeline
```
Raw HTML → Parse → Extract → Validate → Clean → Deduplicate → Store
↓
Quarantine (failed validation)
```
**Common cleaning operations:**
| Problem | Solution |
|---------|----------|
| HTML entities (`&`) | `html.unescape()` |
| Extra whitespace | `" ".join(text.split())` |
| Unicode issues | `unicodedata.normalize('NFKD', text)` |
| Price in text ("$49.99") | Regex: `r'[$£€]?([d,]+.?d*)'` |
| Date formats vary | `dateutil.parser.parse()` with `dayfirst` flag |
| Relative URLs | `urllib.parse.urljoin(base, relative)` |
| Encoding issues | `chardet.detect()` then decode |
---
## Phase 7: Storage & Export
### Storage Decision Guide
| Volume | Frequency | Query Needs | Recommendation |
|--------|-----------|-------------|----------------|
| <10K records | One-time | None | JSON/CSV files |
| <10K records | Recurring | Simple lookups | SQLite |
| 10K-1M records | Recurring | Complex queries | PostgreSQL |
| 1M+ records | Continuous | Analytics | PostgreSQL + partitioning |
| Append-only logs | Continuous | Time-series | ClickHouse / TimescaleDB |
### SQLite Pattern (Most Common)
```python
import sqlite3
import json
from datetime import datetime
def init_db(path="scraper_data.db"):
conn = sqlite3.connect(path)
conn.execute("""
CREATE TABLE IF NOT EXISTS items (
id INTEGER PRIMARY KEY,
url TEXT UNIQUE,
data JSON NOT NULL,
scraped_at TEXT DEFAULT (datetime('now')),
updated_at TEXT,
checksum TEXT
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_url ON items(url)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_scraped ON items(scraped_at)")
return conn
def upsert(conn, url, data, checksum):
conn.execute("""
INSERT INTO items (url, data, checksum) VALUES (?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
data = excluded.data,
updated_at = datetime('now'),
checksum = excluded.checksum
WHERE items.checksum != excluded.checksum
""", (url, json.dumps(data), checksum))
conn.commit()
```
### Export Formats
```python
# CSV export
import csv
def to_csv(items, path, fields):
with open(path, 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=fields)
writer.writeheader()
writer.writerows(items)
# JSON Lines (best for large datasets — streaming)
def to_jsonl(items, path):
with open(path, 'w') as f:
for item in items:
f.write(json.dumps(item) + 'n')
# Incremental export (only new/changed since last export)
def export_since(conn, last_export_time):
cursor = conn.execute(
"SELECT data FROM items WHERE scraped_at > ? OR updated_at > ?",
(last_export_time, last_export_time)
)
return [json.loads(row[0]) for row in cursor]
```
---
## Phase 8: Error Handling & Resilience
### Error Classification
| HTTP Code | Meaning | Action |
|-----------|---------|--------|
| 200 | Success | Process normally |
| 301/302 | Redirect | Follow (max 5 hops) |
| 403 | Forbidden/blocked | Rotate proxy, slow down |
| 404 | Not found | Log, skip, mark URL dead |
| 429 | Rate limited | Respect Retry-After, back off 2x |
| 500-504 | Server error | Retry 3x with backoff |
| Connection timeout | Network issue | Retry with different proxy |
| SSL error | Certificate issue | Log, investigate, skip |
### Circuit Breaker Pattern
```python
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=300):
self.failures = 0
self.threshold = failure_threshold
self.reset_timeout = reset_timeout
self.last_failure = 0
self.state = "closed" # closed | open | half-open
def record_failure(self):
self.failures += 1
self.last_failure = time.time()
if self.failures >= self.threshold:
self.state = "open"
# Alert: "Circuit open — too many failures"
def record_success(self):
self.failures = 0
self.state = "closed"
def can_proceed(self):
if self.state == "closed":
return True
if self.state == "open":
if time.time() - self.last_failure > self.reset_timeout:
self.state = "half-open"
return True # Try one request
return False
return True # half-open: allow attempt
```
### Checkpoint & Resume
```python
import json
from pathlib import Path
class Checkpointer:
def __init__(self, path="checkpoint.json"):
self.path = Path(path)
self.state = self._load()
def _load(self):
if self.path.exists():
return json.loads(self.path.read_text())
return {"completed_urls": [], "last_page": 0, "cursor": None}
def save(self):
self.path.write_text(json.dumps(self.state))
def is_done(self, url):
return url in self.state["completed_urls"]
def mark_done(self, url):
self.state["completed_urls"].append(url)
if len(self.state["completed_urls"]) % 50 == 0:
self.save() # Periodic save
```
---
## Phase 9: Monitoring & Operations
### Scraper Health Dashboard
```yaml
dashboard:
real_time:
- metric: "requests_per_minute"
alert_if: "> 60 for small sites"
- metric: "success_rate"
alert_if: "< 90%"
- metric: "avg_response_time_ms"
alert_if: "> 5000"
- metric: "blocked_rate"
alert_if: "> 10%"
per_run:
- metric: "pages_scraped"
- metric: "items_extracted"
- metric: "items_validated"
- metric: "items_deduplicated"
- metric: "new_items"
- metric: "updated_items"
- metric: "errors_by_type"
- metric: "run_duration"
- metric: "proxy_cost"
weekly:
- metric: "data_freshness"
description: "% of records updated in last 7 days"
- metric: "site_structure_changes"
description: "Selectors that stopped matching"
- metric: "total_cost"
description: "Proxy + compute + storage"
```
### Breakage Detection
Sites redesign. Selectors break. Detect it early:
```python
def health_check(results: list[dict], expected_fields: list[str]) -> dict:
"""Check if scraper is still extracting correctly."""
total = len(results)
if total == 0:
return {"status": "CRITICAL", "message": "Zero results — likely broken"}
field_coverage = {}
for field in expected_fields:
filled = sum(1 for r in results if r.get(field))
coverage = filled / total
field_coverage[field] = coverage
issues = []
for field, coverage in field_coverage.items():
if coverage < 0.5:
issues.append(f"{field}: {coverage:.0%} fill rate (expected >50%)")
if issues:
return {"status": "WARNING", "issues": issues}
return {"status": "OK", "field_coverage": field_coverage}
```
### Operational Runbook
**Daily:**
- Check success rate per target domain
- Review error logs for new patterns
- Verify data freshness
**Weekly:**
- Compare extraction counts vs baseline (>20% drop = investigate)
- Review proxy spend
- Spot-check 10 random records for accuracy
**Monthly:**
- Full selector validation against live pages
- Review legal compliance (robots.txt changes, ToS updates)
- Cost optimization review
- Prune dead URLs from queue
---
## Phase 10: Common Scraping Patterns
### Pattern 1: E-commerce Price Monitor
```yaml
use_case: "Track competitor prices daily"
tool: "requests + BeautifulSoup"
schedule: "Daily at 03:00 UTC (off-peak)"
targets: ["competitor-a.com/products", "competitor-b.com/api"]
data:
- product_id
- product_name
- price
- currency
- in_stock
- scraped_at
storage: "SQLite with price history"
alerts: "Price change > 10% → notify"
```
### Pattern 2: Job Board Aggregator
```yaml
use_case: "Aggregate job listings from multiple boards"
tool: "Scrapy with per-site spiders"
schedule: "Every 6 hours"
targets: ["board-a.com", "board-b.com", "board-c.com"]
data:
- title
- company
- location
- salary_range
- posted_date
- url
- source
dedup: "Hash(title + company + location)"
storage: "PostgreSQL"
```
### Pattern 3: News & Content Monitor
```yaml
use_case: "Monitor industry news mentions"
tool: "requests + RSS feeds (preferred) + web fallback"
schedule: "Every 30 minutes"
approach:
1: "RSS/Atom feeds (fastest, cleanest)"
2: "Google News RSS for topic"
3: "Direct scraping if no feed"
data:
- headline
- source
- url
- published_at
- snippet
- sentiment
alerts: "Keyword match → immediate notification"
```
### Pattern 4: Social Media Intelligence
```yaml
use_case: "Track brand mentions and sentiment"
tool: "Official APIs (always) + web search fallback"
rules:
- NEVER scrape social platforms directly — use APIs
- Twitter/X: Official API ($100/mo basic)
- Reddit: Official API (free tier available)
- LinkedIn: No scraping (aggressive legal action)
- Instagram: Official API only (Meta Business)
fallback: "Brave/Google search for public mentions"
```
### Pattern 5: Real Estate Listings
```yaml
use_case: "Track property listings and prices"
tool: "Playwright (most listing sites are JS-heavy)"
schedule: "Daily"
challenges:
- Heavy JavaScript rendering
- Anti-bot measures (Cloudflare common)
- Frequent layout changes
- Map-based results
approach: "API endpoint discovery via network tab first"
```
---
## Phase 11: Scaling Strategies
### Concurrency Architecture
```
Single machine (small scale):
├── asyncio + aiohttp (Python) → 50-200 concurrent requests
├── Worker pool (ThreadPoolExecutor) → 10-50 threads
└── Scrapy reactor → Built-in concurrency
Multi-machine (large scale):
├── URL queue: Redis / RabbitMQ / SQS
├── Workers: Multiple Scrapy/custom workers
├── Results: Shared PostgreSQL / S3
└── Coordinator: Celery / custom scheduler
```
### Cost Optimization
| Lever | Impact | How |
|-------|--------|-----|
| Static > Browser | 10-50x cheaper | Always try HTTP first |
| Block images/CSS/fonts | 60-80% bandwidth saved | Route filtering |
| Cache DNS | Minor but cumulative | Local DNS cache |
| Compress responses | 50-70% bandwidth | Accept-Encoding: gzip, br |
| Smart scheduling | Avoid redundant scrapes | Change detection before full re-scrape |
| Proxy tier matching | 3-10x cost difference | Don't use residential for easy sites |
---
## Phase 12: Advanced Patterns
### API Discovery (Network Tab Mining)
Before building a scraper, check if the site has hidden API endpoints:
1. Open DevTools → Network tab
2. Filter by XHR/Fetch
3. Navigate the site, click load-more, filter/sort
4. Look for JSON responses — these are your goldmine
5. Most SPAs load data via REST/GraphQL APIs
**Common hidden API patterns:**
- `/api/v1/products?page=1&limit=20`
- `/graphql` with query parameters
- `/_next/data/...` (Next.js data routes)
- `/wp-json/wp/v2/posts` (WordPress)
### Headless Browser Optimization
```python
# Minimize browser resource usage
context = browser.new_context(
viewport={"width": 1280, "height": 720},
java_script_enabled=True, # Only if needed
has_touch=False,
is_mobile=False,
)
# Block resource types you don't need
page.route("**/*", lambda route: (
route.abort() if route.request.resource_type in
["image", "stylesheet", "font", "media"]
else route.continue_()
))
```
### Scraping Behind Authentication
```python
# When authorized to scrape behind login
# ALWAYS use session-based auth, never store passwords in code
# Pattern: Login once, reuse session
session = requests.Session()
login_resp = session.post("https://example.com/login", data={
"username": os.environ["SCRAPE_USER"],
"password": os.environ["SCRAPE_PASS"],
})
assert login_resp.ok, "Login failed"
# Session cookies are now stored — use for subsequent requests
data_resp = session.get("https://example.com/api/data")
```
### Change Detection (Avoid Redundant Scrapes)
```python
import hashlib
def has_changed(url, session, last_etag=None, last_modified=None):
"""Check if page changed without downloading full content."""
headers = {}
if last_etag:
headers["If-None-Match"] = last_etag
if last_modified:
headers["If-Modified-Since"] = last_modified
resp = session.head(url, headers=headers)
if resp.status_code == 304:
return False, resp.headers.get("ETag"), resp.headers.get("Last-Modified")
return True, resp.headers.get("ETag"), resp.headers.get("Last-Modified")
```
---
## Quality Scoring Rubric (0-100)
| Dimension | Weight | What to Assess |
|-----------|--------|---------------|
| Legal compliance | 20% | robots.txt, ToS, PII handling, audit trail |
| Data quality | 20% | Validation, accuracy, completeness, freshness |
| Resilience | 15% | Error handling, retries, circuit breakers, checkpointing |
| Anti-detection | 15% | Proxy rotation, fingerprint diversity, rate limiting |
| Architecture | 10% | Right tool selection, clean code, modularity |
| Monitoring | 10% | Success rates, breakage detection, alerting |
| Performance | 5% | Speed, cost efficiency, resource usage |
| Documentation | 5% | Runbook, schema docs, legal assessment |
**Grading:** 90+ Excellent | 75-89 Good | 60-74 Needs work | <60 Redesign
---
## 10 Common Mistakes
| # | Mistake | Fix |
|---|---------|-----|
| 1 | No robots.txt check | Always check first — it's your legal defense |
| 2 | Fixed delays (no jitter) | Add ±30% random jitter to all delays |
| 3 | No data validation | Validate every field before storing |
| 4 | Using browser for static HTML | HTTP client is 10-50x faster and cheaper |
| 5 | Single IP, no rotation | Proxy rotation for any serious scraping |
| 6 | No breakage detection | Monitor extraction counts and field fill rates |
| 7 | Storing raw HTML only | Extract + structure immediately |
| 8 | No checkpoint/resume | Long scrapes must be resumable |
| 9 | Ignoring structured data | JSON-LD/microdata is cleaner than CSS selectors |
| 10 | Scraping when API exists | Always check for API first |
---
## 5 Edge Cases
1. **Single-page apps (React/Vue/Angular)**: Must use browser rendering OR find the underlying API (network tab). Prefer API discovery — it's faster and more reliable.
2. **Infinite scroll**: Intercept the XHR/fetch calls that load more content. Simulate scrolling only as last resort. The API endpoint usually accepts `page` or `offset` params.
3. **CAPTCHAs**: If you're hitting CAPTCHAs, you're scraping too aggressively. Slow down first. If CAPTCHAs persist: managed services (2Captcha, Anti-Captcha) or rethink approach.
4. **Dynamic class names** (CSS modules, Tailwind): Use data attributes, ARIA labels, or text content selectors instead. `[data-testid="price"]` survives redesigns. `.sc-bdVTJa` does not.
5. **Multi-language sites**: Detect language via `html[lang]` attribute. Set `Accept-Language` header to get desired locale. Watch for different URL structures (`/en/`, `/de/`, subdomains).
---
## Natural Language Commands
1. **"Check if I can scrape [URL]"** → Run compliance checklist (robots.txt, ToS, data type)
2. **"What tool should I use for [site]?"** → Analyze site rendering, anti-bot, recommend tool
3. **"Build a scraper for [description]"** → Full architecture brief + code pattern
4. **"My scraper is getting blocked"** → Anti-detection diagnostic + proxy/stealth recommendations
5. **"Extract [data] from [URL]"** → Check structured data first, then CSS selectors
6. **"Monitor [site] for changes"** → Change detection + scheduling + alerting setup
7. **"How do I handle pagination on [site]?"** → Identify pagination type + code pattern
8. **"Scrape at scale ([N] pages)"** → Concurrency architecture + cost estimate
9. **"Clean and store this scraped data"** → Validation + dedup + storage recommendation
10. **"Is my scraper healthy?"** → Run health check + breakage detection
11. **"Find the API behind [site]"** → Network tab mining guide + common patterns
12. **"Set up price monitoring for [competitors]"** → Full e-commerce monitor pattern