Skrape — 技能 — openclaw中文资讯站

技能详情（站内镜像，无评论）

Ethical web data extraction with robots exclusion protocol adherence, throttled scraping requests, and privacy-compliant handling ("Scrape responsibly!").

开发与 DevOps

作者：X @10OSS

许可证：MIT-0

MIT-0 ·免费使用、修改和重新分发。无需归因。

版本：v1.1.1

统计：⭐ 0 · 112 · 0 current installs · 0 all-time installs

⭐ 0

安装量（当前） 0

🛡 VirusTotal ：良性 · OpenClaw ：良性

Package：10oss/skrape

安全扫描（ClawHub）

VirusTotal ：良性
OpenClaw ：良性

OpenClaw 评估

The skill's instructions, example code, and requested surface area are consistent with an ethical web-scraping helper and do not ask for unrelated credentials, installs, or persistent privileges.

目的

Name/description match the contents: SKILL.md and code.md focus on robots.txt checks, throttling/backoff, and privacy guidance. There are no unrelated env vars, binaries, or opaque network endpoints requested.

说明范围

Runtime instructions stay within scraping responsibilities (check robots, prefer APIs, throttle, avoid PII). The example code is illustrative and consistent, but contains implementation simplifications (e.g., treating missing robots.txt and fetch errors as 'permitted', a basic robots.txt evaluator that may not fully implement precedence/longest-match rules). These are functional caveats rather than malicious behavior.

安装机制

This is instruction-only with no install spec and no external downloads — lowest surface area. The code examples only use Node built-ins (http/https/url/console/process).

证书

No credentials, env vars, or config paths are requested. The sample uses a contact email in the User-Agent, which is appropriate for polite scraping but not a secret.

持久

always is false and the skill does not request persistent or cross-skill privileges. It does not modify system or other-skill configs.

综合结论

This appears to be a coherent, instruction-only scraper helper. Before you use it in production: (1) Treat code.md as example patterns, not a drop-in library — the workflow references require('./scrape') which isn't provided and robots parsing is simplified. (2) Replace the example contact email with a real contact or remove it as appropriate. (3) Consider tightening robots handling (be conservative on errors instead of assuming permission) an…

安装（复制给龙虾 AI）

将下方整段复制到龙虾中文库对话中，由龙虾按 SKILL.md 完成安装。

请把本段交给龙虾中文库（龙虾 AI）执行：为本机安装 OpenClaw 技能「Skrape」。简介：Ethical web data extraction with robots exclusion protocol adherence, throttled…。
请 fetch 以下地址读取 SKILL.md 并按文档完成安装：https://raw.githubusercontent.com/openclaw/skills/refs/heads/main/skills/10oss/skrape/SKILL.md
（来源：yingzhi8.cn 技能库）

SKILL.md

打开原始 SKILL.md（GitHub raw）

---
name: Skrape
description: Ethical web data extraction with robots exclusion protocol adherence, throttled scraping requests, and privacy-compliant handling ("Scrape responsibly!").
---

## Respect Creative Work

- **Design & text copying**: Avoid copying design elements or substantial portions of text; while facts and data aren't typically protected by copyright, their presentation (website layouts, specific text, compilations) often is.
- **Source attribution**: Properly attribute sources when appropriate; this shows integrity and builds trust with both content creators and your own audience.
- **Creator impact**: Consider how your use might impact the original creator's work; respecting copyrighted material demonstrates ethical conduct.

## Pre-Extraction Verification Steps

**I. Access Authorization** — Retrieve `{domain}/robots.txt` and review `/terms` or `/tos` endpoints. Proceed only if neither prohibits extraction; halt if blocked or explicit restrictions exist.

**II. Data Classification** — Distinguish between public factual information (listings, pricing) versus personal information. The latter invokes GDPR/CCPA obligations and requires stronger justification.

**III. Preferred Channels** — Check whether the platform offers an API. If available, use it instead of direct extraction. Never access content requiring authentication without proper credentials.

## Operational Conduct & Compliance

- **Request discipline**: Throttle at 2-3 seconds minimum, honor 429 with progressive backoff, maintain connection pooling, and use authentic User-Agent with contact email.
- **Access boundaries**: robots.txt disregard carries uncertain legal standing (Meta v. Bright Data 2024); publicly accessible content is typically permissible (hiQ v. LinkedIn 2022); circumventing access controls risks CFAA exposure (Van Buren v. US 2021).
- **Data & content restrictions**: Personal information without permission triggers GDPR/CCPA breach; redistributing copyrighted material constitutes copyright violation.

## Information Stewardship

- **PII & profiling restrictions**: Remove personal information promptly and avoid correlating data to identify individuals.
- **Limit retention**: Store only necessary data, purge the rest.
- **Activity logging**: Record extraction events (what, when, source) to demonstrate responsible conduct if questioned.

Implementation patterns and robots.txt evaluation logic in `code.md`