openclaw 网盘下载
OpenClaw

技能详情(站内镜像,无评论)

首页 > 技能库 > skroller

Automate scraping and filtering of public social media posts with keyword search, engagement filters, deduplication, and export to JSON, CSV, or notes apps.

媒体与内容

作者:X @10OSS

许可证:MIT-0

MIT-0 ·免费使用、修改和重新分发。无需归因。

版本:v0.0.1

统计:⭐ 0 · 88 · 0 current installs · 0 all-time installs

0

安装量(当前) 0

🛡 VirusTotal :可疑 · OpenClaw :可疑

Package:10oss/skroller

安全扫描(ClawHub)

  • VirusTotal :可疑
  • OpenClaw :可疑

OpenClaw 评估

The package is a coherent social‑media scraping tool, but it contains undeclared credential usage, filesystem writes, and explicit anti‑bot/proxy/evasion guidance that increase risk and warrant careful review before use.

目的

The name/description match the code and docs: Playwright-based browser automation across many social platforms, filtering/deduplication, digest generation, and export to note apps. The declared dependencies (playwright in package.json) and included scripts align with the stated purpose.

说明范围

SKILL.md and the scripts instruct the agent to perform broad browser automation, login when needed, persist cookies, write a seen-posts file, and call external services (Notion API, grizzly CLI). The docs also include explicit anti-bot/evading techniques (user-agent rotation, proxy rotation, human mouse simulation, residential proxy vendors) that go beyond benign automation recommendations and may facilitate bypassing platform protections. The…

安装机制

The registry entry provides no install spec, but package.json lists Playwright (a large dependency) meaning users will need to run npm install to use it. There are no suspicious external download URLs; all code is included. The absence of an explicit install/install-time instructions in the registry is notable (user will need to install Node >=18 and run npm install), but not inherently malicious.

证书

The skill's metadata declares no required environment variables, yet the SKILL.md and scripts reference/expect secrets and external credentials (e.g., NOTION_API_KEY / NOTION_TOKEN, MS_TOKEN, and possibly others for note APIs). Exports use process.env.NOTION_API_KEY and command examples show $MS_TOKEN. Those credentials are reasonable for the export features, but they should have been declared in requires.env/primaryEnv. The exporter also runs…

持久

The skill does not request always: true and does not modify other skills. It writes local files (seen posts file, cookies.json, creates notes in user-specified vaults and paths) and may create notes via external services (Notion). Writing to the user's filesystem and storing cookies/seen-posts is expected for a scraper but is persistent behavior the user should be aware of.

scripts/export-to-notes.js:160

Shell command execution detected (child_process).

scripts/skroller.js:28

Dynamic code execution detected.

scripts/export-to-notes.js:232

Environment variable access combined with network send.

scripts/export-to-notes.js:50

File read combined with network send (possible exfiltration).

安装(复制给龙虾 AI)

将下方整段复制到龙虾中文库对话中,由龙虾按 SKILL.md 完成安装。

请把本段交给龙虾中文库(龙虾 AI)执行:为本机安装 OpenClaw 技能「skroller」。简介:Automate scraping and filtering of public social media posts with keyword searc…。
请 fetch 以下地址读取 SKILL.md 并按文档完成安装:https://raw.githubusercontent.com/openclaw/skills/refs/heads/main/skills/10oss/skroller/SKILL.md
(来源:yingzhi8.cn 技能库)

SKILL.md

打开原始 SKILL.md(GitHub raw)

---
name: skroller
description: Automated social media content collection and analysis across platforms (Twitter/X, Instagram, TikTok, Reddit, LinkedIn, YouTube, Product Hunt, Medium, GitHub, Pinterest). Use when you need to: (1) scrape public posts programmatically, (2) analyze content by keywords or filters, (3) monitor brand mentions or trends for research, (4) curate content for personal analysis, (5) archive publicly available information, or (6) generate digests from scraped feeds. Always comply with platform ToS and applicable privacy laws.
---

# Skroller - Social Media Content collection

Automate the collection and analysis of publicly available social media content. This skill handles content collection, filtering, and export with compliance safeguards.

## ⚖️ Legal & Compliance Requirements

**Before using this skill:**
1. **Review Platform ToS** - Each platform has different rules about automated access
2. **Check robots.txt** - Respect disallowed paths
3. **Rate Limiting** - Stay within platform rate limits to avoid service disruption
4. **Privacy Laws** - GDPR, CCPA, and other regulations apply to personal data
5. **Permitted Use** - Research, personal analysis, competitive intelligence (where allowed)
6. **Prohibited Use** - Spam, harassment, data resale, manipulation, bypassing auth

**Data Protection:**
- Anonymize personal data when storing
- Honor deletion requests (GDPR Art. 17)
- Limit retention to necessary periods
- Document lawful basis for processing
- Do not scrape sensitive personal data

## Core Capabilities

- **Content collection** - Gather publicly available posts from platforms
- **Data extraction** - Pull text, timestamps, engagement metrics, author info
- **Smart filtering** - Filter by keywords, hashtags, date ranges, engagement thresholds
- **Deduplication** - Track seen posts to avoid duplicates across sessions
- **Export formats** - JSON, CSV, Markdown, or direct to note apps (Bear/Obsidian)
- **Rate limiting** - Respect platform constraints and avoid service disruption

## Platform Support

| Platform | Approach | Notes |
|----------|----------|-------|
| Twitter/X | Browser automation | Use Playwright; handle login if needed |
| Instagram | Browser automation | Rate-limited; use sparingly |
| TikTok | Browser automation | Heavy JS; may need longer waits |
| Reddit | API + browser | Prefer API where possible |
| LinkedIn | Browser automation | Login required for most content |
| YouTube | API + browser | Comments via browser, videos via API |
| Product Hunt | Browser automation | Product discovery, launches |
| Medium | Browser automation | Articles, blog posts |
| GitHub | Browser automation | Issues, discussions, repos |
| Pinterest | Browser automation | Visual content, pins |

## Quick Start

### Basic Scroll and Extract

```bash
# Scroll Twitter feed, extract 50 posts about "AI"
node scripts/skroller.js --platform twitter --query "AI" --limit 50 --output posts.json
```

### Monitor Brand Mentions

```bash
# Monitor Reddit for brand mentions, export to Markdown
node scripts/skroller.js --platform reddit --query "mybrand" --format markdown --output mentions.md
```

### Competitive Research

```bash
# Scroll competitor Instagram, capture top posts by engagement
node scripts/skroller.js --platform instagram --profile @competitor --min-likes 1000 --output competitor-posts.json
```

## Scripts

### `scripts/skroller.js` - Main scrolling engine

JavaScript script using Playwright for browser automation.

```bash
# Using npm scripts (recommended)
npm run scroll -- --platform twitter --query "AI" --limit 50 --output posts.json

# Direct execution
node scripts/skroller.js --platform twitter --query "AI" --limit 50 --output posts.json
```

**Options:**
- `--platform` - Target platform (required): twitter, reddit, instagram, tiktok, linkedin, youtube, producthunt, medium, github, pinterest
- `--query` - Search keyword/hashtag
- `--profile` - Specific profile to scroll
- `--limit` - Max posts to scrape (default: 50)
- `--min-likes` - Filter by minimum engagement
- `--format` - Output format: json, csv, markdown (default: json)
- `--output` - Output file path
- `--screenshot` - Capture screenshots for debugging
- `--dedupe` - Skip previously seen posts

### `scripts/feed-digest.js` - Generate digests

Creates summary digests from exported post data.

```bash
npm run digest -- --input posts.json --output digest.md
# or: node scripts/feed-digest.js --input posts.json --output digest.md
```

### `scripts/export-to-notes.js` - Unified note app exporter

Exports scraped posts to multiple note applications with a single command.

**Supported apps:** Bear, Obsidian, Notion, Apple Notes, Evernote, OneNote, Google Keep, Roam Research, Logseq, Anytype

```bash
# Using npm scripts (recommended)
npm run export -- --input posts.json --app obsidian --vault ~/Documents/Obsidian

# Direct execution
node scripts/export-to-notes.js --input posts.json --app bear --tags "ai,research"
node scripts/export-to-notes.js --input posts.json --app notion --api-key $NOTION_TOKEN
node scripts/export-to-notes.js --input posts.json --app apple-notes
node scripts/export-to-notes.js --input posts.json --app evernote --output export.enex
node scripts/export-to-notes.js --input posts.json --app one-note --access-token $MS_TOKEN
node scripts/export-to-notes.js --input posts.json --app keep --output keep.html
node scripts/export-to-notes.js --input posts.json --app roam --output roam.md
node scripts/export-to-notes.js --input posts.json --app logseq --vault ~/Documents/Logseq
node scripts/export-to-notes.js --input posts.json --app anytype --output anytype.md
node scripts/export-to-notes.js --input posts.json --app obsidian --dry-run
```

**Configuration:** Set defaults in `.skroller-config.json`:
```json
{
  "export": {
    "defaultApp": "obsidian",
    "vault": "~/Documents/Obsidian",
    "folder": "Skroller",
    "notionDatabaseId": "<db-id>"
  }
}
```

**Requirements by app:**
- **Bear:** grizzly CLI (`go install github.com/tylerwince/grizzly/cmd/grizzly@latest`)
- **Obsidian:** Vault path
- **Notion:** API key (create at notion.so)
- **Apple Notes:** macOS with Notes app
- **Evernote:** Manual ENEX import
- **OneNote:** Microsoft Graph access token
- **Google Keep:** Manual HTML import
- **Roam Research:** Markdown import (drag MDL file into Roam)
- **Logseq:** Vault path (writes to pages/ folder)
- **Anytype:** Markdown import (use Anytype Import feature)

## Configuration

### `.skroller-config.json`

Store default settings:

```json
{
  "defaultLimit": 50,
  "scrollDelayMs": 1500,
  "userAgent": "Mozilla/5.0 ...",
  "platforms": {
    "twitter": {
      "loginRequired": false,
      "rateLimit": "100 requests/hour"
    },
    "instagram": {
      "loginRequired": true,
      "rateLimit": "50 requests/hour"
    }
  },
  "export": {
    "defaultFormat": "json",
    "includeImages": true,
    "includeMetrics": true
  }
}
```

### Authentication

Some platforms require login. Store credentials securely:

```bash
# For platforms requiring auth, set environment variables
export SKROLLR_TWITTER_COOKIE="<auth cookie>"
export SKROLLR_INSTAGRAM_USER="<username>"
export SKROLLR_INSTAGRAM_PASS="<password>"
```

**Security note:** Never commit auth files. Use `.env` with `.gitignore`.

## Output Structure

### JSON Output (default)

```json
{
  "platform": "twitter",
  "query": "AI",
  "scrapedAt": "2026-03-14T20:30:00Z",
  "posts": [
    {
      "id": "1234567890",
      "author": "@username",
      "text": "Post content here...",
      "timestamp": "2026-03-14T18:00:00Z",
      "likes": 150,
      "retweets": 42,
      "replies": 12,
      "url": "https://twitter.com/...",
      "media": ["image1.jpg"],
      "hashtags": ["#AI", "#ML"]
    }
  ]
}
```

### Markdown Output

```markdown
## Twitter Posts: "AI" (2026-03-14)

### @username - 150 likes
Post content here...

[View post](https://twitter.com/...)

---
```

## Filtering Strategies

### Keyword Matching
- Exact match: `"exact phrase"`
- Boolean: `AI AND (startup OR venture)`
- Exclude: `AI -crypto`

### Engagement Thresholds
- Filter low-quality: `--min-likes 100`
- Viral content: `--min-shares 500`

### Time Windows
- Recent only: `--date-from 2026-03-14`
- Historical: `--date-from 2026-01-01 --date-to 2026-01-31`

## Best Practices

### Rate Limiting
- Add delays between scrolls: `--delay 2000`
- Respect platform limits (see config)
- Use proxies for high-volume scraping

### Content Quality
- Filter by engagement to find signal
- Dedupe across sessions
- Export with timestamps for freshness tracking

### Ethics & Compliance
- Check platform ToS before scraping
- Don't scrape personal data at scale
- Use for research/curation, not spam

## Troubleshooting

### Scroll stops early
- Increase `--limit` or check for login requirements
- Some platforms detect automation; add random delays

### Missing content
- Some platforms lazy-load; increase scroll delay
- Try `--screenshot` to debug what's visible

### Rate limited
- Reduce frequency; use `--delay`
- Check platform-specific rate limits in config

## Integration with Other Skills

- **bear-notes**: Export via `export-to-notes.js --app bear`
- **obsidian**: Export via `export-to-notes.js --app obsidian`
- **notion**: Export via `export-to-notes.js --app notion`
- **github**: Create issues from scraped feedback/mentions

```bash
# Scroll and export to Obsidian in one command
node scripts/skroller.js --platform twitter --query "AI" --limit 20 --output ai.json && 
  node scripts/export-to-notes.js --input ai.json --app obsidian --vault ~/Documents/Obsidian

# Scroll and export to Notion
node scripts/skroller.js --platform reddit --query "startups" --limit 30 --output startups.json && 
  node scripts/export-to-notes.js --input startups.json --app notion --api-key $NOTION_TOKEN

# Use npm scripts for cleaner commands
npm run scroll -- --platform twitter --query "tech" --limit 25 --output tech.json
npm run export -- --input tech.json --app bear --tags "tech,research"
```

## See Also

- `references/platform-details.md` - Platform-specific selectors and quirks
- `references/rate-limits.md` - Rate limit guidelines per platform
- `assets/selector-cheatsheet.md` - CSS selectors for each platform