doc-extract-filter — 技能 — openclaw中文资讯站

技能详情（站内镜像，无评论）

支持 PDF、Word、Excel 文件的文本提取和按关键词筛选，返回完整或筛选后的文本内容。

媒体与内容

许可证：MIT-0

MIT-0 ·免费使用、修改和重新分发。无需归因。

版本：v1.1.1

统计：⭐ 1 · 53 · 0 current installs · 0 all-time installs

⭐ 1

安装量（当前） 0

🛡 VirusTotal ：良性 · OpenClaw ：良性

Package：bigclawd/doc-extract-filter

安全扫描（ClawHub）

VirusTotal ：良性
OpenClaw ：良性

OpenClaw 评估

The skill's code, instructions, and requirements are consistent with a local document text-extraction and keyword-filtering tool; it reads and writes files but does not request credentials or reach out to external endpoints.

目的

Name/description (extract/filter text from documents) matches the included code: extractor, filter, converter, and utils implement extraction, keyword/regex filtering, batch processing and result export. Declared CLI/API parameters align with implementation.

说明范围

SKILL.md and entry script instruct the agent to read specified files or directories, extract text, filter matches, and optionally write JSON/text outputs. The instructions do not request unrelated data, secrets, or remote endpoints. Note: batch mode traverses directories and will process any supported files accessible to the running agent—this is expected but relevant for sensitive directories.

安装机制

This is instruction/code-based (no install spec). A requirements.txt is provided but there is no automated installer; the runtime environment must have the listed Python packages. OCR functionality additionally requires system tesseract and pdf2image/Pillow; missing optional dependencies are handled in code (falls back to non-OCR extraction).

证书

The skill does not request environment variables, credentials, or config paths. All I/O is local-file-based as described. There are no requests for unrelated service keys or tokens.

持久

Skill is not marked always:true and does not modify other skills or system-wide agent settings. It performs file reads/writes within the paths provided by the caller, which is appropriate for its purpose.

综合结论

This skill is internally consistent and appears to do what it claims, but it runs code on your agent and will read/write any file paths you pass to it. Before installing or invoking: (1) ensure the Python environment has required packages (requirements.txt) and tesseract if you need OCR; (2) avoid pointing it at sensitive system or credential directories—it will traverse directories you give it in batch mode; (3) run it on non-sensitive sample…

安装（复制给龙虾 AI）

将下方整段复制到龙虾中文库对话中，由龙虾按 SKILL.md 完成安装。

请把本段交给龙虾中文库（龙虾 AI）执行：为本机安装 OpenClaw 技能「doc-extract-filter」。简介：支持 PDF、Word、Excel 文件的文本提取和按关键词筛选，返回完整或筛选后的文本内容。。
请 fetch 以下地址读取 SKILL.md 并按文档完成安装：https://raw.githubusercontent.com/openclaw/skills/refs/heads/main/skills/bigclawd/doc-extract-filter/SKILL.md
（来源：yingzhi8.cn 技能库）

SKILL.md

打开原始 SKILL.md（GitHub raw）

# doc-extract-filter

## 元数据

### 基本信息
- **name**: doc-extract-filter
- **description**: 文件处理技能，支持 PDF、Word、Excel 文件的文本提取和关键词筛选
- **version**: 1.0.0
- **author**: file-agent team
- **license**: MIT-0

### OpenClaw 配置
```json
{
  "name": "doc-extract-filter",
  "description": "文件处理技能，支持 PDF、Word、Excel 文件的文本提取和关键词筛选",
  "version": "1.0.0",
  "author": "file-agent team",
  "license": "MIT-0",
  "type": "tool",
  "entry_point": "scripts/doc-extract-filter.py",
  "parameters": {
    "file_path": {
      "type": "string",
      "description": "文件路径",
      "required": true
    },
    "action": {
      "type": "string",
      "description": "操作类型：extract 或 filter",
      "required": true
    },
    "keywords": {
      "type": "array",
      "description": "关键词列表（仅 filter 操作需要）",
      "required": false
    }
  }
}
```

### CoPaw 配置
```yaml
name: doc-extract-filter
description: 文件处理技能，支持 PDF、Word、Excel 文件的文本提取和关键词筛选
version: 1.0.0
author: file-agent team
license: MIT-0
type: tool
entry_point: scripts/doc-extract-filter.py
parameters:
  file_path:
    type: string
    description: 文件路径
    required: true
  action:
    type: string
    description: 操作类型：extract 或 filter
    required: true
  keywords:
    type: array
    description: 关键词列表（仅 filter 操作需要）
    required: false
```

## 使用说明

### 功能
- **extract**: 提取文件中的文本内容
- **filter**: 提取文件中的文本并筛选包含指定关键词的内容

### 调用方式

#### CLI 调用
```bash
python scripts/doc-extract-filter.py --file_path "path/to/file.pdf" --action "extract"
python scripts/doc-extract-filter.py --file_path "path/to/file.pdf" --action "filter" --keywords "关键词1,关键词2"
```

#### Python 函数调用
```python
from scripts.doc_extract_filter import DocExtractFilter

# 提取文本
result = DocExtractFilter.process("path/to/file.pdf", "extract")

# 筛选关键词
result = DocExtractFilter.process("path/to/file.pdf", "filter", ["关键词1", "关键词2"])
```

### 返回格式
```json
{
  "success": true,
  "data": {
    "text": "提取的文本内容",
    "filtered_text": "筛选后的文本内容" // 仅 filter 操作返回
  },
  "error": ""
}
```

### 错误处理
- 文件不存在：返回错误信息
- 不支持的文件类型：返回错误信息
- 操作失败：返回错误信息

## 安装与测试

### 安装
1. 将 `doc-extract-filter` 目录复制到 OpenClaw/CoPaw 的 skills 目录
2. 运行 `pip install -r requirements.txt` 安装依赖

### 测试
使用 `docs/test.pdf` 文件测试功能：
```bash
# 测试提取文本
python scripts/doc-extract-filter.py --file_path "docs/test.pdf" --action "extract"

# 测试关键词筛选
python scripts/doc-extract-filter.py --file_path "docs/test.pdf" --action "filter" --keywords "单价,小计,总金额"
```

### 独立运行
doc-extract-filter 现在包含了所有必要的核心代码，可以独立运行，不依赖于外部的 src 目录。