Zhipu Asr — 技能 — openclaw中文资讯站

技能详情（站内镜像，无评论）

Automatic Speech Recognition (ASR) using Zhipu AI (BigModel) GLM-ASR model. Use when you need to transcribe audio files to text. Supports Chinese audio trans...

媒体与内容

作者：xiaofei @franklu0819-lang

许可证：MIT-0

MIT-0 ·免费使用、修改和重新分发。无需归因。

版本：v1.0.2

统计：⭐ 0 · 647 · 4 current installs · 5 all-time installs

⭐ 0

安装量（当前） 5

🛡 VirusTotal ：良性 · OpenClaw ：良性

Package：zhipu-asr

安全扫描（ClawHub）

VirusTotal ：良性
OpenClaw ：良性

OpenClaw 评估

The skill is internally consistent for an ASR integration: it converts audio locally and uploads it to Zhipu's transcription API using a single API key; nothing in the code or docs requests unrelated secrets or system access.

目的

The name/description, SKILL.md, and the included shell script all align with an ASR transcription skill. The only minor inconsistency is metadata in package.json/_meta.json (and package.json lists only jq in openclaw.requires while the script also requires curl and ffmpeg); this looks like sloppy packaging rather than malicious behavior.

说明范围

Runtime instructions and the script stay within scope: they read the provided audio file, optionally convert it with ffmpeg, check size/duration, and POST the file and optional context/hotwords to the declared Zhipu API endpoint. The script does not read other environment variables, system files, or contact unexpected endpoints.

安装机制

There is no install spec (instruction-only with a shipped shell script). That is low-risk; no external archives or binaries are downloaded by the skill itself. The script requires standard system tools (jq, curl, ffmpeg) already expected for this functionality.

证书

Only ZHIPU_API_KEY is requested, which is proportionate to uploading audio to Zhipu's API. No other secrets or unrelated credentials are required or referenced.

持久

always is false and the skill does not attempt to modify other skills or system settings. It does not request persistent elevated presence beyond normal user invocation.

综合结论

This skill appears to do exactly what it says: convert audio locally (ffmpeg) and upload it to Zhipu's transcription API using ZHIPU_API_KEY. Before installing, confirm you trust the destination (open.bigmodel.cn) and are comfortable sending audio (which may contain sensitive PII) to that external service. Ensure ffmpeg/curl/jq are installed from trusted package sources. Note the small metadata inconsistencies (package.json/_meta.json versions…

安装（复制给龙虾 AI）

将下方整段复制到龙虾中文库对话中，由龙虾按 SKILL.md 完成安装。

请把本段交给龙虾中文库（龙虾 AI）执行：为本机安装 OpenClaw 技能「Zhipu Asr」。简介：Automatic Speech Recognition (ASR) using Zhipu AI (BigModel) GLM-ASR model. Use…。
请 fetch 以下地址读取 SKILL.md 并按文档完成安装：https://raw.githubusercontent.com/openclaw/skills/refs/heads/main/skills/franklu0819-lang/zhipu-asr/SKILL.md
（来源：yingzhi8.cn 技能库）

SKILL.md

打开原始 SKILL.md（GitHub raw）

---
name: zhipu-asr
description: Automatic Speech Recognition (ASR) using Zhipu AI (BigModel) GLM-ASR model. Use when you need to transcribe audio files to text. Supports Chinese audio transcription with context prompts, custom hotwords, and multiple audio formats.
metadata:
  {
    "openclaw":
      {
        "requires": { "bins": ["jq", "curl", "ffmpeg"], "env": ["ZHIPU_API_KEY"] },
      },
  }
---

# Zhipu AI Automatic Speech Recognition (ASR)

Transcribe Chinese audio files to text using Zhipu AI's GLM-ASR model.

## Setup

**1. Get your API Key:**
Get a key from [Zhipu AI Console](https://bigmodel.cn/usercenter/proj-mgmt/apikeys)

**2. Set it in your environment:**
```bash
export ZHIPU_API_KEY="your-key-here"
```

## Supported Audio Formats

- **WAV** - Recommended, best quality
- **MP3** - Widely supported
- **OGG** - Auto-converted to MP3
- **M4A** - Auto-converted to MP3
- **AAC** - Auto-converted to MP3
- **FLAC** - Auto-converted to MP3
- **WMA** - Auto-converted to MP3

> **Note:** The script automatically converts unsupported formats to MP3 using ffmpeg. Only WAV and MP3 are accepted by the API, but you can use any format that ffmpeg supports.

## File Constraints

- **Maximum file size:** 25 MB
- **Maximum duration:** 30 seconds
- **Recommended sample rate:** 16000 Hz or higher
- **Audio channels:** Mono or stereo

## Usage

### Basic Transcription

Transcribe an audio file with default settings:

```bash
bash scripts/speech_to_text.sh recording.wav
```

### Transcription with Context

Provide previous transcription or context for better accuracy:

```bash
bash scripts/speech_to_text.sh recording.wav "这是之前的转录内容，有助于提高准确性"
```

### Transcription with Hotwords

Use custom vocabulary to improve recognition of specific terms:

```bash
bash scripts/speech_to_text.sh recording.mp3 "" "人名,地名,专业术语,公司名称"
```

### Full Options

Combine context and hotwords:

```bash
bash scripts/speech_to_text.sh recording.wav "会议记录片段" "张三,李四,项目名称"
```

**Parameters:**
- `audio_file` (required): Path to audio file (.wav or .mp3)
- `prompt` (optional): Previous transcription or context text (max 8000 chars)
- `hotwords` (optional): Comma-separated list of specific terms (max 100 words)

## Features

### Context Prompts

**Why use context prompts:**
- Improves accuracy in long conversations
- Helps with domain-specific terminology
- Maintains consistency across multiple segments

**When to use:**
- Multi-part conversations or meetings
- Technical or specialized content
- Continuing from previous transcriptions

**Example:**
```bash
bash scripts/speech_to_text.sh part2.wav "第一部分的转录内容：讨论了项目进展和下一步计划"
```

### Hotwords

**What are hotwords:**
Custom vocabulary list that boosts recognition accuracy for specific terms.

**Best use cases:**
- Proper names (people, places)
- Domain-specific terminology
- Company names and products
- Technical jargon
- Industry-specific terms

**Examples:**
```bash
# Medical transcription
bash scripts/speech_to_text.sh medical.wav "" "患者,症状,诊断,治疗方案"

# Business meeting
bash scripts/speech_to_text.sh meeting.wav "" "张经理,李总,项目代号,预算"

# Tech discussion
bash scripts/speech_to_text.sh tech.wav "" "API,数据库,算法,框架"
```

## Workflow Examples

### Transcribe a Meeting

```bash
# Part 1
bash scripts/speech_to_text.sh meeting_part1.wav

# Part 2 with context
bash scripts/speech_to_text.sh meeting_part2.wav "第一部分讨论了项目进度" "张总,李经理,项目名称"

# Part 3 with context
bash scripts/speech_to_text.sh meeting_part3.wav "前两部分讨论了项目进度和预算" "张总,李经理,项目名称"
```

### Transcribe a Lecture

```bash
bash scripts/speech_to_text.sh lecture.wav "" "教授,课程名称,专业术语1,专业术语2"
```

### Process Multiple Files

```bash
for file in recording_*.wav; do
    bash scripts/speech_to_text.sh "$file"
done
```

## Audio Quality Tips

**Best practices for accurate transcription:**

1. **Clear audio source**
   - Minimize background noise
   - Use good quality microphone
   - Speak clearly and at moderate pace

2. **Optimal audio settings**
   - Sample rate: 16000 Hz or higher
   - Bit depth: 16-bit or higher
   - Single channel (mono) is sufficient

3. **File preparation**
   - Remove silence from beginning/end
   - Normalize audio levels
   - Ensure consistent volume

## Output Format

The script outputs JSON with:
- `id`: Task ID
- `created`: Request timestamp (Unix timestamp)
- `request_id`: Unique request identifier
- `model`: Model name used
- `text`: Transcribed text

Example output:
```json
{
  "id": "task-12345",
  "created": 1234567890,
  "request_id": "req-abc123",
  "model": "glm-asr-2512",
  "text": "你好，这是转录的文本内容"
}
```

## Troubleshooting

**File Size Issues:**
- Split audio files larger than 25 MB
- Reduce sample rate or bit depth
- Use compression (MP3) for smaller files

**Duration Issues:**
- Split recordings longer than 30 seconds
- Process segments separately
- Use context prompts to maintain continuity

**Poor Accuracy:**
- Improve audio quality
- Use hotwords for specific terms
- Provide context prompts
- Ensure clear speech and minimal noise

**Format Issues:**
- Ensure file is .wav or .mp3
- Check file is not corrupted
- Verify audio can be played by standard players

## Limitations

- Maximum audio duration: 30 seconds per request
- File size limit: 25 MB
- Maximum hotwords: 100 terms
- Context prompt limit: 8000 characters
- Best performance with Chinese language audio

## Performance Notes

- Typical transcription time: 1-3 seconds
- Real-time or faster for most audio
- Processing time scales with audio quality and length