Voice messaging setup — 技能 — openclaw中文资讯站

技能详情（站内镜像，无评论）

Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS

通信与消息

作者：Dmitry Aksenkin @aksenkin

许可证：MIT-0

MIT-0 ·免费使用、修改和重新分发。无需归因。

版本：v1.0.3

统计：⭐ 0 · 304 · 1 current installs · 1 all-time installs

⭐ 0

安装量（当前） 1

🛡 VirusTotal ：良性 · OpenClaw ：良性

Package：aksenkin/voice-stt-tts

安全扫描（ClawHub）

VirusTotal ：良性
OpenClaw ：良性

OpenClaw 评估

The skill's instructions and requirements are consistent with a local STT+TTS setup: it creates a venv, installs faster-whisper, writes a transcription script, and shows OpenClaw config changes; it does not request unrelated credentials or hidden installs, but it will download large ML models and install native dependencies.

目的

The name/description (STT + TTS using faster-whisper and Edge TTS) match the actions in SKILL.md: creating a venv, installing faster-whisper, creating a transcribe.py, and adding OpenClaw config entries for media.audio and messages.tts. Nothing requested or shown is unrelated to providing local transcription and TTS.

说明范围

The instructions direct the agent to create files under ~/.openclaw/workspace/voice-messages, install packages into that venv, and modify ~/.openclaw/openclaw.json. These actions are expected for this purpose, but they do write to the user's home and update OpenClaw config — the user should review/backup that config before applying changes. The SKILL.md does not explicitly warn that model weights will be downloaded at runtime (faster-whisper/h…

安装机制

No packaged install spec is present; the SKILL.md includes shell commands to create a Python venv and pip install faster-whisper. Using pip in an isolated venv is a reasonable install mechanism. The packages pulled (faster-whisper and its deps) come from PyPI/huggingface and are expected for transcription. There is no download from untrusted personal URLs or extract-from-URL steps in the manifest.

证书

The skill declares no environment variables or credentials, which is proportional. However, faster-whisper/huggingface-hub will perform network downloads of model artifacts (potentially large) and could prompt for HF auth if private models are used; the SKILL.md does not explicitly call this out. No unrelated secrets or config paths are requested.

持久

The skill is instruction-only and not always-enabled; it does not request elevated privileges or modify other skills. It proposes editing the agent's openclaw.json configuration (its own runtime configuration), which is appropriate for enabling STT/TTS.

综合结论

This skill appears to do what it claims, but before running: (1) review and back up ~/.openclaw/openclaw.json — the instructions modify it; (2) expect pip to install large/native packages (onnxruntime, ctranslate2, ffmpeg may be needed) and for faster-whisper to download model weights from the Hugging Face hub (large disk and network usage); (3) prefer running the install steps manually in a terminal so you can inspect outputs and resolve miss…

安装（复制给龙虾 AI）

将下方整段复制到龙虾中文库对话中，由龙虾按 SKILL.md 完成安装。

请把本段交给龙虾中文库（龙虾 AI）执行：为本机安装 OpenClaw 技能「Voice messaging setup」。简介：Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge…。
请 fetch 以下地址读取 SKILL.md 并按文档完成安装：https://raw.githubusercontent.com/openclaw/skills/refs/heads/main/skills/aksenkin/voice-stt-tts/SKILL.md
（来源：yingzhi8.cn 技能库）

SKILL.md

打开原始 SKILL.md（GitHub raw）

---
name: voice-stt-tts
description: Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS
homepage: https://docs.openclaw.ai/nodes/audio
metadata:
  {
    "openclaw":
      {
        "emoji": "🎙️",
        "install": [
          {
            "id": "faster-whisper-venv",
            "kind": "bash",
            "label": "Install faster-whisper in venv",
            "command": "python3 -m venv ~/.openclaw/workspace/voice-messages && ~/.openclaw/workspace/voice-messages/bin/pip install faster-whisper"
          },
          {
            "id": "transcribe-script",
            "kind": "bash",
            "label": "Create transcribe.py script",
            "command": "cat > ~/.openclaw/workspace/voice-messages/transcribe.py << 'EOF'n#!/usr/bin/env python3nimport argparsenfrom faster_whisper import WhisperModelnndef transcribe(audio_path: str, model_name: str = "small", lang: str = "en", device: str = "cpu") -> str:n    model = WhisperModel(n        model_name,n        device=device,n        compute_type="int8" if device == "cpu" else "float16",n    )n    segments, _ = model.transcribe(audio_path, language=lang, vad_filter=True)n    text = " ".join(seg.text.strip() for seg in segments if seg.text and seg.text.strip()).strip()n    return textnndef main():n    p = argparse.ArgumentParser()n    p.add_argument("--audio", required=True)n    p.add_argument("--model", default="small")n    p.add_argument("--lang", default="en")n    p.add_argument("--device", default="cpu", choices=["cpu", "cuda"])n    args = p.parse_args()n    text = transcribe(args.audio, args.model, args.lang, args.device)n    print(text if text else "")nif __name__ == "__main__":n    main()nEOF"
          }
        ]
      }
  }
---

# Voice Messages (STT + TTS) for OpenClaw 🎙️

Complete voice message setup using **faster-whisper** for transcription and **Edge TTS** for voice replies.

## What we configure

- ✅ **STT** (Speech-to-Text) — transcribe voice messages via faster-whisper
- ✅ **TTS** (Text-to-Speech) — voice replies via Edge TTS
- 🎯 **Result:** voice → text → reply with voice

---

## Installation

### 1. Create virtual environment (venv)

For Ubuntu create an isolated venv:

```bash
python3 -m venv ~/.openclaw/workspace/voice-messages
```

### 2. Install faster-whisper

Install packages in venv:

```bash
~/.openclaw/workspace/voice-messages/bin/pip install faster-whisper
```

**What gets installed:**
- `faster-whisper` — Python library for transcription
- Dependencies: `ctranslate2`, `onnxruntime`, `huggingface-hub`, `av`, `numpy`, and others.
- Size: ~250 MB

---

## Transcription Script

### Path and content

**File:** `~/.openclaw/workspace/voice-messages/transcribe.py`

```python
#!/usr/bin/env python3
import argparse
from faster_whisper import WhisperModel


def transcribe(audio_path: str, model_name: str = "small", lang: str = "en", device: str = "cpu") -> str:
    model = WhisperModel(
        model_name,
        device=device,
        compute_type="int8" if device == "cpu" else "float16",
    )
    segments, _ = model.transcribe(audio_path, language=lang, vad_filter=True)
    text = " ".join(seg.text.strip() for seg in segments if seg.text and seg.text.strip()).strip()
    return text


def main():
    p = argparse.ArgumentParser()
    p.add_argument("--audio", required=True)
    p.add_argument("--model", default="small")
    p.add_argument("--lang", default="en")
    p.add_argument("--device", default="cpu", choices=["cpu", "cuda"])
    args = p.parse_args()

    text = transcribe(args.audio, args.model, args.lang, args.device)
    print(text if text else "")


if __name__ == "__main__":
    main()
```

**What the script does:**
1. Accepts audio file path (`--audio`)
2. Loads Whisper model (`--model`): `small` by default
3. Sets language (`--lang`): `en` for English
4. Transcribes with VAD filter (Voice Activity Detection)
5. Outputs clean text to stdout

### Make file executable:

```bash
chmod +x ~/.openclaw/workspace/voice-messages/transcribe.py
```

---

## OpenClaw Configuration

### 1. Configure STT (`tools.media.audio`)

Add to `~/.openclaw/openclaw.json`:

```json5
{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    }
  }
}
```

**Parameters:**

| Parameter | Value | Description |
|-----------|----------|-----------|
| `enabled` | `true` | Enable audio transcription |
| `maxBytes` | `20971520` | Max file size (20 MB) |
| `type` | `"cli"` | Model type: CLI command |
| `command` | Python path | Path to python in venv |
| `args` | argument array | Arguments for script |
| `{{MediaPath}}` | placeholder | Replaced with audio file path |
| `timeoutSeconds` | `120` | Transcription timeout (2 minutes) |

### 2. Configure TTS (`messages.tts`)

Add to `~/.openclaw/openclaw.json`:

```json5
{
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    }
  }
}
```

**Parameters:**

| Parameter | Value | Description |
|-----------|----------|-----------|
| `auto` | `"inbound"` | **Key mode!** — reply with voice only on incoming voice messages |
| `provider` | `"edge"` | TTS provider (free, no API key) |
| `voice` | `"en-US-JennyNeural"` | Voice (see available below) |
| `lang` | `"en-US"` | Locale (en-US for US english) |

### 3. Full configuration example

```json5
{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    },
  },
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    },
    "ackReactionScope": "group-mentions"
  }
}
```

---

## Apply Changes

### Restart Gateway

```bash
# Method 1: via openclaw CLI
openclaw gateway restart

# Method 2: via systemd
systemctl --user restart openclaw-gateway

# Check status
systemctl --user status openclaw-gateway
# Should show: active (running)
```

---

## Testing

### Test STT (transcription)

**Action:** Send a voice message to your Telegram bot

**Expected result:**
```
[Audio] User text: [Telegram ...] <media:audio> Transcript: <transcribed text>
```

**Example response:**
```
[Audio] User text: [Telegram kd (@someuser) id:12345678 +5s ...] <media:audio> Transcript: Hello. How are you?
```

### Test TTS (voice replies)

**Action:** After successful transcription, bot should send a voice reply

**Expected result:**
- Voice file arrives in Telegram
- Voice note (round bubble)

**Expected behavior:**
- Incoming voice → bot replies with voice
- Text messages → bot replies with text (this is normal!)

---

## Available Edge TTS Voices

### Female voices

| Voice | ID | Usage example |
|--------|-----|------------------|
| Jenny | `en-US-JennyNeural` | ← current |
| Ana | `en-US-AnaNeural` | Softer |

### Male voices

| Voice | ID | Usage example |
|--------|-----|------------------|
| Dmitry | `en-US-RogerNeural` | More bass |

**How to change voice:**
```bash
cat ~/.openclaw/openclaw.json | 
  jq '.messages.tts.edge.voice = "en-US-MichelleNeural"' > ~/.openclaw/openclaw.json.tmp
mv ~/.openclaw/openclaw.json.tmp ~/.openclaw/openclaw.json
systemctl --user restart openclaw-gateway
```

---

## Additional Edge TTS Parameters

### Adjusting speed, pitch, volume

```json5
{
  "messages": {
    "tts": {
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US",
        "rate": "+10%",      // Speed: -50% to +100%
        "pitch": "-5%",     // Pitch: -50% to +50%
        "volume": "+5%"     // Volume: -100% to +100%
      }
    }
  }
}
```

---

## Troubleshooting

### Problem: Voice not transcribed

**Logs show:**
```
[ERROR] Transcription failed
```

**Possible causes:**
1. **File too large** — > 20 MB
   ```bash
   # Solution: Increase maxBytes in config
   maxBytes: 52428800  # 50 MB
   ```

2. **Timeout** — transcription took > 2 minutes
   ```bash
   # Solution: Increase timeoutSeconds
   timeoutSeconds: 180  # 3 minutes
   ```

3. **Model not downloaded** — first run
   ```bash
   # Solution: Wait while it downloads (1-2 minutes)
   # Models are cached in ~/.cache/huggingface/
   ```

### Problem: No voice reply

**Possible causes:**
1. **Reply too short** (< 10 characters)
   - TTS skips very short replies
   - Solution: this is expected behavior

2. **auto: "inbound"** but text message
   - TTS in `inbound` mode replies with voice only on **voice messages**
   - Text messages get text replies — this is correct!

3. **Edge TTS unavailable**
   ```bash
   # Check
   curl -s "https://speech.platform.bing.com/consumer/api/v1/tts" | head -c 100
   # If error — temporarily unavailable
   ```

---

## Performance

### Transcription time (Raspberry Pi 4/ARM)

| Whisper Model | Est. time | Quality |
|---------------|--------------|---------|
| `tiny` | ~5-10 sec | Low |
| `base` | ~10-20 sec | Medium |
| `small` | ~20-40 sec | High ← current |
| `medium` | ~40-80 sec | Very high |
| `large` | ~80-160 sec | Maximum |

**Recommendation:** For Raspberry Pi use `small` or `base`. `medium`/`large` will be very slow.

### Where Whisper models are stored

```bash
~/.cache/huggingface/
```

Models download automatically on first run.

## Done! 🎉

After completing these steps:

1. ✅ faster-whisper installed in venv
2. ✅ `transcribe.py` script created
3. ✅ OpenClaw configured (STT + TTS)
4. ✅ Gateway restarted
5. ✅ Voice messages working

Now your Telegram bot:
- 🎙️ **Accepts voice** → transcribes via faster-whisper
- 🎤 **Replies with voice** → generates via Edge TTS
- 💬 **Accepts text** → replies with text (as usual)

---

**Useful links:**
- OpenClaw docs: https://docs.openclaw.ai
- TTS docs: https://docs.openclaw.ai/tts
- Audio docs: https://docs.openclaw.ai/nodes/audio
- Install skills: `npx clawhub search voice`

---

*Created: 2026-03-01 for OpenClaw 2026.2.26*