Qwen3 Audio — 技能 — openclaw中文资讯站

技能详情（站内镜像，无评论）

High-performance audio library for Apple Silicon with text-to-speech (TTS) and speech-to-text (STT).

媒体与内容

作者：noah @DarkNoah

许可证：MIT-0

MIT-0 ·免费使用、修改和重新分发。无需归因。

版本：v0.1.1

统计：⭐ 0 · 245 · 0 current installs · 0 all-time installs

⭐ 0

安装量（当前） 0

🛡 VirusTotal ：可疑 · OpenClaw ：良性

Package：darknoah/qwen3-audio

安全扫描（ClawHub）

VirusTotal ：可疑
OpenClaw ：良性

OpenClaw 评估

The skill's code and instructions are coherent with a local TTS/STT wrapper around the mlx-audio package, but it will install/run third-party code and contact model hubs at runtime — review dependency and network behavior before use.

综合结论

This skill appears to do what it claims (local TTS/STT using the mlx-audio package), but it will: (1) install a third‑party Python package at runtime via the 'uv' tool; (2) access network endpoints to check/download model data (Hugging Face or a mirror) and reference an external test audio URL; and (3) read/write audio/text files (including creating a voices/ folder and temp chunk files). Before installing/using: review the mlx-audio package a…

安装（复制给龙虾 AI）

将下方整段复制到龙虾中文库对话中，由龙虾按 SKILL.md 完成安装。

请把本段交给龙虾中文库（龙虾 AI）执行：为本机安装 OpenClaw 技能「Qwen3 Audio」。简介：High-performance audio library for Apple Silicon with text-to-speech (TTS) and …。
请 fetch 以下地址读取 SKILL.md 并按文档完成安装：https://raw.githubusercontent.com/openclaw/skills/refs/heads/main/skills/darknoah/qwen3-audio/SKILL.md
（来源：yingzhi8.cn 技能库）

SKILL.md

打开原始 SKILL.md（GitHub raw）

---
name: qwen3-audio
description: "High-performance audio library for Apple Silicon with text-to-speech (TTS) and speech-to-text (STT)."
version: "0.0.3"
---

# Qwen3-Audio

## Overview

Qwen3-Audio is a high-performance audio processing library optimized for Apple Silicon (M1/M2/M3/M4). It delivers fast, efficient TTS and STT with support for multiple models, languages, and audio formats.

## Prerequisites

- Python 3.10+
- Apple Silicon Mac (M1/M2/M3/M4)

### Environment checks

Before using any capability, verify that all items in `./references/env-check-list.md` are complete.

## Capabilities

### Text to Speech
```bash
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav"
```

**Returns (JSON):**
```json
{
  "audio_path": "/path_to_save.wav",
  "duration": 1.234,
  "sample_rate": 24000
}
```

### Voice Cloning
Clone any voice using a reference audio sample. Provide the wav file and its transcript:
```bash
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav" --ref_audio "sample_audio.wav" --ref_text "This is what my voice sounds like."
```
ref_audio: reference audio to clone
ref_text: transcript of the reference audio

### Use Created Voice (Shortcut)
Use a voice created with `voice create` by its ID:
```bash
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav" --ref_voice "my-voice-id"
```
This automatically loads `ref_audio` and `ref_text` from the voice profile.

### CustomVoice (Emotion Control)
Use predefined voices with emotion/style instructions:
```bash
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav" --speaker "Ryan" --language "English" --instruct "Very happy and excited."
```

### VoiceDesign (Create Any Voice)
Create any voice from a text description:
```bash
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav" --language "English" --instruct "A cheerful young female voice with high pitch and energetic tone."
```

### Automatic Speech Recognition (STT)
```bash
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" stt --audio "/sample_audio.wav" --output "/path_to_save.txt" --output-format srt
```
Test audio: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav
output-format: "txt" | "ass" | "srt" | "all"

**Returns (JSON):**
```json
{
  "text": "transcribed text content",
  "duration": 10.5,
  "sample_rate": 16000,
  "files": ["/path_to_save.txt", "/path_to_save.srt"]
}
```

### Voice Management

Voices are stored in the `voices/` directory at the skill root level. Each voice has its own folder containing:
- `ref_audio.wav` - Reference audio file
- `ref_text.txt` - Reference text transcript
- `ref_instruct.txt` - Voice style description

#### Create a Voice
Create a reusable voice profile using VoiceDesign model. The `--instruct` parameter is required to describe the voice style:
```bash
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" voice create --text "This is a sample voice reference text." --instruct "A warm, friendly female voice with a professional tone." --language "English"
```
Optional: `--id "my-voice-id"` to specify a custom voice ID.

**Returns (JSON):**
```json
{
  "id": "abc12345",
  "ref_audio": "/path/to/skill/voices/abc12345/ref_audio.wav",
  "ref_text": "This is a sample voice reference text.",
  "instruct": "A warm, friendly female voice with a professional tone.",
  "duration": 3.456,
  "sample_rate": 24000
}
```

#### List Voices
List all created voice profiles:
```bash
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" voice list
```

**Returns (JSON):**
```json
[
  {
    "id": "abc12345",
    "ref_audio": "/path/to/skill/voices/abc12345/ref_audio.wav",
    "ref_text": "This is a sample voice reference text.",
    "instruct": "A warm, friendly female voice with a professional tone.",
    "duration": 3.456,
    "sample_rate": 24000
  }
]
```

#### Use a Created Voice
After creating a voice, use it for TTS with the `--ref_voice` parameter. The instruct will be automatically loaded:
```bash
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "New text to speak" --output "/output.wav" --ref_voice "abc12345"
```

## Predefined Speakers (CustomVoice)

For `Qwen3-TTS-12Hz-1.7B/0.6B-CustomVoice` models, the supported speakers and their descriptions are listed below. We recommend using each speaker's native language for best quality. Each speaker can still speak any language supported by the model.

| Speaker | Voice Description | Native Language |
| --- | --- | --- |
| Vivian | Bright, slightly edgy young female voice. | Chinese |
| Serena | Warm, gentle young female voice. | Chinese |
| Uncle_Fu | Seasoned male voice with a low, mellow timbre. | Chinese |
| Dylan | Youthful Beijing male voice with a clear, natural timbre. | Chinese (Beijing Dialect) |
| Eric | Lively Chengdu male voice with a slightly husky brightness. | Chinese (Sichuan Dialect) |
| Ryan | Dynamic male voice with strong rhythmic drive. | English |
| Aiden | Sunny American male voice with a clear midrange. | English |
| Ono_Anna | Playful Japanese female voice with a light, nimble timbre. | Japanese |
| Sohee | Warm Korean female voice with rich emotion. | Korean |


### Released Models

| Model | Features | Language Support | Instruction Control |
|---|---|---|---|
| Qwen3-TTS-12Hz-1.7B-VoiceDesign | Performs voice design based on user-provided descriptions. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | ✅ |
| Qwen3-TTS-12Hz-1.7B-CustomVoice | Provides style control over target timbres via user instructions; supports 9 premium timbres covering various combinations of gender, age, language, and dialect. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | ✅ |
| Qwen3-TTS-12Hz-1.7B-Base | Base model capable of 3-second rapid voice clone from user audio input; can be used for fine-tuning (FT) other models. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian |  |