voicebox-voice-synthesis — 技能 — openclaw中文资讯站

技能详情（站内镜像，无评论）

Expert skill for Voicebox — the open-source local voice cloning and TTS studio built with Tauri, React, and FastAPI

开发与 DevOps

许可证：MIT-0

MIT-0 ·免费使用、修改和重新分发。无需归因。

版本：v1.0.0

统计：⭐ 0 · 38 · 1 current installs · 1 all-time installs

⭐ 0

安装量（当前） 1

🛡 VirusTotal ：良性 · OpenClaw ：良性

Package：adisinghstudent/voicebox-voice-synthesis

安全扫描（ClawHub）

VirusTotal ：良性
OpenClaw ：良性

OpenClaw 评估

The skill's instructions, requirements, and actions are consistent with a local-first Voicebox TTS/voice-cloning helper; it does not request unrelated credentials or attempt unexplained actions.

目的

The name/description (Voicebox voice synthesis) match the SKILL.md content: instructions to install/run Voicebox and calls against a local REST API (http://localhost:17493). There are no environment variables, binaries, or config paths requested that are unrelated to running or integrating with a local TTS service.

说明范围

Instructions stay within the stated purpose (installation, running the backend, and calling the local REST API). Two points to be aware of: (1) several examples show uploading local audio files (e.g., /path/to/voice-sample.wav) — those may be sensitive (personal audio) and you should only upload audio you control/are permitted to use; (2) the SKILL.md points you to external download URLs and to run build tasks (just setup / just dev) which exe…

安装机制

The skill is instruction-only (no install spec). It links to pre-built binaries on voicebox.sh and to a GitHub repo for building from source. Because the skill only suggests those links (it does not itself download/install), risk is low, but users should verify the legitimacy and checksums of any downloaded binaries and the authenticity of the GitHub repo before running installers or build scripts.

证书

The skill declares no required environment variables or credentials. Example code references an optional VOICEBOX_API_URL env var (normal for a configurable client). There are no unrelated secret requests or excessive environment access.

持久

The skill does not request always:true or other elevated persistence. It is user-invocable and allows normal autonomous invocation (platform default) but does not claim permanent/system-wide presence or modify other skills.

综合结论

This skill is coherent and appears to be an instruction helper for running a local Voicebox TTS service. Before you install or run anything: (1) verify the download domain (voicebox.sh) and the GitHub repo's authenticity and check release checksums where available; (2) prefer building/running inside a container or isolated VM if you are unsure about running third‑party binaries or build scripts; (3) be cautious about uploading audio samples — …

安装（复制给龙虾 AI）

将下方整段复制到龙虾中文库对话中，由龙虾按 SKILL.md 完成安装。

请把本段交给龙虾中文库（龙虾 AI）执行：为本机安装 OpenClaw 技能「voicebox-voice-synthesis」。简介：Expert skill for Voicebox — the open-source local voice cloning and TTS studio …。
请 fetch 以下地址读取 SKILL.md 并按文档完成安装：https://raw.githubusercontent.com/openclaw/skills/refs/heads/main/skills/adisinghstudent/voicebox-voice-synthesis/SKILL.md
（来源：yingzhi8.cn 技能库）

SKILL.md

打开原始 SKILL.md（GitHub raw）

---
name: voicebox-voice-synthesis
description: Expert skill for Voicebox — the open-source local voice cloning and TTS studio built with Tauri, React, and FastAPI
triggers:
  - "clone a voice with voicebox"
  - "generate speech locally with voicebox"
  - "set up voicebox voice synthesis"
  - "use voicebox API to synthesize speech"
  - "add TTS to my app with voicebox"
  - "configure voicebox TTS engine"
  - "apply voice effects in voicebox"
  - "voicebox stories editor multi-voice"
---

# Voicebox Voice Synthesis Studio

> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.

Voicebox is a local-first, open-source voice cloning and TTS studio — a self-hosted alternative to ElevenLabs. It runs entirely on your machine (macOS MLX/Metal, Windows/Linux CUDA, CPU fallback), exposes a REST API on `localhost:17493`, and ships with 5 TTS engines, 23 languages, post-processing effects, and a multi-track Stories editor.

---

## Installation

### Pre-built Binaries (Recommended)

| Platform | Link |
|---|---|
| macOS Apple Silicon | https://voicebox.sh/download/mac-arm |
| macOS Intel | https://voicebox.sh/download/mac-intel |
| Windows | https://voicebox.sh/download/windows |
| Docker | `docker compose up` |

Linux requires building from source: https://voicebox.sh/linux-install

### Build from Source

**Prerequisites:** [Bun](https://bun.sh), [Rust](https://rustup.rs), [Python 3.11+](https://python.org), Tauri prerequisites

```bash
git clone https://github.com/jamiepine/voicebox.git
cd voicebox

# Install just task runner
brew install just        # macOS
cargo install just       # any platform

# Set up Python venv + all dependencies
just setup

# Start backend + desktop app in dev mode
just dev
```

```bash
# List all available commands
just --list
```

---

## Architecture

| Layer | Technology |
|---|---|
| Desktop App | Tauri (Rust) |
| Frontend | React + TypeScript + Tailwind CSS |
| State | Zustand + React Query |
| Backend | FastAPI (Python) on port 17493 |
| TTS Engines | Qwen3-TTS, LuxTTS, Chatterbox, Chatterbox Turbo, TADA |
| Effects | Pedalboard (Spotify) |
| Transcription | Whisper / Whisper Turbo |
| Inference | MLX (Apple Silicon) / PyTorch (CUDA/ROCm/XPU/CPU) |
| Database | SQLite |

The Python FastAPI backend handles all ML inference. The Tauri Rust shell wraps the frontend and manages the backend process lifecycle. The API is accessible directly at `http://localhost:17493` even when using the desktop app.

---

## REST API Reference

Base URL: `http://localhost:17493`  
Interactive docs: `http://localhost:17493/docs`

### Generate Speech

```bash
# Basic generation
curl -X POST http://localhost:17493/generate 
  -H "Content-Type: application/json" 
  -d '{
    "text": "Hello world, this is a voice clone.",
    "profile_id": "abc123",
    "language": "en"
  }'

# With engine selection
curl -X POST http://localhost:17493/generate 
  -H "Content-Type: application/json" 
  -d '{
    "text": "Speak slowly and with gravitas.",
    "profile_id": "abc123",
    "language": "en",
    "engine": "qwen3-tts"
  }'

# With paralinguistic tags (Chatterbox Turbo only)
curl -X POST http://localhost:17493/generate 
  -H "Content-Type: application/json" 
  -d '{
    "text": "That is absolutely hilarious! [laugh] I cannot believe it.",
    "profile_id": "abc123",
    "engine": "chatterbox-turbo",
    "language": "en"
  }'
```

### Voice Profiles

```bash
# List all profiles
curl http://localhost:17493/profiles

# Create a new profile
curl -X POST http://localhost:17493/profiles 
  -H "Content-Type: application/json" 
  -d '{
    "name": "Narrator",
    "language": "en",
    "description": "Deep narrative voice"
  }'

# Upload audio sample to a profile
curl -X POST http://localhost:17493/profiles/{profile_id}/samples 
  -F "file=@/path/to/voice-sample.wav"

# Export a profile
curl http://localhost:17493/profiles/{profile_id}/export 
  --output narrator-profile.zip

# Import a profile
curl -X POST http://localhost:17493/profiles/import 
  -F "file=@narrator-profile.zip"
```

### Generation Queue & Status

```bash
# Get generation status (SSE stream)
curl -N http://localhost:17493/generate/{generation_id}/status

# List recent generations
curl http://localhost:17493/generations

# Retry a failed generation
curl -X POST http://localhost:17493/generations/{generation_id}/retry

# Download generated audio
curl http://localhost:17493/generations/{generation_id}/audio 
  --output output.wav
```

### Models

```bash
# List available models and download status
curl http://localhost:17493/models

# Unload a model from GPU memory (without deleting)
curl -X POST http://localhost:17493/models/{model_id}/unload
```

---

## TypeScript/JavaScript Integration

### Basic TTS Client

```typescript
const VOICEBOX_URL = process.env.VOICEBOX_API_URL ?? "http://localhost:17493";

interface GenerateRequest {
  text: string;
  profile_id: string;
  language?: string;
  engine?: "qwen3-tts" | "luxtts" | "chatterbox" | "chatterbox-turbo" | "tada";
}

interface GenerateResponse {
  generation_id: string;
  status: "queued" | "processing" | "complete" | "failed";
  audio_url?: string;
}

async function generateSpeech(req: GenerateRequest): Promise<GenerateResponse> {
  const response = await fetch(`${VOICEBOX_URL}/generate`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify(req),
  });

  if (!response.ok) {
    throw new Error(`Voicebox API error: ${response.status} ${await response.text()}`);
  }

  return response.json();
}

// Usage
const result = await generateSpeech({
  text: "Welcome to our application.",
  profile_id: "abc123",
  language: "en",
  engine: "qwen3-tts",
});

console.log("Generation ID:", result.generation_id);
```

### Poll for Completion

```typescript
async function waitForGeneration(
  generationId: string,
  timeoutMs = 60_000
): Promise<string> {
  const start = Date.now();

  while (Date.now() - start < timeoutMs) {
    const res = await fetch(`${VOICEBOX_URL}/generations/${generationId}`);
    const data = await res.json();

    if (data.status === "complete") {
      return `${VOICEBOX_URL}/generations/${generationId}/audio`;
    }
    if (data.status === "failed") {
      throw new Error(`Generation failed: ${data.error}`);
    }

    await new Promise((r) => setTimeout(r, 1000));
  }

  throw new Error("Generation timed out");
}
```

### Stream Status with SSE

```typescript
function streamGenerationStatus(
  generationId: string,
  onStatus: (status: string) => void
): () => void {
  const eventSource = new EventSource(
    `${VOICEBOX_URL}/generate/${generationId}/status`
  );

  eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);
    onStatus(data.status);

    if (data.status === "complete" || data.status === "failed") {
      eventSource.close();
    }
  };

  eventSource.onerror = () => eventSource.close();

  // Return cleanup function
  return () => eventSource.close();
}

// Usage
const cleanup = streamGenerationStatus("gen_abc123", (status) => {
  console.log("Status update:", status);
});
```

### Download Audio as Blob

```typescript
async function downloadAudio(generationId: string): Promise<Blob> {
  const response = await fetch(
    `${VOICEBOX_URL}/generations/${generationId}/audio`
  );

  if (!response.ok) {
    throw new Error(`Failed to download audio: ${response.status}`);
  }

  return response.blob();
}

// Play in browser
async function playGeneratedAudio(generationId: string): Promise<void> {
  const blob = await downloadAudio(generationId);
  const url = URL.createObjectURL(blob);
  const audio = new Audio(url);
  audio.play();
  audio.onended = () => URL.revokeObjectURL(url);
}
```

---

## Python Integration

```python
import httpx
import asyncio

VOICEBOX_URL = "http://localhost:17493"

async def generate_speech(
    text: str,
    profile_id: str,
    language: str = "en",
    engine: str = "qwen3-tts"
) -> bytes:
    async with httpx.AsyncClient(timeout=120.0) as client:
        # Submit generation
        resp = await client.post(
            f"{VOICEBOX_URL}/generate",
            json={
                "text": text,
                "profile_id": profile_id,
                "language": language,
                "engine": engine,
            }
        )
        resp.raise_for_status()
        generation_id = resp.json()["generation_id"]

        # Poll until complete
        for _ in range(120):
            status_resp = await client.get(
                f"{VOICEBOX_URL}/generations/{generation_id}"
            )
            status_data = status_resp.json()

            if status_data["status"] == "complete":
                audio_resp = await client.get(
                    f"{VOICEBOX_URL}/generations/{generation_id}/audio"
                )
                return audio_resp.content

            if status_data["status"] == "failed":
                raise RuntimeError(f"Generation failed: {status_data.get('error')}")

            await asyncio.sleep(1.0)

        raise TimeoutError("Generation timed out after 120s")


# Usage
audio_bytes = asyncio.run(
    generate_speech(
        text="The quick brown fox jumps over the lazy dog.",
        profile_id="your-profile-id",
        language="en",
        engine="chatterbox",
    )
)

with open("output.wav", "wb") as f:
    f.write(audio_bytes)
```

---

## TTS Engine Selection Guide

| Engine | Best For | Languages | VRAM | Notes |
|---|---|---|---|---|
| `qwen3-tts` (0.6B/1.7B) | Quality + instructions | 10 | Medium | Supports delivery instructions in text |
| `luxtts` | Fast CPU generation | English only | ~1GB | 150x realtime on CPU, 48kHz |
| `chatterbox` | Multilingual coverage | 23 | Medium | Arabic, Hindi, Swahili, CJK + more |
| `chatterbox-turbo` | Expressive/emotion | English only | Low (350M) | Use `[laugh]`, `[sigh]`, `[gasp]` tags |
| `tada` (1B/3B) | Long-form coherence | 10 | High | 700s+ audio, HumeAI model |

### Delivery Instructions (Qwen3-TTS)

Embed natural language instructions directly in the text:

```typescript
await generateSpeech({
  text: "(whisper) I have a secret to tell you.",
  profile_id: "abc123",
  engine: "qwen3-tts",
});

await generateSpeech({
  text: "(speak slowly and clearly) Step one: open the application.",
  profile_id: "abc123",
  engine: "qwen3-tts",
});
```

### Paralinguistic Tags (Chatterbox Turbo)

```typescript
const tags = [
  "[laugh]", "[chuckle]", "[gasp]", "[cough]",
  "[sigh]", "[groan]", "[sniff]", "[shush]", "[clear throat]"
];

await generateSpeech({
  text: "Oh really? [gasp] I had no idea! [laugh] That's incredible.",
  profile_id: "abc123",
  engine: "chatterbox-turbo",
});
```

---

## Environment & Configuration

```bash
# Custom models directory (set before launching)
export VOICEBOX_MODELS_DIR=/path/to/models

# For AMD ROCm GPU (auto-configured, but can override)
export HSA_OVERRIDE_GFX_VERSION=11.0.0
```

Docker configuration (`docker-compose.yml` override):

```yaml
services:
  voicebox:
    environment:
      - VOICEBOX_MODELS_DIR=/models
    volumes:
      - /host/models:/models
    ports:
      - "17493:17493"
    # For NVIDIA GPU passthrough:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
```

---

## Common Patterns

### Voice Profile Creation Flow

```typescript
// 1. Create profile
const profile = await fetch(`${VOICEBOX_URL}/profiles`, {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ name: "My Voice", language: "en" }),
}).then((r) => r.json());

// 2. Upload audio sample (WAV/MP3, ideally 5–30 seconds clean speech)
const formData = new FormData();
formData.append("file", audioBlob, "sample.wav");

await fetch(`${VOICEBOX_URL}/profiles/${profile.id}/samples`, {
  method: "POST",
  body: formData,
});

// 3. Generate with the new profile
const gen = await generateSpeech({
  text: "Testing my cloned voice.",
  profile_id: profile.id,
});
```

### Batch Generation with Queue

```typescript
async function batchGenerate(
  items: Array<{ text: string; profileId: string }>,
  engine = "qwen3-tts"
): Promise<string[]> {
  // Submit all — Voicebox queues them serially to avoid GPU contention
  const submissions = await Promise.all(
    items.map((item) =>
      generateSpeech({ text: item.text, profile_id: item.profileId, engine })
    )
  );

  // Wait for all completions
  const audioUrls = await Promise.all(
    submissions.map((s) => waitForGeneration(s.generation_id))
  );

  return audioUrls;
}
```

### Long-Form Text (Auto-Chunking)

Voicebox auto-chunks at sentence boundaries — just send the full text:

```typescript
const longScript = `
  Chapter one. The morning fog rolled across the valley floor...
  // Up to 50,000 characters supported
`;

await generateSpeech({
  text: longScript,
  profile_id: "narrator-profile-id",
  engine: "tada", // Best for long-form coherence
  language: "en",
});
```

---

## Troubleshooting

### API not responding

```bash
# Check if backend is running
curl http://localhost:17493/health

# Restart backend only (dev mode)
just backend

# Check logs
just logs
```

### GPU not detected

```bash
# Check detected backend
curl http://localhost:17493/system/info

# Force CPU mode (set before launch)
export VOICEBOX_FORCE_CPU=1
```

### Model download fails / slow

```bash
# Set custom models directory with more space
export VOICEBOX_MODELS_DIR=/path/with/space
just dev

# Cancel stuck download via API
curl -X DELETE http://localhost:17493/models/{model_id}/download
```

### Out of VRAM — unload models

```bash
# List loaded models
curl http://localhost:17493/models | jq '.[] | select(.loaded == true)'

# Unload specific model
curl -X POST http://localhost:17493/models/{model_id}/unload
```

### Audio quality issues

- Use 5–30 seconds of clean, noise-free speech for voice samples
- Multiple samples improve clone quality — upload 3–5 different sentences
- For multilingual cloning, use `chatterbox` engine
- Ensure sample audio is 16kHz+ mono WAV for best results
- Use `luxtts` for highest output quality (48kHz) in English

### Generation stuck in queue after crash

Voicebox auto-recovers stale generations on startup. If the issue persists:

```bash
curl -X POST http://localhost:17493/generations/{generation_id}/retry
```

---

## Frontend Integration (React Example)

```tsx
import { useState } from "react";

const VOICEBOX_URL = import.meta.env.VITE_VOICEBOX_URL ?? "http://localhost:17493";

export function VoiceGenerator({ profileId }: { profileId: string }) {
  const [text, setText] = useState("");
  const [audioUrl, setAudioUrl] = useState<string | null>(null);
  const [loading, setLoading] = useState(false);

  const handleGenerate = async () => {
    setLoading(true);
    try {
      const res = await fetch(`${VOICEBOX_URL}/generate`, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ text, profile_id: profileId, language: "en" }),
      });
      const { generation_id } = await res.json();

      // Poll for completion
      let done = false;
      while (!done) {
        await new Promise((r) => setTimeout(r, 1000));
        const statusRes = await fetch(`${VOICEBOX_URL}/generations/${generation_id}`);
        const { status } = await statusRes.json();
        if (status === "complete") {
          setAudioUrl(`${VOICEBOX_URL}/generations/${generation_id}/audio`);
          done = true;
        } else if (status === "failed") {
          throw new Error("Generation failed");
        }
      }
    } finally {
      setLoading(false);
    }
  };

  return (
    <div>
      <textarea value={text} onChange={(e) => setText(e.target.value)} />
      <button onClick={handleGenerate} disabled={loading}>
        {loading ? "Generating..." : "Generate Speech"}
      </button>
      {audioUrl && <audio controls src={audioUrl} />}
    </div>
  );
}
```