{"id":647,"date":"2026-03-21T22:52:58","date_gmt":"2026-03-21T14:52:58","guid":{"rendered":"https:\/\/pa.yingzhi8.cn\/index.php\/2026\/03\/21\/tools-tts\/"},"modified":"2026-03-21T23:08:54","modified_gmt":"2026-03-21T15:08:54","slug":"tools-tts","status":"publish","type":"post","link":"https:\/\/pa.yingzhi8.cn\/index.php\/2026\/03\/21\/tools-tts\/","title":{"rendered":"Text-to-Speech"},"content":{"rendered":"<h1>Text-to-Speech<\/h1>\n<h1>Text-to-speech (TTS)<\/h1>\n<p>OpenClaw can convert outbound replies into audio using ElevenLabs, Microsoft, or OpenAI.<br \/>\nIt works anywhere OpenClaw can send audio; Telegram gets a round voice-note bubble.<\/p>\n<h2>Supported services<\/h2>\n<ul>\n<li><strong>ElevenLabs<\/strong> (primary or fallback provider)<\/li>\n<li><strong>Microsoft<\/strong> (primary or fallback provider; current bundled implementation uses <code>node-edge-tts<\/code>, default when no API keys)<\/li>\n<li><strong>OpenAI<\/strong> (primary or fallback provider; also used for summaries)<\/li>\n<\/ul>\n<h3>Microsoft speech notes<\/h3>\n<p>The bundled Microsoft speech provider currently uses Microsoft Edge&#8217;s online<br \/>\nneural TTS service via the <code>node-edge-tts<\/code> library. It&#8217;s a hosted service (not<br \/>\nlocal), uses Microsoft endpoints, and does not require an API key.<br \/>\n<code>node-edge-tts<\/code> exposes speech configuration options and output formats, but<br \/>\nnot all options are supported by the service. Legacy config and directive input<br \/>\nusing <code>edge<\/code> still works and is normalized to <code>microsoft<\/code>.<\/p>\n<p>Because this path is a public web service without a published SLA or quota,<br \/>\ntreat it as best-effort. If you need guaranteed limits and support, use OpenAI<br \/>\nor ElevenLabs.<\/p>\n<h2>Optional keys<\/h2>\n<p>If you want OpenAI or ElevenLabs:<\/p>\n<ul>\n<li><code>ELEVENLABS_API_KEY<\/code> (or <code>XI_API_KEY<\/code>)<\/li>\n<li><code>OPENAI_API_KEY<\/code><\/li>\n<\/ul>\n<p>Microsoft speech does <strong>not<\/strong> require an API key. If no API keys are found,<br \/>\nOpenClaw defaults to Microsoft (unless disabled via<br \/>\n<code>messages.tts.microsoft.enabled=false<\/code> or <code>messages.tts.edge.enabled=false<\/code>).<\/p>\n<p>If multiple providers are configured, the selected provider is used first and the others are fallback options.<br \/>\nAuto-summary uses the configured <code>summaryModel<\/code> (or <code>agents.defaults.model.primary<\/code>),<br \/>\nso that provider must also be authenticated if you enable summaries.<\/p>\n<h2>Service links<\/h2>\n<ul>\n<li><a href=\"https:\/\/platform.openai.com\/docs\/guides\/text-to-speech\">OpenAI Text-to-Speech guide<\/a><\/li>\n<li><a href=\"https:\/\/platform.openai.com\/docs\/api-reference\/audio\">OpenAI Audio API reference<\/a><\/li>\n<li><a href=\"https:\/\/elevenlabs.io\/docs\/api-reference\/text-to-speech\">ElevenLabs Text to Speech<\/a><\/li>\n<li><a href=\"https:\/\/elevenlabs.io\/docs\/api-reference\/authentication\">ElevenLabs Authentication<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/SchneeHertz\/node-edge-tts\">node-edge-tts<\/a><\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/azure\/ai-services\/speech-service\/rest-text-to-speech#audio-outputs\">Microsoft Speech output formats<\/a><\/li>\n<\/ul>\n<h2>Is it enabled by default?<\/h2>\n<p>No. Auto\u2011TTS is <strong>off<\/strong> by default. Enable it in config with<br \/>\n<code>messages.tts.auto<\/code> or per session with <code>\/tts always<\/code> (alias: <code>\/tts on<\/code>).<\/p>\n<p>Microsoft speech <strong>is<\/strong> enabled by default once TTS is on, and is used automatically<br \/>\nwhen no OpenAI or ElevenLabs API keys are available.<\/p>\n<h2>Config<\/h2>\n<p>TTS config lives under <code>messages.tts<\/code> in <code>openclaw.json<\/code>.<br \/>\nFull schema is in <a href=\"\/gateway\/configuration\">Gateway configuration<\/a>.<\/p>\n<h3>Minimal config (enable + provider)<\/h3>\n<p>&#8220;`json5  theme={&#8220;theme&#8221;:{&#8220;light&#8221;:&#8221;min-light&#8221;,&#8221;dark&#8221;:&#8221;min-dark&#8221;}}<br \/>\n{<br \/>\n  messages: {<br \/>\n    tts: {<br \/>\n      auto: &#8220;always&#8221;,<br \/>\n      provider: &#8220;elevenlabs&#8221;,<br \/>\n    },<br \/>\n  },<br \/>\n}<\/p>\n<pre><code>\n### OpenAI primary with ElevenLabs fallback\n\n```json5  theme={&quot;theme&quot;:{&quot;light&quot;:&quot;min-light&quot;,&quot;dark&quot;:&quot;min-dark&quot;}}\n{\n  messages: {\n    tts: {\n      auto: &quot;always&quot;,\n      provider: &quot;openai&quot;,\n      summaryModel: &quot;openai\/gpt-4.1-mini&quot;,\n      modelOverrides: {\n        enabled: true,\n      },\n      openai: {\n        apiKey: &quot;openai_api_key&quot;,\n        baseUrl: &quot;https:\/\/api.openai.com\/v1&quot;,\n        model: &quot;gpt-4o-mini-tts&quot;,\n        voice: &quot;alloy&quot;,\n      },\n      elevenlabs: {\n        apiKey: &quot;elevenlabs_api_key&quot;,\n        baseUrl: &quot;https:\/\/api.elevenlabs.io&quot;,\n        voiceId: &quot;voice_id&quot;,\n        modelId: &quot;eleven_multilingual_v2&quot;,\n        seed: 42,\n        applyTextNormalization: &quot;auto&quot;,\n        languageCode: &quot;en&quot;,\n        voiceSettings: {\n          stability: 0.5,\n          similarityBoost: 0.75,\n          style: 0.0,\n          useSpeakerBoost: true,\n          speed: 1.0,\n        },\n      },\n    },\n  },\n}\n<\/code><\/pre>\n<h3>Microsoft primary (no API key)<\/h3>\n<p>&#8220;`json5  theme={&#8220;theme&#8221;:{&#8220;light&#8221;:&#8221;min-light&#8221;,&#8221;dark&#8221;:&#8221;min-dark&#8221;}}<br \/>\n{<br \/>\n  messages: {<br \/>\n    tts: {<br \/>\n      auto: &#8220;always&#8221;,<br \/>\n      provider: &#8220;microsoft&#8221;,<br \/>\n      microsoft: {<br \/>\n        enabled: true,<br \/>\n        voice: &#8220;en-US-MichelleNeural&#8221;,<br \/>\n        lang: &#8220;en-US&#8221;,<br \/>\n        outputFormat: &#8220;audio-24khz-48kbitrate-mono-mp3&#8221;,<br \/>\n        rate: &#8220;+10%&#8221;,<br \/>\n        pitch: &#8220;-5%&#8221;,<br \/>\n      },<br \/>\n    },<br \/>\n  },<br \/>\n}<\/p>\n<pre><code>\n### Disable Microsoft speech\n\n```json5  theme={&quot;theme&quot;:{&quot;light&quot;:&quot;min-light&quot;,&quot;dark&quot;:&quot;min-dark&quot;}}\n{\n  messages: {\n    tts: {\n      microsoft: {\n        enabled: false,\n      },\n    },\n  },\n}\n<\/code><\/pre>\n<h3>Custom limits + prefs path<\/h3>\n<p>&#8220;`json5  theme={&#8220;theme&#8221;:{&#8220;light&#8221;:&#8221;min-light&#8221;,&#8221;dark&#8221;:&#8221;min-dark&#8221;}}<br \/>\n{<br \/>\n  messages: {<br \/>\n    tts: {<br \/>\n      auto: &#8220;always&#8221;,<br \/>\n      maxTextLength: 4000,<br \/>\n      timeoutMs: 30000,<br \/>\n      prefsPath: &#8220;~\/.openclaw\/settings\/tts.json&#8221;,<br \/>\n    },<br \/>\n  },<br \/>\n}<\/p>\n<pre><code>\n### Only reply with audio after an inbound voice note\n\n```json5  theme={&quot;theme&quot;:{&quot;light&quot;:&quot;min-light&quot;,&quot;dark&quot;:&quot;min-dark&quot;}}\n{\n  messages: {\n    tts: {\n      auto: &quot;inbound&quot;,\n    },\n  },\n}\n<\/code><\/pre>\n<h3>Disable auto-summary for long replies<\/h3>\n<p>&#8220;`json5  theme={&#8220;theme&#8221;:{&#8220;light&#8221;:&#8221;min-light&#8221;,&#8221;dark&#8221;:&#8221;min-dark&#8221;}}<br \/>\n{<br \/>\n  messages: {<br \/>\n    tts: {<br \/>\n      auto: &#8220;always&#8221;,<br \/>\n    },<br \/>\n  },<br \/>\n}<\/p>\n<pre><code>\nThen run:\n\n<\/code><\/pre>\n<p>\/tts summary off<\/p>\n<pre><code>\n### Notes on fields\n\n* `auto`: auto\u2011TTS mode (`off`, `always`, `inbound`, `tagged`).\n  * `inbound` only sends audio after an inbound voice note.\n  * `tagged` only sends audio when the reply includes `[[tts]]` tags.\n* `enabled`: legacy toggle (doctor migrates this to `auto`).\n* `mode`: `&quot;final&quot;` (default) or `&quot;all&quot;` (includes tool\/block replies).\n* `provider`: speech provider id such as `&quot;elevenlabs&quot;`, `&quot;microsoft&quot;`, or `&quot;openai&quot;` (fallback is automatic).\n* If `provider` is **unset**, OpenClaw prefers `openai` (if key), then `elevenlabs` (if key),\n  otherwise `microsoft`.\n* Legacy `provider: &quot;edge&quot;` still works and is normalized to `microsoft`.\n* `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`.\n  * Accepts `provider\/model` or a configured model alias.\n* `modelOverrides`: allow the model to emit TTS directives (on by default).\n  * `allowProvider` defaults to `false` (provider switching is opt-in).\n* `maxTextLength`: hard cap for TTS input (chars). `\/tts audio` fails if exceeded.\n* `timeoutMs`: request timeout (ms).\n* `prefsPath`: override the local prefs JSON path (provider\/limit\/summary).\n* `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`\/`XI_API_KEY`, `OPENAI_API_KEY`).\n* `elevenlabs.baseUrl`: override ElevenLabs API base URL.\n* `openai.baseUrl`: override the OpenAI TTS endpoint.\n  * Resolution order: `messages.tts.openai.baseUrl` -&gt; `OPENAI_TTS_BASE_URL` -&gt; `https:\/\/api.openai.com\/v1`\n  * Non-default values are treated as OpenAI-compatible TTS endpoints, so custom model and voice names are accepted.\n* `elevenlabs.voiceSettings`:\n  * `stability`, `similarityBoost`, `style`: `0..1`\n  * `useSpeakerBoost`: `true|false`\n  * `speed`: `0.5..2.0` (1.0 = normal)\n* `elevenlabs.applyTextNormalization`: `auto|on|off`\n* `elevenlabs.languageCode`: 2-letter ISO 639-1 (e.g. `en`, `de`)\n* `elevenlabs.seed`: integer `0..4294967295` (best-effort determinism)\n* `microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key).\n* `microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`).\n* `microsoft.lang`: language code (e.g. `en-US`).\n* `microsoft.outputFormat`: Microsoft output format (e.g. `audio-24khz-48kbitrate-mono-mp3`).\n  * See Microsoft Speech output formats for valid values; not all formats are supported by the bundled Edge-backed transport.\n* `microsoft.rate` \/ `microsoft.pitch` \/ `microsoft.volume`: percent strings (e.g. `+10%`, `-5%`).\n* `microsoft.saveSubtitles`: write JSON subtitles alongside the audio file.\n* `microsoft.proxy`: proxy URL for Microsoft speech requests.\n* `microsoft.timeoutMs`: request timeout override (ms).\n* `edge.*`: legacy alias for the same Microsoft settings.\n\n## Model-driven overrides (default on)\n\nBy default, the model **can** emit TTS directives for a single reply.\nWhen `messages.tts.auto` is `tagged`, these directives are required to trigger audio.\n\nWhen enabled, the model can emit `[[tts:...]]` directives to override the voice\nfor a single reply, plus an optional `[[tts:text]]...[[\/tts:text]]` block to\nprovide expressive tags (laughter, singing cues, etc) that should only appear in\nthe audio.\n\n`provider=...` directives are ignored unless `modelOverrides.allowProvider: true`.\n\nExample reply payload:\n\n<\/code><\/pre>\n<p>Here you go.<\/p>\n<p>[[tts:voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]]<br \/>\n<a href=\"laughs\">[tts:text]<\/a> Read the song once more.[[\/tts:text]]<\/p>\n<pre><code>\nAvailable directive keys (when enabled):\n\n* `provider` (registered speech provider id, for example `openai`, `elevenlabs`, or `microsoft`; requires `allowProvider: true`)\n* `voice` (OpenAI voice) or `voiceId` (ElevenLabs)\n* `model` (OpenAI TTS model or ElevenLabs model id)\n* `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`\n* `applyTextNormalization` (`auto|on|off`)\n* `languageCode` (ISO 639-1)\n* `seed`\n\nDisable all model overrides:\n\n```json5  theme={&quot;theme&quot;:{&quot;light&quot;:&quot;min-light&quot;,&quot;dark&quot;:&quot;min-dark&quot;}}\n{\n  messages: {\n    tts: {\n      modelOverrides: {\n        enabled: false,\n      },\n    },\n  },\n}\n<\/code><\/pre>\n<p>Optional allowlist (enable provider switching while keeping other knobs configurable):<\/p>\n<p>&#8220;`json5  theme={&#8220;theme&#8221;:{&#8220;light&#8221;:&#8221;min-light&#8221;,&#8221;dark&#8221;:&#8221;min-dark&#8221;}}<br \/>\n{<br \/>\n  messages: {<br \/>\n    tts: {<br \/>\n      modelOverrides: {<br \/>\n        enabled: true,<br \/>\n        allowProvider: true,<br \/>\n        allowSeed: false,<br \/>\n      },<br \/>\n    },<br \/>\n  },<br \/>\n}<\/p>\n<pre><code>\n## Per-user preferences\n\nSlash commands write local overrides to `prefsPath` (default:\n`~\/.openclaw\/settings\/tts.json`, override with `OPENCLAW_TTS_PREFS` or\n`messages.tts.prefsPath`).\n\nStored fields:\n\n* `enabled`\n* `provider`\n* `maxLength` (summary threshold; default 1500 chars)\n* `summarize` (default `true`)\n\nThese override `messages.tts.*` for that host.\n\n## Output formats (fixed)\n\n* **Telegram**: Opus voice note (`opus_48000_64` from ElevenLabs, `opus` from OpenAI).\n  * 48kHz \/ 64kbps is a good voice-note tradeoff and required for the round bubble.\n* **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).\n  * 44.1kHz \/ 128kbps is the default balance for speech clarity.\n* **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).\n  * The bundled transport accepts an `outputFormat`, but not all formats are available from the service.\n  * Output format values follow Microsoft Speech output formats (including Ogg\/WebM Opus).\n  * Telegram `sendVoice` accepts OGG\/MP3\/M4A; use OpenAI\/ElevenLabs if you need\n    guaranteed Opus voice notes. \ue200cite\ue202turn1search1\ue201\n  * If the configured Microsoft output format fails, OpenClaw retries with MP3.\n\nOpenAI\/ElevenLabs formats are fixed; Telegram expects Opus for voice-note UX.\n\n## Auto-TTS behavior\n\nWhen enabled, OpenClaw:\n\n* skips TTS if the reply already contains media or a `MEDIA:` directive.\n* skips very short replies (&lt; 10 chars).\n* summarizes long replies when enabled using `agents.defaults.model.primary` (or `summaryModel`).\n* attaches the generated audio to the reply.\n\nIf the reply exceeds `maxLength` and summary is off (or no API key for the\nsummary model), audio\nis skipped and the normal text reply is sent.\n\n## Flow diagram\n\n<\/code><\/pre>\n<p>Reply -&gt; TTS enabled?<br \/>\n  no  -&gt; send text<br \/>\n  yes -&gt; has media \/ MEDIA: \/ short?<br \/>\n          yes -&gt; send text<br \/>\n          no  -&gt; length &gt; limit?<br \/>\n                   no  -&gt; TTS -&gt; attach audio<br \/>\n                   yes -&gt; summary enabled?<br \/>\n                            no  -&gt; send text<br \/>\n                            yes -&gt; summarize (summaryModel or agents.defaults.model.primary)<br \/>\n                                      -&gt; TTS -&gt; attach audio<\/p>\n<pre><code>\n## Slash command usage\n\nThere is a single command: `\/tts`.\nSee [Slash commands](\/tools\/slash-commands) for enablement details.\n\nDiscord note: `\/tts` is a built-in Discord command, so OpenClaw registers\n`\/voice` as the native command there. Text `\/tts ...` still works.\n\n<\/code><\/pre>\n<p>\/tts off<br \/>\n\/tts always<br \/>\n\/tts inbound<br \/>\n\/tts tagged<br \/>\n\/tts status<br \/>\n\/tts provider openai<br \/>\n\/tts limit 2000<br \/>\n\/tts summary off<br \/>\n\/tts audio Hello from OpenClaw<br \/>\n&#8220;`<\/p>\n<p>Notes:<\/p>\n<ul>\n<li>Commands require an authorized sender (allowlist\/owner rules still apply).<\/li>\n<li><code>commands.text<\/code> or native command registration must be enabled.<\/li>\n<li><code>off|always|inbound|tagged<\/code> are per\u2011session toggles (<code>\/tts on<\/code> is an alias for <code>\/tts always<\/code>).<\/li>\n<li><code>limit<\/code> and <code>summary<\/code> are stored in local prefs, not the main config.<\/li>\n<li><code>\/tts audio<\/code> generates a one-off audio reply (does not toggle TTS on).<\/li>\n<\/ul>\n<h2>Agent tool<\/h2>\n<p>The <code>tts<\/code> tool converts text to speech and returns a <code>MEDIA:<\/code> path. When the<br \/>\nresult is Telegram-compatible, the tool includes <code>[[audio_as_voice]]<\/code> so<br \/>\nTelegram sends a voice bubble.<\/p>\n<h2>Gateway RPC<\/h2>\n<p>Gateway methods:<\/p>\n<ul>\n<li><code>tts.status<\/code><\/li>\n<li><code>tts.enable<\/code><\/li>\n<li><code>tts.disable<\/code><\/li>\n<li><code>tts.convert<\/code><\/li>\n<li><code>tts.setProvider<\/code><\/li>\n<li><code>tts.providers<\/code><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Text-to-Speech Text-to-speech (TTS) OpenClaw can conver [&hellip;]<\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[],"class_list":["post-647","post","type-post","status-publish","format-standard","hentry","category-docs"],"_links":{"self":[{"href":"https:\/\/pa.yingzhi8.cn\/index.php\/wp-json\/wp\/v2\/posts\/647","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pa.yingzhi8.cn\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pa.yingzhi8.cn\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/pa.yingzhi8.cn\/index.php\/wp-json\/wp\/v2\/comments?post=647"}],"version-history":[{"count":2,"href":"https:\/\/pa.yingzhi8.cn\/index.php\/wp-json\/wp\/v2\/posts\/647\/revisions"}],"predecessor-version":[{"id":718,"href":"https:\/\/pa.yingzhi8.cn\/index.php\/wp-json\/wp\/v2\/posts\/647\/revisions\/718"}],"wp:attachment":[{"href":"https:\/\/pa.yingzhi8.cn\/index.php\/wp-json\/wp\/v2\/media?parent=647"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pa.yingzhi8.cn\/index.php\/wp-json\/wp\/v2\/categories?post=647"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pa.yingzhi8.cn\/index.php\/wp-json\/wp\/v2\/tags?post=647"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}