Home/Prompts/Build Voice System
📋 Prompt / Build Guide

Prompt: Build Your Voice System

Copy-paste this into Claude Code to set up Chatterbox TTS with voice cloning and integrate it with OpenClaw.


Prerequisites

  • Mac with Apple Silicon (M1/M2/M3/M4) for MPS acceleration (works on Intel too, just slower)
  • Python 3.11+ and uv installed
  • OpenClaw running at localhost:18789
  • A short audio sample (15-60 seconds) of the voice you want to clone
    • Must be clean audio, minimal background noise
    • WAV or MP3 format

Part 1: Install Chatterbox TTS

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a directory for TTS
mkdir ~/tts && cd ~/tts

# Create a virtual environment
uv venv
source .venv/bin/activate

# Install Chatterbox
uv pip install chatterbox-tts

# Test it works (no voice clone, just default voice)
python -c "
from chatterbox.tts import ChatterboxTTS
import torchaudio
model = ChatterboxTTS.from_pretrained(device='mps')  # 'cpu' if no Apple Silicon
wav = model.generate('Hello, this is a test.')
torchaudio.save('test.wav', wav, model.sr)
print('Success! Check test.wav')
"

Part 2: Clone a Voice

# Clone a voice from your audio sample
python -c "
from chatterbox.tts import ChatterboxTTS
import torchaudio

model = ChatterboxTTS.from_pretrained(device='mps')

# Path to your audio sample (15-60 seconds)
VOICE_SAMPLE = '/path/to/your/voice-sample.wav'
VOICE_NAME = 'myvoice'

wav = model.generate(
    'This is a test of my cloned voice. Testing one two three.',
    audio_prompt_path=VOICE_SAMPLE,
    exaggeration=0.5,  # 0.0 = subtle clone, 1.0 = strong exaggeration
)
torchaudio.save(f'{VOICE_NAME}-test.wav', wav, model.sr)
print(f'Voice clone test saved to {VOICE_NAME}-test.wav')
"

# Listen to it — adjust exaggeration (0.3-0.7 is usually good range)
open myvoice-test.wav

Part 3: Build the TTS Server

Use this prompt with Claude Code:

I need a Chatterbox TTS server that acts as an ElevenLabs-compatible API,
so OpenClaw can use it for text-to-speech.

## Requirements

Build a Python FastAPI server that:

1. Accepts requests in ElevenLabs format:
   POST /v1/text-to-speech/{voice_id}
   Body: {"text": "...", "model_id": "..."}
   Returns: audio/mpeg (MP3 file)

2. Supports multiple voice clones (one per voice_id)

3. Uses Chatterbox TTS with MPS acceleration (Apple Silicon)

4. Starts fast (model loads once on startup)

5. Handles concurrent requests (async)

## Voice Configuration
Create a voices.json config file:
{
  "voices": {
    "myvoice": {
      "sample_path": "/path/to/sample.wav",
      "exaggeration": 0.5,
      "display_name": "My Voice"
    },
    "default": {
      "sample_path": null,
      "exaggeration": 0.3,
      "display_name": "Default"
    }
  }
}

## Tech Stack
- Python with FastAPI
- uv for package management
- Chatterbox TTS (already installed in ~/tts/.venv)
- Run on port 4126
- Start with: python server.py

## File Structure
~/tts/
├── server.py          # FastAPI server
├── voices.json        # Voice configurations  
├── samples/           # Voice sample WAV files
│   └── myvoice.wav
├── cache/             # Generated audio cache (optional)
└── .venv/

## Extra Features
- Health check endpoint: GET /health
- List voices: GET /v1/voices
- Basic request caching by text+voice_id (optional, saves API calls)
- Logging with request duration

## After building
I'll test with:
curl -X POST http://localhost:4126/v1/text-to-speech/myvoice \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello from my cloned voice"}' \
  --output test.mp3 && open test.mp3

Part 4: Connect to OpenClaw

After the server is running, configure OpenClaw:

// In ~/.openclaw/config.json, add:
{
  "tts": {
    "provider": "elevenlabs",
    "baseUrl": "http://localhost:4126",
    "apiKey": "local",
    "defaultVoice": "myvoice"
  }
}

Then per-agent voice configuration:

# In your agent YAML:
id: chief
voice: myvoice   # matches the voice_id in your voices.json

Test it:

openclaw tts test "Hello, this is my AI assistant speaking."

Part 5: Auto-Start on Login

Help me set up a macOS LaunchAgent to auto-start my Chatterbox TTS server
when I log in.

The server runs as: cd ~/tts && .venv/bin/python server.py
Port: 4126
Log file: ~/tts/chatterbox.log

Create:
1. ~/Library/LaunchAgents/com.myai.chatterbox.plist (LaunchAgent config)
2. ~/tts/start.sh (start script with proper env)

The LaunchAgent should:
- Start on login
- Restart if it crashes
- Log stdout and stderr to ~/tts/chatterbox.log

Load it:

launchctl load ~/Library/LaunchAgents/com.myai.chatterbox.plist

Voice Quality Tips

  • Sample quality > sample length. A clean 20s sample beats a noisy 2-minute one.
  • Exaggeration 0.3-0.5 works for most voices. Higher = more distinctive but potentially robotic.
  • Multiple samples for the same voice (different sentences, same speaker) improves consistency.
  • Public domain voices: Many historical speeches are in the public domain and make excellent samples.

Ethical Considerations

  • Only clone voices you have permission to use
  • Never use cloned voices to impersonate real people in deceptive contexts
  • Cloned celebrity voices for personal productivity (your own ears only) is generally acceptable
  • For public presentations: use original samples or truly synthesized voices