Supported TTS Engines
- mascotbot (default): Built-in TTS engine
- elevenlabs: High-quality TTS with custom voices
- cartesia: Alternative TTS provider
Authorizations
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Body
Text to synthesize into speech
5000"Hello world! How are you today?"
Voice ID to use for synthesis
"N2lVS1w4EtoT3dr4eOWO"
TTS engine to use
mascotbot, elevenlabs, cartesia "elevenlabs"
API key for external TTS engines (required for elevenlabs/cartesia)
"sk_your_elevenlabs_api_key_here"
Speech speed multiplier (e.g., 1.0 = normal, 1.2 = 20% faster)
0.5 <= x <= 21.1
Response
Successful response streams synchronized audio and viseme data using Server-Sent Events.
The stream contains alternating audio and visemes events:
- Audio events contain base64-encoded PCM audio data for playback
- Visemes events contain facial animation data synchronized with audio
- Error events indicate processing errors
Events are delivered in real-time with typical first chunk latency of 200-500ms.
Server-Sent Events stream with "data: " prefix followed by JSON.
Event types:
- audio: Contains PCM audio data and timing info
- visemes: Contains viseme IDs and timing offsets
- error: Contains error information