Converts text to speech using various TTS engines and generates synchronized viseme data for facial animation. Returns real-time streaming data via Server-Sent Events (SSE) for low-latency playback.
Supports multiple TTS engines:
The response streams both audio chunks (base64-encoded PCM) and viseme timing data.
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Text to synthesize into speech
5000"Hello world! How are you today?"
Voice ID to use for synthesis
"N2lVS1w4EtoT3dr4eOWO"
TTS engine to use
mascotbot, elevenlabs, cartesia "elevenlabs"
API key for external TTS engines (required for elevenlabs/cartesia)
"sk_your_elevenlabs_api_key_here"
Speech speed multiplier (e.g., 1.0 = normal, 1.2 = 20% faster)
0.5 <= x <= 21.1
Successful response streams synchronized audio and viseme data using Server-Sent Events.
The stream contains alternating audio and visemes events:
Events are delivered in real-time with typical first chunk latency of 200-500ms.
Server-Sent Events stream with "data: " prefix followed by JSON.
Event types: