Converts text to speech using various TTS engines and generates synchronized viseme data for facial animation. Returns real-time streaming data via Server-Sent Events (SSE) for low-latency playback.
Supports multiple TTS engines:
The response streams both audio chunks (base64-encoded PCM) and viseme timing data.
model parameter to select a viseme model optimized for your content’s language:
"default" — English (used when model is omitted)"indonesian" — Bahasa IndonesiaBearer authentication header of the form Bearer <token>, where <token> is your auth token.
Text to synthesize into speech
5000"Hello world! How are you today?"
Voice ID to use for synthesis
"N2lVS1w4EtoT3dr4eOWO"
TTS engine to use
mascotbot-tts, elevenlabs, cartesia "elevenlabs"
API key for external TTS engines (required for elevenlabs/cartesia)
"sk_your_elevenlabs_api_key_here"
Speech speed multiplier (e.g., 1.0 = normal, 1.2 = 20% faster)
0.5 <= x <= 21.1
Viseme model to use for prediction. Different models are optimized for different languages.
Available models: default (English), indonesian (Bahasa Indonesia).
default, indonesian "default"
Successful response streams synchronized audio and viseme data using Server-Sent Events.
The stream contains alternating audio and visemes events:
Events are delivered in real-time with typical first chunk latency of 200-500ms.
Server-Sent Events stream with "data: " prefix followed by JSON.
Event types: