Generate Speech and Visemes
Converts text to speech using various TTS engines and generates synchronized viseme data for facial animation. Returns real-time streaming data via Server-Sent Events (SSE) for low-latency playback.
Supports multiple TTS engines:
- mascotbot (default): Built-in TTS engine
- elevenlabs: High-quality TTS with custom voices
- cartesia: Alternative TTS provider
The response streams both audio chunks (base64-encoded PCM) and viseme timing data.
Convert text to speech using various TTS engines and generate synchronized viseme data for facial animation. This endpoint returns real-time streaming data via Server-Sent Events (SSE) for low-latency playback.
Supported TTS Engines
- mascotbot (default): Built-in TTS engine
- elevenlabs: High-quality TTS with custom voices
- cartesia: Alternative TTS provider
The response streams both audio chunks (base64-encoded PCM) and viseme timing data.
Authorizations
Bearer authentication header of the form Bearer <token>
, where <token>
is your auth token.
Body
Response
Successful response streams synchronized audio and viseme data using Server-Sent Events.
The stream contains alternating audio and visemes events:
- Audio events contain base64-encoded PCM audio data for playback
- Visemes events contain facial animation data synchronized with audio
- Error events indicate processing errors
Events are delivered in real-time with typical first chunk latency of 200-500ms.
Server-Sent Events stream with "data: " prefix followed by JSON.
Event types:
- audio: Contains PCM audio data and timing info
- visemes: Contains viseme IDs and timing offsets
- error: Contains error information