Converts text to speech using various TTS engines and generates synchronized viseme data for facial animation. Returns real-time streaming data via Server-Sent Events (SSE) for low-latency playback.
Supports multiple TTS engines:
The response streams both audio chunks (base64-encoded PCM) and viseme timing data.
Bearer authentication header of the form Bearer <token>
, where <token>
is your auth token.
Successful response streams synchronized audio and viseme data using Server-Sent Events.
The stream contains alternating audio and visemes events:
Events are delivered in real-time with typical first chunk latency of 200-500ms.
Server-Sent Events stream with "data: " prefix followed by JSON.
Event types: