Endpoints
Process Audio for Visemes
Takes base64 encoded audio and streams viseme (mouth shapes) predictions using Server-Sent Events (SSE). Requires valid API key.
POST
Process base64 encoded audio and receive streaming viseme predictions using Server-Sent Events (SSE). This endpoint is ideal for real-time facial animation based on audio input.
Authorizations
Bearer authentication header of the form Bearer <token>
, where <token>
is your auth token.
Body
application/json
Response
200
text/event-stream
Successful response streams viseme predictions using Server-Sent Events. Each event contains viseme data for a processed audio chunk.
Server-Sent Events stream with the following format:
- Each event starts with "data: " prefix followed by JSON
- Example normal event: data: {"visemes":[...],"chunk_progress":"1/10","chunk_duration_ms":100}
- Example error event: error: {"message":"Error message","chunk_id":5}
Previous
Generate Speech and VisemesConverts text to speech using various TTS engines and generates synchronized viseme data for facial animation.
Returns real-time streaming data via Server-Sent Events (SSE) for low-latency playback.
Supports multiple TTS engines:
- **mascotbot** (default): Built-in TTS engine
- **elevenlabs**: High-quality TTS with custom voices
- **cartesia**: Alternative TTS provider
The response streams both audio chunks (base64-encoded PCM) and viseme timing data.
Next