POST
/
v1
/
visemes-audio
curl --request POST \
  --url https://api.mascot.bot/v1/visemes-audio \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "text": "Hello! This uses ElevenLabs for high-quality speech.",
  "tts_engine": "elevenlabs",
  "tts_api_key": "sk_your_elevenlabs_api_key_here",
  "voice": "N2lVS1w4EtoT3dr4eOWO",
  "speed": 1.1
}'
"data: {\"type\":\"audio\",\"data\":\"UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=\",\"sample_rate\":16000,\"sequence\":0,\"chunk_duration_ms\":250}\n"

Convert text to speech using various TTS engines and generate synchronized viseme data for facial animation. This endpoint returns real-time streaming data via Server-Sent Events (SSE) for low-latency playback.

Supported TTS Engines

  • mascotbot (default): Built-in TTS engine
  • elevenlabs: High-quality TTS with custom voices
  • cartesia: Alternative TTS provider

The response streams both audio chunks (base64-encoded PCM) and viseme timing data.

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

Response

200
text/event-stream

Successful response streams synchronized audio and viseme data using Server-Sent Events.

The stream contains alternating audio and visemes events:

  1. Audio events contain base64-encoded PCM audio data for playback
  2. Visemes events contain facial animation data synchronized with audio
  3. Error events indicate processing errors

Events are delivered in real-time with typical first chunk latency of 200-500ms.

Server-Sent Events stream with "data: " prefix followed by JSON.

Event types:

  • audio: Contains PCM audio data and timing info
  • visemes: Contains viseme IDs and timing offsets
  • error: Contains error information