Skip to main content
POST
/
v1
/
visemes-audio
curl --request POST \
--url https://api.mascot.bot/v1/visemes-audio \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: application/json' \
--data '{
"text": "Hello! This uses ElevenLabs for high-quality speech.",
"tts_engine": "elevenlabs",
"tts_api_key": "sk_your_elevenlabs_api_key_here",
"voice": "N2lVS1w4EtoT3dr4eOWO",
"speed": 1.1
}'
"data: {\"type\":\"audio\",\"data\":\"UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=\",\"sample_rate\":16000,\"sequence\":0,\"chunk_duration_ms\":250}\n"
Convert text to speech using various TTS engines and generate synchronized viseme data for facial animation. This endpoint returns real-time streaming data via Server-Sent Events (SSE) for low-latency playback.

Supported TTS Engines

  • mascotbot (default): Built-in TTS engine
  • elevenlabs: High-quality TTS with custom voices
  • cartesia: Alternative TTS provider
The response streams both audio chunks (base64-encoded PCM) and viseme timing data.

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json
text
string
required

Text to synthesize into speech

Maximum length: 5000
Example:

"Hello world! How are you today?"

voice
string
required

Voice ID to use for synthesis

Example:

"N2lVS1w4EtoT3dr4eOWO"

tts_engine
enum<string>
default:elevenlabs

TTS engine to use

Available options:
mascotbot,
elevenlabs,
cartesia
Example:

"elevenlabs"

tts_api_key
string

API key for external TTS engines (required for elevenlabs/cartesia)

Example:

"sk_your_elevenlabs_api_key_here"

speed
number
default:1

Speech speed multiplier (e.g., 1.0 = normal, 1.2 = 20% faster)

Required range: 0.5 <= x <= 2
Example:

1.1

Response

Successful response streams synchronized audio and viseme data using Server-Sent Events.

The stream contains alternating audio and visemes events:

  1. Audio events contain base64-encoded PCM audio data for playback
  2. Visemes events contain facial animation data synchronized with audio
  3. Error events indicate processing errors

Events are delivered in real-time with typical first chunk latency of 200-500ms.

Server-Sent Events stream with "data: " prefix followed by JSON.

Event types:

  • audio: Contains PCM audio data and timing info
  • visemes: Contains viseme IDs and timing offsets
  • error: Contains error information