Skip to main content
POST
/
v1
/
visemes-audio
curl --request POST \ --url https://api.mascot.bot/v1/visemes-audio \ --header 'Authorization: Bearer <token>' \ --header 'Content-Type: application/json' \ --data ' { "text": "Hello! This uses ElevenLabs for high-quality speech.", "tts_engine": "elevenlabs", "tts_api_key": "sk_your_elevenlabs_api_key_here", "voice": "N2lVS1w4EtoT3dr4eOWO", "speed": 1.1 } '
"data: {\"type\":\"audio\",\"data\":\"UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=\",\"sample_rate\":16000,\"sequence\":0,\"chunk_duration_ms\":250}\n"

Documentation Index

Fetch the complete documentation index at: https://docs.mascot.bot/llms.txt

Use this file to discover all available pages before exploring further.

Convert text to speech using various TTS engines and generate synchronized viseme data for facial animation. This endpoint returns real-time streaming data via Server-Sent Events (SSE) for low-latency playback.

Supported TTS Engines

  • mascotbot (default): Built-in TTS engine
  • elevenlabs: High-quality TTS with custom voices
  • cartesia: Alternative TTS provider
The response streams both audio chunks (base64-encoded PCM) and viseme timing data.

Language-Specific Models

Use the optional model parameter to select a viseme model optimized for your content’s language:
  • "default" — English (used when model is omitted)
  • "indonesian" — Bahasa Indonesia

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json
text
string
required

Text to synthesize into speech

Maximum string length: 5000
Example:

"Hello world! How are you today?"

voice
string
required

Voice ID to use for synthesis

Example:

"N2lVS1w4EtoT3dr4eOWO"

tts_engine
enum<string>
default:elevenlabs

TTS engine to use

Available options:
mascotbot-tts,
elevenlabs,
cartesia
Example:

"elevenlabs"

tts_api_key
string

API key for external TTS engines (required for elevenlabs/cartesia)

Example:

"sk_your_elevenlabs_api_key_here"

speed
number
default:1

Speech speed multiplier (e.g., 1.0 = normal, 1.2 = 20% faster)

Required range: 0.5 <= x <= 2
Example:

1.1

model
enum<string>
default:default

Viseme model to use for prediction. Different models are optimized for different languages. Available models: default (English), indonesian (Bahasa Indonesia).

Available options:
default,
indonesian
Example:

"default"

Response

Successful response streams synchronized audio and viseme data using Server-Sent Events.

The stream contains alternating audio and visemes events:

  1. Audio events contain base64-encoded PCM audio data for playback
  2. Visemes events contain facial animation data synchronized with audio
  3. Error events indicate processing errors

Events are delivered in real-time with typical first chunk latency of 200-500ms.

Server-Sent Events stream with "data: " prefix followed by JSON.

Event types:

  • audio: Contains PCM audio data and timing info
  • visemes: Contains viseme IDs and timing offsets
  • error: Contains error information