Skip to main content
Welcome to the Mascotbot Viseme Prediction API documentation. This API provides two main capabilities for creating synchronized facial animations:

Available Endpoints

/v1/visemes - Process Audio for Visemes

Process existing audio files to generate viseme predictions for facial animation. Ideal when you already have audio and need synchronized mouth movements.

/v1/visemes-audio - Generate Speech and Visemes

Convert text to speech while simultaneously generating viseme predictions. Supports multiple TTS engines including ElevenLabs and Cartesia for high-quality voice synthesis.

/v1/get-signed-url - Generate a signed URL for conversational AI with visemes

Get a temporary signed URL (expires in 10 minutes) for connecting to a conversational AI proxy. Supports ElevenLabs Conversational AI, Gemini Live API, and OpenAI Realtime API — the proxy enriches the audio stream with synchronized viseme data for avatar lip sync.

Language-Specific Viseme Models

All endpoints support an optional model parameter to select a language-optimized viseme prediction model. This improves lip sync accuracy for non-English content.
ModelLanguageParameter Value
DefaultEnglish"default" (or omit)
IndonesianBahasa Indonesia"indonesian"
Pass "model": "indonesian" in the request body (for REST endpoints) or viseme_model=indonesian as a query parameter (for WebSocket endpoints). If omitted, the default English model is used.

Real-time Streaming

Both endpoints use Server-Sent Events (SSE) to provide real-time streaming responses, enabling low-latency playback and immediate visual feedback.

Process Audio for Visemes

Convert existing audio files to viseme predictions for facial animation

Generate Speech and Visemes

Convert text to speech with synchronized viseme generation using multiple TTS engines

ElevenLabs Conversational AI

Use your existing ElevenLabs Conversational AI Agent with visemes stream added on top.

Gemini Live API Avatar

Build interactive AI avatars with Gemini Live API and real-time lip sync.