OpenAI Realtime API Avatar — Build an Interactive Lip-Synced AI Avatar
Add a lip-synced animated avatar to your OpenAI Realtime API application in minutes. Mascot Bot SDK works alongside the official OpenAI Agents Realtime SDK (@openai/agents-realtime) — your existing OpenAI code stays untouched. Connect through a signed WebSocket URL and you get automatic viseme injection, real-time lip sync, and production-ready React components.
Why Add an Avatar to Your OpenAI Realtime API App?
The OpenAI Realtime API delivers native speech-to-speech conversations with ultra-low latency — but voice alone can feel impersonal. By pairing it with an interactive AI avatar that speaks with perfectly synchronized lip movements, you transform a voice stream into an engaging visual experience. Think of it as building a ChatGPT avatar that actually talks back with natural mouth animation.Mascot Bot SDK is designed to complement the OpenAI Agents Realtime SDK, not replace it. You continue to use @openai/agents-realtime for transport, audio streaming, and event handling exactly as you normally would. Mascot Bot simply intercepts the WebSocket stream, extracts audio timing data, and drives a Rive-powered avatar with frame-accurate lip sync — all without modifying a single line of your OpenAI SDK code.
The integration uses a proxy pattern through signed WebSocket URLs. Your server creates an OpenAI ephemeral token, passes it to the Mascot Bot proxy, and receives a signed WebSocket URL the client connects to:
Copy
// Without Mascot Bot — direct to OpenAIconst transport = new OpenAIRealtimeWebSocket({ model: "gpt-realtime", apiKey: "your-openai-api-key",});// With Mascot Bot — connect through signed URL for lip syncconst transport = new OpenAIRealtimeWebSocket({ model: config.model, apiKey: "proxy", useInsecureApiKey: true, createWebSocket: async () => new WebSocket(config.signedUrl) as any,});// Everything else stays exactly the same!await transport.connect({ apiKey: "proxy" });
The Mascot Bot proxy transparently forwards all OpenAI Realtime traffic while injecting viseme data into the response stream. Your OpenAI SDK calls — transport.sendAudio(), event listeners, session management — all work identically.
Frame-accurate viseme synchronization with OpenAI audio responses. Audio and visemes arrive in a single combined WebSocket message, ensuring zero drift between voice and mouth animation.
Smooth, natural voice-driven facial animation powered by WebGL2 and the Rive runtime. Sub-50ms audio-to-visual latency for responsive conversational AI avatars.
Works alongside @openai/agents-realtime with zero conflicts. Your existing OpenAI Realtime API code stays untouched — Mascot Bot integrates through a signed WebSocket URL that wraps the OpenAI connection.
Full support for OpenAI’s client secrets system. Lock system instructions, voice config, VAD settings, and model parameters server-side using the /v1/realtime/client_secrets endpoint. The client never sees your API key or prompt.
Automatic WebSocket streaming avatar data extraction from OpenAI connections. Handles all Realtime API event types including response.audio.delta, response.audio.done, and speech interruptions.
Full support for OpenAI’s built-in server-side VAD. Automatic turn detection, silence handling, and interruption management. The avatar responds naturally when you pause and stops when you interrupt — zero client-side configuration needed.
Built-in support for token refresh and reconnection. Pre-fetch signed URLs for instant connection, auto-refresh before expiry, and clean up on disconnect.
You’ll receive the SDK .tgz file after subscribing to one of our plans. The SDK works alongside the official OpenAI Agents Realtime SDK (@openai/agents-realtime) without any modifications. Both packages are required for the integration.
Want to see a complete working example? Check out our open-source demo repository with full implementation, or deploy it directly to Vercel with one click.
Step 1: Set Up Ephemeral Token Generation (Server-Side)
OpenAI Realtime API supports client secrets (ephemeral tokens) that lock configuration server-side. This is the recommended approach for production — the client never sees your API key, system instructions, or voice settings.Mascot Bot fully supports this pattern. You create an OpenAI ephemeral token, pass it to the Mascot Bot proxy, and receive a signed WebSocket URL the client can safely connect to.
The Mascot Bot proxy receives your OpenAI ephemeral token and creates a proxied WebSocket connection that injects real-time viseme data into the OpenAI Realtime stream. This is required for avatar lip sync to work.
Copy
// app/api/get-signed-url-openai/route.tsimport { NextResponse } from 'next/server';// Configuration locked server-side — client never sees thisconst OPENAI_CONFIG = { model: 'gpt-realtime', voice: 'marin', systemMessage: 'You are a friendly assistant. Keep responses brief and conversational. Start by greeting the user when conversation starts.',};export async function GET() { try { const openaiApiKey = process.env.OPENAI_API_KEY; if (!openaiApiKey) { return NextResponse.json( { error: 'OpenAI API key not configured' }, { status: 500 } ); } const mascotBotApiKey = process.env.MASCOT_BOT_API_KEY; if (!mascotBotApiKey) { return NextResponse.json( { error: 'Mascot Bot API key not configured' }, { status: 500 } ); } // 1. Create OpenAI ephemeral token with locked config // This keeps system instructions server-side — client only gets the token const tokenResponse = await fetch( 'https://api.openai.com/v1/realtime/client_secrets', { method: 'POST', headers: { Authorization: `Bearer ${openaiApiKey}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ session: { type: 'realtime', model: OPENAI_CONFIG.model, instructions: OPENAI_CONFIG.systemMessage, output_modalities: ['audio'], audio: { input: { turn_detection: { type: 'server_vad', threshold: 0.5, prefix_padding_ms: 300, silence_duration_ms: 500, }, }, output: { voice: OPENAI_CONFIG.voice, }, }, }, }), } ); if (!tokenResponse.ok) { const errorText = await tokenResponse.text(); console.error('Failed to create ephemeral token:', errorText); throw new Error(`Failed to create ephemeral token: ${errorText}`); } const clientSecret = await tokenResponse.json(); // 2. Get Mascot Bot proxy signed URL (wraps the OpenAI ephemeral token) const response = await fetch('https://api.mascot.bot/v1/get-signed-url', { method: 'POST', headers: { Authorization: `Bearer ${mascotBotApiKey}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ config: { provider: 'openai', provider_config: { ephemeral_token: clientSecret.value, model: OPENAI_CONFIG.model, voice: OPENAI_CONFIG.voice, }, }, }), cache: 'no-store', }); if (!response.ok) { throw new Error('Failed to get signed URL'); } const data = await response.json(); // 3. Return connection info — config is NOT exposed to the client return NextResponse.json({ signedUrl: data.signed_url, model: OPENAI_CONFIG.model, }); } catch (error) { console.error('Error:', error); return NextResponse.json( { error: 'Failed to generate signed URL' }, { status: 500 } ); }}export const dynamic = 'force-dynamic';
OpenAI ephemeral tokens are single-use. After a call ends, the cached token is consumed. Always invalidate the cache on disconnect and fetch a fresh token for the next call.
Copy
// In your connection_change handler:transport.on('connection_change', (status) => { if (status === 'disconnected') { configRef.current = null; // Invalidate consumed token refreshConfig(); // Pre-fetch fresh token for next call }});
Create more realistic mouth movements by adjusting natural lip sync parameters:
Start with the “conversation” preset for most use cases. Adjust parameters based on your specific needs — higher minVisemeInterval for smoother movements, lower for more articulation.
Copy
// Different presets for various use casesconst lipSyncPresets = { // Natural conversation — best for most OpenAI Realtime API voice AI conversation: { minVisemeInterval: 40, mergeWindow: 60, keyVisemePreference: 0.6, preserveSilence: true, similarityThreshold: 0.4, preserveCriticalVisemes: true, criticalVisemeMinDuration: 80, desktopTransitionDuration: 18, mobileTransitionDuration: 22, }, // Fast speech — for excited or rapid responses fastSpeech: { minVisemeInterval: 80, mergeWindow: 100, keyVisemePreference: 0.5, preserveSilence: true, similarityThreshold: 0.3, preserveCriticalVisemes: true, }, // Clear articulation — for educational AI tutor avatars educational: { minVisemeInterval: 40, mergeWindow: 50, keyVisemePreference: 0.9, preserveSilence: true, similarityThreshold: 0.8, preserveCriticalVisemes: true, },};
The lip sync config object must have a stable reference. Define it as a const outside the component or wrap it in useMemo. Creating a new object on every render destroys and recreates the MascotPlayback instance, killing active lip sync.
Copy
// ❌ Don't do this — creates new object on every renderuseMascotOpenAI({ session, naturalLipSyncConfig: { minVisemeInterval: 40, mergeWindow: 60, },});// ✅ Do this — stable reference (const outside component)const NATURAL_LIP_SYNC_CONFIG = { minVisemeInterval: 40, mergeWindow: 60, keyVisemePreference: 0.6, preserveSilence: true, similarityThreshold: 0.4, preserveCriticalVisemes: true,} as const;useMascotOpenAI({ session, naturalLipSyncConfig: NATURAL_LIP_SYNC_CONFIG,});
Server-side VAD means the avatar automatically handles turn-taking — it listens while you speak and responds when you pause. Mascot Bot handles interruption events automatically, resetting the lip sync when the user interrupts the avatar mid-sentence.
OpenAI Realtime API offers multiple voices. Configure the voice in the ephemeral token:
Copy
const OPENAI_CONFIG = { model: 'gpt-realtime', // or 'gpt-realtime-mini' for lower cost voice: 'marin', // OpenAI voice name systemMessage: 'You are a friendly assistant.',};
Available voices include alloy, ash, ballad, coral, echo, fable, marin, sage, shimmer, and verse. Check OpenAI’s documentation for the latest voice options.
Use gpt-realtime-mini for lower per-minute costs with slightly reduced quality. The lip sync integration works identically with both models.
The core hook for OpenAI Realtime API avatar integration:
This hook automatically starts WebSocket interception when the session connects and handles all message processing internally. No manual setup required.
Copy
interface UseMascotOpenAIOptions { /** Session object with connection status */ session: OpenAIRealtimeSession; /** Log WebSocket data flow (default: false) */ debug?: boolean; /** Enable natural lip sync processing (default: false) */ naturalLipSync?: boolean; /** Natural lip sync tuning parameters */ naturalLipSyncConfig?: { /** Min time between visemes in ms (default: 40) */ minVisemeInterval?: number; /** Window for merging similar visemes in ms (default: 60) */ mergeWindow?: number; /** Preference for distinctive mouth shapes, 0–1 (default: 0.6) */ keyVisemePreference?: number; /** Preserve silence visemes (default: true) */ preserveSilence?: boolean; /** Threshold for merging similar visemes, 0–1 (default: 0.4) */ similarityThreshold?: number; /** Never skip critical viseme shapes (default: true) */ preserveCriticalVisemes?: boolean; /** Min duration for critical visemes in ms (default: 80) */ criticalVisemeMinDuration?: number; /** Desktop transition smoothing in ms (default: 18) */ desktopTransitionDuration?: number; /** Mobile transition smoothing in ms (default: 22) */ mobileTransitionDuration?: number; };}interface UseMascotOpenAIResult { /** Whether WebSocket interception is active */ isIntercepting: boolean; /** Number of audio+viseme messages received */ messageCount: number;}
OpenAI Realtime API pricing is based on token usage for both audio input and output:
Feature
gpt-realtime
gpt-realtime-mini
Audio input
$0.06 / 1M tokens
$0.01 / 1M tokens
Audio output
$0.24 / 1M tokens
$0.04 / 1M tokens
Session limit
~15 min per connection
~15 min per connection
Latency
Lower
Similar
Quality
Higher
Good for most use cases
Mascot Bot SDK adds lip sync avatar capabilities on top. Plans start at $149/month with 75 hours of lip sync included, then ~$2.48 per additional speaking hour (~$0.04/min). Check app.mascot.bot for current plans.
For budgeting, here’s the combined cost of running an OpenAI Realtime API avatar:
Component
Cost per hour
Notes
OpenAI Realtime API (gpt-realtime)
Varies by token usage
Native speech-to-speech
Mascot Bot lip sync
~$2.48/hr (after 75 included hrs)
Client-side Rive animation
All-in estimate
~$3.83–$4.21/hr beyond included
OpenAI API + Mascot Bot overage
Unlike HeyGen, D-ID, or Synthesia which charge for server-side video rendering ($0.10–$0.50/min), Mascot Bot uses client-side Rive animation — the lip sync cost is dramatically lower at ~$0.04/min, and the first 75 hours per month are included in your plan.
MascotBot vs HeyGen vs D-ID vs Synthesia for Interactive Avatars
Looking for an interactive avatar API? Here’s how Mascot Bot compares to popular avatar platforms:
Feature
Mascot Bot SDK
HeyGen
D-ID
Synthesia
Type
Open SDK (React)
SaaS platform
SaaS platform
SaaS platform
Rendering
Client-side Rive animation
Server-side video
Server-side video
Server-side video
Custom characters
Any Rive animation
Limited templates
Stock faces
Stock presenters
Real-time lip sync
Yes (< 50ms)
Yes
Limited
Pre-rendered only
OpenAI Realtime API
Native integration
No
No
No
Gemini Live API
Native integration
No
No
No
ElevenLabs
Native integration
Partial
Yes
Yes
WebSocket streaming
Yes
Limited
No
No
Code ownership
You own the code
SaaS dependency
SaaS dependency
SaaS dependency
Per-minute conversation cost
~$0.04/min lip sync (75 hrs included/mo)
~$0.10-0.50/min
~$0.05-0.08/sec
Enterprise pricing
Open source examples
Yes (GitHub)
No
No
No
Mascot Bot is designed for developers who want full control over the avatar experience. Unlike SaaS platforms where you configure avatars in a dashboard, you integrate the SDK into your own React application and customize every aspect — from the character model to the voice pipeline.
Ensure useMascotOpenAI is called inside a component wrapped by MascotClient. Check the browser console for WebSocket errors. Verify your Rive file has the correct input names (is_speaking, gesture).
This typically happens when naturalLipSyncConfig is created inline, causing React to reinitialize the hook on every render:
Copy
// ❌ Don't do this — creates new object on every renderuseMascotOpenAI({ session, naturalLipSyncConfig: { minVisemeInterval: 40, mergeWindow: 60, },});// ✅ Do this — define outside the componentconst NATURAL_LIP_SYNC_CONFIG = { minVisemeInterval: 40, mergeWindow: 60, keyVisemePreference: 0.6, preserveSilence: true, similarityThreshold: 0.4, preserveCriticalVisemes: true,} as const;
OpenAI ephemeral tokens are single-use. After a call ends, the cached config is consumed. Make sure you invalidate and fetch a fresh signed URL on disconnect:
This happens when the WavRecorder sends empty audio buffers (common in automated testing environments without a microphone). In production, this is typically not an issue. If needed, add a guard:
Copy
await recorder.record((data) => { if (data.mono.length > 0 && transportRef.current?.status === 'connected') { transportRef.current.sendAudio(data.mono as any); }});
How Does Mascot Bot Work with OpenAI Agents Realtime SDK?
Mascot Bot integrates through a signed WebSocket URL. Your code continues using @openai/agents-realtime for everything — connecting, streaming audio, handling events. The only change is pointing the transport at a Mascot Bot signed URL instead of connecting directly to OpenAI. The proxy transparently forwards all traffic while injecting viseme data for lip sync.
Each connection has a ~15-minute limit. After that, the connection closes. Your app should handle the connection_change event for disconnected, clean up resources, and allow the user to reconnect with a fresh token.
The Realtime API supports gpt-realtime (higher quality, higher cost) and gpt-realtime-mini (lower cost, slightly reduced quality). Both work identically with Mascot Bot’s lip sync integration. The model is specified in the ephemeral token configuration on your server.
The Mascot Bot proxy analyzes OpenAI’s response.audio.delta events in real time and injects viseme (mouth shape) data into the WebSocket stream. Each audio chunk contains both the original audio and timing-synchronized visemes. The useMascotOpenAI hook extracts this data and drives the Rive avatar’s mouth animation at 120fps.
Can I Connect Directly to OpenAI Without the Proxy?
Yes — for audio-only features, you can connect directly to OpenAI’s Realtime API using @openai/agents-realtime as normal. However, avatar lip sync will not work without the Mascot Bot proxy, since OpenAI does not provide viseme data natively. The avatar’s mouth will not move.
What is Voice Activity Detection (VAD) in OpenAI Realtime API?
OpenAI Realtime API includes built-in server-side voice activity detection that automatically detects when the user starts and stops speaking. This enables natural turn-taking — the avatar listens while you speak and responds when you pause. Mascot Bot handles interruption events automatically, resetting the lip sync when the user interrupts the avatar.
Is This an Open-Source Alternative to HeyGen Interactive Avatar?
Mascot Bot SDK is a developer-focused, interactive avatar SDK that you integrate into your own React app — unlike HeyGen’s SaaS platform where you configure avatars in their dashboard. With Mascot Bot, you own the code, choose your own voice AI backend (OpenAI Realtime, Gemini Live, or ElevenLabs), and customize the character with any Rive animation. No server-side video rendering costs.
Yes. Mascot Bot SDK natively supports OpenAI Realtime API, Gemini Live API, and ElevenLabs Conversational AI as voice backends. This is a unique positioning — you can build the same avatar experience across different voice AI providers and switch between them without changing your frontend code.
Unlike pre-rendered video avatars from HeyGen, D-ID, or Synthesia, Mascot Bot provides real-time, interactive avatars that respond dynamically to OpenAI’s voice output. No server-side video rendering — just lightweight client-side Rive animation at a fraction of the cost, driven by the same GPT model powering your conversations.
Deploy your OpenAI Realtime API interactive AI avatar
Build engaging voice AI experiences with the most developer-friendly lip sync SDK for OpenAI Realtime API. Your users will love talking to an animated character that actually talks back.