Realtime AI Voice Avatars — Lip Sync for OpenAI, Gemini & ElevenLabs

Adding a lip-synced avatar to a realtime voice assistant is one idea:

Give the SDK a MediaStream of the assistant’s voice. It turns that audio into a talking avatar in real time.

useLipsyncStream({ client, playback, source: { kind: "mediaStream", stream } });

You wire the provider with its own official SDK and the SDK lip-syncs the audio in real time. The only question is how you obtain that MediaStream, which depends on whether the provider plays the audio for you.

Pick the path by provider

Provider / transport	Plays audio itself?	How you get the stream	SDK piece
OpenAI Realtime — WebRTC	Yes (into an `<audio>` you supply)	`createElementTap()`	none
Gemini Live	No (raw base64 PCM16 @ 24 kHz)	`createPCMStreamPlayer().outputStream`	PCM player
OpenAI Realtime — WebSocket	No (raw PCM16 `ArrayBuffer`)	`createPCMStreamPlayer().outputStream`	PCM player
ElevenLabs Conversational AI	Yes (internal worklet → `<audio>`)	`createElementTap()`	none

Never route a self-playing provider (ElevenLabs, OpenAI-WebRTC) through createPCMStreamPlayer — the voice would play twice. The player is only for providers that hand you raw PCM and do not play it.

Tap a playing element

When a provider plays the audio itself (OpenAI WebRTC, ElevenLabs), tap the element it plays through with createElementTap() — an SDK helper that works in Chrome, Safari and Firefox (HTMLMediaElement.captureStream() is not implemented in Safari/WebKit, so the SDK does not use it):

import { createElementTap } from "@mascotbot/react";

// Create inside the click that starts the call (so its AudioContext isn't
// born suspended). `stream` is usable immediately — silent until attach().
const tap = createElementTap();
useLipsyncStream({
  client,
  playback,
  source: { kind: "mediaStream", stream: tap.stream },
});

tap.attach(audioEl); // now, or later once the provider's <audio> exists
// teardown: tap.close();

tap.attach(el) handles both element kinds: a file/URL <audio> (e.g. OpenAI WebRTC) is kept audible and tapped; an element whose srcObject is a MediaStream (e.g. ElevenLabs’ worklet output) is tapped only, so the provider’s own playback isn’t doubled. attach() is idempotent and may run after creation. Also exported from @mascotbot/core.

Provider tokens stay on the server

Never ship a standing provider key to the browser. A server route mints a short-lived credential per session; the client connects with that. The demo ships reference route handlers for all three:

Provider	Server mints	Notes
OpenAI	`POST https://api.openai.com/v1/realtime/client_secrets` → `clientSecret`	model `gpt-realtime`
Gemini	`@google/genai` `ai.authTokens.create(...)` → ephemeral `token.name`	model `models/gemini-3.1-flash-live-preview`, `apiVersion: "v1alpha"`
ElevenLabs	`GET https://api.elevenlabs.io/v1/convai/conversation/get-signed-url?agent_id=…` → `signedUrl`	`xi-api-key` header, server-side only

Path 1 — OpenAI Realtime (WebRTC), cleanest

The provider auto-plays into an <audio> element. Supply your own so you can tap it; no SDK audio piece needed.

import { RealtimeAgent, RealtimeSession } from "@openai/agents-realtime";

const audioEl = new Audio();
const session = new RealtimeSession(new RealtimeAgent({ name: "Assistant" }), { transport: "webrtc" });
await session.connect({ apiKey: clientSecret, audioElement: audioEl }); // clientSecret from your server route

const tap = createElementTap(); // create in the click — see "Tap a playing element"
useLipsyncStream({ client, playback, source: { kind: "mediaStream", stream: tap.stream } });
tap.attach(audioEl);

Path 2 — Gemini Live / OpenAI Realtime (WebSocket)

The provider streams raw PCM and does not play it. createPCMStreamPlayer plays it gap-tolerantly and exposes the same audio as a tappable MediaStream.

import { createPCMStreamPlayer } from "@mascotbot/core";

const player = createPCMStreamPlayer({ sampleRate: 24000 }); // both emit 24 kHz
useLipsyncStream({ client, playback, source: { kind: "mediaStream", stream: player.outputStream } });

// Gemini Live (@google/genai): modelTurn audio part
session.onmessage = (m) => {
  const b64 = m?.serverContent?.modelTurn?.parts?.[0]?.inlineData?.data;
  if (typeof b64 === "string") player.pushBase64PCM16(b64);
  if (m?.serverContent?.interrupted) player.stop();
};

// OpenAI Realtime (WebSocket): PCM16 ArrayBuffer
session.on("audio", (e) => player.pushPCM16(new Uint8Array(e.data)));
session.on("audio_interrupted", () => player.stop());

The transport parsing above is provider glue and stays in your app — it must not enter the SDK. Only the play-and-tap primitive is shared.

Path 3 — ElevenLabs Conversational AI

ElevenLabs plays its assistant audio through a hidden <audio> whose srcObject is a MediaStream (an internal worklet → MediaStreamDestination). Create createElementTap() in the click, patch window.Audio before Conversation.startSession to capture the element, then tap.attach(el) once its srcObject is set — the srcObject branch taps without re-outputting, so ElevenLabs’ own playback is not doubled:

const tap = createElementTap();   // in the click, before startSession
setStream(tap.stream);            // → useLipsyncStream source: { kind: "mediaStream", stream }

const w = window as unknown as { Audio: typeof Audio; __el?: HTMLAudioElement };
const Orig = w.Audio;
w.Audio = function (...a: unknown[]) {
  const el = new Orig(...(a as []));
  w.__el = el;
  return el;
} as unknown as typeof Audio;

const { Conversation } = await import("@elevenlabs/client");
await Conversation.startSession({ signedUrl });

const iv = setInterval(() => {
  const el = w.__el;
  if (el?.srcObject instanceof MediaStream) {
    clearInterval(iv);
    w.Audio = Orig;
    tap.attach(el);               // srcObject branch → tap only; stays audible
  }
}, 100);
// teardown: tap.close();

Server TTS

For plain TTS, the server route returns audio only (base64 PCM16). The client plays it through createPCMStreamPlayer and the tap drives the mouth. The server only synthesizes speech; it never computes or streams visemes.

End-of-utterance silence

The SDK’s internal −50 dBFS silence gate suppresses the phantom mouth shapes that appear when the assistant stops talking. You do not implement your own gate — this is handled for every realtime path.

Stress emphasis and gestures

stress and gesture add body to a talking avatar. There is no flag to “enable” them — you drive them, and each needs the matching input declared on the .riv (the ready-made mascots include stress). They work the same for every realtime provider; the natural trigger is speech onset, which useLipsyncStream’s onFrame gives you.

`stress` — built-in emphasis

stress is one of the three input families the SDK drives (mouth, is_speaking, stress). useMascotPlayback() returns a stress() method: you push emphasis cues { offset, stress } and the SDK eases the Rive stress input toward each target. offset is ms on the playback clock; cues are applied in order, and a cue whose offset has already passed is applied on the next frame — so offset: 0 means “apply now”. That makes the realtime pattern trivial: raise stress while the assistant speaks, drop it when it stops.

`gesture` — your own one-shot trigger

gesture is a consumer-owned input — the SDK never touches it. If your .riv declares one, fire it yourself with useMascotInputs(). has("gesture") confirms the input exists; custom.gesture.fire?.() triggers it (the optional-call form also tolerates a numeric gesture input).

Wiring both for ElevenLabs (or any provider)

playback must be created with stream: true for realtime. This drives stress on the speech envelope and fires gesture once per utterance — identical for OpenAI / Gemini, only the stream source differs:

import { useRef } from "react";
import { useMascot } from "@mascotbot/react";
import { useMascotPlayback, useMascotInputs, useLipsyncStream } from "@mascotbot/react/rive";

function AvatarReactions({ stream }: { stream: MediaStream | null }) {
  const { client } = useMascot();
  const playback = useMascotPlayback({ stream: true, enableNaturalLipSync: true });
  const { has, custom } = useMascotInputs();
  const speaking = useRef(false);

  useLipsyncStream({
    client,
    playback,
    source: { kind: "mediaStream", stream }, // createElementTap() for ElevenLabs, or player.outputStream
    onFrame: (f) => {
      const isSpeech = !f.silenceDetected;
      if (isSpeech && !speaking.current) {
        speaking.current = true;
        playback.stress([{ offset: 0, stress: 1 }]);          // emphasize while speaking
        if (has("gesture")) custom.gesture.fire?.();           // one-shot reaction
      } else if (!isSpeech && speaking.current) {
        speaking.current = false;
        playback.stress([{ offset: 0, stress: 0 }]);          // ease back to neutral
      }
    },
  });
  return null;
}

For a single emphasis bump instead of a sustained one, push stress: 1 then release after a hold: setTimeout(() => playback.stress([{ offset: 0, stress: 0 }]), 350). For offline playback the same playback.stress([...]) works with real timeline offsets (e.g. { offset: 0, stress: 1 }, { offset: 400, stress: 0 }). reset() clears scheduled stress with the rest of playback. The timeline JSON does not carry stress — you always schedule it separately. See Rive co-existence for why gesture is yours and stress is SDK-driven.

Provider guides

ElevenLabs avatar

Conversational AI avatar.

Gemini Live avatar

Gemini Live API avatar.

OpenAI Realtime avatar

ChatGPT Realtime avatar.

PCM stream player

Play + tap raw PCM.

Streaming & mic

useLipsyncStream in depth.

Getting Started

Core concepts

React SDK

Core SDK (vanilla)

Realtime providers

Reference

Ready-made Mascots

Realtime AI Voice Avatars - Lip Sync for OpenAI, Gemini & ElevenLabs

Pick the path by provider

Tap a playing element

Provider tokens stay on the server

Path 1 — OpenAI Realtime (WebRTC), cleanest

Path 2 — Gemini Live / OpenAI Realtime (WebSocket)

Path 3 — ElevenLabs Conversational AI

Server TTS

End-of-utterance silence

Stress emphasis and gestures

`stress` — built-in emphasis

`gesture` — your own one-shot trigger

Wiring both for ElevenLabs (or any provider)

Provider guides

ElevenLabs avatar

Gemini Live avatar

OpenAI Realtime avatar

Next

PCM stream player

Streaming & mic

Getting Started

Core concepts

React SDK

Core SDK (vanilla)

Realtime providers

Reference

Ready-made Mascots

Documentation Index

​Pick the path by provider

​Tap a playing element

​Provider tokens stay on the server

​Path 1 — OpenAI Realtime (WebRTC), cleanest

​Path 2 — Gemini Live / OpenAI Realtime (WebSocket)

​Path 3 — ElevenLabs Conversational AI

​Server TTS

​End-of-utterance silence

​Stress emphasis and gestures

​stress — built-in emphasis

​gesture — your own one-shot trigger

​Wiring both for ElevenLabs (or any provider)

​Provider guides

ElevenLabs avatar

Gemini Live avatar

OpenAI Realtime avatar

​Next

PCM stream player

Streaming & mic

Pick the path by provider

Tap a playing element

Provider tokens stay on the server

Path 1 — OpenAI Realtime (WebRTC), cleanest

Path 2 — Gemini Live / OpenAI Realtime (WebSocket)

Path 3 — ElevenLabs Conversational AI

Server TTS

End-of-utterance silence

Stress emphasis and gestures

`stress` — built-in emphasis

`gesture` — your own one-shot trigger

Wiring both for ElevenLabs (or any provider)

Provider guides

Next