Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.mascot.bot/llms.txt

Use this file to discover all available pages before exploring further.

Adding a lip-synced avatar to a realtime voice assistant is one idea:
Give the SDK a MediaStream of the assistant’s voice. It turns that audio into a talking avatar in real time.
useLipsyncStream({ client, playback, source: { kind: "mediaStream", stream } });
You wire the provider with its own official SDK and the SDK lip-syncs the audio in real time. The only question is how you obtain that MediaStream, which depends on whether the provider plays the audio for you.

Pick the path by provider

Provider / transportPlays audio itself?How you get the streamSDK piece
OpenAI Realtime — WebRTCYes (into an <audio> you supply)createElementTap()none
Gemini LiveNo (raw base64 PCM16 @ 24 kHz)createPCMStreamPlayer().outputStreamPCM player
OpenAI Realtime — WebSocketNo (raw PCM16 ArrayBuffer)createPCMStreamPlayer().outputStreamPCM player
ElevenLabs Conversational AIYes (internal worklet → <audio>)createElementTap()none
Never route a self-playing provider (ElevenLabs, OpenAI-WebRTC) through createPCMStreamPlayer — the voice would play twice. The player is only for providers that hand you raw PCM and do not play it.

Tap a playing element

When a provider plays the audio itself (OpenAI WebRTC, ElevenLabs), tap the element it plays through with createElementTap() — an SDK helper that works in Chrome, Safari and Firefox (HTMLMediaElement.captureStream() is not implemented in Safari/WebKit, so the SDK does not use it):
import { createElementTap } from "@mascotbot/react";

// Create inside the click that starts the call (so its AudioContext isn't
// born suspended). `stream` is usable immediately — silent until attach().
const tap = createElementTap();
useLipsyncStream({
  client,
  playback,
  source: { kind: "mediaStream", stream: tap.stream },
});

tap.attach(audioEl); // now, or later once the provider's <audio> exists
// teardown: tap.close();
tap.attach(el) handles both element kinds: a file/URL <audio> (e.g. OpenAI WebRTC) is kept audible and tapped; an element whose srcObject is a MediaStream (e.g. ElevenLabs’ worklet output) is tapped only, so the provider’s own playback isn’t doubled. attach() is idempotent and may run after creation. Also exported from @mascotbot/core.

Provider tokens stay on the server

Never ship a standing provider key to the browser. A server route mints a short-lived credential per session; the client connects with that. The demo ships reference route handlers for all three:
ProviderServer mintsNotes
OpenAIPOST https://api.openai.com/v1/realtime/client_secretsclientSecretmodel gpt-realtime
Gemini@google/genai ai.authTokens.create(...) → ephemeral token.namemodel models/gemini-3.1-flash-live-preview, apiVersion: "v1alpha"
ElevenLabsGET https://api.elevenlabs.io/v1/convai/conversation/get-signed-url?agent_id=…signedUrlxi-api-key header, server-side only

Path 1 — OpenAI Realtime (WebRTC), cleanest

The provider auto-plays into an <audio> element. Supply your own so you can tap it; no SDK audio piece needed.
import { RealtimeAgent, RealtimeSession } from "@openai/agents-realtime";

const audioEl = new Audio();
const session = new RealtimeSession(new RealtimeAgent({ name: "Assistant" }), { transport: "webrtc" });
await session.connect({ apiKey: clientSecret, audioElement: audioEl }); // clientSecret from your server route

const tap = createElementTap(); // create in the click — see "Tap a playing element"
useLipsyncStream({ client, playback, source: { kind: "mediaStream", stream: tap.stream } });
tap.attach(audioEl);

Path 2 — Gemini Live / OpenAI Realtime (WebSocket)

The provider streams raw PCM and does not play it. createPCMStreamPlayer plays it gap-tolerantly and exposes the same audio as a tappable MediaStream.
import { createPCMStreamPlayer } from "@mascotbot/core";

const player = createPCMStreamPlayer({ sampleRate: 24000 }); // both emit 24 kHz
useLipsyncStream({ client, playback, source: { kind: "mediaStream", stream: player.outputStream } });

// Gemini Live (@google/genai): modelTurn audio part
session.onmessage = (m) => {
  const b64 = m?.serverContent?.modelTurn?.parts?.[0]?.inlineData?.data;
  if (typeof b64 === "string") player.pushBase64PCM16(b64);
  if (m?.serverContent?.interrupted) player.stop();
};

// OpenAI Realtime (WebSocket): PCM16 ArrayBuffer
session.on("audio", (e) => player.pushPCM16(new Uint8Array(e.data)));
session.on("audio_interrupted", () => player.stop());
The transport parsing above is provider glue and stays in your app — it must not enter the SDK. Only the play-and-tap primitive is shared.

Path 3 — ElevenLabs Conversational AI

ElevenLabs plays its assistant audio through a hidden <audio> whose srcObject is a MediaStream (an internal worklet → MediaStreamDestination). Create createElementTap() in the click, patch window.Audio before Conversation.startSession to capture the element, then tap.attach(el) once its srcObject is set — the srcObject branch taps without re-outputting, so ElevenLabs’ own playback is not doubled:
const tap = createElementTap();   // in the click, before startSession
setStream(tap.stream);            // → useLipsyncStream source: { kind: "mediaStream", stream }

const w = window as unknown as { Audio: typeof Audio; __el?: HTMLAudioElement };
const Orig = w.Audio;
w.Audio = function (...a: unknown[]) {
  const el = new Orig(...(a as []));
  w.__el = el;
  return el;
} as unknown as typeof Audio;

const { Conversation } = await import("@elevenlabs/client");
await Conversation.startSession({ signedUrl });

const iv = setInterval(() => {
  const el = w.__el;
  if (el?.srcObject instanceof MediaStream) {
    clearInterval(iv);
    w.Audio = Orig;
    tap.attach(el);               // srcObject branch → tap only; stays audible
  }
}, 100);
// teardown: tap.close();

Server TTS

For plain TTS, the server route returns audio only (base64 PCM16). The client plays it through createPCMStreamPlayer and the tap drives the mouth. The server only synthesizes speech; it never computes or streams visemes.

End-of-utterance silence

The SDK’s internal −50 dBFS silence gate suppresses the phantom mouth shapes that appear when the assistant stops talking. You do not implement your own gate — this is handled for every realtime path.

Stress emphasis and gestures

stress and gesture add body to a talking avatar. There is no flag to “enable” them — you drive them, and each needs the matching input declared on the .riv (the ready-made mascots include stress). They work the same for every realtime provider; the natural trigger is speech onset, which useLipsyncStream’s onFrame gives you.

stress — built-in emphasis

stress is one of the three input families the SDK drives (mouth, is_speaking, stress). useMascotPlayback() returns a stress() method: you push emphasis cues { offset, stress } and the SDK eases the Rive stress input toward each target. offset is ms on the playback clock; cues are applied in order, and a cue whose offset has already passed is applied on the next frame — so offset: 0 means “apply now”. That makes the realtime pattern trivial: raise stress while the assistant speaks, drop it when it stops.

gesture — your own one-shot trigger

gesture is a consumer-owned input — the SDK never touches it. If your .riv declares one, fire it yourself with useMascotInputs(). has("gesture") confirms the input exists; custom.gesture.fire?.() triggers it (the optional-call form also tolerates a numeric gesture input).

Wiring both for ElevenLabs (or any provider)

playback must be created with stream: true for realtime. This drives stress on the speech envelope and fires gesture once per utterance — identical for OpenAI / Gemini, only the stream source differs:
import { useRef } from "react";
import { useMascot } from "@mascotbot/react";
import { useMascotPlayback, useMascotInputs, useLipsyncStream } from "@mascotbot/react/rive";

function AvatarReactions({ stream }: { stream: MediaStream | null }) {
  const { client } = useMascot();
  const playback = useMascotPlayback({ stream: true, enableNaturalLipSync: true });
  const { has, custom } = useMascotInputs();
  const speaking = useRef(false);

  useLipsyncStream({
    client,
    playback,
    source: { kind: "mediaStream", stream }, // createElementTap() for ElevenLabs, or player.outputStream
    onFrame: (f) => {
      const isSpeech = !f.silenceDetected;
      if (isSpeech && !speaking.current) {
        speaking.current = true;
        playback.stress([{ offset: 0, stress: 1 }]);          // emphasize while speaking
        if (has("gesture")) custom.gesture.fire?.();           // one-shot reaction
      } else if (!isSpeech && speaking.current) {
        speaking.current = false;
        playback.stress([{ offset: 0, stress: 0 }]);          // ease back to neutral
      }
    },
  });
  return null;
}
For a single emphasis bump instead of a sustained one, push stress: 1 then release after a hold: setTimeout(() => playback.stress([{ offset: 0, stress: 0 }]), 350). For offline playback the same playback.stress([...]) works with real timeline offsets (e.g. { offset: 0, stress: 1 }, { offset: 400, stress: 0 }). reset() clears scheduled stress with the rest of playback. The timeline JSON does not carry stress — you always schedule it separately. See Rive co-existence for why gesture is yours and stress is SDK-driven.

Provider guides

ElevenLabs avatar

Conversational AI avatar.

Gemini Live avatar

Gemini Live API avatar.

OpenAI Realtime avatar

ChatGPT Realtime avatar.

Next

PCM stream player

Play + tap raw PCM.

Streaming & mic

useLipsyncStream in depth.