Adding a lip-synced avatar to a realtime voice assistant is one idea:Documentation Index
Fetch the complete documentation index at: https://docs.mascot.bot/llms.txt
Use this file to discover all available pages before exploring further.
Give the SDK a MediaStream of the assistant’s voice. It turns that
audio into a talking avatar in real time.
MediaStream,
which depends on whether the provider plays the audio for you.
Pick the path by provider
| Provider / transport | Plays audio itself? | How you get the stream | SDK piece |
|---|---|---|---|
| OpenAI Realtime — WebRTC | Yes (into an <audio> you supply) | createElementTap() | none |
| Gemini Live | No (raw base64 PCM16 @ 24 kHz) | createPCMStreamPlayer().outputStream | PCM player |
| OpenAI Realtime — WebSocket | No (raw PCM16 ArrayBuffer) | createPCMStreamPlayer().outputStream | PCM player |
| ElevenLabs Conversational AI | Yes (internal worklet → <audio>) | createElementTap() | none |
Tap a playing element
When a provider plays the audio itself (OpenAI WebRTC, ElevenLabs), tap the element it plays through withcreateElementTap() — an SDK helper that
works in Chrome, Safari and Firefox (HTMLMediaElement.captureStream() is not
implemented in Safari/WebKit, so the SDK does not use it):
tap.attach(el) handles both element kinds: a file/URL <audio> (e.g. OpenAI
WebRTC) is kept audible and tapped; an element whose srcObject is a
MediaStream (e.g. ElevenLabs’ worklet output) is tapped only, so the
provider’s own playback isn’t doubled. attach() is idempotent and may run
after creation. Also exported from @mascotbot/core.
Provider tokens stay on the server
Never ship a standing provider key to the browser. A server route mints a short-lived credential per session; the client connects with that. The demo ships reference route handlers for all three:| Provider | Server mints | Notes |
|---|---|---|
| OpenAI | POST https://api.openai.com/v1/realtime/client_secrets → clientSecret | model gpt-realtime |
| Gemini | @google/genai ai.authTokens.create(...) → ephemeral token.name | model models/gemini-3.1-flash-live-preview, apiVersion: "v1alpha" |
| ElevenLabs | GET https://api.elevenlabs.io/v1/convai/conversation/get-signed-url?agent_id=… → signedUrl | xi-api-key header, server-side only |
Path 1 — OpenAI Realtime (WebRTC), cleanest
The provider auto-plays into an<audio> element. Supply your own so you can
tap it; no SDK audio piece needed.
Path 2 — Gemini Live / OpenAI Realtime (WebSocket)
The provider streams raw PCM and does not play it.createPCMStreamPlayer plays it gap-tolerantly and
exposes the same audio as a tappable MediaStream.
Path 3 — ElevenLabs Conversational AI
ElevenLabs plays its assistant audio through a hidden<audio> whose
srcObject is a MediaStream (an internal worklet → MediaStreamDestination).
Create createElementTap() in the click, patch
window.Audio before Conversation.startSession to capture the element,
then tap.attach(el) once its srcObject is set — the srcObject branch
taps without re-outputting, so ElevenLabs’ own playback is not doubled:
Server TTS
For plain TTS, the server route returns audio only (base64 PCM16). The client plays it throughcreatePCMStreamPlayer and the tap drives the mouth.
The server only synthesizes speech; it never computes or streams visemes.
End-of-utterance silence
The SDK’s internal −50 dBFS silence gate suppresses the phantom mouth shapes that appear when the assistant stops talking. You do not implement your own gate — this is handled for every realtime path.Stress emphasis and gestures
stress and gesture add body to a talking avatar. There is no flag to
“enable” them — you drive them, and each needs the matching input declared on
the .riv (the ready-made mascots include
stress). They work the same for every realtime provider; the natural
trigger is speech onset, which useLipsyncStream’s onFrame gives you.
stress — built-in emphasis
stress is one of the three input families the SDK drives (mouth,
is_speaking, stress). useMascotPlayback() returns a stress()
method: you push emphasis cues { offset, stress } and the SDK eases the Rive
stress input toward each target. offset is ms on the playback clock; cues
are applied in order, and a cue whose offset has already passed is applied
on the next frame — so offset: 0 means “apply now”. That makes the
realtime pattern trivial: raise stress while the assistant speaks, drop it
when it stops.
gesture — your own one-shot trigger
gesture is a consumer-owned input — the SDK never touches it. If your .riv
declares one, fire it yourself with useMascotInputs(). has("gesture")
confirms the input exists; custom.gesture.fire?.() triggers it (the
optional-call form also tolerates a numeric gesture input).
Wiring both for ElevenLabs (or any provider)
playback must be created with stream: true for realtime. This drives
stress on the speech envelope and fires gesture once per utterance —
identical for OpenAI / Gemini, only the stream source differs:
stress: 1 then
release after a hold: setTimeout(() => playback.stress([{ offset: 0, stress: 0 }]), 350).
For offline playback the same playback.stress([...]) works with real
timeline offsets (e.g. { offset: 0, stress: 1 }, { offset: 400, stress: 0 }).
reset() clears scheduled stress with the rest of playback. The timeline JSON
does not carry stress — you always schedule it separately. See
Rive co-existence for why gesture is yours and
stress is SDK-driven.
Provider guides
ElevenLabs avatar
Conversational AI avatar.
Gemini Live avatar
Gemini Live API avatar.
OpenAI Realtime avatar
ChatGPT Realtime avatar.
Next
PCM stream player
Play + tap raw PCM.
Streaming & mic
useLipsyncStream in depth.