Skip to main content

ElevenLabs Avatar Integration - Real-time Visual Avatars for Your Voice AI

Transform your ElevenLabs voice agents into engaging visual experiences with Mascot Bot SDK. Get perfect lip sync, seamless integration, and production-ready React components that work alongside the official ElevenLabs SDK — unchanged. The SDK works alongside your ElevenLabs setup and animates a real-time avatar from the audio.

Quick Start

Add avatars in 5 minutes

Live Demo

See voice avatars in action

GitHub Repo

Complete example code

Features

Real-time lip sync & more

API Reference

Complete hook documentation

Deploy

One-click deployment

Why Add Avatars to Your ElevenLabs Conversational AI?

Voice-only AI can feel impersonal. By adding a conversational AI avatar with real-time lip sync, you create more engaging, human-like interactions. The Mascot Bot voice to avatar SDK works alongside your existing ElevenLabs setup — you keep using @elevenlabs/client exactly as you do today, and the SDK lip-syncs whatever the agent says in real time.

Features

  Real-time Lip Sync

The SDK turns ElevenLabs’ audio output into a real-time talking avatar with no perceptible lag. There is no server round-trip for visemes — only ElevenLabs’ own audio stream.

  120fps Animation Performance

Smooth, natural voice-driven facial animation powered by WebGL2 and the Rive runtime.

  Native ElevenLabs Support

Works alongside @elevenlabs/client with zero conflicts and zero modifications to your ElevenLabs code. The SDK never proxies or intercepts the ElevenLabs connection — it only taps the audio it already plays.

  Customizable Avatars

Choose from ready-made mascots or bring your own Rive file. The SDK only writes the mouth, is_speaking, and stress — every other input, outfit, gesture, and ViewModel stays yours (Rive co-existence).

  Streaming Avatar Audio

ElevenLabs plays the assistant’s audio itself; the SDK captures that exact playback as a MediaStream and lip-syncs it. The capture point is the playback point, so the mouth never drifts ahead of speech.

  Natural Lip Sync Processing

An optional post-processor merges rapid visemes and preserves the distinctive shapes for natural, non-robotic motion — Natural lip sync.

Quick Start

Installation

The SDK installs from the private registry npm.mascot.bot. Add an .npmrc, then install alongside the official ElevenLabs client:
.npmrc
@mascotbot:registry=https://npm.mascot.bot/
//npm.mascot.bot/:_authToken=mascot_xxx
pnpm add @mascotbot/react @rive-app/react-webgl2 @rive-app/webgl2 @elevenlabs/client
Get your Mascot Bot key at app.mascot.bot/api-keys (mascot_dev_… for localhost, mascot_pub_… for production). The SDK works alongside the official ElevenLabs SDK without any modifications. Full setup: Installation.
Want a complete working example? See the open-source demo repository, or deploy it to Vercel with one click.

Basic Integration

Three pieces: a server route that mints an ElevenLabs signed URL, the ElevenLabs Conversation, and the SDK tapping its audio.
"use client";
import { useState } from "react";
import { MascotProvider, useMascot } from "@mascotbot/react";
import { Mascot, MascotRive, useMascotPlayback, useLipsyncStream } from "@mascotbot/react/rive";

function App() {
  return (
    <MascotProvider apiKey="mascot_pub_…">
      <MascotProvider>
        <Mascot src="/mascot.riv">
          <MascotRive />
          <ElevenLabsAvatar />
        </Mascot>
      </MascotProvider>
    </MascotProvider>
  );
}
The avatar is wired in Step 2. Your ElevenLabs voice with avatar is then ready — the SDK handles synchronization automatically.

Complete Implementation Guide

Step 1: Mint an ElevenLabs Signed URL (Server-Side)

ElevenLabs needs a signed URL for the WebSocket. Mint it on the server so the standing xi-api-key never reaches the browser. This is the standard ElevenLabs signed-URL endpoint.
// app/api/elevenlabs/signed-url/route.ts
export const runtime = "nodejs";

export async function POST() {
  const key = process.env.ELEVENLABS_API_KEY;
  const agentId = process.env.ELEVENLABS_AGENT_ID;
  if (!key || !agentId) {
    return Response.json({ error: "ElevenLabs env not set" }, { status: 400 });
  }
  const url = new URL("https://api.elevenlabs.io/v1/convai/conversation/get-signed-url");
  url.searchParams.set("agent_id", agentId);
  const res = await fetch(url, { headers: { "xi-api-key": key }, cache: "no-store" });
  if (!res.ok) return Response.json({ error: `ElevenLabs ${res.status}` }, { status: 502 });
  const json = (await res.json()) as { signed_url?: string };
  return Response.json({ signedUrl: json.signed_url });
}
Required environment variables (server-side only):
  • ELEVENLABS_API_KEY — your ElevenLabs API key
  • ELEVENLABS_AGENT_ID — your ElevenLabs Conversational AI agent id
Your Mascot Bot key (mascot_pub_…) is a separate, browser-safe publishable key passed to <MascotProvider>.

Step 2: Create Your Avatar Component

ElevenLabs plays the assistant audio internally through an <audio> element. Capture that element, expose it as a MediaStream, and feed it to useLipsyncStream. Leave playback with ElevenLabs.
"use client";
import { useEffect, useRef, useState } from "react";
import { useMascot } from "@mascotbot/react";
import { useMascotPlayback, useLipsyncStream } from "@mascotbot/react/rive";

// Stable module constant — see the natural-lip-sync warning below.
const LIP_SYNC = { minVisemeInterval: 60, mergeWindow: 80 } as const;

export function ElevenLabsAvatar() {
  const { client, status } = useMascot();
  const playback = useMascotPlayback({ stream: true, enableNaturalLipSync: true, naturalLipSyncConfig: LIP_SYNC });
  const [stream, setStream] = useState<MediaStream | null>(null);
  const teardownRef = useRef<null | (() => void)>(null);

  // The SDK lip-syncs whatever this MediaStream carries.
  const { error, attached } = useLipsyncStream({
    client,
    playback,
    source: { kind: "mediaStream", stream },
  });

  useEffect(() => () => teardownRef.current?.(), []);

  const start = async () => {
    if (status !== "ready") return;
    // createElementTap from "@mascotbot/react" — create in
    // this click gesture so its AudioContext isn't suspended (Safari).
    const tap = createElementTap();
    setStream(tap.stream);
    // Capture the hidden <audio> ElevenLabs creates (its srcObject is a
    // MediaStream of the worklet output we'll tap).
    const w = window as unknown as { Audio: typeof Audio; __el?: HTMLAudioElement | null };
    const OrigAudio = w.Audio;
    w.Audio = function (...args: unknown[]) {
      const el = new OrigAudio(...(args as []));
      w.__el = el;
      return el;
    } as unknown as typeof Audio;

    const { signedUrl } = await (await fetch("/api/elevenlabs/signed-url", { method: "POST" })).json();
    const { Conversation } = await import("@elevenlabs/client");
    const convo = await Conversation.startSession({ signedUrl });

    // Live-tracks guard: on a 2nd "end → start" cycle, `w.__el` will
    // briefly still point at the previous call's <audio> element (its
    // srcObject is a MediaStream whose tracks are 'ended'). Attaching
    // to it produces silence for the entire new call. Require at
    // least one 'live' track before attaching.
    const isLive = (el: HTMLMediaElement | null | undefined) =>
      !!el && el.srcObject instanceof MediaStream &&
      el.srcObject.getAudioTracks().some((t) => t.readyState === "live");

    let tries = 0;
    const iv = window.setInterval(() => {
      const el = w.__el;
      // tap.attach(el): cross-browser tap (Safari has no captureStream).
      // See /realtime/overview#tap-a-playing-element
      if (isLive(el)) {
        tap.attach(el as HTMLMediaElement);
        window.clearInterval(iv);
      } else if (++tries > 100) {
        window.clearInterval(iv);
      }
    }, 100);

    teardownRef.current = () => {
      window.clearInterval(iv);
      w.Audio = OrigAudio;
      // Null the stash so the *next* call's poll doesn't latch onto
      // this (now-dead) element before EL has constructed its new one.
      w.__el = null;
      // Close the tap's AudioContext + MediaStream — otherwise each
      // restart leaks a worklet graph.
      tap.close();
      void convo.endSession();
    };
  };

  return (
    <div>
      <button onClick={start} disabled={status !== "ready"}>Start conversation</button>
      <span>{stream ? (attached ? "lip-sync attached" : "attaching…") : "idle"}</span>
      {error ? <p>{error.message}</p> : null}
    </div>
  );
}
Do not route ElevenLabs through createPCMStreamPlayer — ElevenLabs plays the audio itself, so the player would make the voice play twice. Tap the audio it already renders, as above.

Step 3: Advanced Features

Natural Lip Sync Configuration

Tune the post-processor by passing a stable naturalLipSyncConfig to useMascotPlayback:
// Module scope — a NEW object every render reinitializes playback and
// breaks lip sync after the first chunk.
const CONVERSATION = {
  minVisemeInterval: 60,
  mergeWindow: 80,
  keyVisemePreference: 0.7,
  preserveSilence: true,
  similarityThreshold: 0.6,
  preserveCriticalVisemes: true,
} as const;

const playback = useMascotPlayback({ enableNaturalLipSync: true, naturalLipSyncConfig: CONVERSATION });
See Natural lip sync for every field, the defaults, and conversation / fast-speech / educational presets.

Embedded Avatar Widget

Mount the avatar small and fixed for an embeddable AI agent with face. The SDK only animates the mouth — your own widget chrome, click handlers, and Rive inputs are untouched:
<div className="fixed bottom-4 right-4 w-64 h-64">
  <Mascot src="/widget-mascot.riv">
    <MascotRive />
    <ElevenLabsAvatar />
  </Mascot>
</div>
Need a non-mouth input (a wave, a reveal)? Use useMascotInputs().has(name) then drive it yourself — Rive co-existence.

Gestures on Every Agent Turn

The legacy SDK auto-fired a gesture trigger at the start of every agent utterance (the old gesture: true flag on useMascotElevenlabs). 0.2.x removed the auto-fire — consumers wire it themselves. ElevenLabs makes this a one-liner: Conversation.startSession exposes an onModeChange callback that flips to "speaking" the moment the first audio chunk of a new turn lands.
import { useRef } from "react";
import { useMascotInputs } from "@mascotbot/react/rive";

function ElevenLabsAvatarWithGestures() {
  // useMascotInputs() returns a fresh object every render — capture in a
  // ref so the long-lived EL callback always reads the current handle.
  const { custom } = useMascotInputs();
  const customRef = useRef(custom);
  customRef.current = custom;

  const start = async () => {
    const { signedUrl } = await (await fetch("/api/elevenlabs/signed-url", { method: "POST" })).json();
    const { Conversation } = await import("@elevenlabs/client");
    await Conversation.startSession({
      signedUrl,
      onModeChange: ({ mode }) => {
        // Fires exactly once per agent turn start; stays silent on the
        // "listening" back-edge so the gesture only plays as the agent
        // begins speaking.
        if (mode !== "speaking") return;
        customRef.current?.gesture?.fire?.();
      },
      // ... your existing onConnect / onDisconnect / onError
    });
  };

  return <button onClick={start}>Start conversation</button>;
}
Declare gesture on the parent <Mascot inputs={["gesture", ...]}> so the SDK exposes a real trigger handle. On .riv files without a gesture input the SDK returns a no-op shim, so the optional-chain fire?.() stays safe. For a provider-agnostic approach driven by the speech envelope (works identically for OpenAI / Gemini), see Stress emphasis and gestures.

Step 4: Dynamic Variables

ElevenLabs dynamic variables personalize conversations at runtime. They are an ElevenLabs feature and are completely independent of the SDK — pass them straight to Conversation.startSession:
const dynamicVariables = { name: userName ?? "Guest", role: userRole ?? "user" };

const { signedUrl } = await (await fetch("/api/elevenlabs/signed-url", { method: "POST" })).json();
const { Conversation } = await import("@elevenlabs/client");
const convo = await Conversation.startSession({ signedUrl, dynamicVariables });
Configure your agent prompt to use {{name}} / {{role}} placeholders. The SDK does not see or touch these — it only lip-syncs the resulting audio.

API Reference

This integration uses the standard SDK surface plus the official ElevenLabs client.
SurfaceRole
<MascotProvider apiKey>Initializes the licensed avatar client. Config.
<MascotProvider> / <Mascot src> / <MascotRive>Load and render the Rive avatar.
useMascotPlayback({ stream: true, enableNaturalLipSync, naturalLipSyncConfig })The mouth playback engine.
useLipsyncStream({ client, playback, source: { kind: "mediaStream", stream } })Lip-syncs the tapped ElevenLabs audio. Reference.
@elevenlabs/client ConversationThe official ElevenLabs SDK — unchanged.
Server route: the plain ElevenLabs GET /v1/convai/conversation/get-signed-url with your xi-api-key. No Mascot Bot endpoint sits in the path.

Use Cases

AI Customer Service Avatar

A visible virtual assistant with face for support — visual feedback during voice conversations, on-brand appearance, expressions driven by your own Rive inputs.

Educational AI Tutor Avatar

Clear articulation for learning; pair with the educational natural-lip-sync preset for crisper mouth shapes.

Voice AI Virtual Receptionist

A welcoming visual presence with natural conversation flow and a brand-customizable mascot.

Technical Details

Voice-to-Animation Pipeline

  1. ElevenLabs streams and plays the assistant’s audio (its own WebSocket, untouched).
  2. The SDK captures that playback as a MediaStream.
  3. The SDK infers a viseme per 10 ms frame from the audio.
  4. The Rive runtime renders the mouth at up to 120fps.
  5. Optional natural lip sync smooths the motion.
No audio or viseme data is sent to or stored on Mascotbot servers.

Performance

  • Low audio-to-visual delay (the capture point is the playback point).
  • WebGL2-accelerated rendering.
  • End-of-utterance phantom mouth shapes are suppressed by the SDK’s internal silence gate — you do not implement one.

Troubleshooting

Avatar Not Moving?

Confirm status === "ready" from useMascot(), that the tapped MediaStream is non-null (the window.Audio patch must be installed before Conversation.startSession, and el.srcObject must be a MediaStream), and that the Rive file uses artboard Character + state machine mascotStateMachine with inputs 100118.

Only the First Second of Speech Animates?

A new naturalLipSyncConfig object on every render reinitializes playback. Use a stable module constant or useState/useMemo — see the example above and Troubleshooting.

Hearing the Voice Twice?

You routed ElevenLabs through createPCMStreamPlayer. ElevenLabs self-plays — tap its audio instead (Step 2), never the PCM player.

Dynamic Variables Not Applied?

They are an ElevenLabs concern. Ensure the agent prompt has the {{placeholders}} and that you pass dynamicVariables to Conversation.startSession. The SDK is not involved.

FAQ

Can You Add an Avatar to ElevenLabs?

Yes. The SDK works alongside the official @elevenlabs/client with no modifications. You connect ElevenLabs as usual; the SDK lip-syncs its audio in real time.

Does It Work With My Existing ElevenLabs Setup?

Yes. Keep your @elevenlabs/client code exactly as it is — the SDK lip-syncs the audio ElevenLabs plays, in real time.

Do I Modify My ElevenLabs Code?

No. Keep your Conversation setup. You only add a MediaStream tap of the audio it plays.

How Is the Lip Sync Synchronized?

The audio is tapped at its playback point with a Web-Audio MediaStreamDestination, so visemes are derived from exactly what the user hears — the mouth cannot run ahead of the voice.

Is Audio Sent to Mascot Bot?

Your users’ speech is processed by the SDK in their browser and isn’t sent to or stored on Mascotbot servers.

What Is the Voice Avatar SDK?

A React/JavaScript library that adds a real-time, lip-synced avatar to any voice AI — including ElevenLabs Conversational AI.

Start with ElevenLabs Avatar Today

Ready to transform your voice AI? The open-source avatar for ElevenLabs makes it simple:

Try Voice Avatar Demo

Experience it yourself

Demo Repository

Complete working example
Unlike pre-rendered solutions, this is a real-time alternative — dynamic, responsive avatars that connect with users.

Next Steps

  1. Get a key at app.mascot.bot/api-keys and install from the private registry.
  2. Add <MascotProvider> + <MascotProvider>/<Mascot> and the avatar component above.
  3. Choose a ready-made mascot or your own Rive file.
  4. Tune motion with natural lip sync; review the realtime overview for the general pattern.
Transform your ElevenLabs implementation today with the most developer-friendly avatar SDK for voice AI.