Build a backend and client from scratch

This guide walks through building the full Conversational AI stack from scratch: a server that issues tokens and manages agent sessions, and a browser client that captures microphone audio and streams live transcripts. You will end up with the same structure as the official Next.js and Python starter repos, but with every step explained.

If you want to hear an agent speak in under five minutes, follow the Voice AI quickstart guide.

info

This project uses Agora-managed presets, so no vendor API keys are required to complete this tutorial. If you prefer to switch to your own vendor accounts, see the documentation for your chosen ASR, LLM, and TTS providers.

What you will build

A complete client-server app:

Backend: Three HTTP endpoints:
- POST /api/token: Issues RTC + RTM tokens for a given channel and UID
- POST /api/invite-agent: Starts the agent using the Agent Server SDK
- POST /api/stop-conversation: Stops the agent by ID
Frontend: A Next.js page that:
- Requests microphone permission
- Joins the Agora channel using the RTC SDK
- Fetches a token from your backend
- Calls invite-agent to bring the agent into the channel
- Renders live transcripts from RTM
- Calls stop-conversation on page unload

The architecture looks like this:

The backend never touches audio, and the browser never embeds your App Certificate. This clean separation is the reason you need a backend.

Prerequisites

An active Agora account.
git and a terminal.
One of the following language runtimes:
- Node.js 20 LTS or later with pnpm (TypeScript)
- Python 3.11 or later with uv or pip
- Go 1.22 or later
A modern browser with microphone access

Set up your environment

This section walks you through installing the Agora CLI and scaffolding your project.

Install the Agora CLI

The Agora CLI is the recommended way to bootstrap a new Agora project. Use it to create projects, enable features, write credentials to .env files, and run diagnostics.

To install the Agora CLI, log in, and create a project with Conversational AI enabled:

npm install -g agoraio-cli
agora login
agora project create conv-ai-tutorial --feature rtc --feature convoai
agora project use conv-ai-tutorial

Confirm that the CLI can read your credentials:

agora project env --shell

AGORA_APP_ID=your_app_id_here
AGORA_APP_CERTIFICATE=your_certificate_here

Keep this terminal open. You will reuse these credentials in both the backend and frontend steps.

Scaffold the repo

Select the tab for your preferred language.

TypeScript
Python
Go

For TypeScript, the backend and frontend live in the same Next.js app.

Scaffold a Next.js app with TypeScript, the App Router, Tailwind CSS, and ESLint, then install the required Agora packages:

_3pnpm dlx create-next-app@latest conv-ai-tutorial --typescript --app --tailwind --eslint --src-dir=false --import-alias='@/*' _3cd conv-ai-tutorial _3pnpm add agora-agent-server-sdk agora-rtc-sdk-ng agora-rtm-sdk agora-token
- agora-agent-server-sdk: Starts and stops agent sessions from the backend
- agora-rtc-sdk-ng: Handles mic capture and RTC channel joining in the browser
- agora-rtm-sdk: Receives live transcript messages from the agent
- agora-token: Generates RTC and RTM tokens
Write your Agora credentials to .env.local:

_5eval "$(agora project env --shell)" _5cat > .env.local <<EOF _5NEXT_PUBLIC_AGORA_APP_ID=$AGORA_APP_ID _5NEXT_AGORA_APP_CERTIFICATE=$AGORA_APP_CERTIFICATE _5EOF

The NEXT_PUBLIC_ prefix makes AGORA_APP_ID available in the browser, which is required to join the RTC channel. Never apply this prefix to APP_CERTIFICATE; it must remain server-side only.

For Python, the backend and frontend are separate apps in a single monorepo.

Scaffold the monorepo, create a Python virtual environment, and install the required packages:

_15mkdir conv-ai-tutorial && cd conv-ai-tutorial _15git init _15 _15# Backend _15mkdir server-python && cd server-python _15python -m venv .venv _15source .venv/bin/activate # Windows: .venv\Scripts\activate _15pip install fastapi 'uvicorn[standard]' agora-agent-server-sdk agora-token-builder python-dotenv _15cd .. _15 _15# Frontend _15pnpm dlx create-next-app@latest web-client --typescript --app --tailwind --eslint --src-dir=false --import-alias='@/*' _15cd web-client _15pnpm add agora-rtc-sdk-ng agora-rtm-sdk _15cd ..
- agora-agent-server-sdk: Starts and stops agent sessions from the backend
- agora-token-builder: Generates RTC and RTM tokens
- agora-rtc-sdk-ng: Handles mic capture and RTC channel joining in the browser
- agora-rtm-sdk: Receives live transcript messages from the agent
Write your Agora credentials to the backend and frontend environment files:

_14eval "$(agora project env --shell)" _14 _14# Backend environment _14cat > server-python/.env <<EOF _14APP_ID=$AGORA_APP_ID _14APP_CERTIFICATE=$AGORA_APP_CERTIFICATE _14PORT=8000 _14EOF _14 _14# Frontend environment _14cat > web-client/.env.local <<EOF _14NEXT_PUBLIC_AGORA_APP_ID=$AGORA_APP_ID _14NEXT_PUBLIC_BACKEND_URL=http://localhost:8000 _14EOF

The backend reads APP_ID and APP_CERTIFICATE directly. The frontend only receives APP_ID through NEXT_PUBLIC_AGORA_APP_ID. The certificate never leaves server-python/.

Build the backend

The backend exposes three endpoints, one for each operation the frontend needs.

TypeScript
Python
Go

Generate tokens

Endpoint: POST /api/token

This endpoint builds an RTC token and an RTM token for the browser client. It is the only place the App Certificate is used.

// app/api/token/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { RtcTokenBuilder, RtcRole, RtmTokenBuilder } from 'agora-token';
const APP_ID = process.env.NEXT_PUBLIC_AGORA_APP_ID!;
const APP_CERTIFICATE = process.env.NEXT_AGORA_APP_CERTIFICATE!;
const TOKEN_TTL_SECONDS = 60 * 60; // 1 hour
export async function POST(req: NextRequest) {
  const { channel, uid } = await req.json();
  if (!channel || typeof uid !== 'number') {
    return NextResponse.json({ error: 'channel and numeric uid required' }, { status: 400 });
  }
  const expireAt = Math.floor(Date.now() / 1000) + TOKEN_TTL_SECONDS;
  const rtcToken = RtcTokenBuilder.buildTokenWithUid(
    APP_ID, APP_CERTIFICATE, channel, uid, RtcRole.PUBLISHER, expireAt, expireAt
  );
  const rtmToken = RtmTokenBuilder.buildToken(
    APP_ID, APP_CERTIFICATE, String(uid), expireAt
  );
  return NextResponse.json({ rtcToken, rtmToken, expireAt });
}

Start an agent session

Endpoint: POST /api/invite-agent

This endpoint uses the Agent Server SDK to configure an agent and start a session. The STT, LLM, and TTS configurations use Agora-managed presets and therefore do not require an apiKey.

// app/api/invite-agent/route.ts
import { NextRequest, NextResponse } from 'next/server';
import {
  AgoraClient, Agent, Area, DeepgramSTT, MiniMaxTTS, OpenAI, ExpiresIn,
} from 'agora-agent-server-sdk';
const client = new AgoraClient({
  area: Area.US,
  appId: process.env.NEXT_PUBLIC_AGORA_APP_ID!,
  appCertificate: process.env.NEXT_AGORA_APP_CERTIFICATE!,
});
const AGENT_UID = 123456;
export async function POST(req: NextRequest) {
  const { channel } = await req.json();
  if (!channel) {
    return NextResponse.json({ error: 'channel required' }, { status: 400 });
  }
  const agent = new Agent({
    name: 'support-agent',
    instructions: 'You are a friendly support agent for Acme Corp. Keep answers under 30 seconds.',
    greeting: 'Hi there! How can I help you today?',
    failureMessage: 'Sorry, I had trouble hearing that. Could you repeat?',
    maxHistory: 50,
    advancedFeatures: { enable_rtm: true, enable_tools: false },
    parameters: { data_channel: 'rtm', enable_error_message: true },
  })
    .withStt(new DeepgramSTT({ model: 'nova-3', language: 'en' }))
    .withLlm(new OpenAI({ model: 'gpt-4o-mini', maxHistory: 15 }))
    .withTts(new MiniMaxTTS({ model: 'speech_2_6_turbo', voiceId: 'English_captivating_female1' }));
  const session = agent.createSession(client, {
    channel,
    agentUid: AGENT_UID,
    remoteUids: ['*'],
    idleTimeout: 30,
    expiresIn: ExpiresIn.hours(1),
  });
  try {
    const { agentId } = await session.start();
    return NextResponse.json({ agentId, agentUid: AGENT_UID });
  } catch (err: unknown) {
    const message = err instanceof Error ? err.message : String(err);
    return NextResponse.json({ error: `start failed: ${message}` }, { status: 502 });
  }
}

Stop an agent session

Endpoint: POST /api/stop-conversation

This endpoint stops a running agent session by ID.

// app/api/stop-conversation/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { AgoraClient, Area } from 'agora-agent-server-sdk';
const client = new AgoraClient({
  area: Area.US,
  appId: process.env.NEXT_PUBLIC_AGORA_APP_ID!,
  appCertificate: process.env.NEXT_AGORA_APP_CERTIFICATE!,
});
export async function POST(req: NextRequest) {
  const { agentId } = await req.json();
  if (!agentId) {
    return NextResponse.json({ error: 'agentId required' }, { status: 400 });
  }
  try {
    await client.agents.leave(agentId);
    return NextResponse.json({ stopped: true });
  } catch (err: unknown) {
    const message = err instanceof Error ? err.message : String(err);
    return NextResponse.json({ error: `stop failed: ${message}` }, { status: 502 });
  }
}

All backend code lives in server-python/main.py.

Generate tokens

Endpoint: POST /api/token

This endpoint generates an RTC token and an RTM token for the browser client. It is the only place the App Certificate is used.

# server-python/main.py
import os
import time
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from dotenv import load_dotenv
from agora_token_builder import RtcTokenBuilder, RtmTokenBuilder
load_dotenv()
APP_ID = os.environ["APP_ID"]
APP_CERTIFICATE = os.environ["APP_CERTIFICATE"]
TOKEN_TTL_SECONDS = 60 * 60
app = FastAPI()
app.add_middleware(
    CORSMiddleware,
    allow_origins=["http://localhost:3000"],
    allow_methods=["POST"],
    allow_headers=["*"],
)
class TokenRequest(BaseModel):
    channel: str
    uid: int
@app.post("/api/token")
def token(body: TokenRequest):
    expire_at = int(time.time()) + TOKEN_TTL_SECONDS
    rtc_token = RtcTokenBuilder.buildTokenWithUid(
        APP_ID, APP_CERTIFICATE, body.channel, body.uid,
        role=1, privilegeExpiredTs=expire_at,
    )
    rtm_token = RtmTokenBuilder.buildToken(
        APP_ID, APP_CERTIFICATE, str(body.uid), role=1, privilegeExpiredTs=expire_at,
    )
    return {"rtcToken": rtc_token, "rtmToken": rtm_token, "expireAt": expire_at}

Start an agent session

Endpoint: POST /api/invite-agent

This endpoint uses the Agent Server SDK to configure an agent and start a session in the caller's channel. The STT, LLM, and TTS configurations use Agora-managed presets and therefore do not require an api_key.

Add the following to main.py:

from agora_agent_server_sdk import (
    AgoraClient, Agent, Area, DeepgramSTT, MiniMaxTTS, OpenAI, ExpiresIn,
)
agora_client = AgoraClient(
    area=Area.US,
    app_id=APP_ID,
    app_certificate=APP_CERTIFICATE,
)
AGENT_UID = 123456
class InviteRequest(BaseModel):
    channel: str
@app.post("/api/invite-agent")
def invite_agent(body: InviteRequest):
    agent = (
        Agent(
            name="support-agent",
            instructions="You are a friendly support agent for Acme Corp. Keep answers under 30 seconds.",
            greeting="Hi there! How can I help you today?",
            failure_message="Sorry, I had trouble hearing that. Could you repeat?",
            max_history=50,
            advanced_features={"enable_rtm": True, "enable_tools": False},
            parameters={"data_channel": "rtm", "enable_error_message": True},
        )
        .with_stt(DeepgramSTT(model="nova-3", language="en"))
        .with_llm(OpenAI(model="gpt-4o-mini", max_history=15))
        .with_tts(MiniMaxTTS(model="speech_2_6_turbo", voice_id="English_captivating_female1"))
    )
    session = agent.create_session(
        agora_client,
        channel=body.channel,
        agent_uid=AGENT_UID,
        remote_uids=["*"],
        idle_timeout=30,
        expires_in=ExpiresIn.hours(1),
    )
    try:
        result = session.start()
        return {"agentId": result.agent_id, "agentUid": AGENT_UID}
    except Exception as exc:
        raise HTTPException(status_code=502, detail=f"start failed: {exc}")

Stop an agent session

Endpoint: POST /api/stop-conversation

This endpoint stops a running agent session by ID.

Add the following to main.py:

class StopRequest(BaseModel):
    agentId: str
@app.post("/api/stop-conversation")
def stop_conversation(body: StopRequest):
    try:
        agora_client.agents.leave(body.agentId)
        return {"stopped": True}
    except Exception as exc:
        raise HTTPException(status_code=502, detail=f"stop failed: {exc}")

To start the backend:

cd server-python
uvicorn main:app --reload --port 8000

Swagger docs are available at http://localhost:8000/docs.

Build the frontend

The frontend is the same Next.js app for all three backends. The only difference is whether it calls its own API routes (TypeScript) or a separate backend on port 8000 (Python and Go).

A basic API client

Create lib/api.ts to give the frontend a single place to manage the backend URL and endpoint calls.

// lib/api.ts
const BACKEND = process.env.NEXT_PUBLIC_BACKEND_URL ?? '';
async function post<T>(path: string, body: object): Promise<T> {
  const res = await fetch(`${BACKEND}${path}`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(body),
  });
  if (!res.ok) throw new Error(`${path} failed: ${res.status} ${await res.text()}`);
  return res.json() as Promise<T>;
}
export type TokenResponse = { rtcToken: string; rtmToken: string; expireAt: number };
export type InviteResponse = { agentId: string; agentUid: number };
export const api = {
  token: (channel: string, uid: number) => post<TokenResponse>('/api/token', { channel, uid }),
  invite: (channel: string) => post<InviteResponse>('/api/invite-agent', { channel }),
  stop: (agentId: string) => post<{ stopped: boolean }>('/api/stop-conversation', { agentId }),
};

For TypeScript, NEXT_PUBLIC_BACKEND_URL is not set in .env.local, so calls go to same-origin routes like /api/token. For Python and Go, it is set to http://localhost:8000 in the scaffold step, so calls go to the external backend.

Create the RTC and RTM hook

Create hooks/useConvoAgent.ts. This hook joins the RTC channel for audio, connects to RTM for transcripts, and exposes a start() and stop() function to the UI.

// hooks/useConvoAgent.ts
'use client';
import { useCallback, useRef, useState } from 'react';
import AgoraRTC, { IAgoraRTCClient, IMicrophoneAudioTrack } from 'agora-rtc-sdk-ng';
import { RTMClient, RTMEvents } from 'agora-rtm-sdk';
import { api } from '@/lib/api';
type TranscriptLine = { role: 'user' | 'agent'; text: string; final: boolean };
export function useConvoAgent(channel: string, uid: number) {
  const rtcRef = useRef<IAgoraRTCClient | null>(null);
  const micRef = useRef<IMicrophoneAudioTrack | null>(null);
  const rtmRef = useRef<RTMClient | null>(null);
  const agentIdRef = useRef<string | null>(null);
  const [connected, setConnected] = useState(false);
  const [transcripts, setTranscripts] = useState<TranscriptLine[]>([]);
  const [error, setError] = useState<string | null>(null);
  const start = useCallback(async () => {
    try {
      const { rtcToken, rtmToken } = await api.token(channel, uid);
      const appId = process.env.NEXT_PUBLIC_AGORA_APP_ID!;
      // 1. Join RTC and publish the mic
      const rtc = AgoraRTC.createClient({ mode: 'rtc', codec: 'vp8' });
      await rtc.join(appId, channel, rtcToken, uid);
      const mic = await AgoraRTC.createMicrophoneAudioTrack();
      await rtc.publish(mic);
      rtc.on('user-published', async (user, mediaType) => {
        if (mediaType === 'audio') {
          await rtc.subscribe(user, mediaType);
          user.audioTrack?.play();
        }
      });
      rtcRef.current = rtc;
      micRef.current = mic;
      // 2. Join RTM for transcripts
      const rtm = new RTMClient({ appId, userId: String(uid) });
      await rtm.login({ token: rtmToken });
      await rtm.subscribe(channel);
      rtm.addEventListener('message', (e: RTMEvents.MessageEvent) => {
        try {
          const payload = JSON.parse(e.message as string);
          if (payload.type === 'transcript') {
            setTranscripts(prev => [
              ...prev,
              { role: payload.role, text: payload.text, final: payload.final },
            ]);
          }
        } catch {
          // Non-JSON RTM messages. Ignore for this tutorial.
        }
      });
      rtmRef.current = rtm;
      // 3. Ask the backend to bring the agent into the channel
      const { agentId } = await api.invite(channel);
      agentIdRef.current = agentId;
      setConnected(true);
    } catch (e) {
      setError(e instanceof Error ? e.message : String(e));
    }
  }, [channel, uid]);
  const stop = useCallback(async () => {
    try {
      if (agentIdRef.current) await api.stop(agentIdRef.current);
    } finally {
      micRef.current?.stop();
      micRef.current?.close();
      await rtcRef.current?.leave();
      await rtmRef.current?.logout();
      agentIdRef.current = null;
      rtcRef.current = null;
      micRef.current = null;
      rtmRef.current = null;
      setConnected(false);
    }
  }, []);
  return { connected, transcripts, error, start, stop };
}

The hook follows the same structure as components/ConversationComponent.tsx in the Next.js starter repo.

Build the client UI

Create app/page.tsx as the main UI. It renders a start/stop button and a live transcript list.

// app/page.tsx
'use client';
import { useConvoAgent } from '@/hooks/useConvoAgent';
const CHANNEL = 'support-room-123';
const USER_UID = 111222;
export default function Home() {
  const { connected, transcripts, error, start, stop } = useConvoAgent(CHANNEL, USER_UID);
  return (
    <main className="mx-auto max-w-2xl p-8 space-y-6">
      <h1 className="text-2xl font-semibold">Conv AI Tutorial</h1>
      <div className="space-x-3">
        {!connected ? (
          <button onClick={start} className="rounded bg-blue-600 text-white px-4 py-2">
            Start conversation
          </button>
        ) : (
          <button onClick={stop} className="rounded bg-red-600 text-white px-4 py-2">
            Stop
          </button>
        )}
      </div>
      {error && <p className="text-red-600">Error: {error}</p>}
      <ol className="space-y-2">
        {transcripts.map((line, i) => (
          <li key={i} className={line.role === 'agent' ? 'text-blue-800' : 'text-slate-800'}>
            <span className="font-medium">{line.role === 'agent' ? 'Agent' : 'You'}:</span>{' '}
            {line.text}
            {!line.final && <span className="text-slate-400"> …</span>}
          </li>
        ))}
      </ol>
    </main>
  );
}

The page has no state library or design system. It provides just enough UI to verify that the backend is working.

Handle page unload (optional)

Add this effect inside app/page.tsx to stop the agent cleanly when the user closes the tab, rather than waiting for the 30-second idle timeout.

// Inside the page, after useConvoAgent(...)
useEffect(() => {
  const onUnload = () => { if (connected) stop(); };
  window.addEventListener('beforeunload', onUnload);
  return () => window.removeEventListener('beforeunload', onUnload);
}, [connected, stop]);

Test and validate

Start the app and verify that the agent joins, responds, and stops cleanly.

Run the app

TypeScript
Python
Go

pnpm dev

Open http://localhost:3000, click Start conversation, allow microphone access, and speak.

Start the backend and frontend in two separate terminals:

# Terminal 1: Backend
cd server-python
source .venv/bin/activate
uvicorn main:app --reload --port 8000

# Terminal 2: Frontend
cd web-client
pnpm dev

Open http://localhost:3000.

Verify the integration

A healthy run passes all three checks:

Check	How to verify	Time budget
Agent joined the channel	The `invite-agent` response resolves with an `agentId`, and the agent emits a greeting in RTC within two seconds.	< 2 s
Transcripts stream	`transcripts` state updates as you speak; partial lines are marked `final: false`.	< 500 ms partial latency
Stop is clean	After Stop, the backend returns `{ stopped: true }`, and the Convo AI engine logs `STATE=STOPPED, reason=API`.	Immediate

If you run into problems, first run the CLI diagnostic:

agora project doctor

This checks for credential errors, feature-enablement issues, and network reachability problems.

Troubleshooting

Symptom	Likely cause	Fix
`Error: /api/token failed: 500` in the browser	Backend cannot read the `APP_CERTIFICATE` environment variable.	Confirm `.env.local` (TypeScript) or `server-python/.env` (Python) contains the variable and that the server loaded the file on startup.
`invalid token` from the RTC join	Clock skew between token generation and channel join.	RTC tokens are time-sensitive. Regenerate a token on each `start()` call to avoid expiry issues.
Agent never speaks but `agentId` is returned	Conversational AI feature not enabled on the Agora project.	Run `agora project feature list`. If `convoai` is missing, rerun `agora project create --feature convoai` or enable it in the Agora Console.
No transcripts in RTM	`enable_rtm` not set, or `data_channel` is set to `stream` instead of `rtm`.	Confirm `advancedFeatures.enable_rtm: true` and `parameters.data_channel: 'rtm'` in the agent config.
CORS error in the browser (Python)	FastAPI CORS middleware does not include your frontend origin.	Add `http://localhost:3000` to `allow_origins` in `main.py`.
Agent greets itself in a loop	No echo cancellation on the device.	Use headphones, or set `parameters.enable_aec: true`.
`unauthorized` error on `agora login` in CI	SSO browser flow cannot open on a headless machine.	Use `agora login --device` for the device-code flow.
Chrome blocks microphone access	`getUserMedia` is not available on non-localhost HTTP origins.	Test on `http://localhost:3000` exactly, not `http://127.0.0.1` or a LAN IP.

Next steps

Now that you have a working agent, explore the following topics:

Integrate an MLLM: Replace the cascading STT → LLM → TTS pipeline with a single realtime model.
Transmit custom information: Guide the agent with user-specific context to personalize responses.
Integrate short-term memory: Help the agent maintain context across a conversation.
Receive webhook notifications: Receive agent event notifications in real time.
Use filler words: Reduce perceived latency by filling silence during LLM processing.
Optimize conversation latency: Tune LLM, ASR, and TTS components for lower end-to-end latency.