Optimize conversation latency

Latency is a key factor affecting user experience in conversational AI agent scenarios. This guide helps you understand and optimize conversation latency to improve user experience.

How latency works

Understanding latency composition helps you identify optimization opportunities.

Cascaded architecture latency

When using the cascaded architecture with ASR, LLM, and TTS components, end-to-end latency consists of the following:

End-to-end latency = RTC latency 
                    + Algorithm preprocessing latency 
                    + ASR latency 
                    + LLM latency 
                    + TTS latency 
                    (+ Avatar latency)

Latency for each component

The following table shows typical latency ranges for each component based on actual test data:

Component	Latency metric	Description	Typical latency range (ms)
RTC	Audio and video latency	Includes audio capture, encoding, network transmission, decoding, and playback	150-300
Algorithm preprocessing	Preprocessing latency	Includes VAD (Voice Activity Detection), intelligent interruption handling (AIVAD), and other algorithm processing time	720-940*
ASR	`asr_ttlw`	Time To Last Word. The latency from when the user stops speaking to when ASR outputs the last word.	400-700
LLM	`llm_ttfb` / `llm_ttfs`	TTFB: Time To First Byte, the first byte latency. TTFS: Time To First Sentence, the first sentence latency.	250-1000
TTS	`tts_ttfb`	Time To First Byte. The response latency from when the TTS request starts to when the first byte is received.	100-350
Avatar rendering	Rendering latency	The latency from when the avatar module receives the first frame of TTS audio to when it generates and synchronizes the first frame of audio and video (if enabled)	50-200

* Algorithm preprocessing latency is based on silence_duration_ms set to 640 ms. Adjusting this parameter affects preprocessing latency.

info

Test data shows that LLM typically contributes the most to overall latency. Optimizing LLM selection and configuration is key to reducing end-to-end latency.

Real-world latency example

The following example shows latency data from three conversation turns in an actual conversation. This data comes from the 111 agent metrics event in Agora's message notification service:

{
  "metrics": [
    {
        "turn_id": 1,
        "tts_ttfb": 61
    },
    {
        "turn_id": 2,
        "asr_ttlw": 141,
        "llm_ttfb": 270,
        "llm_ttfs": 482,
        "tts_ttfb": 90
    },
    {
        "turn_id": 3,
        "asr_ttlw": 103,
        "llm_ttfb": 306,
        "llm_ttfs": 948,
        "tts_ttfb": 106
    }
  ]
}

Monitor latency metrics

The Conversational AI engine provides two ways to monitor latency metrics for each conversation turn:

Use client components

If you use client components (Android, iOS, or Web), you can listen to agent performance metrics in real time by registering the onAgentMetrics callback.

Android
iOS
Web

api.addHandler(object : IConversationalAIAPIEventHandler {    override fun onAgentMetrics(agentUserId: String, metric: Metric) {        when (metric.type) {            ModuleType.ASR -> {                Log.d("Metrics", "ASR TTLW: ${metric.value}ms")            }            ModuleType.LLM -> {                // metric.name can be "ttfb" or "ttfs"                Log.d("Metrics", "LLM ${metric.name}: ${metric.value}ms")            }            ModuleType.TTS -> {                Log.d("Metrics", "TTS TTFB: ${metric.value}ms")            }            ModuleType.TOTAL -> {                Log.d("Metrics", "Total Delay: ${metric.value}ms")            }            else -> {                Log.d("Metrics", "${metric.type}: ${metric.name} = ${metric.value}ms")            }        }    }})

func onAgentMetrics(agentUserId: String, metrics: Metric) {    switch metrics.type {    case .asr:        print("ASR TTLW: (metrics.value)ms")    case .llm:        print("LLM (metrics.name): (metrics.value)ms")    case .tts:        print("TTS TTFB: (metrics.value)ms")    case .total:        print("Total Delay: (metrics.value)ms")    case .unknown:        print("Unknown metric: (metrics.name) = (metrics.value)ms")    }}

conversationalAIAPI.on(  EConversationalAIAPIEvents.AGENT_METRICS,   (agentUserId: string, metrics: Metric) => {    console.log(`[${metrics.type}] ${metrics.name}: ${metrics.value}ms`);        if (metrics.type === 'TOTAL') {      console.log(`Total delay for turn: ${metrics.value}ms`);    }  });

For detailed integration steps and API reference, see Receive webhook notifications.

Use Message Notification Service

If you have enabled Agora's Message Notification Service, you can obtain agent performance metrics by receiving the agent metrics event where eventType is 111.

Event callback example

{
  "noticeId": "2000001428:4330:107",
  "productId": 17,
  "eventType": 111,
  "notifyMs": 1611566412672,
  "payload": {
    "agent_id": "A42AC47Hxxxxxxxx4PK27ND25E",
    "start_ts": 1000,
    "stop_ts": 1672531200,
    "channel": "test-channel",
    "metrics": [
      {
          "turn_id": 1,
          "tts_ttfb": 61
      },
      {
        "turn_id": 2,
        "asr_ttlw": 141,
        "llm_ttfb": 270,
        "llm_ttfs": 482,
        "tts_ttfb": 90,
      }
    ]
  }
}

For detailed event field descriptions, see Notification event 111 agent metrics.

Optimize cascaded architecture latency

To reduce latency in cascaded architecture, focus on optimizing individual components, geographic deployment, and RTC settings.

Optimize LLM, ASR, and TTS components

LLM is typically the component that contributes the most to latency. Optimizing LLM can significantly reduce overall latency.

Choose low-latency vendors

LLM, ASR, and TTS vendors and models vary significantly in response speed. Refer to the Conversational AI Performance Lab to compare performance metrics across vendors.

Optimize parameter configuration

When creating an agent, read the vendor documentation for ASR, LLM, and TTS to understand available parameters and tune them for your use case. The following are general optimization approaches:

LLM
- Choose smaller models: Models such as gpt-4o-mini and gemini-2.5-flash typically respond faster than larger models.
- Limit max_tokens: Reducing the maximum number of tokens generated can lower TTFS (first sentence latency).
- Enable streaming response: Ensure stream: true so the agent can start speaking as soon as possible.
ASR
- Use vendor-recommended sampling rate: Use the sampling rate recommended by the vendor, such as 16 kHz, to avoid unnecessary resampling.
- Limit language model: Use the phrases or context parameter to provide domain-specific vocabulary and improve recognition speed.
- Disable non-essential features: Some ASR vendors provide advanced parameters such as punctuation and tone output. You can disable these based on your use case to improve response speed.
TTS
- Choose faster modes: Some TTS providers offer modes such as turbo or low-latency, which typically respond faster than default mode.
- Choose simpler voices: Some TTS providers offer voices with varying complexity. Choosing less complex voices can reduce generation time.
- Disable non-essential features: Some TTS vendors provide advanced parameters such as profanity_filter, punctuation_filter, and diarization. You can disable these based on your use case to improve response speed.

The following example shows how to optimize LLM parameters:

{
  "properties": {
    "llm": {
      "url": "https://api.openai.com/v1/chat/completions",
      "api_key": "your_api_key",
      "params": {
        "model": "gpt-4o-mini",  // Select the faster-responding model
        "temperature": 0.7,
        "max_tokens": 150,  // Limit the generation length to reduce latency
        "stream": true  // Enable streaming response
      }
    }
  }
}

Optimize RTC latency

RTC latency includes audio capture, encoding, network transmission, decoding, and playback. To optimize RTC latency, configure audio settings on the client side.

Use AI conversation scenario

Agora RTC SDK 4.5.1 and later supports the AI conversation scenario (AUDIO_SCENARIO_AI_CLIENT), which is specifically optimized for AI conversations and includes:

Optimized audio 3A algorithms (echo cancellation, noise reduction, and gain control)
Lower audio capture and playback latency
Audio processing tailored to AI voice characteristics

Android
iOS

Use the client-side component API (recommended)

val config = ConversationalAIAPIConfig(    rtcEngine = rtcEngineInstance,    rtmClient = rtmClientInstance,    enableLog = true)val api = ConversationalAIAPIImpl(config)// Load optimal audio settingsapi.loadAudioSettings()

Configure the RTC SDK directly

val config = RtcEngineConfig()config.mAudioScenario = Constants.AUDIO_SCENARIO_AI_CLIENTrtcEngine = RtcEngine.create(config)

Use the client-side component API (recommended)

let config = ConversationalAIAPIConfig(    rtcEngine: rtcEngine,     rtmEngine: rtmEngine,    enableLog: true)convoAIAPI = ConversationalAIAPIImpl(config: config)// Load optimal audio settingsconvoAIAPI.loadAudioSettings()

Configure the RTC SDK directly

let config = AgoraRtcEngineConfig()config.audioScenario = .aiClientrtcEngine = AgoraRtcEngineKit.sharedEngine(with: config, delegate: delegate)

For detailed audio setting optimization, see Optimize audio.

Latency optimization checklist

Use the following checklist to systematically optimize your conversational AI agent latency:

Server-side optimization

Choose low-latency LLM models: Refer to Conversational AI Performance Lab to select models with excellent TTFT and throughput performance.
Enable streaming response: Ensure stream: true.
Choose low-latency ASR and TTS vendors: Enable low-latency or turbo modes when available. Refer to Conversational AI Performance Lab to select models with low latency and high throughput performance.
Optimize geographic deployment: Deploy ASR, LLM, and TTS in the same region.
Configure regional access restrictions: Use the geofence parameter to lock in the optimal region.

Client-side optimization

Use AI conversation scenario: Set AUDIO_SCENARIO_AI_CLIENT (RTC SDK 4.5.1 and later).
Load optimal audio settings: Call the loadAudioSettings() method of the client component.
Integrate required audio plugins: Ensure integration of AI noise reduction and AI echo cancellation plugins.
Optimize network conditions: Ensure stable user network connection and consider using SD-RTN™ to optimize network transmission.

Monitoring and analysis

Monitor latency metrics in real time: Obtain latency data for each conversation turn through client components or NCS.
Identify latency bottlenecks: Analyze which component contributes the most to latency.
Continuously optimize: Adjust configuration based on actual data and conduct A/B testing.

Balance latency and quality

When optimizing latency, find a balance between response speed and conversation quality:

Optimization strategy	Latency impact	Quality impact	Recommended scenario
Use smaller LLM models	✅ Significantly reduces	⚠️ May reduce	Latency-sensitive scenarios with relatively simple conversations
Limit `max_tokens`	✅ Moderately reduces	⚠️ May affect completeness	Scenarios requiring short responses
Regional access restrictions	✅ Moderately reduces	No impact	Users concentrated in a specific region
Optimize RTC settings	✅ Moderately reduces	No impact	All scenarios

tip

Monitor latency metrics to identify bottlenecks, then optimize the highest-contributing component rather than pursuing minimal latency across all components.

References

Refer to the following resources for further details.