Skip to main content

Optimize conversation latency

Latency is a key factor affecting user experience in conversational AI agent scenarios. This guide helps you understand and optimize conversation latency to improve user experience.

How latency works

Understanding latency composition helps you identify optimization opportunities.

Cascaded architecture latency

When using the cascaded architecture with ASR, LLM, and TTS components, end-to-end latency consists of the following:


_6
End-to-end latency = RTC latency
_6
+ Algorithm preprocessing latency
_6
+ ASR latency
_6
+ LLM latency
_6
+ TTS latency
_6
(+ Avatar latency)

Latency for each component

The following table shows typical latency ranges for each component based on actual test data:

ComponentLatency metricDescriptionTypical latency range (ms)
RTCAudio and video latencyIncludes audio capture, encoding, network transmission, decoding, and playback150-300
Algorithm preprocessingPreprocessing latencyIncludes VAD (Voice Activity Detection), intelligent interruption handling (AIVAD), and other algorithm processing time720-940*
ASRasr_ttlwTime To Last Word. The latency from when the user stops speaking to when ASR outputs the last word.400-700
LLMllm_ttfb / llm_ttfsTTFB: Time To First Byte, the first byte latency.
TTFS: Time To First Sentence, the first sentence latency.
250-1000
TTStts_ttfbTime To First Byte. The response latency from when the TTS request starts to when the first byte is received.100-350
Avatar renderingRendering latencyThe latency from when the avatar module receives the first frame of TTS audio to when it generates and synchronizes the first frame of audio and video (if enabled)50-200

* Algorithm preprocessing latency is based on silence_duration_ms set to 640 ms. Adjusting this parameter affects preprocessing latency.

info

Test data shows that LLM typically contributes the most to overall latency. Optimizing LLM selection and configuration is key to reducing end-to-end latency.

Real-world latency example

The following example shows latency data from three conversation turns in an actual conversation. This data comes from the 111 agent metrics event in Agora's message notification service:


_22
{
_22
"metrics": [
_22
{
_22
"turn_id": 1,
_22
"tts_ttfb": 61
_22
},
_22
{
_22
"turn_id": 2,
_22
"asr_ttlw": 141,
_22
"llm_ttfb": 270,
_22
"llm_ttfs": 482,
_22
"tts_ttfb": 90
_22
},
_22
{
_22
"turn_id": 3,
_22
"asr_ttlw": 103,
_22
"llm_ttfb": 306,
_22
"llm_ttfs": 948,
_22
"tts_ttfb": 106
_22
}
_22
]
_22
}

Monitor latency metrics

The Conversational AI engine provides two ways to monitor latency metrics for each conversation turn:

Use client components

If you use client components (Android, iOS, or Web), you can listen to agent performance metrics in real time by registering the onAgentMetrics callback.

api.addHandler(object : IConversationalAIAPIEventHandler {    override fun onAgentMetrics(agentUserId: String, metric: Metric) {        when (metric.type) {            ModuleType.ASR -> {                Log.d("Metrics", "ASR TTLW: ${metric.value}ms")            }            ModuleType.LLM -> {                // metric.name can be "ttfb" or "ttfs"                Log.d("Metrics", "LLM ${metric.name}: ${metric.value}ms")            }            ModuleType.TTS -> {                Log.d("Metrics", "TTS TTFB: ${metric.value}ms")            }            ModuleType.TOTAL -> {                Log.d("Metrics", "Total Delay: ${metric.value}ms")            }            else -> {                Log.d("Metrics", "${metric.type}: ${metric.name} = ${metric.value}ms")            }        }    }})

For detailed integration steps and API reference, see Receive webhook notifications.

Use Message Notification Service

If you have enabled Agora's Message Notification Service, you can obtain agent performance metrics by receiving the agent metrics event where eventType is 111.

Event callback example


_25
{
_25
"noticeId": "2000001428:4330:107",
_25
"productId": 17,
_25
"eventType": 111,
_25
"notifyMs": 1611566412672,
_25
"payload": {
_25
"agent_id": "A42AC47Hxxxxxxxx4PK27ND25E",
_25
"start_ts": 1000,
_25
"stop_ts": 1672531200,
_25
"channel": "test-channel",
_25
"metrics": [
_25
{
_25
"turn_id": 1,
_25
"tts_ttfb": 61
_25
},
_25
{
_25
"turn_id": 2,
_25
"asr_ttlw": 141,
_25
"llm_ttfb": 270,
_25
"llm_ttfs": 482,
_25
"tts_ttfb": 90,
_25
}
_25
]
_25
}
_25
}

For detailed event field descriptions, see Notification event 111 agent metrics.

Optimize cascaded architecture latency

To reduce latency in cascaded architecture, focus on optimizing individual components, geographic deployment, and RTC settings.

Optimize LLM, ASR, and TTS components

LLM is typically the component that contributes the most to latency. Optimizing LLM can significantly reduce overall latency.

Choose low-latency vendors

LLM, ASR, and TTS vendors and models vary significantly in response speed. Refer to the Conversational AI Performance Lab to compare performance metrics across vendors.

Optimize parameter configuration

When creating an agent, read the vendor documentation for ASR, LLM, and TTS to understand available parameters and tune them for your use case. The following are general optimization approaches:

  • LLM
    • Choose smaller models: Models such as gpt-4o-mini and gemini-2.5-flash typically respond faster than larger models.
    • Limit max_tokens: Reducing the maximum number of tokens generated can lower TTFS (first sentence latency).
    • Enable streaming response: Ensure stream: true so the agent can start speaking as soon as possible.
  • ASR
    • Use vendor-recommended sampling rate: Use the sampling rate recommended by the vendor, such as 16 kHz, to avoid unnecessary resampling.
    • Limit language model: Use the phrases or context parameter to provide domain-specific vocabulary and improve recognition speed.
    • Disable non-essential features: Some ASR vendors provide advanced parameters such as punctuation and tone output. You can disable these based on your use case to improve response speed.
  • TTS
    • Choose faster modes: Some TTS providers offer modes such as turbo or low-latency, which typically respond faster than default mode.
    • Choose simpler voices: Some TTS providers offer voices with varying complexity. Choosing less complex voices can reduce generation time.
    • Disable non-essential features: Some TTS vendors provide advanced parameters such as profanity_filter, punctuation_filter, and diarization. You can disable these based on your use case to improve response speed.

The following example shows how to optimize LLM parameters:


_14
{
_14
"properties": {
_14
"llm": {
_14
"url": "https://api.openai.com/v1/chat/completions",
_14
"api_key": "your_api_key",
_14
"params": {
_14
"model": "gpt-4o-mini", // Select the faster-responding model
_14
"temperature": 0.7,
_14
"max_tokens": 150, // Limit the generation length to reduce latency
_14
"stream": true // Enable streaming response
_14
}
_14
}
_14
}
_14
}

Optimize RTC latency

RTC latency includes audio capture, encoding, network transmission, decoding, and playback. To optimize RTC latency, configure audio settings on the client side.

Use AI conversation scenario

Agora RTC SDK 4.5.1 and later supports the AI conversation scenario (AUDIO_SCENARIO_AI_CLIENT), which is specifically optimized for AI conversations and includes:

  • Optimized audio 3A algorithms (echo cancellation, noise reduction, and gain control)
  • Lower audio capture and playback latency
  • Audio processing tailored to AI voice characteristics

Use the client-side component API (recommended)

val config = ConversationalAIAPIConfig(    rtcEngine = rtcEngineInstance,    rtmClient = rtmClientInstance,    enableLog = true)val api = ConversationalAIAPIImpl(config)// Load optimal audio settingsapi.loadAudioSettings()

Configure the RTC SDK directly

val config = RtcEngineConfig()config.mAudioScenario = Constants.AUDIO_SCENARIO_AI_CLIENTrtcEngine = RtcEngine.create(config)

For detailed audio setting optimization, see Optimize audio.

Latency optimization checklist

Use the following checklist to systematically optimize your conversational AI agent latency:

Server-side optimization

  • Choose low-latency LLM models: Refer to Conversational AI Performance Lab to select models with excellent TTFT and throughput performance.
  • Enable streaming response: Ensure stream: true.
  • Choose low-latency ASR and TTS vendors: Enable low-latency or turbo modes when available. Refer to Conversational AI Performance Lab to select models with low latency and high throughput performance.
  • Optimize geographic deployment: Deploy ASR, LLM, and TTS in the same region.
  • Configure regional access restrictions: Use the geofence parameter to lock in the optimal region.

Client-side optimization

  • Use AI conversation scenario: Set AUDIO_SCENARIO_AI_CLIENT (RTC SDK 4.5.1 and later).
  • Load optimal audio settings: Call the loadAudioSettings() method of the client component.
  • Integrate required audio plugins: Ensure integration of AI noise reduction and AI echo cancellation plugins.
  • Optimize network conditions: Ensure stable user network connection and consider using SD-RTN™ to optimize network transmission.

Monitoring and analysis

  • Monitor latency metrics in real time: Obtain latency data for each conversation turn through client components or NCS.
  • Identify latency bottlenecks: Analyze which component contributes the most to latency.
  • Continuously optimize: Adjust configuration based on actual data and conduct A/B testing.

Balance latency and quality

When optimizing latency, find a balance between response speed and conversation quality:

Optimization strategyLatency impactQuality impactRecommended scenario
Use smaller LLM models✅ Significantly reduces⚠️ May reduceLatency-sensitive scenarios with relatively simple conversations
Limit max_tokens✅ Moderately reduces⚠️ May affect completenessScenarios requiring short responses
Regional access restrictions✅ Moderately reducesNo impactUsers concentrated in a specific region
Optimize RTC settings✅ Moderately reducesNo impactAll scenarios
tip

Monitor latency metrics to identify bottlenecks, then optimize the highest-contributing component rather than pursuing minimal latency across all components.

References

Refer to the following resources for further details.