Optimize conversation latency
Latency is a key factor affecting user experience in conversational AI agent scenarios. This guide helps you understand and optimize conversation latency to improve user experience.
How latency works
Understanding latency composition helps you identify optimization opportunities.
Cascaded architecture latency
When using the cascaded architecture with ASR, LLM, and TTS components, end-to-end latency consists of the following:
Latency for each component
The following table shows typical latency ranges for each component based on actual test data:
| Component | Latency metric | Description | Typical latency range (ms) |
|---|---|---|---|
| RTC | Audio and video latency | Includes audio capture, encoding, network transmission, decoding, and playback | 150-300 |
| Algorithm preprocessing | Preprocessing latency | Includes VAD (Voice Activity Detection), intelligent interruption handling (AIVAD), and other algorithm processing time | 720-940* |
| ASR | asr_ttlw | Time To Last Word. The latency from when the user stops speaking to when ASR outputs the last word. | 400-700 |
| LLM | llm_ttfb / llm_ttfs | TTFB: Time To First Byte, the first byte latency. TTFS: Time To First Sentence, the first sentence latency. | 250-1000 |
| TTS | tts_ttfb | Time To First Byte. The response latency from when the TTS request starts to when the first byte is received. | 100-350 |
| Avatar rendering | Rendering latency | The latency from when the avatar module receives the first frame of TTS audio to when it generates and synchronizes the first frame of audio and video (if enabled) | 50-200 |
* Algorithm preprocessing latency is based on silence_duration_ms set to 640 ms. Adjusting this parameter affects preprocessing latency.
Test data shows that LLM typically contributes the most to overall latency. Optimizing LLM selection and configuration is key to reducing end-to-end latency.
Real-world latency example
The following example shows latency data from three conversation turns in an actual conversation. This data comes from the 111 agent metrics event in Agora's message notification service:
Monitor latency metrics
The Conversational AI engine provides two ways to monitor latency metrics for each conversation turn:
Use client components
If you use client components (Android, iOS, or Web), you can listen to agent performance metrics in real time by registering the onAgentMetrics callback.
- Android
- iOS
- Web
api.addHandler(object : IConversationalAIAPIEventHandler { override fun onAgentMetrics(agentUserId: String, metric: Metric) { when (metric.type) { ModuleType.ASR -> { Log.d("Metrics", "ASR TTLW: ${metric.value}ms") } ModuleType.LLM -> { // metric.name can be "ttfb" or "ttfs" Log.d("Metrics", "LLM ${metric.name}: ${metric.value}ms") } ModuleType.TTS -> { Log.d("Metrics", "TTS TTFB: ${metric.value}ms") } ModuleType.TOTAL -> { Log.d("Metrics", "Total Delay: ${metric.value}ms") } else -> { Log.d("Metrics", "${metric.type}: ${metric.name} = ${metric.value}ms") } } }})func onAgentMetrics(agentUserId: String, metrics: Metric) { switch metrics.type { case .asr: print("ASR TTLW: (metrics.value)ms") case .llm: print("LLM (metrics.name): (metrics.value)ms") case .tts: print("TTS TTFB: (metrics.value)ms") case .total: print("Total Delay: (metrics.value)ms") case .unknown: print("Unknown metric: (metrics.name) = (metrics.value)ms") }}conversationalAIAPI.on( EConversationalAIAPIEvents.AGENT_METRICS, (agentUserId: string, metrics: Metric) => { console.log(`[${metrics.type}] ${metrics.name}: ${metrics.value}ms`); if (metrics.type === 'TOTAL') { console.log(`Total delay for turn: ${metrics.value}ms`); } });For detailed integration steps and API reference, see Receive webhook notifications.
Use Message Notification Service
If you have enabled Agora's Message Notification Service, you can obtain agent performance metrics by receiving the agent metrics event where eventType is 111.
Event callback example
For detailed event field descriptions, see Notification event 111 agent metrics.
Optimize cascaded architecture latency
To reduce latency in cascaded architecture, focus on optimizing individual components, geographic deployment, and RTC settings.
Optimize LLM, ASR, and TTS components
LLM is typically the component that contributes the most to latency. Optimizing LLM can significantly reduce overall latency.
Choose low-latency vendors
LLM, ASR, and TTS vendors and models vary significantly in response speed. Refer to the Conversational AI Performance Lab to compare performance metrics across vendors.
Optimize parameter configuration
When creating an agent, read the vendor documentation for ASR, LLM, and TTS to understand available parameters and tune them for your use case. The following are general optimization approaches:
- LLM
- Choose smaller models: Models such as
gpt-4o-miniandgemini-2.5-flashtypically respond faster than larger models. - Limit
max_tokens: Reducing the maximum number of tokens generated can lower TTFS (first sentence latency). - Enable streaming response: Ensure
stream: trueso the agent can start speaking as soon as possible.
- Choose smaller models: Models such as
- ASR
- Use vendor-recommended sampling rate: Use the sampling rate recommended by the vendor, such as 16 kHz, to avoid unnecessary resampling.
- Limit language model: Use the
phrasesorcontextparameter to provide domain-specific vocabulary and improve recognition speed. - Disable non-essential features: Some ASR vendors provide advanced parameters such as punctuation and tone output. You can disable these based on your use case to improve response speed.
- TTS
- Choose faster modes: Some TTS providers offer modes such as turbo or low-latency, which typically respond faster than default mode.
- Choose simpler voices: Some TTS providers offer voices with varying complexity. Choosing less complex voices can reduce generation time.
- Disable non-essential features: Some TTS vendors provide advanced parameters such as profanity_filter, punctuation_filter, and diarization. You can disable these based on your use case to improve response speed.
The following example shows how to optimize LLM parameters:
Optimize RTC latency
RTC latency includes audio capture, encoding, network transmission, decoding, and playback. To optimize RTC latency, configure audio settings on the client side.
Use AI conversation scenario
Agora RTC SDK 4.5.1 and later supports the AI conversation scenario (AUDIO_SCENARIO_AI_CLIENT), which is specifically optimized for AI conversations and includes:
- Optimized audio 3A algorithms (echo cancellation, noise reduction, and gain control)
- Lower audio capture and playback latency
- Audio processing tailored to AI voice characteristics
- Android
- iOS
Use the client-side component API (recommended)
val config = ConversationalAIAPIConfig( rtcEngine = rtcEngineInstance, rtmClient = rtmClientInstance, enableLog = true)val api = ConversationalAIAPIImpl(config)// Load optimal audio settingsapi.loadAudioSettings()Configure the RTC SDK directly
val config = RtcEngineConfig()config.mAudioScenario = Constants.AUDIO_SCENARIO_AI_CLIENTrtcEngine = RtcEngine.create(config)Use the client-side component API (recommended)
let config = ConversationalAIAPIConfig( rtcEngine: rtcEngine, rtmEngine: rtmEngine, enableLog: true)convoAIAPI = ConversationalAIAPIImpl(config: config)// Load optimal audio settingsconvoAIAPI.loadAudioSettings()Configure the RTC SDK directly
let config = AgoraRtcEngineConfig()config.audioScenario = .aiClientrtcEngine = AgoraRtcEngineKit.sharedEngine(with: config, delegate: delegate)For detailed audio setting optimization, see Optimize audio.
Latency optimization checklist
Use the following checklist to systematically optimize your conversational AI agent latency:
Server-side optimization
- Choose low-latency LLM models: Refer to Conversational AI Performance Lab to select models with excellent TTFT and throughput performance.
- Enable streaming response: Ensure
stream: true. - Choose low-latency ASR and TTS vendors: Enable low-latency or turbo modes when available. Refer to Conversational AI Performance Lab to select models with low latency and high throughput performance.
- Optimize geographic deployment: Deploy ASR, LLM, and TTS in the same region.
- Configure regional access restrictions: Use the
geofenceparameter to lock in the optimal region.
Client-side optimization
- Use AI conversation scenario: Set
AUDIO_SCENARIO_AI_CLIENT(RTC SDK 4.5.1 and later). - Load optimal audio settings: Call the
loadAudioSettings()method of the client component. - Integrate required audio plugins: Ensure integration of AI noise reduction and AI echo cancellation plugins.
- Optimize network conditions: Ensure stable user network connection and consider using SD-RTN™ to optimize network transmission.
Monitoring and analysis
- Monitor latency metrics in real time: Obtain latency data for each conversation turn through client components or NCS.
- Identify latency bottlenecks: Analyze which component contributes the most to latency.
- Continuously optimize: Adjust configuration based on actual data and conduct A/B testing.
Balance latency and quality
When optimizing latency, find a balance between response speed and conversation quality:
| Optimization strategy | Latency impact | Quality impact | Recommended scenario |
|---|---|---|---|
| Use smaller LLM models | ✅ Significantly reduces | ⚠️ May reduce | Latency-sensitive scenarios with relatively simple conversations |
Limit max_tokens | ✅ Moderately reduces | ⚠️ May affect completeness | Scenarios requiring short responses |
| Regional access restrictions | ✅ Moderately reduces | No impact | Users concentrated in a specific region |
| Optimize RTC settings | ✅ Moderately reduces | No impact | All scenarios |
Monitor latency metrics to identify bottlenecks, then optimize the highest-contributing component rather than pursuing minimal latency across all components.
References
Refer to the following resources for further details.