Overview
Multimodal Large Language Models (MLLMs) enable real-time audio and text interactions without separate ASR/TTS components. They process voice input directly and generate audio responses, creating more natural conversations with lower latency. Agora supports multiple MLLM providers, allowing you to choose the best real-time capabilities for your specific requirements.
Integration steps
To integrate the MLLM provider of your choice, follow these steps:
- Choose your MLLM provider from the Supported MLLM providers table
- Obtain an API key from the provider's console
- Copy the sample configuration for your chosen provider
- Replace the API key placeholder with your actual API key
- Configure voice settings and system instructions
- Enable MLLM in advanced features:
"enable_mllm": true
- Specify the configuration in the request body as
properties
>mllm
when Starting a conversational AI agent
Supported MLLM providers
Conversational AI Engine currently supports the following MLLM providers:
Provider | Documentation |
---|---|
OpenAI Realtime | https://platform.openai.com/docs/guides/realtime |
Real-time capabilities
MLLMs offer advanced features for conversational AI:
- Direct audio processing: No separate ASR step required, reducing latency
- Natural speech synthesis: Built-in voice generation with emotional nuance
- Real-time streaming: WebSocket-based communication for immediate responses
- Multimodal understanding: Can process both audio and text inputs simultaneously
- Turn detection: Advanced semantic understanding of conversation flow
Modality configuration
MLLMs support flexible input and output combinations:
- Input modalities:
["audio"]
for voice-only,["audio", "text"]
for mixed input - Output modalities:
["text", "audio"]
for comprehensive responses
Choose modality combinations based on your application needs, user experience requirements, and integration complexity. Refer to each provider's documentation for specific capabilities and limitations.