Overview

Multimodal Large Language Models (MLLMs) enable real-time audio and text interactions without separate ASR/TTS components. They process voice input directly and generate audio responses, creating more natural conversations with lower latency. Agora supports multiple MLLM providers, allowing you to choose the best real-time capabilities for your specific requirements.

Integration steps

To integrate the MLLM provider of your choice, follow these steps:

Choose your MLLM provider from the Supported MLLM providers table
Obtain an API key from the provider's console
Copy the sample configuration for your chosen provider
Replace the API key placeholder with your actual API key
Configure voice settings and system instructions
Enable MLLM in advanced features: "enable_mllm": true
Specify the configuration in the request body as properties > mllm when Starting a conversational AI agent

Supported MLLM providers

Conversational AI Engine currently supports the following MLLM providers:

OpenAI Realtime

Real-time capabilities

MLLMs offer advanced features for conversational AI:

Direct audio processing: No separate ASR step required, reducing latency
Natural speech synthesis: Built-in voice generation with emotional nuance
Real-time streaming: WebSocket-based communication for immediate responses
Multimodal understanding: Can process both audio and text inputs simultaneously
Turn detection: Advanced semantic understanding of conversation flow

Modality configuration

MLLMs support flexible input and output combinations:

Input modalities: ["audio"] for voice-only, ["audio", "text"] for mixed input
Output modalities: ["text", "audio"] for comprehensive responses

Choose modality combinations based on your application needs, user experience requirements, and integration complexity. Refer to each provider's documentation for specific capabilities and limitations.

Integration steps​

Supported MLLM providers​

Real-time capabilities​

Modality configuration​

Integration steps

Supported MLLM providers

Real-time capabilities

Modality configuration