Skip to main content

Overview

Multimodal Large Language Models (MLLMs) enable real-time audio and text interactions without separate ASR/TTS components. They process voice input directly and generate audio responses, creating more natural conversations with lower latency. Agora supports multiple MLLM providers, allowing you to choose the best real-time capabilities for your specific requirements.

Integration steps

To integrate the MLLM provider of your choice, follow these steps:

  1. Choose your MLLM provider from the Supported MLLM providers table
  2. Obtain an API key from the provider's console
  3. Copy the sample configuration for your chosen provider
  4. Replace the API key placeholder with your actual API key
  5. Configure voice settings and system instructions
  6. Enable MLLM in advanced features: "enable_mllm": true
  7. Specify the configuration in the request body as properties > mllm when Starting a conversational AI agent

Supported MLLM providers

Conversational AI Engine currently supports the following MLLM providers:

ProviderDocumentation
OpenAI Realtimehttps://platform.openai.com/docs/guides/realtime

Real-time capabilities

MLLMs offer advanced features for conversational AI:

  • Direct audio processing: No separate ASR step required, reducing latency
  • Natural speech synthesis: Built-in voice generation with emotional nuance
  • Real-time streaming: WebSocket-based communication for immediate responses
  • Multimodal understanding: Can process both audio and text inputs simultaneously
  • Turn detection: Advanced semantic understanding of conversation flow

Modality configuration

MLLMs support flexible input and output combinations:

  • Input modalities: ["audio"] for voice-only, ["audio", "text"] for mixed input
  • Output modalities: ["text", "audio"] for comprehensive responses

Choose modality combinations based on your application needs, user experience requirements, and integration complexity. Refer to each provider's documentation for specific capabilities and limitations.