OpenAI Realtime API

OpenAI Realtime provides multimodal large language model capabilities with real-time audio processing, enabling natural voice conversations without separate ASR/TTS components.

Enable MLLM

To enable MLLM functionality, set enable_mllm to true under advanced_features.

"advanced_features": {
  "enable_mllm": true
}

Sample configuration

The following example shows a starting mllm parameter configuration you can use when you Start a conversational AI agent.

"mllm": {
  "url": "wss://api.openai.com/v1/realtime",
  "api_key": "<openai_api_key>",
  "params": {
    "model": "gpt-4o-realtime-preview",
    "voice": "coral",
    "instructions": "You are a Conversational AI Agent, developed by Agora.",
    "input_audio_transcription": {
      "language": "<language>",
      "model": "gpt-4o-mini-transcribe",
      "prompt": "expect words related to real-time engagement"
    }
  },
  "max_history": 50,
  "greeting_message": "<greetings>",
  "output_modalities": ["text", "audio"],
  "vendor": "openai",
  "style": "openai"
}

Key parameters

mllmrequired

url stringrequired
The WebSocket URL for OpenAI Realtime API.
api_key stringrequired
The API key used for authentication. Get your API key from the OpenAI Console.
messages array[object]nullable
Array of conversation items used for short-term memory management. Uses the same structure as item.content from the OpenAI Realtime API.
params objectnullable
Additional MLLM configuration parameters.
- Modalities override: The modalities setting in params is overridden by input_modalities and output_modalities.
- Turn detection override: The turn_detection setting in params is overridden by the turn_detection section outside of mllm.
  See MLLM Overview for details.
Show propertiesHide properties
- model stringnullable
  The model identifier.
- voice stringnullable
  The voice identifier for audio output.
- instructions stringnullable
  System instructions that define the assistant's behavior and personality.
- input_audio_transcription objectnullable
  Configuration for audio input transcription.
  Show propertiesHide properties
  language stringnullable
  The language of the input audio. Supplying the input language in ISO-639-1 format (For example en) improves accuracy and latency.
  model stringnullable
  The model to use for transcription. Current options are gpt-4o-transcribe, gpt-4o-mini-transcribe, and whisper-1.
  prompt stringnullable
  An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models, the prompt is a free text string, for example "expect words related to technology".
max_history integernullable
Default: 32
The number of conversation history messages to maintain. Cannot exceed the model's context window.
input_modalities array[string]nullable
Default: ["audio"]
MLLM input modalities:
- ["audio"]: Audio only
- ["audio", "text"]: Audio plus text
output_modalities array[string]nullable
Default: ["text", "audio"]
Output format options: ["text", "audio"] for both text and voice responses.
greeting_message stringnullable
Initial message the agent speaks when a user joins the channel.
vendor stringnullable
MLLM provider identifier. Set to openai for OpenAI Realtime API.
style stringnullable
API request style. Set to openai for OpenAI Realtime API format.

For comprehensive API reference, real-time capabilities, and detailed parameter descriptions, see the OpenAI Realtime API documentation.

Enable MLLM​

Sample configuration​

Key parameters​

Was this helpful?

Enable MLLM

Sample configuration

Key parameters