Skip to main content

OpenAI Realtime API

OpenAI Realtime provides multimodal large language model capabilities with real-time audio processing, enabling natural voice conversations without separate ASR/TTS components.

Enable MLLM

To enable MLLM functionality, set enable_mllm to true under advanced_features.


_3
"advanced_features": {
_3
"enable_mllm": true
_3
}

Sample configuration

The following example shows a starting mllm parameter configuration you can use when you Start a conversational AI agent.


_19
"mllm": {
_19
"url": "wss://api.openai.com/v1/realtime",
_19
"api_key": "<openai_api_key>",
_19
"params": {
_19
"model": "gpt-4o-realtime-preview",
_19
"voice": "coral",
_19
"instructions": "You are a Conversational AI Agent, developed by Agora.",
_19
"input_audio_transcription": {
_19
"language": "<language>",
_19
"model": "gpt-4o-mini-transcribe",
_19
"prompt": "expect words related to real-time engagement"
_19
}
_19
},
_19
"max_history": 50,
_19
"greeting_message": "<greetings>",
_19
"output_modalities": ["text", "audio"],
_19
"vendor": "openai",
_19
"style": "openai"
_19
}

Key parameters

mllmrequired
  • url stringrequired

    The WebSocket URL for OpenAI Realtime API.

  • api_key stringrequired

    The API key used for authentication. Get your API key from the OpenAI Console.

  • messages array[object]nullable

    Array of conversation items used for short-term memory management. Uses the same structure as item.content from the OpenAI Realtime API.

  • params objectnullable

    Additional MLLM configuration parameters.

    • Modalities override: The modalities setting in params is overridden by input_modalities and output_modalities.
    • Turn detection override: The turn_detection setting in params is overridden by the turn_detection section outside of mllm.
      See MLLM Overview for details.
    Show propertiesHide properties
    • model stringnullable

      The model identifier.

    • voice stringnullable

      The voice identifier for audio output.

    • instructions stringnullable

      System instructions that define the assistant's behavior and personality.

    • input_audio_transcription objectnullable

      Configuration for audio input transcription.

      Show propertiesHide properties
      • language stringnullable

        The language of the input audio. Supplying the input language in ISO-639-1 format (For example en) improves accuracy and latency.

      • model stringnullable

        The model to use for transcription. Current options are gpt-4o-transcribe, gpt-4o-mini-transcribe, and whisper-1.

      • prompt stringnullable

        An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models, the prompt is a free text string, for example "expect words related to technology".

  • max_history integernullable

    Default: 32

    The number of conversation history messages to maintain. Cannot exceed the model's context window.

  • input_modalities array[string]nullable

    Default: ["audio"]

    MLLM input modalities:

    • ["audio"]: Audio only
    • ["audio", "text"]: Audio plus text
  • output_modalities array[string]nullable

    Default: ["text", "audio"]

    Output format options: ["text", "audio"] for both text and voice responses.

  • greeting_message stringnullable

    Initial message the agent speaks when a user joins the channel.

  • vendor stringnullable

    MLLM provider identifier. Set to openai for OpenAI Realtime API.

  • style stringnullable

    API request style. Set to openai for OpenAI Realtime API format.

For comprehensive API reference, real-time capabilities, and detailed parameter descriptions, see the OpenAI Realtime API documentation.