Skip to main content

OpenAI Realtime API

OpenAI Realtime provides multimodal large language model capabilities with real-time audio processing, enabling natural voice conversations without separate ASR/TTS components.

info

Enabling MLLM automatically disables ASR, LLM, and TTS since the MLLM handles end-to-end voice processing directly.

Sample configuration

The following example shows a starting mllm parameter configuration you can use when you Start a conversational AI agent.


_21
"mllm": {
_21
"enable": true,
_21
"url": "wss://api.openai.com/v1/realtime",
_21
"api_key": "<openai_api_key>",
_21
"params": {
_21
"model": "gpt-realtime",
_21
"voice": "coral",
_21
"instructions": "You are a Conversational AI Agent, developed by Agora.",
_21
"input_audio_transcription": {
_21
"language": "<language>",
_21
"model": "gpt-4o-mini-transcribe",
_21
"prompt": "expect words related to real-time engagement"
_21
}
_21
},
_21
"turn_detection": {
_21
// see details below
_21
},
_21
"greeting_message": "<greetings>",
_21
"output_modalities": ["text", "audio"],
_21
"vendor": "openai"
_21
}

Turn detection

For a full list of turn_detection parameters, see mllm.turn_detection. The following examples show the supported turn_detection configurations for OpenAI Realtime API. To set up turn detection, add a turn_detection block inside the mllm object when you Start a conversational AI agent.

  • Server VAD


    _8
    "turn_detection": {
    _8
    "mode": "server_vad",
    _8
    "server_vad_config": {
    _8
    "prefix_padding_ms": 800,
    _8
    "silence_duration_ms": 640,
    _8
    "threshold": 0.5
    _8
    }
    _8
    }

  • Semantic VAD


    _6
    "turn_detection": {
    _6
    "mode": "semantic_vad",
    _6
    "semantic_vad_config": {
    _6
    "eagerness": "auto"
    _6
    }
    _6
    }

  • Agora VAD


    _9
    "turn_detection": {
    _9
    "mode": "agora_vad",
    _9
    "agora_vad_config": {
    _9
    "interrupt_duration_ms": 160,
    _9
    "prefix_padding_ms": 800,
    _9
    "silence_duration_ms": 640,
    _9
    "threshold": 0.5
    _9
    }
    _9
    }

Key parameters

mllmrequired
  • enable booleannullable

    Enables the MLLM module. Replaces the deprecated advanced_features.enable_mllm.

  • url stringrequired

    The WebSocket URL for OpenAI Realtime API.

  • api_key stringrequired

    The API key used for authentication. Get your API key from the OpenAI Console.

  • messages array[object]nullable

    Array of conversation items used for short-term memory management. Uses the same structure as item.content from the OpenAI Realtime API.

  • params objectnullable

    Additional MLLM configuration parameters. See MLLM Overview for details.

    • Modalities override: The modalities setting in params is overridden by input_modalities and output_modalities.
    • Turn detection override: The turn_detection setting in params is overridden by mllm.turn_detection.
    Show propertiesHide properties
    • model stringnullable

      The model identifier.

    • voice stringnullable

      The voice identifier for audio output.

    • instructions stringnullable

      System instructions that define the assistant's behavior and personality.

    • input_audio_transcription objectnullable

      Configuration for audio input transcription.

      Show propertiesHide properties
      • language stringnullable

        The language of the input audio. Supplying the input language in ISO-639-1 format (For example en) improves accuracy and latency.

      • model stringnullable

        The model to use for transcription. Current options are gpt-4o-transcribe, gpt-4o-mini-transcribe, and whisper-1.

      • prompt stringnullable

        An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models, the prompt is a free text string, for example "expect words related to technology".

  • turn_detection objectnullable

    Turn detection configuration for the MLLM module.

    info

    When mllm.turn_detection is defined, the top-level turn_detection object has no effect.

    Show propertiesHide properties
    • mode stringnullable

      Possible values: agora_vad, server_vad, semantic_vad

      • agora_vad: Agora VAD-based detection.
      • server_vad: Vendor-side VAD-based detection.
      • semantic_vad: Semantic-based detection.
    • agora_vad_config objectnullable

      Configuration for Agora VAD-based turn detection. Applicable when mode is agora_vad.

      Show propertiesHide properties
      • interrupt_duration_ms integernullable

        Minimum duration of speech in milliseconds required to trigger an interruption.

      • prefix_padding_ms integernullable

        Duration of audio in milliseconds to include before the detected speech start.

      • silence_duration_ms integernullable

        Duration of silence in milliseconds required to determine end of speech.

      • threshold numbernullable

        VAD sensitivity threshold. A higher value reduces false positives.

    • server_vad_config objectnullable

      Configuration for vendor-side VAD-based turn detection. Applicable when mode is server_vad. Parameters are passed through to the vendor.

      Show propertiesHide properties
      • prefix_padding_ms integernullable

        Duration of audio in milliseconds to include before the detected speech start.

      • silence_duration_ms integernullable

        Duration of silence in milliseconds required to determine end of speech.

      • threshold numbernullable

        VAD sensitivity threshold.

      • idle_timeout_ms integernullable

        Idle timeout in milliseconds.

    • semantic_vad_config objectnullable

      Configuration for semantic-based turn detection. Applicable when mode is semantic_vad.

      Show propertiesHide properties
      • eagerness stringnullable

        Possible values: auto, low, medium, high

        Controls how eagerly the model ends its turn.

  • input_modalities array[string]nullable

    Default: ["audio"]

    MLLM input modalities:

    • ["audio"]: Audio only
    • ["audio", "text"]: Audio plus text
  • output_modalities array[string]nullable

    Default: ["text", "audio"]

    Output format options: ["text", "audio"] for both text and voice responses.

  • greeting_message stringnullable

    Initial message the agent speaks when a user joins the channel.

  • vendor stringnullable

    MLLM provider identifier. Set to openai for OpenAI Realtime API.

For comprehensive API reference, real-time capabilities, and detailed parameter descriptions, see the OpenAI Realtime API documentation.