OpenAI Realtime API
OpenAI Realtime provides multimodal large language model capabilities with real-time audio processing, enabling natural voice conversations without separate ASR/TTS components.
Enabling MLLM automatically disables ASR, LLM, and TTS since the MLLM handles end-to-end voice processing directly.
Sample configuration
The following example shows a starting mllm parameter configuration you can use when you Start a conversational AI agent.
Turn detection
For a full list of turn_detection parameters, see mllm.turn_detection.
The following examples show the supported turn_detection configurations for OpenAI Realtime API. To set up turn detection, add a turn_detection block inside the mllm object when you Start a conversational AI agent.
-
Server VAD
-
Semantic VAD
-
Agora VAD
Key parameters
mllmrequired
- enable booleannullable
Enables the MLLM module. Replaces the deprecated
advanced_features.enable_mllm. - url stringrequired
The WebSocket URL for OpenAI Realtime API.
- api_key stringrequired
The API key used for authentication. Get your API key from the OpenAI Console.
- messages array[object]nullable
Array of conversation items used for short-term memory management. Uses the same structure as
item.contentfrom the OpenAI Realtime API. - params objectnullable
Additional MLLM configuration parameters. See MLLM Overview for details.
- Modalities override: The
modalitiessetting in params is overridden byinput_modalitiesandoutput_modalities. - Turn detection override: The
turn_detectionsetting inparamsis overridden bymllm.turn_detection.
Show propertiesHide properties
- model stringnullable
The model identifier.
- voice stringnullable
The voice identifier for audio output.
- instructions stringnullable
System instructions that define the assistant's behavior and personality.
- input_audio_transcription objectnullable
Configuration for audio input transcription.
Show propertiesHide properties
- language stringnullable
The language of the input audio. Supplying the input language in ISO-639-1 format (For example
en) improves accuracy and latency. - model stringnullable
The model to use for transcription. Current options are
gpt-4o-transcribe,gpt-4o-mini-transcribe, andwhisper-1. - prompt stringnullable
An optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels, the prompt is a free text string, for example "expect words related to technology".
- Modalities override: The
- turn_detection objectnullable
Turn detection configuration for the MLLM module.
infoWhen
mllm.turn_detectionis defined, the top-levelturn_detectionobject has no effect.Show propertiesHide properties
- mode stringnullable
Possible values:
agora_vad,server_vad,semantic_vadagora_vad: Agora VAD-based detection.server_vad: Vendor-side VAD-based detection.semantic_vad: Semantic-based detection.
- agora_vad_config objectnullable
Configuration for Agora VAD-based turn detection. Applicable when
modeisagora_vad.Show propertiesHide properties
- interrupt_duration_ms integernullable
Minimum duration of speech in milliseconds required to trigger an interruption.
- prefix_padding_ms integernullable
Duration of audio in milliseconds to include before the detected speech start.
- silence_duration_ms integernullable
Duration of silence in milliseconds required to determine end of speech.
- threshold numbernullable
VAD sensitivity threshold. A higher value reduces false positives.
- server_vad_config objectnullable
Configuration for vendor-side VAD-based turn detection. Applicable when
modeisserver_vad. Parameters are passed through to the vendor.Show propertiesHide properties
- prefix_padding_ms integernullable
Duration of audio in milliseconds to include before the detected speech start.
- silence_duration_ms integernullable
Duration of silence in milliseconds required to determine end of speech.
- threshold numbernullable
VAD sensitivity threshold.
- idle_timeout_ms integernullable
Idle timeout in milliseconds.
- semantic_vad_config objectnullable
Configuration for semantic-based turn detection. Applicable when
modeissemantic_vad.Show propertiesHide properties
- eagerness stringnullable
Possible values:
auto,low,medium,highControls how eagerly the model ends its turn.
- input_modalities array[string]nullable
Default:
["audio"]MLLM input modalities:
["audio"]: Audio only["audio", "text"]: Audio plus text
- output_modalities array[string]nullable
Default:
["text", "audio"]Output format options:
["text", "audio"]for both text and voice responses. - greeting_message stringnullable
Initial message the agent speaks when a user joins the channel.
- vendor stringnullable
MLLM provider identifier. Set to
openaifor OpenAI Realtime API.
For comprehensive API reference, real-time capabilities, and detailed parameter descriptions, see the OpenAI Realtime API documentation.