Google Gemini Live (Vertex AI)
Google Gemini Live provides multimodal large language model capabilities with real-time audio processing, enabling natural voice conversations without separate ASR/TTS components. This page covers integration using Vertex AI, authenticated with Google Cloud Application Default Credentials (ADC).
Enabling MLLM automatically disables ASR, LLM, and TTS since the MLLM handles end-to-end voice processing directly.
Sample configuration
The following example shows a starting mllm parameter configuration you can use when you Start a conversational AI agent.
Turn detection
For a full list of turn_detection parameters, see mllm.turn_detection. The following examples show the supported configurations for Google Gemini Live (Vertex AI). To set up turn detection, add a turn_detection block inside the mllm object when you Start a conversational AI agent.
-
Server VAD
-
Agora VAD
Key parameters
mllmrequired
- enable booleannullable
Enables the MLLM module. Replaces the deprecated
advanced_features.enable_mllm. - adc_credentials_string stringrequired
Base64-encoded Google Cloud Application Default Credentials (ADC).
- project_id stringrequired
Your Google Cloud project ID for Vertex AI access.
- location stringrequired
The Google Cloud region hosting the Gemini Live model. Check the Google Cloud documentation for the full list of available regions.
- messages array[object]nullable
An array of conversation history items passed to the model as context. Each item represents a single message in the conversation history.
- params objectrequired
Main configuration object for the Gemini Live model.
Show propertiesHide properties
- model stringrequired
The Gemini Live model identifier.
- instructions stringnullable
System instructions that define the agent’s behavior or tone.
- voice stringnullable
The voice identifier for audio output. For example,
Aoede,Puck,Charon,Kore,Fenrir,Leda,Orus, orZephyr. - transcribe_agent booleannullable
Whether to transcribe the agent’s speech in real time.
- transcribe_user booleannullable
Whether to transcribe the user’s speech in real time.
- turn_detection objectnullable
Turn detection configuration for the MLLM module.
infoWhen
mllm.turn_detectionis defined, the top-levelturn_detectionobject has no effect.Show propertiesHide properties
- mode stringnullable
Possible values:
agora_vad,server_vad,semantic_vadagora_vad: Agora VAD-based detection.server_vad: Vendor-side VAD-based detection.
- agora_vad_config objectnullable
Configuration for Agora VAD-based turn detection. Applicable when
modeisagora_vad.Show propertiesHide properties
- interrupt_duration_ms integernullable
Minimum duration of speech in milliseconds required to trigger an interruption.
- prefix_padding_ms integernullable
Duration of audio in milliseconds to include before the detected speech start.
- silence_duration_ms integernullable
Duration of silence in milliseconds required to determine end of speech.
- threshold numbernullable
VAD sensitivity threshold. A higher value reduces false positives.
- server_vad_config objectnullable
Configuration for vendor-side VAD-based turn detection. Applicable when
modeisserver_vad. Parameters are passed through to the vendor.Show propertiesHide properties
- prefix_padding_ms integernullable
Duration of audio in milliseconds to include before the detected speech start.
- silence_duration_ms integernullable
Duration of silence in milliseconds required to determine end of speech.
- start_of_speech_sensitivity stringnullable
Possible values:
START_SENSITIVITY_HIGH,START_SENSITIVITY_LOWSensitivity for start of speech detection.
- end_of_speech_sensitivity stringnullable
Possible values:
END_SENSITIVITY_HIGH,END_SENSITIVITY_LOWSensitivity for end of speech detection.
- input_modalities array[string]nullable
Default:
["audio"]Input modalities for the MLLM.
["audio"]: Audio-only input["audio", "text"]: Accept both audio and text input
- output_modalities array[string]nullable
Default:
["audio"]Output modalities for the MLLM.
["audio"]: Audio-only response["text", "audio"]: Combined text and audio output
- greeting_message stringnullable
The message the agent speaks when a user joins the channel.
- vendor stringrequired
The MLLM provider identifier. Set to
"vertexai"to use Google Gemini Live with Vertex AI.
For comprehensive API reference, real-time capabilities, and detailed parameter descriptions, see the Google Gemini Live API.