Skip to main content

Google Gemini Live (Vertex AI)

Google Gemini Live provides multimodal large language model capabilities with real-time audio processing, enabling natural voice conversations without separate ASR/TTS components. This page covers integration using Vertex AI, authenticated with Google Cloud Application Default Credentials (ADC).

info

Enabling MLLM automatically disables ASR, LLM, and TTS since the MLLM handles end-to-end voice processing directly.

Sample configuration

The following example shows a starting mllm parameter configuration you can use when you Start a conversational AI agent.


_30
"mllm": {
_30
"enable": true,
_30
"adc_credentials_string": "<GOOGLE_APPLICATION_CREDENTIALS_STRING>",
_30
"project_id": "<GOOGLE_CLOUD_PROJECT_ID>",
_30
"location": "<GOOGLE_CLOUD_REGION>",
_30
"messages": [
_30
{
_30
"role": "user",
_30
"content": "<HISTORY_CONTENT>"
_30
}
_30
],
_30
"params": {
_30
"model": "gemini-3.1-flash-live-preview",
_30
"instructions": "<YOUR_SYSTEM_PROMPT>",
_30
"voice": "Aoede",
_30
"transcribe_agent": true,
_30
"transcribe_user": true
_30
},
_30
"turn_detection": {
_30
// see details below
_30
},
_30
"greeting_message": "Hi, how can I assist you today?",
_30
"input_modalities": [
_30
"audio"
_30
],
_30
"output_modalities": [
_30
"audio"
_30
],
_30
"vendor": "vertexai"
_30
}

Turn detection

For a full list of turn_detection parameters, see mllm.turn_detection. The following examples show the supported configurations for Google Gemini Live (Vertex AI). To set up turn detection, add a turn_detection block inside the mllm object when you Start a conversational AI agent.

  • Server VAD


    _9
    "turn_detection": {
    _9
    "mode": "server_vad",
    _9
    "server_vad_config": {
    _9
    "prefix_padding_ms": 800,
    _9
    "silence_duration_ms": 640,
    _9
    "start_of_speech_sensitivity": "START_SENSITIVITY_HIGH",
    _9
    "end_of_speech_sensitivity": "END_SENSITIVITY_HIGH"
    _9
    }
    _9
    }

  • Agora VAD


    _9
    "turn_detection": {
    _9
    "mode": "agora_vad",
    _9
    "agora_vad_config": {
    _9
    "interrupt_duration_ms": 160,
    _9
    "prefix_padding_ms": 800,
    _9
    "silence_duration_ms": 640,
    _9
    "threshold": 0.5
    _9
    }
    _9
    }

Key parameters

mllmrequired
  • enable booleannullable

    Enables the MLLM module. Replaces the deprecated advanced_features.enable_mllm.

  • adc_credentials_string stringrequired

    Base64-encoded Google Cloud Application Default Credentials (ADC).

  • project_id stringrequired

    Your Google Cloud project ID for Vertex AI access.

  • location stringrequired

    The Google Cloud region hosting the Gemini Live model. Check the Google Cloud documentation for the full list of available regions.

  • messages array[object]nullable

    An array of conversation history items passed to the model as context. Each item represents a single message in the conversation history.

    Show propertiesHide properties
    • role stringrequired

      The role of the message author. For example, user.

    • content stringrequired

      The content of the message.

  • params objectrequired

    Main configuration object for the Gemini Live model.

    Show propertiesHide properties
    • model stringrequired

      The Gemini Live model identifier.

    • instructions stringnullable

      System instructions that define the agent’s behavior or tone.

    • voice stringnullable

      The voice identifier for audio output. For example, Aoede, Puck, Charon, Kore, Fenrir, Leda, Orus, or Zephyr.

    • transcribe_agent booleannullable

      Whether to transcribe the agent’s speech in real time.

    • transcribe_user booleannullable

      Whether to transcribe the user’s speech in real time.

  • turn_detection objectnullable

    Turn detection configuration for the MLLM module.

    info

    When mllm.turn_detection is defined, the top-level turn_detection object has no effect.

    Show propertiesHide properties
    • mode stringnullable

      Possible values: agora_vad, server_vad, semantic_vad

      • agora_vad: Agora VAD-based detection.
      • server_vad: Vendor-side VAD-based detection.
    • agora_vad_config objectnullable

      Configuration for Agora VAD-based turn detection. Applicable when mode is agora_vad.

      Show propertiesHide properties
      • interrupt_duration_ms integernullable

        Minimum duration of speech in milliseconds required to trigger an interruption.

      • prefix_padding_ms integernullable

        Duration of audio in milliseconds to include before the detected speech start.

      • silence_duration_ms integernullable

        Duration of silence in milliseconds required to determine end of speech.

      • threshold numbernullable

        VAD sensitivity threshold. A higher value reduces false positives.

    • server_vad_config objectnullable

      Configuration for vendor-side VAD-based turn detection. Applicable when mode is server_vad. Parameters are passed through to the vendor.

      Show propertiesHide properties
      • prefix_padding_ms integernullable

        Duration of audio in milliseconds to include before the detected speech start.

      • silence_duration_ms integernullable

        Duration of silence in milliseconds required to determine end of speech.

      • start_of_speech_sensitivity stringnullable

        Possible values: START_SENSITIVITY_HIGH, START_SENSITIVITY_LOW

        Sensitivity for start of speech detection.

      • end_of_speech_sensitivity stringnullable

        Possible values: END_SENSITIVITY_HIGH, END_SENSITIVITY_LOW

        Sensitivity for end of speech detection.

  • input_modalities array[string]nullable

    Default: ["audio"]

    Input modalities for the MLLM.

    • ["audio"]: Audio-only input
    • ["audio", "text"]: Accept both audio and text input
  • output_modalities array[string]nullable

    Default: ["audio"]

    Output modalities for the MLLM.

    • ["audio"]: Audio-only response
    • ["text", "audio"]: Combined text and audio output
  • greeting_message stringnullable

    The message the agent speaks when a user joins the channel.

  • vendor stringrequired

    The MLLM provider identifier. Set to "vertexai" to use Google Gemini Live with Vertex AI.

For comprehensive API reference, real-time capabilities, and detailed parameter descriptions, see the Google Gemini Live API.