Skip to main content

xAI Grok

xAI Grok provides multimodal large language model capabilities with real-time audio processing, enabling natural voice conversations without separate ASR/TTS components. This page covers integration using the xAI Realtime API, authenticated with an API key obtained from the xAI developer console.

info

Enabling MLLM automatically disables ASR, LLM, and TTS since the MLLM handles end-to-end voice processing directly.

Sample configuration

The following example shows a starting mllm parameter configuration you can use when you Start a conversational AI agent.


_25
"mllm": {
_25
"enable": true,
_25
"vendor": "xai",
_25
"url": "wss://api.x.ai/v1/realtime",
_25
"api_key": "<XAI_API_KEY>",
_25
"messages": [
_25
{
_25
"role": "user",
_25
"content": "<HISTORY_CONTENT>"
_25
}
_25
],
_25
"output_modalities": [
_25
"audio",
_25
"text"
_25
],
_25
"params": {
_25
"voice": "eve",
_25
"language": "en",
_25
"sample_rate": 24000
_25
},
_25
"turn_detection": {
_25
// see details below
_25
},
_25
"greeting_message": "Hello, how can I help?"
_25
}

Turn detection

To set up turn detection, add a turn_detection block inside the mllm object when you Start a conversational AI agent.

info

When mllm.turn_detection is defined, the top-level turn_detection object has no effect.

The following examples show the supported configurations for xAI Grok.

  • Server VAD


    _8
    "turn_detection": {
    _8
    "mode": "server_vad",
    _8
    "server_vad_config": {
    _8
    "threshold": 0.5,
    _8
    "prefix_padding_ms": 640,
    _8
    "silence_duration_ms": 900
    _8
    }
    _8
    }

  • Agora VAD


    _9
    "turn_detection": {
    _9
    "mode": "agora_vad",
    _9
    "agora_vad_config": {
    _9
    "threshold": 0.5,
    _9
    "interrupt_duration_ms": 160,
    _9
    "prefix_padding_ms": 800,
    _9
    "silence_duration_ms": 640
    _9
    }
    _9
    }

Key parameters

mllmrequired
  • enable booleannullable

    Enables the MLLM module. Replaces the deprecated advanced_features.enable_mllm.

  • vendor stringrequired

    The MLLM provider identifier. Set to "xai" to use xAI Grok.

  • url stringrequired

    The WebSocket endpoint for the xAI Realtime API. Set to "wss://api.x.ai/v1/realtime".

  • api_key stringrequired

    The xAI API key used to authenticate requests. Get your API key from the xAI Console.

  • messages array[object]nullable

    An array of conversation history items passed to the model as context. Each item represents a single message in the conversation history.

    Show propertiesHide properties
    • role stringrequired

      The role of the message author. For example, system or user.

    • content stringrequired

      The content of the message.

  • params objectrequired

    Configuration object for the xAI Grok model.

    Show propertiesHide properties
    • voice stringnullable

      The voice identifier for audio output. For example, eve or rex.

    • language stringnullable

      The language code for speech recognition and synthesis. For example, en.

    • sample_rate integernullable

      The audio sample rate in Hz. For example, 24000.

  • turn_detection objectnullable

    Turn detection configuration for the MLLM module. For a full list of turn_detection parameters, see mllm.turn_detection.

    Show propertiesHide properties
    • mode stringnullable

      Possible values: agora_vad, server_vad

      • agora_vad: Agora VAD-based detection.
      • server_vad: Vendor-side VAD-based detection.
    • agora_vad_config objectnullable

      Configuration for Agora VAD-based turn detection. Applicable when mode is agora_vad.

      Show propertiesHide properties
      • interrupt_duration_ms integernullable

        Minimum duration of speech in milliseconds required to trigger an interruption.

      • prefix_padding_ms integernullable

        Duration of audio in milliseconds to include before the detected speech start.

      • silence_duration_ms integernullable

        Duration of silence in milliseconds required to determine end of speech.

      • threshold numbernullable

        VAD sensitivity threshold. A higher value reduces false positives.

    • server_vad_config objectnullable

      Configuration for vendor-side VAD-based turn detection. Applicable when mode is server_vad. Parameters are passed through to the vendor.

      Show propertiesHide properties
      • threshold numbernullable

        VAD sensitivity threshold. A higher value reduces false positives.

      • prefix_padding_ms integernullable

        Duration of audio in milliseconds to include before the detected speech start.

      • silence_duration_ms integernullable

        Duration of silence in milliseconds required to determine end of speech.

  • output_modalities array[string]nullable

    Default: ["audio"]

    Output modalities for the MLLM.

    • ["audio"]: Audio-only output
    • ["text", "audio"]: Combined text and audio output
  • greeting_message stringnullable

    The message the agent speaks when a user joins the channel.

  • failure_message stringnullable

    The message the agent speaks when an error occurs.

For comprehensive API reference, real-time capabilities, and detailed parameter descriptions, see the xAI Voice Agent API.