Start a conversational AI agent

POST

https://api.agora.io/api/conversational-ai-agent/v2/projects/{appid}/join

Use this endpoint to create and start a Conversational AI agent instance.

Request

Path parameters

appid stringrequired

The App ID of the project.

Request body

APPLICATION/JSON

BODYrequired

name stringrequired

The unique identifier of the agent. The same identifier cannot be used repeatedly.

properties objectrequired

Configuration details of the agent.

Show propertiesHide properties

channel stringrequired

The name of the channel to join.

token stringrequired

The authentication token used by the agent to join the channel.

agent_rtc_uid stringrequired

The user ID of the agent in the channel. A value of 0 means that a random UID is generated and assigned. Set the token accordingly.

remote_rtc_uids array[string]required

A list of user IDs that the agent subscribes to in the channel. Only subscribed users can interact with the agent.

info

Currently, only one user ID is supported.

enable_string_uid booleannullable

Default: false

Whether to enable String uid:

true: Both agent and subscriber user IDs use strings.
false: Both agent and subscriber user IDs must be integers.

idle_timeout integernullable

Default: 30

Sets the timeout after all the users specified in remote_rtc_uids are detected to have left the channel. When the timeout value is exceeded, the agent automatically stops and exits the channel. A value of 0 means that the agent does not exit until it is stopped manually.

Agent lifecycle best practice

For precise and reliable control over the agent's lifecycle, use the leave API to terminate the agent as soon as its task is complete.

geofence nullable

Regional access restriction configuration. Use this to limit which Agora servers the Conversational AI Engine can access based on geographic regions.

Show propertiesHide properties

area stringrequired

Possible values: GLOBAL, NORTH_AMERICA, EUROPE, ASIA, INDIA, JAPAN

The allowed region for server access.

exclude_area stringnullable

Possible values: NORTH_AMERICA, EUROPE, ASIA, INDIA, JAPAN

The excluded region. Only available when area is set to GLOBAL.

advanced_features objectnullable

Advanced features configuration.

Show propertiesHide properties

enable_aivad booleandeprecatednullable

Default: false

Whether to enable the intelligent interruption handling function (AIVAD). This feature is currently available only for English.

caution

This field is deprecated. Use turn_detection.config.end_of_speech.mode.semantic instead.

enable_mllm booleannullable

Default: false

Enable Multimodal Large Language Model for voice-to-voice processing. Enabling MLLM automatically disables ASR, LLM, and TTS since the MLLM handles end-to-end voice processing directly. When you set this parameter to true, enable_aivad is also disabled. See turn_detection.type for turn detection options available with MLLM.

enable_rtm booleannullable

Default: false

Whether to enable the Signaling (RTM) service. When enabled, the agent can combine the capabilities provided by Signaling to implement advanced functions, such as delivering custom information.

info

Before enabling the Signaling service, make sure the token includes both RTC and RTM privileges. When an agent joins an RTM channel, it reuses the token specified in the token field. For more information, see "How can I generate a token with both RTC and Signaling privileges?".

enable_sal booleannullable

Default: false

Enable Selective Attention Locking (SAL). When enabled, configure the sal field to set up speaker recognition or locking modes. See the sal parameter for configuration details.

enable_tools booleannullable

Default: false

Enable tool invocation. When enabled, the agent can invoke tools provided by the MCP server to implement advanced functionality.

asr objectnullable

Automatic Speech Recognition (ASR) configuration.

Show propertiesHide properties

language stringnullable

Default: en-US

The BCP-47 language tag identifying the primary language used for agent interaction. If params contains a vendor-specific language code, it takes precedence over this setting.

vendor stringnullable

Default: ares

Possible values: ares, microsoft, deepgram, openai, google, amazon, assemblyai, sarvam

ASR provider:

ares: Adaptive Recognition Engine for Speech
microsoft: Microsoft Azure
deepgram: Deepgram
openai: OpenAI (Beta)
speechmatics: Speechmatics
assemblyai: AssemblyAI (Beta)
amazon: Amazon Transcribe (Beta)
google: Google (Beta)
sarvam: Sarvam (Beta)

params objectrequired

The configuration parameters for the ASR vendor. See ASR Overview for details.

tts objectrequired

Text-to-speech (TTS) module configuration.

Show propertiesHide properties

vendor stringrequired

Possible values: microsoft, elevenlabs, minimax, cartesia, openai, humeai, rime, fishaudio, google, amazon, sarvam

TTS provider.

microsoft: Microsoft Azure
elevenlabs: ElevenLabs
minimax: MiniMax
murf: Murf (Beta)
cartesia : Cartesia (Beta)
openai: OpenAI (Beta)
humeai: Hume AI (Beta)
rime: Rime (Beta)
fishaudio: Fish Audio (Beta)
google: Google (Beta)
amazon: Amazon Polly (Beta)
sarvam: Sarvam (Beta)

params objectrequired

The configuration parameters for the TTS vendor. See TTS Overview for details.

skip_patterns array[integer]nullable

Controls whether the TTS module skips bracketed content when reading LLM response text. This prevents the agent from vocalizing structural prompt information like tone indicators, action descriptions, and system prompts, creating a more natural and immersive listening experience. Enable this feature by specifying one or more values:

1: Skip content in Chinese parentheses （）
2: Skip content in Chinese square brackets 【】
3: Skip content in parentheses ( )
4: Skip content in square brackets [ ]
5: Skip content in curly braces { }

info

Nested brackets: When input text contains nested brackets and multiple bracket types are configured to be skipped, the system processes only the outermost brackets. The system matches from the beginning of the text and skips the first outermost bracket pair that meets the skip rule, including all nested content.
Agent memory: The agent's short-term memory always contains the complete, unfiltered LLM text, regardless of live captioning settings.
Real-time transcript: When enabled, transcript excludes filtered content during TTS playback but restores the complete text after each sentence finishes.

llm objectrequired

Large language model (LLM) configuration.

Show propertiesHide properties

url stringrequired

The LLM callback address.

api_key stringnullable

The LLM verification API key. The default value is an empty string. Ensure that you enable the API key in a production environment.

system_messages array[object]nullable

A set of predefined information used as input to the LLM, including prompt words and examples.

params objectnullable

Additional LLM configuration parameters, such as the model used, and the maximum token limit. For details about each supported LLM, refer to Supported LLMs.

max_history integernullable

Default: 32

Range: [1, 1024]

The number of conversation history messages cached in the LLM. History includes user and agent dialog messages, tool call information, and timestamps. Agent and user messages are recorded separately.

input_modalities array[string]nullable

Default: ["text"]

LLM input modalities:

["text"]: Text only
["text", "image"]: Text plus image; requires the selected LLM to support visual input

output_modalities array[string]nullable

Default: ["text"]

LLM output modalities:

["text"]: The output text is converted to speech by the TTS module and then published to the RTC channel.
["audio"]: Voice only. Voice is published directly to the RTC channel.
["text", "audio"]: Text plus voice. Write your own logic to process the output of LLM as needed.

greeting_configs objectnullable

Agent greeting broadcast configuration.

Show propertiesHide properties

mode stringnullable

Default: single_every

Possible values: single_every, single_first

Determines when the agent sends greeting messages to users joining the channel.

single_every: Broadcasts a greeting every time a user joins the channel.
single_first: Broadcasts a greeting only once to the first user who joins the channel.

greeting_message stringnullable

Agent greeting. If provided, the first user in the channel is automatically greeted with the message upon joining.

failure_message stringnullable

Prompt for agent activation failure. If provided, it is returned through TTS when the custom LLM call fails.

vendor stringnullable

LLM provider, supports the following settings:

custom: Custom LLM. When you set this option, the agent includes the following fields, in addition to role and content when making requests to the custom LLM:
- turn_id: A unique identifier for each conversation turn. It starts from 0 and increments with each turn. One user-agent interaction corresponds to one turn_id.
- timestamp: The request timestamp, in milliseconds.
azure: Use this value for Azure OpenAI

style stringnullable

Default: openai

Possible values: openai, gemini, anthropic, dify

The request style for chat completion:

openai: For OpenAI and OpenAI-compatible APIs
gemini: For Google Gemini and Google Vertex API format
anthropic: For Anthropic Claude API format
dify: For Dify API format

For details, refer to Supported LLMs.

template_variables objectnullable

Template parameter configuration used to insert variables into the agent's system_messages, greeting_message, failure_message, and parameters.silence_config.content text. Uses key-value pairs, where the key is the variable name and the value is the variable's value. Template variables, combined with prompt customization and SIP outbound calling functionality, enable dynamic content injection, automating processes such as automatic hang-up, voicemail recognition, automatic message leaving, and call transfer.

To insert defined variables in the prompt text, use the syntax {{variable_name}}. The system automatically replaces each variable with the corresponding value defined in template_variables.

caution

Variable values cannot reference other variables. For example, if you define "farewell": "Looking forward to seeing you again, {{name}}", the {{name}} variable will not be resolved.

mcp_servers arraynullable

MCP (Model Context Protocol) server configuration. By configuring MCP servers, agents can call tools provided by external services to implement advanced functionality.

Show propertiesHide properties

name stringrequired

A unique identifier for the MCP server. Maximum 48 characters. Accepts only English letters and numbers.

endpoint stringrequired

The endpoint address of the MCP server. The agent uses this to communicate with the MCP server.

transport stringnullable

Possible values: streamable_http

Transport protocol type.

streamable_http: Streaming HTTP protocol

headers objectnullable

HTTP header information to include when requesting the MCP server, such as authentication information.

allowed_tools arraynullable

A list of tools that the agent is allowed to invoke. The agent can only use tools on this list.

Behavior:

Empty or omitted: All tools are enabled.
Empty array []: No tools are enabled.
["*"]: All tools are enabled.
Specific tools ["aa", "bb", "cc"]: Only aa, bb, and cc are enabled.
Mix with wildcard ["aa", "bb", "*"]: All tools are enabled (wildcard takes precedence).

timeout_ms integernullable

The MCP server request timeout in milliseconds. After timeout, the agent stops waiting for the MCP server's response and continues executing subsequent logic.

mllm objectnullable

Multimodal Large Language Model (MLLM) configuration for real-time audio and text processing.

Show propertiesHide properties

url stringrequired

The MLLM WebSocket URL for real-time communication.

api_key stringrequired

The API key used for MLLM authentication.

messages array[object]nullable

Array of conversation items used for short-term memory management. Uses the same structure as item.content from the OpenAI Realtime API.

params objectnullable

Additional MLLM configuration parameters.

Modalities override: The modalities setting in params is overridden by input_modalities and output_modalities.
Turn detection override: The turn_detection setting in params is overridden by the turn_detection section outside of mllm.

See MLLM Overview for details.

input_modalities array[string]nullable

Default: ["audio"]

MLLM input modalities:

["audio"]: Audio only
["audio", "text"]: Audio plus text

output_modalities array[string]nullable

Default: ["text", "audio"]

MLLM output modalities:

["text", "audio"]: Text plus audio

greeting_message stringnullable

Agent greeting message. If provided, the first user in the channel is automatically greeted with this message upon joining.

vendor stringnullable

Possible values: openai, vertexai

MLLM provider. Currently supports:

openai: OpenAI Realtime API
vertexai: Use this for Google Gemini Live

style stringnullable

Default: openai

Possible values: openai

The request style for MLLM completion:

openai: For OpenAI Realtime API format

avatar objectnullable

Avatar configuration.

Show propertiesHide properties

enable booleannullable

Default: false

Whether to enable the avatar function for the agent. To enable, set to true and configure the vendor and params fields.

vendor stringnullable

Possible values: akool, heygen

Avatar vendor. Supports the following values:

akool: Akool (Beta)
heygen: HeyGen (Beta)

params objectnullable

The configuration parameters for the avatar vendor. See AI Avatar Overview for details.

turn_detection objectnullable

Conversation turn detection settings. Controls the logic for voice activity detection and conversation turn determination. The previous version of turn_detection is deprecated. Refer to Deprecated parameters for details. Agora recommends switching to the latest parameters.

This configuration supports multiple combinations of detection modes:

Start of Speech (SoS): Supports three modes: VAD, Keyword, and Disable.
End of Speech (EoS): Supports VAD and Semantic modes.

Show propertiesHide properties

mode stringnullable

Default: default

Possible values: default

Conversation turn detection mode:

default: Uses standard conversation turn detection configuration.

config objectnullable

Detailed configuration for conversation turn detection.

Show propertiesHide properties

speech_threshold numbernullable

Default: 0.5

Range: (0.0, 1.0)

Voice activity detection sensitivity. Determines the sound level in the audio signal that is considered voice activity. Lower values make it easier for the agent to detect speech, and higher values ignore weak sounds.

start_of_speech objectnullable

Start of Speech (SoS) detection configuration. Determines when a user begins speaking.

Show propertiesHide properties

mode stringrequired

Possible values: vad, keywords, disabled

Start of speech detection mode:

vad: Based on VAD (Voice Activity Detection). Uses audio signal detection.
keywords: (Beta) Based on keyword trigger. Conversation begins when the agent detects a specified keyword.
disabled: Disables start of speech detection. Does not actively trigger new conversation turns.

{mode}_config objectnullable

Start of speech detection configuration parameters. The structure and supported fields vary depending on the detection mode.

info

The configuration type must match mode. For example, when mode is vad, you must provide vad_config.
You cannot provide multiple mode configurations simultaneously.

Configuration examples:

vad_config

_5"vad_config": { _5 "interrupt_duration_ms": 160, _5 "speaking_interrupt_duration_ms": 160, _5 "prefix_padding_ms": 800 _5}
keywords_config

_5"keywords_config": { _5 "interrupt_duration_ms": 160, _5 "prefix_padding_ms": 800, _5 "triggered_keywords": ["Are you there", "hello"] _5}
disabled_config

_3"disabled_config": { _3 "strategy": "append" _3}

Show propertiesHide properties

strategy stringnullable

Possible values: append, ignored

Voice processing strategy when the agent is interacting (speaking or thinking):

append: Append mode. Human voice does not interrupt the agent. The agent processes the human voice input after the current interaction ends.
ignored: Ignore mode. The agent ignores human voice input. If the agent receives human voice while speaking or thinking, the agent discards the input without storing it in context.

end_of_speech objectnullable

End of Speech (EoS) detection configuration. Determines when a user ends their speech.

Show propertiesHide properties

mode stringnullable

Possible values: vad, semantic

End of speech detection mode. Possible values:

vad: Based on VAD (Voice Activity Detection). Detects silence duration.
semantic: Based on semantic triggering. Uses semantic understanding to determine when conversation ends.

{mode}_config objectnullable

End of speech detection configuration parameters. The structure and supported fields vary depending on the detection mode.

info

The configuration type must match mode. For example, when mode is vad, you must provide vad_config.
You cannot provide multiple mode configurations simultaneously.

Configuration examples:

vad_config

_3"vad_config": { _3 "silence_duration_ms": 640 _3}
semantic_config

_4"semantic_config": { _4 "silence_duration_ms": 320, _4 "max_wait_ms": 3000 _4}

Show propertiesHide properties

silence_duration_ms integernullable

Default: 320

Range: [0, 2000]

Silence duration threshold in milliseconds. The minimum silence duration at the end of a speech segment, to ensure that a brief pause does not prematurely end the speech segment.

max_wait_ms integernullable

Default: 3000

Range: [0, 10000]

Maximum wait time in milliseconds. The maximum time to wait for semantic determination. After timeout, the conversation end is determined based on the current state.

sal objectnullable

Selective Attention Locking (SAL) configuration. (Beta)

Show propertiesHide properties

sal_mode stringnullable

Default: locking

Possible values: locking, recognition

Selective attention lock mode. Supports the following options:

locking: Speaker Lock Mode. The agent locks onto the speaker, blocking 95% of ambient human voices and noise. You can enable this mode in two ways:
- Seamless mode: When a user speaks loudly and clearly at the beginning of a conversation, the intelligent agent automatically recognizes the user as the speaker.
- Personalized mode: When creating an agent, a speaker's voiceprint URL is pre-registered through the sample_urls field. The agent then locates the speaker based on the pre-registered voiceprint.
recognition: Voiceprint recognition mode. You can pre-register only one voiceprint URL using the sample_urls field. The agent identifies different speakers and suppresses other background voices and environmental noise. The target speaker is identified through the vpids field in the metadata field and sent to the LLM. Set llm.vendor to "custom" and refer to Custom LLM for instructions on how to make the LLM process speaker information.

sample_urls objectnullable

The registered voiceprint URL as a key-value pair, where the key is the voiceprint name and the value is the download URL for the speaker's voiceprint. Only one voiceprint URL is supported.
Example:

{
  "speaker1": "https://example.com/speaker1.pcm"
}

info

Do not set the incoming voiceprint name to "unknown"; this is a reserved keyword used to identify unknown speakers.
For a registered voiceprint, ensure that:
- Size: The voiceprint file does not exceed 2 MB.
- Duration: Contains 10 to 15 seconds of audio, with at least 8 seconds of effective audio excluding silent segments.
- Format: 16kHz sampling rate, 16-bit depth, mono PCM audio file. The file name extension must be ".pcm".

labels objectnullable

Custom labels in key-value pair format, where the key is the label name and the value is the label value. Enables agents to carry custom business information.

These labels are bound to the agent and returned in the payload field of all message notification callbacks from the conversational AI engine. Use them to implement custom business logic, such as tagging activity IDs, customer groups, and business scenarios.

rtc objectnullable

RTC media encryption configuration.

Show propertiesHide properties

encryption_key stringnullable

The encryption key for RTC media content. The key has no length limit. Agora recommends using a 32-byte key. If no encryption key is set or if the key is empty, built-in encryption is not used.

encryption_salt stringnullable

The salt value used for encryption. This is a Base64-encoded string that is 32 bytes long after decoding. This parameter only takes effect when encryption_mode is set to 7 (AES_128_GCM2) or 8 (AES_256_GCM2). Ensure that the salt parameter is not empty for these encryption modes.

encryption_mode integernullable

Possible values: 1, 2, 3, 4, 5, 6, 7, 8

The built-in encryption mode.

1: AES_128_XTS - 128-bit AES encryption, XTS mode.
2: AES_128_ECB - 128-bit AES encryption, ECB mode.
3: AES_256_XTS - 256-bit AES encryption, XTS mode.
4: SM4_128_ECB - 128-bit SM4 encryption, ECB mode.
5: AES_128_GCM - 128-bit AES encryption, GCM mode.
6: AES_256_GCM - 256-bit AES encryption, GCM mode.
7: AES_128_GCM2 - 128-bit AES encryption, GCM mode. Requires setting encryption_salt.
8: AES_256_GCM2 - 256-bit AES encryption, GCM mode. Requires setting encryption_salt.

Agora recommends using either 7 (AES_128_GCM2) or 8 (AES_256_GCM2) mode. Both modes support cryptographic salts to enhance security.

filler_words objectnullable

Filler word configuration. Plays filler words while waiting for LLM responses to reduce user anxiety and improve conversation flow.

Filler word playback follows these rules:

Playback order: When multiple filler words or LLM responses are waiting to be played, they are played in the order they arrive.
Interruption control: Inherits the interruption mode setting from global configuration in turn_detection.config.

Show propertiesHide properties

enable booleannullable

Default: false

Whether to enable filler words:

true: Enable filler words.
false: Disable filler words.

trigger objectnullable

Filler word trigger configuration. Defines when to trigger filler word playback.

Show propertiesHide properties

mode stringnullable

Possible values: fixed_time

Filler word trigger mode:

fixed_time: Fixed time trigger. Triggers filler word playback when LLM response wait time exceeds the threshold.

{mode}_config objectnullable

Filler word trigger configuration parameters. The parameter name and structure vary depending on the trigger mode.

info

The configuration type must match mode. For example, when mode is fixed_time, you must provide fixed_time_config.
You cannot provide multiple mode configurations simultaneously.

Configuration example:

"fixed_time_config": {
  "response_wait_ms": 1500
}

Show propertiesHide properties

response_wait_ms integernullable

Default: 1500

Range: [100, 10000]

LLM response wait threshold in milliseconds. Triggers filler word playback when the LLM waits this duration without generating a response, such as when waiting for RAG retrieval or tool call results.

content objectnullable

Filler word content configuration. Defines the source and selection rules for filler words.

Show propertiesHide properties

mode stringnullable

Possible values: static

Filler word content mode:

static: Static filler words. Uses a predefined list of filler words.

{mode}_config objectnullable

Filler word content configuration parameters. The parameter name and structure vary depending on the content mode.

info

The configuration type must match mode. For example, when mode is static, you must provide static_config.
You cannot provide multiple mode configurations simultaneously.

Static filler word configuration example:

"static_config": {
  "phrases": [
    "Please wait.",
    "Okay.",
    "Uh-huh."
  ],
  "selection_rule": "shuffle"
}

Show propertiesHide properties

phrases array[string]nullable

List of filler word phrases.

Limits:

Maximum 100 filler words.
Each filler word must not exceed 50 English words.

selection_rule stringnullable

Possible values: shuffle, round_robin

Filler word selection rule:

shuffle: Random shuffle. Already-used filler words are not repeated until all filler words have been used once. After all filler words are played, they are reshuffled randomly and a new round begins.
round_robin: Round-robin. Selects and plays filler words sequentially from the list. After all filler words are played once, a new cycle begins.

parameters objectnullable

Agent configuration parameters.

Show propertiesHide properties

silence_config objectnullable

Settings related to agent silence behavior.

info

silence_config does not apply when you integrate a mllm.

Show propertiesHide properties

timeout_ms integernullable

Default: 0

Possible values: 0 to 60000

Specifies the maximum duration (in milliseconds) that the agent can remain silent. After the agent is successfully created and the user joins the channel, any time during which the agent is not listening, thinking, or speaking is considered silent time. When the silent time reaches the specified value, the agent broadcasts a silent reminder message. This feature is useful for prompting users when they become inactive.

0: Disables the silent reminder feature.
(0, 60000]: Enables the silent reminder. You must also set content; otherwise, the configuration is invalid.

action stringnullable

Default: speak

Specifies how the agent behaves when the silent timeout is reached. Valid values:

speak: Uses the TTS module to announce the silent prompt (content).
think: Appends the silent prompt (content) to the context and passes it to the LLM.

content stringnullable

Specifies the silent prompt message. The message use depends on the value of action parameter.

farewell_config objectnullable

Graceful hang-up settings for the agent.

Show propertiesHide properties

graceful_enabled booleannullable

Default: false

Enable graceful leave:

true: Enabled. When enabled, calling the POST method to stop the agent ensures that the agent is in an IDLE state before leaving the channel.
false: Disabled.

graceful_timeout_seconds integernullable

Default: 30

Range: [0, 120]

Graceful exit timeout (in seconds). Represents the maximum time to wait for the agent to enter an IDLE state before exiting the channel. After this time, the agent will exit the channel immediately, even if it is not in an idle state. This field is only effective when graceful_enabled is true.

data_channel stringnullable

Default: datastream

Agent data transmission channel:

rtm: Use RTM transmission. This configuration takes effect only when advanced_features.enable_rtm is true.
datastream: Use RTC data stream transport.

enable_metrics booleannullable

Default: false

Whether to receive agent performance data:

true: Receive agent performance data.
false: Do not receive agent performance data.

This setting only takes effect when advanced_features.enable_rtm is true. See Listen to agent events to learn how to use client components to receive agent performance data.

enable_error_message booleannullable

Default: false

Whether to receive agent error events:

true: Receive agent error events.
false: (Default) Do not receive agent error events.

This setting only takes effect when advanced_features.enable_rtm is true. See Listen to agent events to learn how to use client components to receive agent error events.

Response

If the returned status code is 200, the request was successful. The response body contains the result of the request.
OK
If the returned status code is not 200, the request failed. The response body includes the detail and reason for failure. Refer to status codes to understand the possible reasons for failure.

Reference

Deprecated parameters

The following turn detection configuration is deprecated. To create more natural conversations and reduce unintended interruptions, Agora recommends using the latest version of turn_detection above.

Turn detection

turn_detection objectnullable

Conversation turn detection settings.

Show propertiesHide properties

type stringnullable

Default: agora_vad

Possible values: agora_vad, server_vad, semantic_vad

Turn detection mechanism.

agora_vad: Agora VAD. Compatible with both cascade (ASR/LLM/TTS) and MLLM modes.
server_vad: The model detects the start and end of speech based on audio volume and responds at the end of user speech. Only available when mllm is enabled and OpenAI Realtime or Gemini Live is selected. The detection behavior is controlled by the LLM provider.
semantic_vad: Uses a turn detection model in conjunction with VAD to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability for more natural conversations. Only available when mllm is enabled and OpenAI is selected.

interrupt_mode stringdeprecatednullable

Default: interrupt

Sets the agent's behavior when human voice interrupts the agent while it is interacting (speaking or thinking). Choose from the following values:

interrupt: The agent immediately stops the current interaction and processes the human voice input.
append: The agent completes the current interaction, then processes the human voice input.
ignore: The agent discards the human voice input without processing or storing it in the context.
keywords: The agent stops its current interaction after detecting any of the keywords specified in turn_detection.interrupt_keywords.
adaptive: The agent dynamically increases the voice continuity threshold while speaking to reduce accidental interruptions.

info

Only the interrupt mode is supported when you integrate an mllm.
keyword interruption mode and graceful interruption feature (advanced_features.enable_aivad) are mutually exclusive and cannot be enabled simultaneously.

interrupt_duration_ms numberdeprecatednullable

Default: 160

The amount of time in milliseconds that the user's voice must exceed the VAD threshold before an interruption is triggered.

interrupt_keywords array[string]deprecatednullable

Specifies the list of keywords that trigger an interruption when the turn_detection.interrupt_mode is set to "keyword".

When the agent detects any of these keywords in the user's speech, it immediately stops its current interaction and processes the new input.

info

Keyword recognition capabilities, such as support for multiple languages or dialects, depend on the ASR provider you choose.
You can configure up to 128 keywords.

prefix_padding_ms integerdeprecatednullable

Default: 800

The extra forward padding time in milliseconds before the processing system starts to process the speech input. This padding helps capture the beginning of the speech.

silence_duration_ms integerdeprecatednullable

Default: 640

The duration of audio silence in milliseconds. If no voice activity is detected during this period, the agent assumes that the user has stopped speaking.

threshold numberdeprecatednullable

Default: 0.5

Range: (0.0, 1.0)

Identification sensitivity determines the level of sound in the audio signal that is considered voice activity. Lower values make it easier for the agent to detect speech, and higher values ignore weak sounds.

create_response booleannullable

Default: true

Whether to automatically generate a response when a VAD stop event occurs. Only available in server_vad and semantic_vad modes when using OpenAI Realtime API.

interrupt_response booleannullable

Default: true

Whether to automatically interrupt any ongoing response when a VAD start event occurs. Only available in server_vad and semantic_vad modes when using OpenAI Realtime API.

eagerness stringnullable

Default: auto

Possible values: auto, low, high

The eagerness of the model to respond:

auto: Equivalent to medium
low: Wait longer for the user to continue speaking
high: Respond more quickly

Only available in semantic_vad mode when using OpenAI Realtime API.

Authorization

This endpoint requires Basic Auth.

Request examples

curl
Python
Node.js

curl --request post \
--url https://api.agora.io/api/conversational-ai-agent/v2/projects/:appid/join \
--header 'Authorization: Basic <your_base64_encoded_credentials>' \
--data '
{
    "name": "unique_name",
    "properties": {
        "channel": "channel_name",
        "token": "token",
        "agent_rtc_uid": "1001",
        "remote_rtc_uids": [
            "1002"
        ],
        "idle_timeout": 120,
        "llm": {
            "url": "https://api.openai.com/v1/chat/completions",
            "api_key": "<your_llm_key>",
            "system_messages": [
                {
                    "role": "system",
                    "content": "You are a helpful chatbot."
                }
            ],
            "max_history": 32,
            "greeting_message": "Hello, how can I assist you today?",
            "failure_message": "Please hold on a second.",
            "params": {
                "model": "gpt-4o-mini"
            }
        },
        "tts": {
            "vendor": "microsoft",
            "params": {
                "key": "<your_tts_api_key>",
                "region": "eastus",
                "voice_name": "en-US-AndrewMultilingualNeural"
            }
        },
        "asr": {
            "language": "en-US"
        }
    }
}'

Response example

{
  "agent_id": "1NT29X10YHxxxxxWJOXLYHNYB",
  "create_ts": 1737111452,
  "status": "RUNNING"
}

Request​

Path parameters​

Request body​

Response​

Reference​

Deprecated parameters​

Authorization

Request examples

Response example

Request

Path parameters

Request body

Response

Reference

Deprecated parameters