Start a conversational AI agent

POST

https://api.agora.io/api/conversational-ai-agent/v2/projects/{appid}/join

Use this endpoint to create and start a Conversational AI agent instance.

Request

Path parameters

appid stringrequired

The App ID of the project.

Request body

APPLICATION/JSON

BODYrequired

name stringrequired
The unique identifier of the agent. The same identifier cannot be used repeatedly.
properties objectrequired
Configuration details of the agent.
Show propertiesHide properties
- channel stringrequired
  The name of the channel to join.
- token stringrequired
  The authentication token used by the agent to join the channel.
- agent_rtc_uid stringrequired
  The user ID of the agent in the channel. A value of 0 means that a random UID is generated and assigned. Set the token accordingly.
- remote_rtc_uids array[string]required
  A list of user IDs that the agent subscribes to in the channel. Only subscribed users can interact with the agent. Use "*" to subscribe to all users in the channel.
  info
  
  The "*" selector includes all UIDs present in the channel, which may include other AI agents. If you're running multiple agents in the same channel, review the best practices under the idle_timeout parameter to avoid unintended behavior and unnecessary usage costs.
  
  When using an AI Avatar, subscribing to all users with "*" is not supported.
- enable_string_uid booleannullable
  Default: false
  Whether to enable String uid:
  true: Both agent and subscriber user IDs use strings.
  
  false: Both agent and subscriber user IDs must be integers.
- idle_timeout integernullable
  Default: 30
  Sets the timeout after all the users specified in remote_rtc_uids are detected to have left the channel. When the timeout value is exceeded, the agent automatically stops and exits the channel. A value of 0 means that the agent does not exit until it is stopped manually.
  Multi-agent use cases
  If multiple AI agents are active in the same channel and are configured to subscribe to all users using remote_rtc_uids: ["*"], they detect each other's presence. As a result, the idle_timeout condition, when all other users have left, might never be triggered. This can cause agents to run indefinitely and lead to significant unintended usage.
  Agent lifecycle best practice
  For precise and reliable control over the agent's lifecycle, use the leave API to terminate the agent as soon as its task is complete.
- advanced_features objectnullable
  Advanced features configuration.
  Show propertiesHide properties
  enable_aivad booleannullable
  Default: false
  Whether to enable the intelligent interruption handling function (AIVAD). This feature is currently available only for English.
  enable_mllm booleannullable
  Default: false
  Enable Multimodal Large Language Model. Enabling MLLM automatically disables ASR, LLM, and TTS. When you set this parameter to true, enable_aivad is also disabled.
  enable_rtm booleannullable
  Default: false
  Whether to enable the Signaling (RTM) service. When enabled, the agent can combine the capabilities provided by Signaling to implement advanced functions, such as delivering custom information.
  info
  Before enabling the Signaling service, make sure the token includes both RTC and RTM privileges. When an agent joins an RTM channel, it reuses the token specified in the token field. For more information, see "How can I generate a token with both RTC and Signaling privileges?".
- asr objectnullable
  Automatic Speech Recognition (ASR) configuration.
  Show propertiesHide properties
  language stringnullable
  Default: en-US
  The BCP-47 language tag identifying the primary language used for agent interaction. If params contains a vendor-specific language code, it takes precedence over this setting.
  vendor stringnullable
  Default: ares
  Possible values: ares, microsoft, deepgram
  ASR provider:
  
  ares: Adaptive Recognition Engine for Speech
  
  microsoft: Microsoft Azure
  
  deepgram: Deepgram
  
  params objectrequired
  The configuration parameters for the ASR vendor. See ASR Overview for details.
- tts objectrequired
  Text-to-speech (TTS) module configuration.
  Show propertiesHide properties
  vendor stringrequired
  Possible values: microsoft, elevenlabs, cartesia, openai, humeai
  TTS provider.
  
  microsoft: Microsoft Azure
  
  elevenlabs: ElevenLabs
  
  cartesia : Cartesia
  
  openai: OpenAI
  
  humeai: Hume AI
  
  params objectrequired
  The configuration parameters for the TTS vendor. See TTS Overview for details.
  skip_patterns array[integer]nullable
  Controls whether the TTS module skips bracketed content when reading LLM response text. This prevents the agent from vocalizing structural prompt information like tone indicators, action descriptions, and system prompts, creating a more natural and immersive listening experience. Enable this feature by specifying one or more values:
  
  1: Skip content in Chinese parentheses （）
  
  2: Skip content in Chinese square brackets 【】
  
  3: Skip content in parentheses ( )
  
  4: Skip content in square brackets [ ]
  
  5: Skip content in curly braces { }
  
  info
  
  Nested brackets: When input text contains nested brackets and multiple bracket types are configured to be skipped, the system processes only the outermost brackets. The system matches from the beginning of the text and skips the first outermost bracket pair that meets the skip rule, including all nested content.
  
  Agent memory: The agent's short-term memory always contains the complete, unfiltered LLM text, regardless of live captioning settings.
  
  Real-time subtitles: When enabled, subtitles exclude filtered content during TTS playback but restore the complete text after each sentence finishes.
- llm objectrequired
  Large language model (LLM) configuration.
  Show propertiesHide properties
  url stringrequired
  The LLM callback address.
  api_key stringnullable
  The LLM verification API key. The default value is an empty string. Ensure that you enable the API key in a production environment.
  system_messages array[object]nullable
  A set of predefined information used as input to the LLM, including prompt words and examples.
  params objectnullable
  Additional LLM configuration parameters, such as the model used, and the maximum token limit. For details about each supported LLM, refer to Supported LLMs.
  max_history integernullable
  Default: 32
  The number of conversation history messages cached in the custom LLM. The minimum value is 1. History includes user and agent dialog messages, tool call information, and timestamps. Agent and user messages are recorded separately.
  input_modalities array[string]nullable
  Default: ["text"]
  LLM input modalities:
  
  ["text"]: Text only
  
  ["text", "image"]: Text plus image; requires the selected LLM to support visual input
  
  output_modalities array[string]nullable
  Default: ["text"]
  LLM output modalities:
  
  ["text"]: The output text is converted to speech by the TTS module and then published to the RTC channel.
  
  ["audio"]: Voice only. Voice is published directly to the RTC channel.
  
  ["text", "audio"]: Text plus voice. Write your own logic to process the output of LLM as needed.
  
  greeting_message stringnullable
  Agent greeting. If provided, the first user in the channel is automatically greeted with the message upon joining.
  failure_message stringnullable
  Prompt for agent activation failure. If provided, it is returned through TTS when the custom LLM call fails.
  vendor stringnullable
  LLM provider, supports the following settings:
  
  custom: Custom LLM. When you set this option, the agent includes the following fields, in addition to role and content when making requests to the custom LLM:
  
  turn_id: A unique identifier for each conversation turn. It starts from 0 and increments with each turn. One user-agent interaction corresponds to one turn_id.
  
  timestamp: The request timestamp, in milliseconds.
  
  style stringnullable
  Default: openai
  Possible values: openai, gemini, anthropic, dify
  The request style for chat completion:
  
  openai: For OpenAI and OpenAI-compatible APIs
  
  gemini: For Google Gemini and Google Vertex API format
  
  anthropic: For Anthropic Claude API format
  
  dify: For Dify API format
  
  For details, refer to Supported LLMs.
- mllm objectnullable
  Multimodal Large Language Model (MLLM) configuration for real-time audio and text processing.
  Show propertiesHide properties
  url stringrequired
  The MLLM WebSocket URL for real-time communication.
  api_key stringrequired
  The API key used for MLLM authentication.
  messages array[object]nullable
  Array of conversation items used for short-term memory management. Uses the same structure as item.content from the OpenAI Realtime API.
  params objectnullable
  Additional MLLM configuration parameters.
  
  Modalities override: The modalities setting in params is overridden by input_modalities and output_modalities.
  
  Turn detection override: The turn_detection setting in params is overridden by the turn_detection section outside of mllm.
  
  See MLLM Overview for details.
  max_history integernullable
  Default: 32
  The number of conversation history messages cached in the MLLM. The minimum value is 1. Cannot exceed the model's context window.
  input_modalities array[string]nullable
  Default: ["audio"]
  MLLM input modalities:
  
  ["audio"]: Audio only
  
  ["audio", "text"]: Audio plus text
  
  output_modalities array[string]nullable
  Default: ["text", "audio"]
  MLLM output modalities:
  
  ["text", "audio"]: Text plus audio
  
  greeting_message stringnullable
  Agent greeting message. If provided, the first user in the channel is automatically greeted with this message upon joining.
  vendor stringnullable
  Possible values: openai
  MLLM provider. Currently supports:
  
  openai: OpenAI Realtime API
  
  style stringnullable
  Default: openai
  Possible values: openai
  The request style for MLLM completion:
  
  openai: For OpenAI Realtime API format
- avatar objectnullable
  Avatar configuration.
  Show propertiesHide properties
  enable booleannullable
  Default: false
  Whether to enable the avatar function for the agent. To enable, set to true and configure the vendor and params fields.
  vendor stringnullable
  Possible values: akool, heygen
  Avatar vendor. Supports the following values:
  
  akool: Akool
  
  heygen: HeyGen
  
  params objectnullable
  The configuration parameters for the avatar vendor. See AI Avatar Overview for details.
- vad objectdeprecatednullable
  Voice Activity Detection (VAD) configuration.
  Deprecation Notice
  The vad configuration section is deprecated and will not be supported in future versions. When the same parameter exists in both vad and turn_detection sections, turn_detection parameters have higher priority. Agora recommends migrating to turn_detection configuration for voice activity detection settings.
  Show propertiesHide properties
  interrupt_duration_ms numbernullable
  Default: 160
  The amount of time in milliseconds that the user's voice must exceed the VAD threshold before an interruption is triggered.
  prefix_padding_ms integernullable
  Default: 800
  The extra forward padding time in milliseconds before the processing system starts to process the speech input. This padding helps capture the beginning of the speech.
  silence_duration_ms integernullable
  Default: 640
  The duration of audio silence in milliseconds. If no voice activity is detected during this period, the agent assumes that the user has stopped speaking.
  threshold numbernullable
  Default: 0.5
  Identification sensitivity determines the level of sound in the audio signal that is considered voice activity. The value range is (0.0, 1.0). Lower values make it easier for the agent to detect speech, and higher values ignore weak sounds.
- turn_detection objectnullable
  Conversation turn detection settings.
  Show propertiesHide properties
  type stringnullable
  Default: agora_vad
  Possible values: agora_vad, server_vad, semantic_vad
  Turn detection mechanism.
  
  agora_vad: Agora VAD.
  
  server_vad: The model detects the start and end of speech based on audio volume and responds at the end of user speech. Only available when mllm is enabled and OpenAI is selected.
  
  semantic_vad: Uses a turn detection model in conjunction with VAD to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability for more natural conversations. Only available when mllm is enabled and OpenAI is selected.
  
  interrupt_mode stringnullable
  Default: interrupt
  Sets the agent's behavior when human voice interrupts the agent while it is interacting (speaking or thinking). Choose from the following values:
  
  interrupt: The agent immediately stops the current interaction and processes the human voice input.
  
  append: The agent completes the current interaction, then processes the human voice input.
  
  ignore: The agent discards the human voice input without processing or storing it in the context.
  
  info
  Only the interrupt mode is supported when you integrate a mllm.
  interrupt_duration_ms numbernullable
  Default: 160
  The amount of time in milliseconds that the user's voice must exceed the VAD threshold before an interruption is triggered.
  prefix_padding_ms integernullable
  Default: 800
  The extra forward padding time in milliseconds before the processing system starts to process the speech input. This padding helps capture the beginning of the speech.
  silence_duration_ms integernullable
  Default: 640
  The duration of audio silence in milliseconds. If no voice activity is detected during this period, the agent assumes that the user has stopped speaking.
  threshold numbernullable
  Default: 0.5
  Identification sensitivity determines the level of sound in the audio signal that is considered voice activity. The value range is (0.0, 1.0). Lower values make it easier for the agent to detect speech, and higher values ignore weak sounds.
  create_response booleannullable
  Default: true
  Whether to automatically generate a response when a VAD stop event occurs. Only available in server_vad and semantic_vad modes when using OpenAI Realtime API.
  interrupt_response booleannullable
  Default: true
  Whether to automatically interrupt any ongoing response when a VAD start event occurs. Only available in server_vad and semantic_vad modes when using OpenAI Realtime API.
  eagerness stringnullable
  Default: auto
  Possible values: auto, low, high
  The eagerness of the model to respond:
  
  auto: Equivalent to medium
  
  low: Wait longer for the user to continue speaking
  
  high: Respond more quickly
  
  Only available in semantic_vad mode when using OpenAI Realtime API.
- parameters objectnullable
  Agent configuration parameters.
  Show propertiesHide properties
  silence_config objectnullable
  Settings related to agent silence behavior.
  info
  silence_config does not apply when you integrate a mllm.
  Show propertiesHide properties
  timeout_ms integernullable
  Default: 0
  Possible values: 0 to 60000
  Specifies the maximum duration (in milliseconds) that the agent can remain silent. After the agent is successfully created and the user joins the channel, any time during which the agent is not listening, thinking, or speaking is considered silent time. When the silent time reaches the specified value, the agent broadcasts a silent reminder message. This feature is useful for prompting users when they become inactive.
  
  0: Disables the silent reminder feature.
  
  (0, 60000]: Enables the silent reminder. You must also set content; otherwise, the configuration is invalid.
  
  action stringnullable
  Default: speak
  Specifies how the agent behaves when the silent timeout is reached. Valid values:
  
  speak: Uses the TTS module to announce the silent prompt (content).
  
  think: Appends the silent prompt (content) to the context and passes it to the LLM.
  
  content stringnullable
  Specifies the silent prompt message. The message use depends on the value of action parameter.
  data_channel stringnullable
  Default: datastream
  Agent data transmission channel:
  
  rtm: Use RTM transmission. This configuration takes effect only when advanced_features.enable_rtm is true.
  
  datastream: Use RTC data stream transport.
  
  enable_metrics booleannullable
  Default: false
  Whether to receive agent performance data:
  
  true: Receive agent performance data.
  
  false: Do not receive agent performance data.
  
  This setting only takes effect when advanced_features.enable_rtm is true. See Listen to agent events to learn how to use client components to receive agent performance data.
  enable_error_message booleannullable
  Default: false
  Whether to receive agent error events:
  
  true: Receive agent error events.
  
  false: (Default) Do not receive agent error events.
  
  This setting only takes effect when advanced_features.enable_rtm is true. See Listen to agent events to learn how to use client components to receive agent error events.

Response

If the returned status code is 200, the request was successful. The response body contains the result of the request.
OK
- agent_id string
  Unique id of the agent instance
- create_ts integer
  Timestamp of when the agent was created
- status string
  Possible values: IDLE, STARTING, RUNNING, STOPPING, STOPPED, RECOVERING, FAILED
  Current status.
  IDLE (0): Agent is idle.
  
  STARTING (1): The agent is being started.
  
  RUNNING (2): The agent is running.
  
  STOPPING (3): The agent is stopping.
  
  STOPPED (4): The agent has exited.
  
  RECOVERING (5): The agent is recovering.
  
  FAILED (6): The agent failed to execute.
If the returned status code is not 200, the request failed. The response body includes the detail and reason for failure. Refer to status codes to understand the possible reasons for failure.

Authorization

This endpoint requires Basic Auth.

Request example

curl
Python
Node.js

curl --request post \
--url https://api.agora.io/api/conversational-ai-agent/v2/projects/:appid/join \
--header 'Authorization: Basic <your_base64_encoded_credentials>' \
--data '
{
    "name": "unique_name",
    "properties": {
        "channel": "channel_name",
        "token": "token",
        "agent_rtc_uid": "1001",
        "remote_rtc_uids": [
            "1002"
        ],
        "idle_timeout": 120,
        "advanced_features": {
            "enable_aivad": true
        },
        "llm": {
            "url": "https://api.openai.com/v1/chat/completions",
            "api_key": "<your_llm_key>",
            "system_messages": [
                {
                    "role": "system",
                    "content": "You are a helpful chatbot."
                }
            ],
            "max_history": 32,
            "greeting_message": "Hello, how can I assist you today?",
            "failure_message": "Please hold on a second.",
            "params": {
                "model": "gpt-4o-mini"
            }
        },
        "tts": {
            "vendor": "microsoft",
            "params": {
                "key": "<your_tts_api_key>",
                "region": "eastus",
                "voice_name": "en-US-AndrewMultilingualNeural"
            }
        },
        "asr": {
            "language": "en-US"
        }
    }
}'

Response example

{
  "agent_id": "1NT29X10YHxxxxxWJOXLYHNYB",
  "create_ts": 1737111452,
  "status": "RUNNING"
}

Start a conversational AI agent

Request​

Path parameters​

Request body​

Response​

Authorization

Request example

Response example

Request

Path parameters

Request body

Response