Skip to main content

Start a conversational AI agent

Start a conversational AI agent

POST
https://api.agora.io/api/conversational-ai-agent/v2/projects/{appid}/join

Use this endpoint to create and start a Conversational AI agent instance.

Request

Path parameters

appid stringrequired

The App ID of the project.

Request body

APPLICATION/JSON
BODYrequired
  • name stringrequired

    The unique identifier of the agent. The same identifier cannot be used repeatedly.

  • properties objectrequired

    Configuration details of the agent.

    Show propertiesHide properties
    • channel stringrequired

      The name of the channel to join.

    • token stringrequired

      The authentication token used by the agent to join the channel.

    • agent_rtc_uid stringrequired

      The user ID of the agent in the channel. A value of 0 means that a random UID is generated and assigned. Set the token accordingly.

    • remote_rtc_uids array[string]required

      A list of user IDs that the agent subscribes to in the channel. Only subscribed users can interact with the agent.

      info

      Currently, only one user ID is supported.

    • enable_string_uid booleannullable

      Default: false

      Whether to enable String uid:

      • true: Both agent and subscriber user IDs use strings.
      • false: Both agent and subscriber user IDs must be integers.
    • idle_timeout integernullable

      Default: 30

      Sets the timeout after all the users specified in remote_rtc_uids are detected to have left the channel. When the timeout value is exceeded, the agent automatically stops and exits the channel. A value of 0 means that the agent does not exit until it is stopped manually.

      Agent lifecycle best practice

      For precise and reliable control over the agent's lifecycle, use the leave API to terminate the agent as soon as its task is complete.

    • geofence nullable

      Regional access restriction configuration. Use this to limit which Agora servers the Conversational AI Engine can access based on geographic regions.

      Show propertiesHide properties
      • area stringrequired

        Possible values: GLOBAL, NORTH_AMERICA, EUROPE, ASIA, INDIA, JAPAN

        The allowed region for server access.

      • exclude_area stringnullable

        Possible values: NORTH_AMERICA, EUROPE, ASIA, INDIA, JAPAN

        The excluded region. Only available when area is set to GLOBAL.

    • advanced_features objectnullable

      Advanced features configuration.

      Show propertiesHide properties
      • enable_aivad booleandeprecatednullable

        Default: false

        Whether to enable the intelligent interruption handling function (AIVAD). This feature is currently available only for English.

        caution

        This field is deprecated. Use turn_detection.config.end_of_speech.mode.semantic instead.

      • enable_mllm booleannullable

        Default: false

        Enable Multimodal Large Language Model for voice-to-voice processing. Enabling MLLM automatically disables ASR, LLM, and TTS since the MLLM handles end-to-end voice processing directly. When you set this parameter to true, enable_aivad is also disabled. See turn_detection.type for turn detection options available with MLLM.

      • enable_rtm booleannullable

        Default: false

        Whether to enable the Signaling (RTM) service. When enabled, the agent can combine the capabilities provided by Signaling to implement advanced functions, such as delivering custom information.

        info

        Before enabling the Signaling service, make sure the token includes both RTC and RTM privileges. When an agent joins an RTM channel, it reuses the token specified in the token field. For more information, see "How can I generate a token with both RTC and Signaling privileges?".

      • enable_sal booleannullable

        Default: false

        Enable Selective Attention Locking (SAL). When enabled, configure the sal field to set up speaker recognition or locking modes. See the sal parameter for configuration details.

      • enable_tools booleannullable

        Default: false

        Enable tool invocation. When enabled, the agent can invoke tools provided by the MCP server to implement advanced functionality.

    • asr objectnullable

      Automatic Speech Recognition (ASR) configuration.

      Show propertiesHide properties
    • tts objectrequired

      Text-to-speech (TTS) module configuration.

      Show propertiesHide properties
      • vendor stringrequired

        Possible values: microsoft, elevenlabs, minimax, cartesia, openai, humeai, rime, fishaudio, google, amazon, sarvam

        TTS provider.

      • params objectrequired

        The configuration parameters for the TTS vendor. See TTS Overview for details.

      • skip_patterns array[integer]nullable

        Controls whether the TTS module skips bracketed content when reading LLM response text. This prevents the agent from vocalizing structural prompt information like tone indicators, action descriptions, and system prompts, creating a more natural and immersive listening experience. Enable this feature by specifying one or more values:

        • 1: Skip content in Chinese parentheses ()
        • 2: Skip content in Chinese square brackets 【】
        • 3: Skip content in parentheses ( )
        • 4: Skip content in square brackets [ ]
        • 5: Skip content in curly braces { }
        info
        • Nested brackets: When input text contains nested brackets and multiple bracket types are configured to be skipped, the system processes only the outermost brackets. The system matches from the beginning of the text and skips the first outermost bracket pair that meets the skip rule, including all nested content.
        • Agent memory: The agent's short-term memory always contains the complete, unfiltered LLM text, regardless of live captioning settings.
        • Real-time transcript: When enabled, transcript excludes filtered content during TTS playback but restores the complete text after each sentence finishes.
    • llm objectrequired

      Large language model (LLM) configuration.

      Show propertiesHide properties
      • url stringrequired

        The LLM callback address.

      • api_key stringnullable

        The LLM verification API key. The default value is an empty string. Ensure that you enable the API key in a production environment.

      • system_messages array[object]nullable

        A set of predefined information used as input to the LLM, including prompt words and examples.

      • params objectnullable

        Additional LLM configuration parameters, such as the model used, and the maximum token limit. For details about each supported LLM, refer to Supported LLMs.

      • max_history integernullable

        Default: 32

        Range: [1, 1024]

        The number of conversation history messages cached in the LLM. History includes user and agent dialog messages, tool call information, and timestamps. Agent and user messages are recorded separately.

      • input_modalities array[string]nullable

        Default: ["text"]

        LLM input modalities:

        • ["text"]: Text only
        • ["text", "image"]: Text plus image; requires the selected LLM to support visual input
      • output_modalities array[string]nullable

        Default: ["text"]

        LLM output modalities:

        • ["text"]: The output text is converted to speech by the TTS module and then published to the RTC channel.
        • ["audio"]: Voice only. Voice is published directly to the RTC channel.
        • ["text", "audio"]: Text plus voice. Write your own logic to process the output of LLM as needed.
      • greeting_configs objectnullable

        Agent greeting broadcast configuration.

        Show propertiesHide properties
        • mode stringnullable

          Default: single_every

          Possible values: single_every, single_first

          Determines when the agent sends greeting messages to users joining the channel.

          • single_every: Broadcasts a greeting every time a user joins the channel.
          • single_first: Broadcasts a greeting only once to the first user who joins the channel.
      • greeting_message stringnullable

        Agent greeting. If provided, the first user in the channel is automatically greeted with the message upon joining.

      • failure_message stringnullable

        Prompt for agent activation failure. If provided, it is returned through TTS when the custom LLM call fails.

      • vendor stringnullable

        LLM provider, supports the following settings:

        • custom: Custom LLM. When you set this option, the agent includes the following fields, in addition to role and content when making requests to the custom LLM:
          • turn_id: A unique identifier for each conversation turn. It starts from 0 and increments with each turn. One user-agent interaction corresponds to one turn_id.
          • timestamp: The request timestamp, in milliseconds.
        • azure: Use this value for Azure OpenAI
      • style stringnullable

        Default: openai

        Possible values: openai, gemini, anthropic, dify

        The request style for chat completion:

        • openai: For OpenAI and OpenAI-compatible APIs
        • gemini: For Google Gemini and Google Vertex API format
        • anthropic: For Anthropic Claude API format
        • dify: For Dify API format

        For details, refer to Supported LLMs.

      • template_variables objectnullable

        Template parameter configuration used to insert variables into the agent's system_messages, greeting_message, failure_message, and parameters.silence_config.content text. Uses key-value pairs, where the key is the variable name and the value is the variable's value. Template variables, combined with prompt customization and SIP outbound calling functionality, enable dynamic content injection, automating processes such as automatic hang-up, voicemail recognition, automatic message leaving, and call transfer.

        To insert defined variables in the prompt text, use the syntax {{variable_name}}. The system automatically replaces each variable with the corresponding value defined in template_variables.

        caution

        Variable values cannot reference other variables. For example, if you define "farewell": "Looking forward to seeing you again, {{name}}", the {{name}} variable will not be resolved.

      • mcp_servers arraynullable

        MCP (Model Context Protocol) server configuration. By configuring MCP servers, agents can call tools provided by external services to implement advanced functionality.

        Show propertiesHide properties
        • name stringrequired

          A unique identifier for the MCP server. Maximum 48 characters. Accepts only English letters and numbers.

        • endpoint stringrequired

          The endpoint address of the MCP server. The agent uses this to communicate with the MCP server.

        • transport stringnullable

          Possible values: streamable_http

          Transport protocol type.

          • streamable_http: Streaming HTTP protocol
        • headers objectnullable

          HTTP header information to include when requesting the MCP server, such as authentication information.

        • allowed_tools arraynullable

          A list of tools that the agent is allowed to invoke. The agent can only use tools on this list.

          Behavior:

          • Empty or omitted: All tools are enabled.
          • Empty array []: No tools are enabled.
          • ["*"]: All tools are enabled.
          • Specific tools ["aa", "bb", "cc"]: Only aa, bb, and cc are enabled.
          • Mix with wildcard ["aa", "bb", "*"]: All tools are enabled (wildcard takes precedence).
        • timeout_ms integernullable

          The MCP server request timeout in milliseconds. After timeout, the agent stops waiting for the MCP server's response and continues executing subsequent logic.

    • mllm objectnullable

      Multimodal Large Language Model (MLLM) configuration for real-time audio and text processing.

      Show propertiesHide properties
      • url stringrequired

        The MLLM WebSocket URL for real-time communication.

      • api_key stringrequired

        The API key used for MLLM authentication.

      • messages array[object]nullable

        Array of conversation items used for short-term memory management. Uses the same structure as item.content from the OpenAI Realtime API.

      • params objectnullable

        Additional MLLM configuration parameters.

        • Modalities override: The modalities setting in params is overridden by input_modalities and output_modalities.
        • Turn detection override: The turn_detection setting in params is overridden by the turn_detection section outside of mllm.

        See MLLM Overview for details.

      • input_modalities array[string]nullable

        Default: ["audio"]

        MLLM input modalities:

        • ["audio"]: Audio only
        • ["audio", "text"]: Audio plus text
      • output_modalities array[string]nullable

        Default: ["text", "audio"]

        MLLM output modalities:

        • ["text", "audio"]: Text plus audio
      • greeting_message stringnullable

        Agent greeting message. If provided, the first user in the channel is automatically greeted with this message upon joining.

      • vendor stringnullable

        Possible values: openai, vertexai

        MLLM provider. Currently supports:

      • style stringnullable

        Default: openai

        Possible values: openai

        The request style for MLLM completion:

        • openai: For OpenAI Realtime API format
    • avatar objectnullable

      Avatar configuration.

      Show propertiesHide properties
      • enable booleannullable

        Default: false

        Whether to enable the avatar function for the agent. To enable, set to true and configure the vendor and params fields.

      • vendor stringnullable

        Possible values: akool, heygen

        Avatar vendor. Supports the following values:

      • params objectnullable

        The configuration parameters for the avatar vendor. See AI Avatar Overview for details.

    • turn_detection objectnullable

      Conversation turn detection settings. Controls the logic for voice activity detection and conversation turn determination. The previous version of turn_detection is deprecated. Refer to Deprecated parameters for details. Agora recommends switching to the latest parameters.

      This configuration supports multiple combinations of detection modes:

      • Start of Speech (SoS): Supports three modes: VAD, Keyword, and Disable.
      • End of Speech (EoS): Supports VAD and Semantic modes.
      Show propertiesHide properties
      • mode stringnullable

        Default: default

        Possible values: default

        Conversation turn detection mode:

        • default: Uses standard conversation turn detection configuration.
      • config objectnullable

        Detailed configuration for conversation turn detection.

        Show propertiesHide properties
        • speech_threshold numbernullable

          Default: 0.5

          Range: (0.0, 1.0)

          Voice activity detection sensitivity. Determines the sound level in the audio signal that is considered voice activity. Lower values make it easier for the agent to detect speech, and higher values ignore weak sounds.

        • start_of_speech objectnullable

          Start of Speech (SoS) detection configuration. Determines when a user begins speaking.

          Show propertiesHide properties
          • mode stringrequired

            Possible values: vad, keywords, disabled

            Start of speech detection mode:

            • vad: Based on VAD (Voice Activity Detection). Uses audio signal detection.
            • keywords: (Beta) Based on keyword trigger. Conversation begins when the agent detects a specified keyword.
            • disabled: Disables start of speech detection. Does not actively trigger new conversation turns.
          • {mode}_config objectnullable

            Start of speech detection configuration parameters. The structure and supported fields vary depending on the detection mode.

            info
            • The configuration type must match mode. For example, when mode is vad, you must provide vad_config.
            • You cannot provide multiple mode configurations simultaneously.

            Configuration examples:

            • vad_config


              _5
              "vad_config": {
              _5
              "interrupt_duration_ms": 160,
              _5
              "speaking_interrupt_duration_ms": 160,
              _5
              "prefix_padding_ms": 800
              _5
              }

            • keywords_config


              _5
              "keywords_config": {
              _5
              "interrupt_duration_ms": 160,
              _5
              "prefix_padding_ms": 800,
              _5
              "triggered_keywords": ["Are you there", "hello"]
              _5
              }

            • disabled_config


              _3
              "disabled_config": {
              _3
              "strategy": "append"
              _3
              }

            Show propertiesHide properties
            • strategy stringnullable

              Possible values: append, ignored

              Voice processing strategy when the agent is interacting (speaking or thinking):

              • append: Append mode. Human voice does not interrupt the agent. The agent processes the human voice input after the current interaction ends.
              • ignored: Ignore mode. The agent ignores human voice input. If the agent receives human voice while speaking or thinking, the agent discards the input without storing it in context.
        • end_of_speech objectnullable

          End of Speech (EoS) detection configuration. Determines when a user ends their speech.

          Show propertiesHide properties
          • mode stringnullable

            Possible values: vad, semantic

            End of speech detection mode. Possible values:

            • vad: Based on VAD (Voice Activity Detection). Detects silence duration.
            • semantic: Based on semantic triggering. Uses semantic understanding to determine when conversation ends.
          • {mode}_config objectnullable

            End of speech detection configuration parameters. The structure and supported fields vary depending on the detection mode.

            info
            • The configuration type must match mode. For example, when mode is vad, you must provide vad_config.
            • You cannot provide multiple mode configurations simultaneously.

            Configuration examples:

            • vad_config


              _3
              "vad_config": {
              _3
              "silence_duration_ms": 640
              _3
              }

            • semantic_config


              _4
              "semantic_config": {
              _4
              "silence_duration_ms": 320,
              _4
              "max_wait_ms": 3000
              _4
              }

            Show propertiesHide properties
            • silence_duration_ms integernullable

              Default: 320

              Range: [0, 2000]

              Silence duration threshold in milliseconds. The minimum silence duration at the end of a speech segment, to ensure that a brief pause does not prematurely end the speech segment.

            • max_wait_ms integernullable

              Default: 3000

              Range: [0, 10000]

              Maximum wait time in milliseconds. The maximum time to wait for semantic determination. After timeout, the conversation end is determined based on the current state.

    • sal objectnullable

      Selective Attention Locking (SAL) configuration. (Beta)

      Show propertiesHide properties
      • sal_mode stringnullable

        Default: locking

        Possible values: locking, recognition

        Selective attention lock mode. Supports the following options:

        • locking: Speaker Lock Mode. The agent locks onto the speaker, blocking 95% of ambient human voices and noise. You can enable this mode in two ways:

          • Seamless mode: When a user speaks loudly and clearly at the beginning of a conversation, the intelligent agent automatically recognizes the user as the speaker.
          • Personalized mode: When creating an agent, a speaker's voiceprint URL is pre-registered through the sample_urls field. The agent then locates the speaker based on the pre-registered voiceprint.
        • recognition: Voiceprint recognition mode. You can pre-register only one voiceprint URL using the sample_urls field. The agent identifies different speakers and suppresses other background voices and environmental noise. The target speaker is identified through the vpids field in the metadata field and sent to the LLM. Set llm.vendor to "custom" and refer to Custom LLM for instructions on how to make the LLM process speaker information.

      • sample_urls objectnullable

        The registered voiceprint URL as a key-value pair, where the key is the voiceprint name and the value is the download URL for the speaker's voiceprint. Only one voiceprint URL is supported.
        Example:


        _3
        {
        _3
        "speaker1": "https://example.com/speaker1.pcm"
        _3
        }

        info
        • Do not set the incoming voiceprint name to "unknown"; this is a reserved keyword used to identify unknown speakers.
        • For a registered voiceprint, ensure that:
          • Size: The voiceprint file does not exceed 2 MB.
          • Duration: Contains 10 to 15 seconds of audio, with at least 8 seconds of effective audio excluding silent segments.
          • Format: 16kHz sampling rate, 16-bit depth, mono PCM audio file. The file name extension must be ".pcm".
    • labels objectnullable

      Custom labels in key-value pair format, where the key is the label name and the value is the label value. Enables agents to carry custom business information.

      These labels are bound to the agent and returned in the payload field of all message notification callbacks from the conversational AI engine. Use them to implement custom business logic, such as tagging activity IDs, customer groups, and business scenarios.

    • rtc objectnullable

      RTC media encryption configuration.

      Show propertiesHide properties
      • encryption_key stringnullable

        The encryption key for RTC media content. The key has no length limit. Agora recommends using a 32-byte key. If no encryption key is set or if the key is empty, built-in encryption is not used.

      • encryption_salt stringnullable

        The salt value used for encryption. This is a Base64-encoded string that is 32 bytes long after decoding. This parameter only takes effect when encryption_mode is set to 7 (AES_128_GCM2) or 8 (AES_256_GCM2). Ensure that the salt parameter is not empty for these encryption modes.

      • encryption_mode integernullable

        Possible values: 1, 2, 3, 4, 5, 6, 7, 8

        The built-in encryption mode.

        • 1: AES_128_XTS - 128-bit AES encryption, XTS mode.
        • 2: AES_128_ECB - 128-bit AES encryption, ECB mode.
        • 3: AES_256_XTS - 256-bit AES encryption, XTS mode.
        • 4: SM4_128_ECB - 128-bit SM4 encryption, ECB mode.
        • 5: AES_128_GCM - 128-bit AES encryption, GCM mode.
        • 6: AES_256_GCM - 256-bit AES encryption, GCM mode.
        • 7: AES_128_GCM2 - 128-bit AES encryption, GCM mode. Requires setting encryption_salt.
        • 8: AES_256_GCM2 - 256-bit AES encryption, GCM mode. Requires setting encryption_salt.

        Agora recommends using either 7 (AES_128_GCM2) or 8 (AES_256_GCM2) mode. Both modes support cryptographic salts to enhance security.

    • filler_words objectnullable

      Filler word configuration. Plays filler words while waiting for LLM responses to reduce user anxiety and improve conversation flow.

      Filler word playback follows these rules:

      • Playback order: When multiple filler words or LLM responses are waiting to be played, they are played in the order they arrive.
      • Interruption control: Inherits the interruption mode setting from global configuration in turn_detection.config.
      Show propertiesHide properties
      • enable booleannullable

        Default: false

        Whether to enable filler words:

        • true: Enable filler words.
        • false: Disable filler words.
      • trigger objectnullable

        Filler word trigger configuration. Defines when to trigger filler word playback.

        Show propertiesHide properties
        • mode stringnullable

          Possible values: fixed_time

          Filler word trigger mode:

          • fixed_time: Fixed time trigger. Triggers filler word playback when LLM response wait time exceeds the threshold.
        • {mode}_config objectnullable

          Filler word trigger configuration parameters. The parameter name and structure vary depending on the trigger mode.

          info
          • The configuration type must match mode. For example, when mode is fixed_time, you must provide fixed_time_config.
          • You cannot provide multiple mode configurations simultaneously.

          Configuration example:


          _3
          "fixed_time_config": {
          _3
          "response_wait_ms": 1500
          _3
          }

          Show propertiesHide properties
          • response_wait_ms integernullable

            Default: 1500

            Range: [100, 10000]

            LLM response wait threshold in milliseconds. Triggers filler word playback when the LLM waits this duration without generating a response, such as when waiting for RAG retrieval or tool call results.

      • content objectnullable

        Filler word content configuration. Defines the source and selection rules for filler words.

        Show propertiesHide properties
        • mode stringnullable

          Possible values: static

          Filler word content mode:

          • static: Static filler words. Uses a predefined list of filler words.
        • {mode}_config objectnullable

          Filler word content configuration parameters. The parameter name and structure vary depending on the content mode.

          info
          • The configuration type must match mode. For example, when mode is static, you must provide static_config.
          • You cannot provide multiple mode configurations simultaneously.

          Static filler word configuration example:


          _8
          "static_config": {
          _8
          "phrases": [
          _8
          "Please wait.",
          _8
          "Okay.",
          _8
          "Uh-huh."
          _8
          ],
          _8
          "selection_rule": "shuffle"
          _8
          }

          Show propertiesHide properties
          • phrases array[string]nullable

            List of filler word phrases.

            Limits:

            • Maximum 100 filler words.
            • Each filler word must not exceed 50 English words.
          • selection_rule stringnullable

            Possible values: shuffle, round_robin

            Filler word selection rule:

            • shuffle: Random shuffle. Already-used filler words are not repeated until all filler words have been used once. After all filler words are played, they are reshuffled randomly and a new round begins.
            • round_robin: Round-robin. Selects and plays filler words sequentially from the list. After all filler words are played once, a new cycle begins.
    • parameters objectnullable

      Agent configuration parameters.

      Show propertiesHide properties
      • silence_config objectnullable

        Settings related to agent silence behavior.

        info

        silence_config does not apply when you integrate a mllm.

        Show propertiesHide properties
        • timeout_ms integernullable

          Default: 0

          Possible values: 0 to 60000

          Specifies the maximum duration (in milliseconds) that the agent can remain silent. After the agent is successfully created and the user joins the channel, any time during which the agent is not listening, thinking, or speaking is considered silent time. When the silent time reaches the specified value, the agent broadcasts a silent reminder message. This feature is useful for prompting users when they become inactive.

          • 0: Disables the silent reminder feature.
          • (0, 60000]: Enables the silent reminder. You must also set content; otherwise, the configuration is invalid.
        • action stringnullable

          Default: speak

          Specifies how the agent behaves when the silent timeout is reached. Valid values:

          • speak: Uses the TTS module to announce the silent prompt (content).
          • think: Appends the silent prompt (content) to the context and passes it to the LLM.
        • content stringnullable

          Specifies the silent prompt message. The message use depends on the value of action parameter.

      • farewell_config objectnullable

        Graceful hang-up settings for the agent.

        Show propertiesHide properties
        • graceful_enabled booleannullable

          Default: false

          Enable graceful leave:

          • true: Enabled. When enabled, calling the POST method to stop the agent ensures that the agent is in an IDLE state before leaving the channel.
          • false: Disabled.
        • graceful_timeout_seconds integernullable

          Default: 30

          Range: [0, 120]

          Graceful exit timeout (in seconds). Represents the maximum time to wait for the agent to enter an IDLE state before exiting the channel. After this time, the agent will exit the channel immediately, even if it is not in an idle state. This field is only effective when graceful_enabled is true.

      • data_channel stringnullable

        Default: datastream

        Agent data transmission channel:

        • rtm: Use RTM transmission. This configuration takes effect only when advanced_features.enable_rtm is true.
        • datastream: Use RTC data stream transport.
      • enable_metrics booleannullable

        Default: false

        Whether to receive agent performance data:

        • true: Receive agent performance data.
        • false: Do not receive agent performance data.

        This setting only takes effect when advanced_features.enable_rtm is true. See Listen to agent events to learn how to use client components to receive agent performance data.

      • enable_error_message booleannullable

        Default: false

        Whether to receive agent error events:

        • true: Receive agent error events.
        • false: (Default) Do not receive agent error events.

        This setting only takes effect when advanced_features.enable_rtm is true. See Listen to agent events to learn how to use client components to receive agent error events.

Response

  • If the returned status code is 200, the request was successful. The response body contains the result of the request.

    OK
    • agent_id string

      Unique id of the agent instance

    • create_ts integer

      Timestamp of when the agent was created

    • status string

      Possible values: IDLE, STARTING, RUNNING, STOPPING, STOPPED, RECOVERING, FAILED

      Current status.

      • IDLE (0): Agent is idle.
      • STARTING (1): The agent is being started.
      • RUNNING (2): The agent is running.
      • STOPPING (3): The agent is stopping.
      • STOPPED (4): The agent has exited.
      • RECOVERING (5): The agent is recovering.
      • FAILED (6): The agent failed to execute.
  • If the returned status code is not 200, the request failed. The response body includes the detail and reason for failure. Refer to status codes to understand the possible reasons for failure.

Reference

Deprecated parameters

The following turn detection configuration is deprecated. To create more natural conversations and reduce unintended interruptions, Agora recommends using the latest version of turn_detection above.

Turn detection
  • turn_detection objectnullable

    Conversation turn detection settings.

    Show propertiesHide properties
    • type stringnullable

      Default: agora_vad

      Possible values: agora_vad, server_vad, semantic_vad

      Turn detection mechanism.

      • agora_vad: Agora VAD. Compatible with both cascade (ASR/LLM/TTS) and MLLM modes.
      • server_vad: The model detects the start and end of speech based on audio volume and responds at the end of user speech. Only available when mllm is enabled and OpenAI Realtime or Gemini Live is selected. The detection behavior is controlled by the LLM provider.
      • semantic_vad: Uses a turn detection model in conjunction with VAD to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability for more natural conversations. Only available when mllm is enabled and OpenAI is selected.
    • interrupt_mode stringdeprecatednullable

      Default: interrupt

      Sets the agent's behavior when human voice interrupts the agent while it is interacting (speaking or thinking). Choose from the following values:

      • interrupt: The agent immediately stops the current interaction and processes the human voice input.
      • append: The agent completes the current interaction, then processes the human voice input.
      • ignore: The agent discards the human voice input without processing or storing it in the context.
      • keywords: The agent stops its current interaction after detecting any of the keywords specified in turn_detection.interrupt_keywords.
      • adaptive: The agent dynamically increases the voice continuity threshold while speaking to reduce accidental interruptions.
      info
      • Only the interrupt mode is supported when you integrate an mllm.
      • keyword interruption mode and graceful interruption feature (advanced_features.enable_aivad) are mutually exclusive and cannot be enabled simultaneously.
    • interrupt_duration_ms numberdeprecatednullable

      Default: 160

      The amount of time in milliseconds that the user's voice must exceed the VAD threshold before an interruption is triggered.

    • interrupt_keywords array[string]deprecatednullable

      Specifies the list of keywords that trigger an interruption when the turn_detection.interrupt_mode is set to "keyword".

      When the agent detects any of these keywords in the user's speech, it immediately stops its current interaction and processes the new input.

      info
      • Keyword recognition capabilities, such as support for multiple languages or dialects, depend on the ASR provider you choose.
      • You can configure up to 128 keywords.
    • prefix_padding_ms integerdeprecatednullable

      Default: 800

      The extra forward padding time in milliseconds before the processing system starts to process the speech input. This padding helps capture the beginning of the speech.

    • silence_duration_ms integerdeprecatednullable

      Default: 640

      The duration of audio silence in milliseconds. If no voice activity is detected during this period, the agent assumes that the user has stopped speaking.

    • threshold numberdeprecatednullable

      Default: 0.5

      Range: (0.0, 1.0)

      Identification sensitivity determines the level of sound in the audio signal that is considered voice activity. Lower values make it easier for the agent to detect speech, and higher values ignore weak sounds.

    • create_response booleannullable

      Default: true

      Whether to automatically generate a response when a VAD stop event occurs. Only available in server_vad and semantic_vad modes when using OpenAI Realtime API.

    • interrupt_response booleannullable

      Default: true

      Whether to automatically interrupt any ongoing response when a VAD start event occurs. Only available in server_vad and semantic_vad modes when using OpenAI Realtime API.

    • eagerness stringnullable

      Default: auto

      Possible values: auto, low, high

      The eagerness of the model to respond:

      • auto: Equivalent to medium
      • low: Wait longer for the user to continue speaking
      • high: Respond more quickly

      Only available in semantic_vad mode when using OpenAI Realtime API.

Authorization

This endpoint requires Basic Auth.

Request examples


_43
curl --request post \
_43
--url https://api.agora.io/api/conversational-ai-agent/v2/projects/:appid/join \
_43
--header 'Authorization: Basic <your_base64_encoded_credentials>' \
_43
--data '
_43
{
_43
"name": "unique_name",
_43
"properties": {
_43
"channel": "channel_name",
_43
"token": "token",
_43
"agent_rtc_uid": "1001",
_43
"remote_rtc_uids": [
_43
"1002"
_43
],
_43
"idle_timeout": 120,
_43
"llm": {
_43
"url": "https://api.openai.com/v1/chat/completions",
_43
"api_key": "<your_llm_key>",
_43
"system_messages": [
_43
{
_43
"role": "system",
_43
"content": "You are a helpful chatbot."
_43
}
_43
],
_43
"max_history": 32,
_43
"greeting_message": "Hello, how can I assist you today?",
_43
"failure_message": "Please hold on a second.",
_43
"params": {
_43
"model": "gpt-4o-mini"
_43
}
_43
},
_43
"tts": {
_43
"vendor": "microsoft",
_43
"params": {
_43
"key": "<your_tts_api_key>",
_43
"region": "eastus",
_43
"voice_name": "en-US-AndrewMultilingualNeural"
_43
}
_43
},
_43
"asr": {
_43
"language": "en-US"
_43
}
_43
}
_43
}'

Response example


_5
{
_5
"agent_id": "1NT29X10YHxxxxxWJOXLYHNYB",
_5
"create_ts": 1737111452,
_5
"status": "RUNNING"
_5
}