Skip to main content

Start a conversational AI agent

Start a conversational AI agent

POST
https://api.agora.io/api/conversational-ai-agent/v2/projects/{appid}/join

Use this endpoint to create and start a Conversational AI agent instance.

Request

Path parameters

appid stringrequired

The App ID of the project.

Request body

APPLICATION/JSON
BODYrequired
  • name stringrequired

    The unique identifier of the agent. The same identifier cannot be used repeatedly.

  • properties objectrequired

    Configuration details of the agent.

    Show propertiesHide properties
    • channel stringrequired

      The name of the channel to join.

    • token stringrequired

      The authentication token used by the agent to join the channel.

    • agent_rtc_uid stringrequired

      The user ID of the agent in the channel. A value of 0 means that a random UID is generated and assigned. Set the token accordingly.

    • remote_rtc_uids array[string]required

      A list of user IDs that the agent subscribes to in the channel. Only subscribed users can interact with the agent. Use "*" to subscribe to all users in the channel.

      info

      The "*" selector includes all UIDs present in the channel, which may include other AI agents. If you're running multiple agents in the same channel, review the best practices under the idle_timeout parameter to avoid unintended behavior and unnecessary usage costs.

    • enable_string_uid booleannullable

      Default: false

      Whether to enable String uid:

      • true: Both agent and subscriber user IDs use strings.
      • false: Both agent and subscriber user IDs must be integers.
    • idle_timeout integernullable

      Default: 30

      Sets the timeout after all the users specified in remote_rtc_uids are detected to have left the channel. When the timeout value is exceeded, the agent automatically stops and exits the channel. A value of 0 means that the agent does not exit until it is stopped manually.

      Multi-agent use cases

      If multiple AI agents are active in the same channel and are configured to subscribe to all users using remote_rtc_uids: ["*"], they detect each other's presence. As a result, the idle_timeout condition, when all other users have left, might never be triggered. This can cause agents to run indefinitely and lead to significant unintended usage.

      Agent lifecycle best practice

      For precise and reliable control over the agent's lifecycle, use the leave API to terminate the agent as soon as its task is complete.

    • advanced_features objectnullable

      Advanced features configuration.

      Show propertiesHide properties
      • enable_aivad booleannullable

        Default: false

        Whether to enable the intelligent interruption handling function (AIVAD). This feature is currently available only for English.

      • enable_rtm booleannullable

        Default: false

        Whether to enable the Signaling (RTM) service. When enabled, the agent can combine the capabilities provided by Signaling to implement advanced functions, such as delivering custom information.

        info

        Before enabling the Signaling service, make sure the token includes both RTC and RTM privileges. When an agent joins an RTM channel, it reuses the token specified in the token field. For more information, see "How can I generate a token with both RTC and Signaling privileges?".

    • asr objectnullable

      Automatic Speech Recognition (ASR) configuration.

      Show propertiesHide properties
      • language stringnullable

        Default: en-US

        Possible values: en-US, es-ES, ja-JP, ko-KR, ar-AE, hi-IN

        The language used by users to interact with the agent. The following languages are in Beta:

        • es-ES: Spanish - Spain
        • ja-JP: Japanese
        • ko-KR: Korean
        • ar-AE: Arabic - UAE
        • hi-IN: Hindi - India

    • tts objectrequired

      Text-to-speech (TTS) module configuration.

      Show propertiesHide properties
      • vendor stringrequired

        Possible values: microsoft, elevenlabs

        TTS provider.

        • microsoft: Microsoft Azure
        • elevenlabs: ElevenLabs

      • params objectrequired

        The configuration parameters for the TTS vendor. See TTS vendor configuration for details.

      • skipPatterns array[integer]nullable

        Controls whether the TTS module skips bracketed content when reading LLM response text. This prevents the agent from vocalizing structural prompt information like tone indicators, action descriptions, and system prompts, creating a more natural and immersive listening experience. Enable this feature by specifying one or more values:

        • 1: Skip content in Chinese parentheses ()
        • 2: Skip content in Chinese square brackets 【】
        • 3: Skip content in parentheses ( )
        • 4: Skip content in square brackets [ ]
        • 5: Skip content in curly braces { }
        info
        • Nested brackets: When input text contains nested brackets and multiple bracket types are configured to be skipped, the system processes only the outermost brackets. The system matches from the beginning of the text and skips the first outermost bracket pair that meets the skip rule, including all nested content.
        • Agent memory: The agent's short-term memory always contains the complete, unfiltered LLM text, regardless of live captioning settings.
        • Real-time subtitles: When enabled, subtitles exclude filtered content during TTS playback but restore the complete text after each sentence finishes.
    • llm objectrequired

      Large language model (LLM) configuration.

      Show propertiesHide properties
      • url stringrequired

        The LLM callback address.

      • api_key stringnullable

        The LLM verification API key. The default value is an empty string. Ensure that you enable the API key in a production environment.

      • system_messages array[object]nullable

        A set of predefined information used as input to the LLM, including prompt words and examples.

      • params objectnullable

        Additional LLM information transmitted in the message body, such as the model used, and the maximum token limit.

      • max_history integernullable

        Default: 32

        The number of conversation history messages cached in the custom LLM. The minimum value is 1. History includes user and agent dialog messages, tool call information, and timestamps. Agent and user messages are recorded separately.

      • input_modalities array[string]nullable

        Default: ["text"]

        LLM input modalities:

        • ["text"]: Text only
        • ["text", "image"]: Text plus image; requires the selected LLM to support visual input
      • output_modalities array[string]nullable

        Default: ["text"]

        LLM output modalities:

        • ["text"]: The output text is converted to speech by the TTS module and then published to the RTC channel.
        • ["audio"]: Voice only. Voice is published directly to the RTC channel.
        • ["text", "audio"]: Text plus voice. Write your own logic to process the output of LLM as needed.
      • greeting_message stringnullable

        Agent greeting. If provided, the first user in the channel is automatically greeted with the message upon joining.

      • failure_message stringnullable

        Prompt for agent activation failure. If provided, it is returned through TTS when the custom LLM call fails.

      • vendor stringnullable

        LLM provider, supports the following settings:

        • custom: Custom LLM. When you set this option, the agent includes the following fields, in addition to role and content when making requests to the custom LLM:
          • turn_id: A unique identifier for each conversation turn. It starts from 0 and increments with each turn. One user-agent interaction corresponds to one turn_id.
          • timestamp: The request timestamp, in milliseconds.
      • style stringnullable

        Default: openai

        Possible values: openai, gemini, anthropic

        The request style for chat completion.
        Use openai for OpenAI-compatible APIs or custom LLM.

    • vad objectnullable

      Voice Activity Detection (VAD) configuration.

      Show propertiesHide properties
      • interrupt_duration_ms numbernullable

        Default: 160

        The amount of time in milliseconds that the user's voice must exceed the VAD threshold before an interruption is triggered.

      • prefix_padding_ms integernullable

        Default: 800

        The extra forward padding time in milliseconds before the processing system starts to process the speech input. This padding helps capture the beginning of the speech.

      • silence_duration_ms integernullable

        Default: 640

        The duration of audio silence in milliseconds. If no voice activity is detected during this period, the agent assumes that the user has stopped speaking.

      • threshold numbernullable

        Default: 0.5

        Identification sensitivity determines the level of sound in the audio signal that is considered voice activity. The value range is (0.0, 1.0). Lower values make it easier for the agent to detect speech, and higher values ignore weak sounds.

    • turn_detection objectnullable

      Conversation turn detection settings.

      Show propertiesHide properties
      • interrupt_mode stringnullable

        Default: interrupt

        Sets the agent's behavior when human voice interrupts the agent while it is interacting (speaking or thinking). Choose from the following values:

        • interrupt: The agent immediately stops the current interaction and processes the human voice input.
        • append: The agent completes the current interaction, then processes the human voice input.
        • ignore: The agent discards the human voice input without processing or storing it in the context.
    • parameters objectnullable

      Agent configuration parameters.

      Show propertiesHide properties
      • silence_config objectnullable

        Settings related to agent silence behavior.

        Show propertiesHide properties
        • timeout_ms integernullable

          Default: 0

          Possible values: 0 to 60000

          Specifies the maximum duration (in milliseconds) that the agent can remain silent. After the agent is successfully created and the user joins the channel, any time during which the agent is not listening, thinking, or speaking is considered silent time. When the silent time reaches the specified value, the agent broadcasts a silent reminder message. This feature is useful for prompting users when they become inactive.

          • 0: Disables the silent reminder feature.
          • (0, 60000]: Enables the silent reminder. You must also set content; otherwise, the configuration is invalid.
        • action stringnullable

          Default: speak

          Specifies how the agent behaves when the silent timeout is reached. Valid values:

          • speak: Uses the TTS module to announce the silent prompt (content).
          • think: Appends the silent prompt (content) to the context and passes it to the LLM.
        • content stringnullable

          Specifies the silent prompt message. The message use depends on the value of action parameter.

Response

  • If the returned status code is 200, the request was successful. The response body contains the result of the request.

    OK
    • agent_id string

      Unique id of the agent instance

    • create_ts integer

      Timestamp of when the agent was created

    • status string

      Possible values: IDLE, STARTING, RUNNING, STOPPING, STOPPED, RECOVERING, FAILED

      Current status.

      • IDLE (0): Agent is idle.
      • STARTING (1): The agent is being started.
      • RUNNING (2): The agent is running.
      • STOPPING (3): The agent is stopping.
      • STOPPED (4): The agent has exited.
      • RECOVERING (5): The agent is recovering.
      • FAILED (6): The agent failed to execute.
  • If the returned status code is not 200, the request failed. The response body includes the detail and reason for failure. Refer to status codes to understand the possible reasons for failure.

Reference

TTS vendor configuration

Conversational AI Engine supports the following TTS vendors:

Microsoft

paramsrequired
  • key stringrequired

    The API key used for authentication.

  • region stringrequired

    The Azure region where the speech service is hosted.

  • voice_name string

    The identifier for the selected voice for speech synthesis.

  • speed number

    Indicates the speaking rate of the text. The rate can be applied at the word or sentence level and should be between 0.5 and 2.0 times the original audio speed.

  • volume number

    Default: 100

    Specifies the audio volume as a number between 0.0 and 100.0, where 0.0 is the quietest and 100.0 is the loudest. For example, a value of 75 sets the volume to 75% of the maximum.

  • sample_rate integer

    Default: 24000

    Specifies the audio sampling rate in Hz.

For further details, refer to Microsoft TTS.

Sample configuration

_10
{
_10
"vendor": "microsoft",
_10
"params": {
_10
"key": "<your_microsoft_key>",
_10
"region": "eastus",
_10
"voice_name": "en-US-AndrewMultilingualNeural",
_10
"speed": 1.0,
_10
"volume": 70
_10
}
_10
}

Elevenlabs

paramsrequired
  • key stringrequired

    The API key used for authentication.

  • model_id stringrequired

    Identifier of the model to be used,

  • voice_id stringrequired

    The identifier for the selected voice for speech synthesis.

  • sample_rate integer

    Default: 24000

    Specifies the audio sampling rate in Hz.

  • stability number

    The stability for voice settings.

  • similarity_boost number
  • style number
  • use_speaker_boost boolean

For further details, refer to Elevenlabs TTS.

Sample configuration

_8
{
_8
"vendor": "elevenlabs",
_8
"params": {
_8
"key": "<your_elevenlabs_key>",
_8
"model_id": "eleven_flash_v2_5",
_8
"voice_id": "pNInz6obpgDQGcFmaJgB"
_8
}
_8
}

Authorization

This endpoint requires Basic Auth.

Request example


_46
curl --request post \
_46
--url https://api.agora.io/api/conversational-ai-agent/v2/projects/:appid/join \
_46
--header 'Authorization: Basic <your_base64_encoded_credentials>' \
_46
--data '
_46
{
_46
"name": "unique_name",
_46
"properties": {
_46
"channel": "channel_name",
_46
"token": "token",
_46
"agent_rtc_uid": "1001",
_46
"remote_rtc_uids": [
_46
"1002"
_46
],
_46
"idle_timeout": 120,
_46
"advanced_features": {
_46
"enable_aivad": true
_46
},
_46
"llm": {
_46
"url": "https://api.openai.com/v1/chat/completions",
_46
"api_key": "<your_llm_key>",
_46
"system_messages": [
_46
{
_46
"role": "system",
_46
"content": "You are a helpful chatbot."
_46
}
_46
],
_46
"max_history": 32,
_46
"greeting_message": "Hello, how can I assist you today?",
_46
"failure_message": "Please hold on a second.",
_46
"params": {
_46
"model": "gpt-4o-mini"
_46
}
_46
},
_46
"tts": {
_46
"vendor": "microsoft",
_46
"params": {
_46
"key": "<your_tts_api_key>",
_46
"region": "eastus",
_46
"voice_name": "en-US-AndrewMultilingualNeural"
_46
}
_46
},
_46
"asr": {
_46
"language": "en-US"
_46
}
_46
}
_46
}'

Response example


_5
{
_5
"agent_id": "1NT29X10YHxxxxxWJOXLYHNYB",
_5
"create_ts": 1737111452,
_5
"status": "RUNNING"
_5
}