Start a conversational AI agent
Start a conversational AI agent
https://api.agora.io/api/conversational-ai-agent/v2/projects/{appid}/join
Use this endpoint to create and start a Conversational AI agent instance.
Request
Path parameters
The App ID of the project
Request body
BODYrequired
- name stringrequired
The unique identifier of the agent. The same identifier cannot be used repeatedly.
- properties objectrequired
Configuration details of the agent.
- channel stringrequired
The name of the channel to join.
- token stringrequired
The authentication token used by the agent to join the channel.
- agent_rtc_uid stringrequired
The user ID of the agent in the channel. A value of
0
means that a random UID is generated and assigned. Set thetoken
accordingly. - remote_rtc_uids array[string]required
The list of user IDs that the agent subscribes to in the channel. Only subscribed users can interact with the agent.
"*"
means that the agent subscribes to all users in the channel. - enable_string_uid booleannullable
Default:
false
Whether to enable String uid:
true
: Both agent and subscriber user IDs use strings.false
: Both agent and subscriber user IDs must be integers.
- idle_timeout integernullable
Default:
30
Sets the timeout after all the users specified in
remote_rtc_uids
are detected to have left the channel. When the timeout value is exceeded, the agent automatically stops and exits the channel. A value of0
means that the agent does not exit until it is stopped manually. - advanced_features objectnullable
Advanced features configuration.
- enable_aivad booleannullable
Default:
false
Whether to enable the intelligent interruption handling function (AIVAD). This feature is currently available only for English.
- asr objectnullable
Automatic Speech Recognition (ASR) configuration.
- language stringnullable
Default:
en-US
Possible values:
en-US
,es-ES
,ja-JP
,ko-KR
,ar-AE
,hi-IN
The language used by users to interact with the agent. The following languages are in Beta:
es-ES
: Spanish - Spain-
ja-JP
: Japanese -
ko-KR
: Korean -
ar-AE
: Arabic - UAE -
hi-IN
: Hindi - India
- tts objectrequired
Text-to-speech (TTS) module configuration.
- vendor stringrequired
Possible values:
microsoft
,elevenlabs
TTS provider.
microsoft
: Microsoft Azure-
elevenlabs
: ElevenLabs
- params objectrequired
The configuration parameters for the TTS vendor. See TTS vendor configuration for details.
- llm objectrequired
Large language model (LLM) configuration.
- url stringrequired
The LLM callback address.
- api_key stringnullable
The LLM verification API key. The default value is an empty string. Ensure that you enable the API key in a production environment.
- system_messages array[object]nullable
A set of predefined information used as input to the LLM, including prompt words and examples.
- params objectnullable
Additional LLM information transmitted in the message body, such as the
model
used, and the maximum token limit. - max_history integernullable
Default:
10
The number of short-term memory entries cached in the custom LLM.
0
means no short-term memory is cached. Users and agents log entries separately. - input_modalities array[string]nullable
Default:
["text"]
LLM input modalities. Supports
["text"]
,["text", "image"]
. - output_modalities array[string]nullable
Default:
["text"]
LLM output modalities. Supports
["audio"]
,["text"]
,["text", "audio"]
. - greeting_message stringnullable
Agent greeting. If provided, the first user in the channel is automatically greeted with the message upon joining.
- failure_message stringnullable
Prompt for agent activation failure. If provided, it is returned through TTS when the custom LLM call fails.
- style stringnullable
Default:
openai
Possible values:
openai
,gemini
The request style for chat completion.
openai
includes OpenAI compatible APIs.
- vad objectnullable
Voice Activity Detection (VAD) configuration.
- interrupt_duration_ms numbernullable
Default:
160
The amount of time in milliseconds that the user's voice must exceed the VAD threshold before an interruption is triggered.
- prefix_padding_ms integernullable
Default:
300
The extra forward padding time in milliseconds before the processing system starts to process the speech input. This padding helps capture the beginning of the speech.
- silence_duration_ms integernullable
Default:
640
The duration of audio silence in milliseconds. If no voice activity is detected during this period, the agent assumes that the user has stopped speaking.
- threshold numbernullable
Default:
0.5
Identification sensitivity determines the level of sound in the audio signal that is considered voice activity. The value range is
(0.0, 1.0)
. Lower values make it easier for the agent to detect speech, and higher values ignore weak sounds.
Response
-
If the returned status code is
200
, the request was successful. The response body contains the result of the request.OK
- agent_id string
Unique id of the agent instance
- create_ts integer
Timestamp of when the agent was created
- status string
Possible values:
IDLE
,STARTING
,RUNNING
,STOPPING
,STOPPED
,RECOVERING
,FAILED
Current status.
IDLE
(0): Agent is idle.STARTING
(1): The agent is being started.RUNNING
(2): The agent is running.STOPPING
(3): The agent is stopping.STOPPED
(4): The agent has exited.RECOVERING
(5): The agent is recovering.FAILED
(6): The agent failed to execute.
-
If the returned status code is not
200
, the request failed. The response body includes thedetail
andreason
for failure. Refer to status codes to understand the possible reasons for failure.
Reference
TTS vendor configuration
Conversational AI Engine supports the following TTS vendors:
Microsoft
paramsrequired
- key stringrequired
The API key used for authentication.
- region stringrequired
The Azure region where the speech service is hosted.
- voice_name string
The identifier for the selected voice for speech synthesis.
- rate number
Indicates the speaking rate of the text. The rate can be applied at the word or sentence level and should be between 0.5 and 2.0 times the original audio speed.
- volume number
Default:
100
Specifies the audio volume as a number between 0.0 and 100.0, where 0.0 is the quietest and 100.0 is the loudest. For example, a value of 75 sets the volume to 75% of the maximum.
- sample_rate integer
Default:
24000
Specifies the audio sampling rate in Hz.
For further details, refer to Microsoft TTS.
Sample configuration
Elevenlabs
paramsrequired
- key stringrequired
The API key used for authentication.
- model_id stringrequired
Identifier of the model to be used,
- voice_id stringrequired
The identifier for the selected voice for speech synthesis.
- sample_rate integer
Default:
24000
Specifies the audio sampling rate in Hz.
- stability number
The stability for voice settings.
- similarity_boost number
- style number
- use_speaker_boost boolean
For further details, refer to Elevenlabs TTS.
Sample configuration