Audio output mode
In addition to text output, the Agora Conversational AI Engine can deliver responses in audio format, allowing for more natural and immersive user interactions. This page describes how to configure the agent for audio output and modify the interface to support features such as context management, subtitle alignment, and agent broadcasting.
Prerequisites
Before you begin, make sure you have the following:
- A reference implementation of a Conversational AI agent that includes the basic logic for interacting with an AI agent.
- If you plan to implement subtitles for audio output, read the Display live subtitles documentation and complete the required configuration.
Implement audio output
Take the following steps to set up and configure the audio output mode.
Select the output mode
To configure the output mode, set the llm.output_modalities
field when you Start a conversational AI agent, as follows:
-
["audio"]
: Sets the output to audio-only mode. In this configuration, you don't need to set up a text-to-speech (TTS) module. The agent directly plays the audio returned by the custom LLM. This page focuses on how to use this audio-only mode. -
["text", "audio"]
: Sets the output to both text and audio modes. In this configuration, two types of audio are returned: one generated by the TTS module and one provided by the custom LLM.
Modify the LLM interface
To use the Agora Conversational AI Engine with the OpenAI Chat Completions API, ensure that your LLM service is compatible with the expected request and response formats. Agora uses an extended API that supports multiple response types, including text, audio, subtitles, and verbatim subtitle timestamps. It also introduces an additional words
field for real-time subtitle alignment.
To adapt your LLM service, refer to Custom LLM for guidance on transforming your API to meet these requirements.
Request format
Compared with text
requests, audio
requests include two additional optional fields:
modalities
: Specifies the output mode.audio
: Specifies the output timbre (voice
) andformat
.
Include these fields in the llm.params
object of the Start a conversational AI agent request.
The following example shows the format of an audio request:
Response format
The audio
response includes three types of content. You can send each type to the agent independently for processing:
Audio type | Description | Source | Agent processing |
---|---|---|---|
Audio datadata | Base64-encoded PCM byte stream array | Plays the audio directly | |
Transcription contenttranscript | The complete text content corresponding to the audio | LLM generation | Stores the text in short-term memory (context) |
Verbatim subtitleswords | Subtitle content with word-by-word timestamps | Supports LLM generation with verbatim output | Processes into verbatim real-time subtitles |
The specific data structure of a streaming response is as follows:
The Conversational AI agent summarizes different types of responses in the following format:
The audio
object contains the following fields:
audio
- data string
Audio data as a Base64-encoded PCM byte stream.
- transcript string
Subtitle content corresponding to the audio.
- words array
An array of word-level subtitle objects. The LLM must support word-level output.
- text string
The spoken word.
- start_ts number
Start time in milliseconds relative to the beginning of the PCM audio data.
- end_ts number
End time in milliseconds relative to the beginning of the PCM audio data.
- duration number
Duration in milliseconds that the word is played.
Depending on your application use-case, configure your custom large model to selectively process and return the relevant fields:
-
Audio-only output: Only the
data
field is required. -
Subtitle-only output: Only the
transcript
field is required. The agent displays the subtitle but does not play it using the TTS module. -
Audio with verbatim subtitles: The
data
and words fields are required. Each item in the words array must include thetext
,start_ts
,end_ts
, andduration
fields.
Advanced features
Conversational AI Engine supports the following advanced features:
Context management
When the response contains the audio.transcript
field, the agent automatically stores the subtitle content in its context manager for use in subsequent interactions. If the audio.transcript
field is not included, the content is not stored.
To ensure the agent retains the audio modality output in short-term memory, include the audio.transcript
field in the response.
Subtitle alignment
When you Display live subtitles, the Conversational AI Engine can use the audio.words
field to segment the audio. The engine aligns subtitles based on the start_ts
, end_ts
, and duration
fields within audio.words
.
To enable the agent to segment the audio based on subtitle content during playback, make sure that the LLM sends both the audio.data
and audio.words
fields in the response.
Agent message broadcasts
If the agent you created is not configured with a TTS module but is set to use audio output with output_modalities
is set to ["audio"], you can enable it to broadcast custom messages by adapting your LLM as follows:
If the agent isn't configured with a TTS module but is set to use audio output (output_modalities is set to ["audio"]), you can enable it to broadcast custom messages by adapting your LLM as follows:
In the messages
list received by the agent and passed to the LLM, the model should handle the last message based on its role
:
-
assistant
: The model treats the message as a directive that doesn't require reasoning. It converts the message to audio and returns it to the agent. The agent plays the audio directly. -
user
: The model treats the message as a prompt that requires reasoning. The agent decides whether to broadcast the message based on whether the model’s response includes audio. This corresponds to a typical user–agent conversation.
The following agent broadcasts also adopt this protocol:
- When the agent announces a greeting
greeting_message
, a processing failure promptfailure_message
or a silent promptsilence_message
. - When you Broadcast a custom message using the TTS module.
Reference
This section contains content that completes the information on this page, or points you to documentation that explains other aspects to this product.
Sample project
Agora provides an open-source Conversational AI sample server project for your reference. Download the project or view the source code for a complete example.