Audio output mode

In addition to text output, the Agora Conversational AI Engine can deliver responses in audio format, allowing for more natural and immersive user interactions. This page describes how to configure the agent for audio output and modify the interface to support features such as context management, subtitle alignment, and agent broadcasting.

Prerequisites

Before you begin, make sure you have the following:

A reference implementation of a Conversational AI agent that includes the basic logic for interacting with an AI agent.
If you plan to implement subtitles for audio output, read the Display live subtitles documentation and complete the required configuration.

Implement audio output

Take the following steps to set up and configure the audio output mode.

Select the output mode

To configure the output mode, set the llm.output_modalities field when you Start a conversational AI agent, as follows:

["audio"]: Sets the output to audio-only mode. In this configuration, you don't need to set up a text-to-speech (TTS) module. The agent directly plays the audio returned by the custom LLM. This page focuses on how to use this audio-only mode.
["text", "audio"]: Sets the output to both text and audio modes. In this configuration, two types of audio are returned: one generated by the TTS module and one provided by the custom LLM.

Modify the LLM interface

To use the Agora Conversational AI Engine with the OpenAI Chat Completions API, ensure that your LLM service is compatible with the expected request and response formats. Agora uses an extended API that supports multiple response types, including text, audio, subtitles, and verbatim subtitle timestamps. It also introduces an additional words field for real-time subtitle alignment.

To adapt your LLM service, refer to Custom LLM for guidance on transforming your API to meet these requirements.

Request format

Compared with text requests, audio requests include two additional optional fields:

modalities: Specifies the output mode.
audio: Specifies the output timbre (voice) and format.

Include these fields in the llm.params object of the Start a conversational AI agent request.

The following example shows the format of an audio request:

{
  "model": "gpt-4o-audio-preview",
  "modalities": ["audio"],
  "audio": { "voice": "alloy", "format": "wav" },
  "messages": [
    {
      "role": "user",
      "content": "Is a golden retriever a good family dog?"
    }
  ]
}

Response format

The audio response includes three types of content. You can send each type to the agent independently for processing:

Audio type	Description	Source	Agent processing
Audio data `data`	Base64-encoded PCM byte stream array	LLM generation with audio processing capabilities Custom audio processing service generation	Plays the audio directly
Transcription content `transcript`	The complete text content corresponding to the audio	LLM generation	Stores the text in short-term memory (context)
Verbatim subtitles `words`	Subtitle content with word-by-word timestamps	Supports LLM generation with verbatim output	Processes into verbatim real-time subtitles

The specific data structure of a streaming response is as follows:

{"choices":[{"index":0,"delta":{"role":"assistant","audio":{"data": ""}},"logprobs":null,"finish_reason":null}]}
 
// Audio data
{"choices":[{"index":0,"delta":{"audio":{"data": "base64 encoded pcm data"}},"logprobs":null,"finish_reason":null}]}
 
// Transcribe subtitles
{"choices":[{"index":0,"delta":{"audio":{"transcript": "Hello world!", "words":[{"text":"Hello", "start_ts":100, "end_ts":140, "duration":40}]}},"logprobs":null,"finish_reason":null}]}
// Word-by-word subtitles
{"choices":[{"index":0,"delta":{"audio":{"words":[{"text":"world", "start_ts":150, "end_ts":190, "duration":40}, {"text":"!", "start_ts":190, "end_ts":200, "duration":10}]}},"logprobs":null,"finish_reason":null}]}
    
....
 
{"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}]}

The Conversational AI agent summarizes different types of responses in the following format:

{
    "choices": [
        {
            "delta": {
                "audio": {
                    "data": "base64 encoded pcm data",
                    "transcript": "Hello world!",
                    "words": [
                        {
                            "text": "Hello",
                            "start_ts": 100,
                            "end_ts": 140,
                            "duration": 40
                        },
                        {
                            "text": "world",
                            "start_ts": 150,
                            "end_ts": 190,
                            "duration": 40
                        },
                        {
                            "text": "!",
                            "start_ts": 190,
                            "end_ts": 200,
                            "duration": 10
                        }
                    ]
                }
            }
        }
    ]
}

The audio object contains the following fields:

audio

data string
Audio data as a Base64-encoded PCM byte stream.
transcript string
Subtitle content corresponding to the audio.
words array
An array of word-level subtitle objects. The LLM must support word-level output.
Show propertiesHide properties
- text string
  The spoken word.
- start_ts number
  Start time in milliseconds relative to the beginning of the PCM audio data.
- end_ts number
  End time in milliseconds relative to the beginning of the PCM audio data.
- duration number
  Duration in milliseconds that the word is played.

Depending on your application use-case, configure your custom large model to selectively process and return the relevant fields:

Audio-only output: Only the data field is required.
Subtitle-only output: Only the transcript field is required. The agent displays the subtitle but does not play it using the TTS module.
Audio with verbatim subtitles: The data and words fields are required. Each item in the words array must include the text, start_ts, end_ts, and duration fields.

Advanced features

Conversational AI Engine supports the following advanced features:

Context management

When the response contains the audio.transcript field, the agent automatically stores the subtitle content in its context manager for use in subsequent interactions. If the audio.transcript field is not included, the content is not stored.

To ensure the agent retains the audio modality output in short-term memory, include the audio.transcript field in the response.

Subtitle alignment

When you Display live subtitles, the Conversational AI Engine can use the audio.words field to segment the audio. The engine aligns subtitles based on the start_ts, end_ts, and duration fields within audio.words.

To enable the agent to segment the audio based on subtitle content during playback, make sure that the LLM sends both the audio.data and audio.words fields in the response.

Agent message broadcasts

If the agent you created is not configured with a TTS module but is set to use audio output with output_modalities is set to ["audio"], you can enable it to broadcast custom messages by adapting your LLM as follows:

If the agent isn't configured with a TTS module but is set to use audio output (output_modalities is set to ["audio"]), you can enable it to broadcast custom messages by adapting your LLM as follows:

In the messages list received by the agent and passed to the LLM, the model should handle the last message based on its role:

assistant: The model treats the message as a directive that doesn't require reasoning. It converts the message to audio and returns it to the agent. The agent plays the audio directly.
user: The model treats the message as a prompt that requires reasoning. The agent decides whether to broadcast the message based on whether the model’s response includes audio. This corresponds to a typical user–agent conversation.

info

The following agent broadcasts also adopt this protocol:

When the agent announces a greeting greeting_message, a processing failure prompt failure_message or a silent prompt silence_message.
When you Broadcast a custom message using the TTS module.

Reference

This section contains content that completes the information on this page, or points you to documentation that explains other aspects to this product.

Sample project

Agora provides an open-source Conversational AI sample server project for your reference. Download the project or view the source code for a complete example.

Was this helpful?