Skip to main content

Audio output mode

In addition to text output, the Agora Conversational AI Engine can deliver responses in audio format, allowing for more natural and immersive user interactions. This page describes how to configure the agent for audio output and modify the interface to support features such as context management, subtitle alignment, and agent broadcasting.

Prerequisites

Before you begin, make sure you have the following:

  • A reference implementation of a Conversational AI agent that includes the basic logic for interacting with an AI agent.
  • If you plan to implement subtitles for audio output, read the Display live subtitles documentation and complete the required configuration.

Implement audio output

Take the following steps to set up and configure the audio output mode.

Select the output mode

To configure the output mode, set the llm.output_modalities field when you Start a conversational AI agent, as follows:

  • ["audio"]: Sets the output to audio-only mode. In this configuration, you don't need to set up a text-to-speech (TTS) module. The agent directly plays the audio returned by the custom LLM. This page focuses on how to use this audio-only mode.

  • ["text", "audio"]: Sets the output to both text and audio modes. In this configuration, two types of audio are returned: one generated by the TTS module and one provided by the custom LLM.

Modify the LLM interface

To use the Agora Conversational AI Engine with the OpenAI Chat Completions API, ensure that your LLM service is compatible with the expected request and response formats. Agora uses an extended API that supports multiple response types, including text, audio, subtitles, and verbatim subtitle timestamps. It also introduces an additional words field for real-time subtitle alignment.

To adapt your LLM service, refer to Custom LLM for guidance on transforming your API to meet these requirements.

Request format

Compared with text requests, audio requests include two additional optional fields:

  • modalities: Specifies the output mode.
  • audio: Specifies the output timbre (voice) and format.

Include these fields in the llm.params object of the Start a conversational AI agent request.

The following example shows the format of an audio request:


_11
{
_11
"model": "gpt-4o-audio-preview",
_11
"modalities": ["audio"],
_11
"audio": { "voice": "alloy", "format": "wav" },
_11
"messages": [
_11
{
_11
"role": "user",
_11
"content": "Is a golden retriever a good family dog?"
_11
}
_11
]
_11
}

Response format

The audio response includes three types of content. You can send each type to the agent independently for processing:

Audio typeDescriptionSourceAgent processing
Audio data
data
Base64-encoded PCM byte stream array
  • LLM generation with audio processing capabilities
  • Custom audio processing service generation
  • Plays the audio directly
    Transcription content
    transcript
    The complete text content corresponding to the audioLLM generationStores the text in short-term memory (context)
    Verbatim subtitles
    words
    Subtitle content with word-by-word timestampsSupports LLM generation with verbatim outputProcesses into verbatim real-time subtitles

    The specific data structure of a streaming response is as follows:


    _14
    {"choices":[{"index":0,"delta":{"role":"assistant","audio":{"data": ""}},"logprobs":null,"finish_reason":null}]}
    _14
    _14
    // Audio data
    _14
    {"choices":[{"index":0,"delta":{"audio":{"data": "base64 encoded pcm data"}},"logprobs":null,"finish_reason":null}]}
    _14
    _14
    // Transcribe subtitles
    _14
    {"choices":[{"index":0,"delta":{"audio":{"transcript": "Hello world!", "words":[{"text":"Hello", "start_ts":100, "end_ts":140, "duration":40}]}},"logprobs":null,"finish_reason":null}]}
    _14
    _14
    // Word-by-word subtitles
    _14
    {"choices":[{"index":0,"delta":{"audio":{"words":[{"text":"world", "start_ts":150, "end_ts":190, "duration":40}, {"text":"!", "start_ts":190, "end_ts":200, "duration":10}]}},"logprobs":null,"finish_reason":null}]}
    _14
    _14
    ....
    _14
    _14
    {"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}]}

    The Conversational AI agent summarizes different types of responses in the following format:


    _32
    {
    _32
    "choices": [
    _32
    {
    _32
    "delta": {
    _32
    "audio": {
    _32
    "data": "base64 encoded pcm data",
    _32
    "transcript": "Hello world!",
    _32
    "words": [
    _32
    {
    _32
    "text": "Hello",
    _32
    "start_ts": 100,
    _32
    "end_ts": 140,
    _32
    "duration": 40
    _32
    },
    _32
    {
    _32
    "text": "world",
    _32
    "start_ts": 150,
    _32
    "end_ts": 190,
    _32
    "duration": 40
    _32
    },
    _32
    {
    _32
    "text": "!",
    _32
    "start_ts": 190,
    _32
    "end_ts": 200,
    _32
    "duration": 10
    _32
    }
    _32
    ]
    _32
    }
    _32
    }
    _32
    }
    _32
    ]
    _32
    }

    The audio object contains the following fields:

    audio
    • data string

      Audio data as a Base64-encoded PCM byte stream.

    • transcript string

      Subtitle content corresponding to the audio.

    • words array

      An array of word-level subtitle objects. The LLM must support word-level output.

        • text string

          The spoken word.

        • start_ts number

          Start time in milliseconds relative to the beginning of the PCM audio data.

        • end_ts number

          End time in milliseconds relative to the beginning of the PCM audio data.

        • duration number

          Duration in milliseconds that the word is played.

    Depending on your application use-case, configure your custom large model to selectively process and return the relevant fields:

    • Audio-only output: Only the data field is required.

    • Subtitle-only output: Only the transcript field is required. The agent displays the subtitle but does not play it using the TTS module.

    • Audio with verbatim subtitles: The data and words fields are required. Each item in the words array must include the text, start_ts, end_ts, and duration fields.

    Advanced features

    Conversational AI Engine supports the following advanced features:

    Context management

    When the response contains the audio.transcript field, the agent automatically stores the subtitle content in its context manager for use in subsequent interactions. If the audio.transcript field is not included, the content is not stored.

    To ensure the agent retains the audio modality output in short-term memory, include the audio.transcript field in the response.

    Subtitle alignment

    When you Display live subtitles, the Conversational AI Engine can use the audio.words field to segment the audio. The engine aligns subtitles based on the start_ts, end_ts, and duration fields within audio.words.

    To enable the agent to segment the audio based on subtitle content during playback, make sure that the LLM sends both the audio.data and audio.words fields in the response.

    Agent message broadcasts

    If the agent you created is not configured with a TTS module but is set to use audio output with output_modalities is set to ["audio"], you can enable it to broadcast custom messages by adapting your LLM as follows:

    If the agent isn't configured with a TTS module but is set to use audio output (output_modalities is set to ["audio"]), you can enable it to broadcast custom messages by adapting your LLM as follows:

    In the messages list received by the agent and passed to the LLM, the model should handle the last message based on its role:

    • assistant: The model treats the message as a directive that doesn't require reasoning. It converts the message to audio and returns it to the agent. The agent plays the audio directly.

    • user: The model treats the message as a prompt that requires reasoning. The agent decides whether to broadcast the message based on whether the model’s response includes audio. This corresponds to a typical user–agent conversation.

    info

    The following agent broadcasts also adopt this protocol:

    Reference

    This section contains content that completes the information on this page, or points you to documentation that explains other aspects to this product.

    Sample project

    Agora provides an open-source Conversational AI sample server project for your reference. Download the project or view the source code for a complete example.