Skip to main content

Architecture and specifications

The ConvoAI Device Kit solution is built on Agora's Media Stream Acceleration (RTSA) or IoT SDK and Conversational AI Engine. RTSA is Agora's RTC client product designed specifically for the IoT industry, providing the real-time communication foundation for voice AI interactions.

Solution architecture

The ConvoAI Device Kit solution consists of two main components connected through Agora's Software-Defined Real-Time Network (SDRTN®).

Solution Architecture

  • Conversational AI Device Kit

    The hardware device handles local processing including power management, Bluetooth pairing, voice wake-up, OTA updates, and peripheral driver support. It captures audio and video through dedicated modules, encodes the data, and transmits it through RTSA (Real-Time Streaming Acceleration) to Agora's SD-RTN. The device also receives and decodes audio/video responses for playback.

  • Conversational AI Engine

    The cloud service processes the incoming audio and video streams. It has two main components:

    • Convo AI Core: Handles AI-powered voice activity detection (VAD), noise reduction, echo cancellation, interruption detection, background voice filtering, and snapshot/frame capture

    • Cascade Model: Processes the audio through Speech-to-Text (STT), sends the text to a Large Language Model (LLM), and converts the response back to speech using Text-to-Speech (TTS)

The system uses G.711/G.722 audio codecs and MJPEG/H.264 video codecs for efficient real-time transmission between the device and cloud services, enabling low-latency conversational AI interactions.

Hardware architecture

The R1 Kit integrates multiple hardware components to enable multimodal AI interactions:

R1 Kit Hardware

Key components:

  • Dual-screen LCD/Single touch LCD (optional)
  • Digital video port
  • Bi-color LED indicator
  • Dual-microphone array with solder pads and connectors
  • Speaker with solder pads and connector
  • Gyroscope for motion sensing
  • SD NAND storage
  • Battery connector for portable power
  • NFC support
  • USB-to-UART interface
  • Reset button and side buttons
  • Vibration motor (expandable)

Software architecture

The software package provides a complete development framework for building conversational AI applications:

R1 Kit Software

Performance specifications

The Convo AI Device Kit delivers ultra-low latency performance with robust audio processing capabilities across a global network.

  • Latency

    • Conversation latency: As low as 650ms
    • Interruption response: As low as 340ms
    • Global network end-to-end latency: Median as low as 76ms
  • Audio processing

    • Noise suppression: Filters 95% of environmental noise
    • Packet loss resistance: Up to 80% packet loss tolerance
  • Coverage

    • Network coverage: 200+ countries and regions
    • Language support: 35+ languages

Hardware specifications

The R1 kit is based on the Broadcom BK7258 chipset and includes open-source hardware and software resources.

  • Audio capabilities

    • Dual-microphone array with local AEC (Acoustic Echo Cancellation) algorithm
    • Precise audio capture with echo interference elimination
  • Visual and sensor capabilities

    • Integrated camera for visual recognition
    • Gyroscope for motion sensing and gesture control
  • Power management

    • Battery power support
    • Deep sleep mode for mobile scenarios
  • Connectivity

    • Bluetooth provisioning
    • Wi-Fi 6 support for one-click cloud service connection
  • Display

    • Dual-screen collaborative display
  • Interaction modes

    • Multi-channel input: voice, touchscreen, and gyroscope (gesture/tilt control)
    • Custom wake word support
    • Real-time continuous conversation
    • LLM real-time visual reasoning

Platform compatibility

The ConvoAI Device Kit supports various mainstream communication standards and chipsets.

  • Supported chip manufacturers

    • Broadcom
    • Espressif
    • Unisoc
    • Ingenic
    • Rockchip (RK)
    • Sigmastar
  • Communication standards

    • Wi-Fi
    • LTE Category 1
  • Additional support

    • Image Signal Processor (ISP) chips

For specific supported chip models and compatibility details, contact technical support.

Advanced audio algorithms

The ConvoAI Device Kit employs specialized algorithms to ensure accurate voice recognition and natural conversation flow in challenging environments.

AI noise reduction

Filters 95% of environmental noise, enabling accurate recognition even in challenging environments like coffee shops and train stations, preventing interaction errors.

BHVS and voiceprint algorithms

  • Background Human Voice Separation (BHVS): Filters background voices in multi-person conversation scenarios
  • Voiceprint recognition: Locks onto the primary speaker in multi-person conversations

Graceful interruption algorithm

Enables AI to detect user interruption intent in real-time and precisely determine when to speak and when to stop, restoring natural conversation rhythm.

Weak network resistance

Maintains uninterrupted voice interaction even in challenging network conditions like subways and basements, with 80% packet loss resistance capability to maintain conversation continuity.