Skip to main content

Cartesia (Beta)

Cartesia provides ultra-fast, low-latency text-to-speech with real-time streaming capabilities, optimized for interactive conversational AI applications.

Sample configuration

The following example shows a starting tts parameter configuration you can use when you Start a conversational AI agent.


_17
"tts": {
_17
"vendor": "cartesia",
_17
"params": {
_17
"api_key": "<your_cartesia_key>",
_17
"model_id": "sonic-2",
_17
"base_url": "wss://api.cartesia.ai",
_17
"voice": {
_17
"mode": "id",
_17
"id": "<voice_id>"
_17
},
_17
"output_format": {
_17
"container": "raw",
_17
"sample_rate": 16000
_17
},
_17
"language": "en"
_17
}
_17
}

caution

The parameters listed on this page are validated for use with Conversational AI Engine. Required parameters must be provided as documented. Any additional parameters are passed through directly to the underlying vendor without validation. For a full list of supported options, refer to the Cartesia TTS documentation.

Key parameters

paramsrequired
  • api_key stringrequired

    The Cartesia API key used to authenticate requests. Get your API key from the Cartesia Console.

  • model_id stringrequired

    The identifier of the TTS model to use. For example, sonic-2.

  • base_url stringnullable

    The WebSocket URL for the Cartesia streaming API. For example, wss://api.cartesia.ai.

  • voice objectrequired

    Voice configuration object.

    Show propertiesHide properties
    • mode stringrequired

      Voice selection mode. Use id to select by voice identifier.

    • id stringrequired

      The identifier of the selected voice for speech synthesis.

  • output_format objectnullable

    Audio output format configuration

    Show propertiesHide properties
    • container stringnullable

      Audio container format for the output stream.

    • sample_rate numbernullable

      Default: 16000

      Possible values: 8000, 16000, 22050, 24000, 44100, 48000

      Audio sampling rate in Hz

  • language stringnullable

    Target language for speech synthesis.