Streaming based speech synthesis

In this tutorial, we'll walk through how to perform text-to-speech using the streaming API. You'll learn how to:

Establish a WebSocket connection with the TTS service
Stream text data trough websocket.
Receive and process synthesized audio data in real time.

All examples are written in JavaScript, with Axios handling HTTP communication. A dedicated client library is also available to simplify integration and can serve as a starting point for your implementation. For details, see Truebar Client Libraries.

Authenticating

Before using the API, you need to authenticate with the Truebar service to obtain an access token. This token must be included in all subsequent API requests.

1. Acquiring an Access Token Using Username and Password#

The following example demonstrates how to obtain an access token using axios in JavaScript:

const axios = require('axios');
async function getAccessToken(username, password) {  const response = await axios.post(    'https://auth.true-bar.si/realms/truebar/protocol/openid-connect/token',    new URLSearchParams({      grant_type: 'password',      username,      password,      client_id: 'truebar-client',    }),    { headers: { 'Content-Type': 'application/x-www-form-urlencoded' } }  );
  return response.data.access_token;}

2. Refreshing access tokens#

Tokens are only valid for a limited time. To keep your session alive, you can refresh the token using the refresh_token from the initial response to obtain a new access token:

async function refreshAccessToken(refreshToken) {  const response = await axios.post(    'https://auth.true-bar.si/realms/truebar/protocol/openid-connect/token',    new URLSearchParams({      grant_type: 'refresh_token',      refresh_token: refreshToken,      client_id: 'truebar-client',    }),    { headers: { 'Content-Type': 'application/x-www-form-urlencoded' } }  );
  return response.data.access_token;}

You can schedule this refresh operation to occur shortly before the access_token expires, using the expires_in value from the original authentication response.

Establishing and managing websocket session

To achieve the lowest possible latency for speech synthesis, we use the streaming API provided by the service. The process begins by connecting to a WebSocket endpoint. Since WebSocket clients typically do not support adding custom headers, we pass the access_token as a URL parameter:

let ws = new Websocket("wss://true-bar.si/api/pipelines/stream?access_token=<...>")ws.addEventListener("message", onMessageListener);

The message listener contains the core logic for handling the communication protocol. As described in the API specification, the server may emit various message types, so the callback must be able to handle all of them.

The whole procedure is as follows:

1 Open a WebSocket connection.
2 Wait for an INITIALIZED status message before sending anything.
3 Once the INITIALIZED message is received, send a CONFIG message containing the pipeline definition to configure the session.
4 Wait for a CONFIGURED status message before sending any additional messages.
5 Once the session is CONFIGURED, begin streaming text segments.
6 While streaming text segments, handle incoming messages from the server. These may include:
- Binary audio chunk: A new chunk of synthesized audio.
- WARNING: An API warning.
- ERROR - An API error.
7 When you want to end the session, send an EOS (End Of Stream) message.
8 Wait for the FINISHED status message.
9 Close the WebSocket connection.

Here's a skeleton implementation for handling received WebSocket messages:

const onMessageListener = (wsMsg) => {    const msg = JSON.parse(wsMsg.data);    switch (msg.type) {        case "STATUS":            switch (msg.status) {                case "INITIALIZED":                    ws.send(JSON.stringify({                        type: "CONFIG",                        pipeline: [                            {                                "task": "TTS",                                "exceptionHandlingPolicy": "THROW",                                "config": {                                    "tag": "<insert one of the available asr tags here>",                                    "parameters": {                                    }                                }                            }                        ]                    }));                    break;                case "CONFIGURED":                    // TODO: Start streaming text data. E.g.:                    ws.send(JSON.stringify(                        {                          "type": "TEXT_SEGMENT",                          "textSegment": {                            "tokens": [                              {                                "text": "Testing"                              },                              {                                "text": "speech"                              },                              {                                "text": "synthesis"                              },                              {                                "text": "."                              }                            ]                          }                        }                    ));                    break;                case "FINISHED":                    // TODO: Clean up and close WebSocket connection.                    break;                default:                    console.warn("Unexpected session status:", msg.status);            }            break;        case "WARNING":            // TODO: Handle warning message.            break;        case "ERROR":            // TODO: Handle error and close WebSocket connection if needed.            break;        default:            console.warn("Unexpected message type:", msg.type);    }};

Building TTS pipeline

The example above demonstrates a simplified text-to-speech (TTS) pipeline. In practice, a complete pipeline must include all necessary preprocessing steps required for the specific synthesis model in use. The exact preprocessing requirements depend on the model's capabilities and the format of the input text.

Different TTS models accept different types of input. For example, some models expect normalized and phonemized text, while others perform those operations internally.

It is the caller's responsibility to determine the appropriate preprocessing steps based on:

The model's requirements,
The format of the input text provided by the caller.

General Guidelines#

Use the following guidelines to determine which preprocessing stages to include:

All models hosted on RIVA or RIVA-STREAM frameworks (tag format: RIVA:*:*:* or RIVA-STREAM:*:*:*:*, require normalized and phonemized input that is properly segmented to sentences. This means that the input must be provided in that form or must be passed through NLP_tn, NLP_g2a and NLP_st, stages before feeding it to TTS stage.
All other frameworks we support do those steps internally, so we can simply use TTS stage without any preprocessing.

Although RIVA-based models require more setup, they offer greater flexibility and allow fine-tuned control over the input processing pipeline.

Audio sample rates#

API currently doesn't support setting a custom sample rate for the synthesized data. All models (tags) produce 16Khz, single channel, little endian encoded audio data, otherwise known as PCM16LE.

Examples#

Assume we want to synthesize the following text:

Hello, and welcome to our speech synthesis API. This service converts text into natural-sounding speech using state-of-the-art machine learning models. You can use it to create voice assistants, automate announcements, or make your applications more accessible. Simply send your text, choose a voice, and get high-quality audio in return.

To ensure high-quality synthesis, the text should be segmented into sentences. Each sentence should be sent as a separate TextSegment to a Websocket connection. The above example should be segmented into:

Segment 1: Hello, and welcome to our speech synthesis API.
Segment 2: This service converts text into natural-sounding speech using state-of-the-art machine learning models.
Segment 3: You can use it to create voice assistants, automate announcements, or make your applications more accessible.
Segment 4: Simply send your text, choose a voice, and get high-quality audio in return.

Proper punctuation is essential for achieving natural-sounding speech. If your input text lacks punctuation, include an additional preprocessing stage (NLP_tc) to insert it automatically before synthesis.

Although the text is sent in separate segments over the WebSocket connection, the synthesized audio is always returned as a single continuous stream. The audio is encoded in PCM16LE format, with one channel (mono), sampled at 16 kHz.

When using a framework that handles preprocessing internally, the only required step is synthesis. In this case, the pipeline can be as simple as:

{    "task": "TTS",    "exceptionHandlingPolicy": "THROW",    "config": {      "tag": "...",      "parameters": {      }    }  }

In contrast, when using RIVA-based models, all required preprocessing steps must be explicitly included in the pipeline. This typically includes text normalization, grapheme-to-phoneme conversion, and sentence tokenization:

[  {    "task": "NLP_st",    "exceptionHandlingPolicy": "THROW",    "config": {      "tag": "...",      "parameters": {}    }  },  {    "task": "NLP_tn",    "exceptionHandlingPolicy": "THROW",    "config": {      "tag": "...",      "parameters": {}    }  },  {    "task": "NLP_g2a",    "exceptionHandlingPolicy": "THROW",    "config": {      "tag": "...",      "parameters": {}    }  },  {    "task": "TTS",    "exceptionHandlingPolicy": "THROW",    "config": {      "tag": "...",      "parameters": {      }    }  }]