Skip to main content

Streaming STT Guide

In this tutorial, we'll walk through how to perform real-time speech-to-text (STT) using the streaming API. You'll learn how to:

  • Establish a WebSocket connection with the STT service
  • Stream live audio data to the server
  • Receive and display transcribed text in real time

Examples are provided for Node.js, Python, and Java using familiar HTTP/WebSocket clients (axios, aiohttp, HttpClient). A dedicated client library is also available to simplify integration and can serve as a starting point for your implementation. For details, see Truebar Client Libraries.

By the end of this guide, you'll have a working STT integration capable of transcribing live microphone input directly in your browser.

Stage tag

Before you start, set TRUEBAR_ASR_TAG to one of the online ASR stages returned by /api/pipelines/stages (or copy the tag from your project’s .env.truebar). The snippets below fall back to KALDI:en-US:*:*, which only works if that tag is available in your account.

Authenticating

Before using the API, you need to authenticate with the Truebar service to obtain an access token. This token must be included in all subsequent API requests.

1. Acquiring an Access Token Using Username and Password#

The following example demonstrates how to obtain an access token using axios in JavaScript:

const axios = require('axios');
async function getAccessToken(username, password) {  const response = await axios.post(    'https://auth.true-bar.si/realms/truebar/protocol/openid-connect/token',    new URLSearchParams({      grant_type: 'password',      username,      password,      client_id: 'truebar-client',    }),    { headers: { 'Content-Type': 'application/x-www-form-urlencoded' } }  );
  return response.data.access_token;}

2. Refreshing access tokens#

Tokens are only valid for a limited time. To keep your session alive, you can refresh the token using the refresh_token from the initial response to obtain a new access token:

async function refreshAccessToken(refreshToken) {  const response = await axios.post(    'https://auth.true-bar.si/realms/truebar/protocol/openid-connect/token',    new URLSearchParams({      grant_type: 'refresh_token',      refresh_token: refreshToken,      client_id: 'truebar-client',    }),    { headers: { 'Content-Type': 'application/x-www-form-urlencoded' } }  );
  return response.data.access_token;}

You can schedule this refresh operation to occur shortly before the access_token expires, using the expires_in value from the original authentication response.

Server-side quickstart (Node.js, Python, Java)

If you are building a backend service, you can reuse the same message flow without the browser-specific pieces. The snippets below match the production-safe pattern from the tutorials: bearer tokens in headers, a CONFIG handshake, binary PCM frames, and an EOS message with lockSession: false.

stt-stream.mjs
import axios from "axios";import WebSocket from "ws";import { readFileSync } from "node:fs";
const tokensToText = (tokens = []) => {  let output = "";  let prevRight = false;  tokens.forEach((token, index) => {    const text = token?.text ?? "";    if (!text) return;    const left = Boolean(token?.isLeftHanded);    if (index > 0 && !prevRight && !left) {      output += " ";    }    output += text;    prevRight = Boolean(token?.isRightHanded);  });  return output;};
async function fetchToken() {  const form = new URLSearchParams({    grant_type: "password",    username: process.env.TRUEBAR_USERNAME,    password: process.env.TRUEBAR_PASSWORD,    client_id: process.env.TRUEBAR_CLIENT_ID ?? "truebar-client",  });
  const { data } = await axios.post(process.env.TRUEBAR_AUTH_URL, form, {    headers: { "Content-Type": "application/x-www-form-urlencoded" },  });
  if (!data?.access_token) throw new Error("Missing access_token");  return data.access_token;}
const token = await fetchToken();const ws = new WebSocket(process.env.TRUEBAR_STT_WS_URL, {  headers: { Authorization: `Bearer ${token}` },});
const pcm = readFileSync("sample.pcm");const chunkSize = 3200 * 2; // 100 ms @ 16 kHz (16-bit samples)let streamed = false;
ws.on("message", (payload, isBinary) => {  if (isBinary) return;  const msg = JSON.parse(payload.toString());
  if (msg.type === "STATUS") {    console.log("STATUS:", msg.status);    if (msg.status === "CONFIGURED" && !streamed) {      streamed = true;      for (let offset = 0; offset < pcm.length; offset += chunkSize) {        ws.send(pcm.subarray(offset, offset + chunkSize));      }      ws.send(JSON.stringify({ type: "EOS", lockSession: false }));    }    if (msg.status === "FINISHED") ws.close();    return;  }
  if (msg.type === "TEXT_SEGMENT") {    const text = tokensToText(msg.textSegment.tokens);    const label = msg.textSegment.isFinal ? "FINAL" : "INTERIM";    console.log(`${label} - ${text}`);  }
  if (msg.type === "ERROR") {    console.error("Streaming error", msg);    ws.close();  }});
ws.once("open", () => {  ws.send(    JSON.stringify({      type: "CONFIG",      pipeline: [        {          task: "ASR",          exceptionHandlingPolicy: "THROW",          config: {            tag: process.env.TRUEBAR_ASR_TAG ?? "KALDI:en-US:*:*",            parameters: { enableInterims: true },          },        },      ],    }),  );});

Both snippets wait for STATUS: CONFIGURED, stream PCM frames, and finish with EOS + lockSession: false. Wrap the send/receive loops in retry logic if you expect intermittent network errors.

Capturing audio

In this chapter, we'll walk through capturing microphone audio in a modern web browser using the Web Audio API. We'll show how to request microphone access, capture raw audio, and prepare it for transmission.

1. Check for Web Audio API Support#

Before requesting microphone access, ensure the browser supports the necessary APIs:

if (!navigator || !navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) throw "Your browser does not support WebAudio!";

2. Initialize AudioContext with a Target Sample Rate#

The Truebar service expects audio encoded at 16 kHz. While it’s best to capture audio directly at this rate, not all browsers support setting the sample rate via the AudioContext constructor. We recommend checking the browser and only specifying the sample rate if it's known to be supported:

let audioContext;const targetSampleRate = 16000;
if (navigator.userAgent.includes("Chrome") || navigator.userAgent.includes("Microsoft Edge")) {  audioContext = new AudioContext({ sampleRate: targetSampleRate });} else {  audioContext = new AudioContext(); // Will default to device rate (often 44.1kHz or 48kHz)}

⚠️ If the AudioContext sample rate does not match the target, we must resample the audio before sending it to the Truebar service.

3. Web Worker for Resampling and Format Conversion#

To offload audio processing (resampling and encoding to S16LE PCM), we use a Web Worker. This keeps the main thread responsive.

addEventListener('message', event => {    let sourceSampleRate = event["data"]["sourceSampleRate"]    let targetSampleRate = event["data"]["targetSampleRate"]    let audioChunk = event["data"]["audioChunk"];
    // Resample if needed    if( sourceSampleRate !== targetSampleRate ) {        audioChunk = waveResampler.resample(audioChunk, sourceSampleRate, targetSampleRate)    }
    // Convert to 16bit little endian encoding    let audioDataArray16b = floatTo16BitPCM(audioChunk);
    // Send back the processed audio buffer    postMessage(audioDataArray16b)});
function floatTo16BitPCM (input) {    // Each 32bit (4byte) float from input is converted to one 16bit (2byte) integer.    // Each element needs 2 bytes    let buffer = new ArrayBuffer(input.length * 2);
    // Define view to raw buffer so we can set values as int16.    let view = new DataView(buffer);
    for (let i = 0; i < input.length; i++) {        // Limit input to [-1, -1]        const s = Math.max(-1, Math.min(1, input[i]));
        // Convert float32 to int16 and force little endian        view.setInt16(2 * i, s < 0 ? s * 0x8000 : s * 0x7fff, true);    }
    return buffer;}

🔧 This example uses the wave-resampler library for resampling. Any other library or custom implementation will also work.

4. Using the Worker from Main Thread#

Create and use the worker in your main script:

let worket = new Worker('AudioChunkProcessor.js')worker.addEventListener( "message", (e) => {    //TODO Send audio chunk (e["data"]) to the service.});

5. Capturing Audio Chunks with ScriptProcessorNode#

Use the deprecated (but still functional) ScriptProcessorNode to capture chunks of audio from the microphone input:

const chunkSize = 4096; // Choose based on latency/throughput needs (recomended 4096, max 65536)
let scriptNode = this.audioContext.createScriptProcessor(chunkSize, 1, 1);scriptNode.onaudioprocess = async (event) => {        this.worker.postMessage( {        "sourceSampleRate" : this.audioContext.sampleRate,        "targetSampleRate" : targetSampleRate,        "audioChunk" : event.inputBuffer.getChannelData(0)    })}

⚠️ ScriptProcessorNode is deprecated. For production, consider using AudioWorkletNode for better performance and future compatibility.

In the above example, we create ScriptProcessing node and implement a callback that is called when a audio chunk of chunkSize become available. Within a callback we then pass this audio chunk to out ServiceWorker.

6. Connect the Audio Processing Chain#

In the end we have to connect out audio processing chain, such that we include ScriptProcessor in it:

const sourceNode = audioContext.createMediaStreamSource(stream);sourceNode.connect(scriptNode);scriptNode.connect(audioContext.destination); // Required in some browsers

Establishing and managing transcription session

To achieve real-time transcription of audio data, we use the streaming API provided by the service. The process begins by connecting to a WebSocket endpoint. When possible, send the bearer token via the Authorization header; fall back to a query parameter only if your environment does not allow custom headers.

const wsUrl = new URL("wss://true-bar.si/api/pipelines/stream");wsUrl.searchParams.set("access_token", accessToken); // Browsers cannot set custom headers
const ws = new WebSocket(wsUrl.toString());ws.addEventListener("message", onMessageListener);

On backend services you can pass the bearer token in the Authorization header instead of using the query parameter.

The message listener contains the core logic for handling the communication protocol. As described in the API specification, the server may emit various message types, so the callback must be able to handle all of them.

The whole procedure is as follows:

  • 1 Open a WebSocket connection.
  • 2 Wait for an INITIALIZED status message before sending anything.
  • 3 Once the INITIALIZED message is received, send a CONFIG message containing the pipeline definition to configure the session.
  • 4 Wait for a CONFIGURED status message before sending any additional messages.
  • 5 Once the session is CONFIGURED, begin streaming binary audio data as described in the previous section.
  • 6 While streaming audio, handle incoming messages from the server. These may include:
    • TEXT_SEGMENT: A new transcription result.
    • WARNING: An API warning.
    • ERROR - An API error.
  • 7 When you want to end the session, send an EOS message ({"type":"EOS","lockSession":false}) so the server can finish processing sent data.
  • 8 Wait for the FINISHED status message. This will arrive when all data has been processed.
  • 9 Close the WebSocket connection.

Sending lockSession: false in the EOS payload unlocks the session which means anyone with access to it can open it in write mode. If you plan to reconnect and resume it or edit it by any other means you can set lockSession: true and you will receive temporary session lock key which gives you exclusive write access to the session.

If the server returns an ERROR message, log the payload (it includes a descriptive message field) and close the socket—then reconnect with a fresh session once you have addressed the root cause. For transient network failures, a simple exponential-backoff retry loop is usually sufficient.

Please note that the procedure described above covers the most common usage. There are also alternative methods for terminating a session:

  • Closing the WebSocket directly: This will immediately terminate the session on the API side. Any unprocessed data at the time of closure will be discarded. The session will be marked as CANCELED and cannot be resumed.
  • Canceling the session without closing the WebSocket: This method is useful when the client does not need to wait for all results but wants to keep the WebSocket connection open. Instead of sending an EOS message, the client can send a CANCEL message. After canceling the session, the client will receive a message with status = CANCELED. The WebSocket connection will then return to the same state as when it was first INITIALIZED. To start a new session over the same WebSocket, the client should send a CONFIG message and begin streaming data messages again. Using this method, the session is again marked as CANCELED and can not be resumed.

Here's a skeleton implementation for handling received WebSocket messages:

const onMessageListener = (wsMsg) => {    const msg = JSON.parse(wsMsg.data);    switch (msg.type) {        case "STATUS":            switch (msg.status) {                case "INITIALIZED":                    ws.send(JSON.stringify({                        type: "CONFIG",                        pipeline: [                            {                                "task": "ASR",                                "exceptionHandlingPolicy": "THROW",                                "config": {                                    "tag": "<insert on of the available asr tags here>",                                    "parameters": {                                        "enableInterims": true                                    }                                }                            }                        ]                    }));                    break;                case "CONFIGURED":                    // Start streaming PCM16LE audio frames once you reach CONFIGURED.                    break;                case "FINISHED":                    // Clean up and close the WebSocket connection.                    break;                default:                    console.warn("Unexpected session status:", msg.status);            }            break;        case "TEXT_SEGMENT":            console.log(tokensToText(msg.textSegment.tokens));            break;        case "WARNING":            // TODO: Handle warning message.            break;        case "ERROR":            console.error("Streaming error", msg);            break;        default:            console.warn("Unexpected message type:", msg.type);    }};

The above example uses a simple pipeline containing only an ASR stage. For more information on building more complex pipelines, please refer to the main API reference.

Processing transcription responses

To ensure real-time transcription, transcripts are returned asynchronously to the client as soon as they become available. Each transcript is sent in the following format:

{    "type": "TEXT_SEGMENT",    "textSegment": {        "isFinal": "<boolean>",         // Indicates whether the transcript is final or interim        "startTimeMs": "<number>",      // Segment start time in milliseconds, relative to the start of the session        "endTimeMs": "<number>",        // Segment end time in milliseconds, relative to the start of the session        "tokens": [            {                "isLeftHanded" : "<boolean>",   // Indicates that the token is left handed (implies a space before a token)                "isRightHanded" : "<boolean>",  // Indicates that the token is right handed (implies a space after a token)                "startOffsetMs": "<number>",    // Token start time relative to the segment start                "endOffsetMs": "<number>",      // Token end time relative to the segment start                "text": "<string>"              // Token content            }        ]    }}

There are two types of transcripts, which is determined by the isFinal flag:

  • Interim: Partial and temporary results
  • Final: Stable and complete results

While the service is decoding audio, it continuously updates the transcript. Each update is sent to the client as an interim result. When either a sufficient length of text is reached or a pause in speech is detected, the segment is finalized and sent as a final transcript. Final segments will no longer be updated.

After a final segment is sent, the next portion of the audio will start producing new interim results, following the same process. The following example illustrates this flow:

  • Interim : To
  • Interim : Today
  • Interim : Today is
  • Interim : Today is a beauty
  • Interim : Today is a beautiful
  • Interim : Today is a beautiful day
  • Final : Today is a beautiful day.
  • Interim : Tomorrow
  • Interim : Tomorrow will
  • Interim : Tomorrow will rain
  • Final : Tomorrow will rain.

This incremental update approach allows real-time applications to show progressively refined results with minimal latency, while also receiving final, stable segments once the system has high confidence.

Each transcription segment is returned as a list of tokens, where every token contains its own word and optional metadata. To correctly reconstruct the original sentence from these tokens, you need to be aware of spacing rules based on the isLeftHanded and isRightHanded flags.

A space should be inserted between two adjacent tokens if:

  • The first token is not right-handed, and
  • The second token is not left-handed
const tokensToText = (tokens) => {    let output = "";    let prevRight = false;    tokens?.forEach((token, index) => {        const text = token?.text ?? "";        if (!text) return;        const left = Boolean(token?.isLeftHanded);        if (index > 0 && !prevRight && !left) {            output += " ";        }        output += text;        prevRight = Boolean(token?.isRightHanded);    });    return output;};

Example#

Token sequence:

[  { "text": "Hello", "isLeftHanded": false, "isRightHanded": false},  { "text": ",", "isLeftHanded": true, "isRightHanded": false},  { "text": "world", "isLeftHanded": false, "isRightHanded": false},  { "text": "!", "isLeftHanded": true, "isRightHanded": false}]

Reconstructed text:

Hello, world!