Skip to main content

Transcribing live audio stream

In this tutorial, we'll walk through how to perform real-time speech-to-text (STT) using the streaming API. You'll learn how to:

  • Establish a WebSocket connection with the STT service
  • Stream live audio data to the server
  • Receive and display transcribed text in real time

All examples are written in JavaScript, with Axios handling HTTP communication. A dedicated client library is also available to simplify integration and can serve as a starting point for your implementation. For details, see Truebar Client Libraries.

By the end of this guide, you'll have a working STT integration capable of transcribing live microphone input directly in your browser.

Authenticating

Before using the API, you need to authenticate with the Truebar service to obtain an access token. This token must be included in all subsequent API requests.

1. Acquiring an Access Token Using Username and Password#

The following example demonstrates how to obtain an access token using axios in JavaScript:

const axios = require('axios');
async function getAccessToken(username, password) {  const response = await axios.post(    'https://auth.true-bar.si/realms/truebar/protocol/openid-connect/token',    new URLSearchParams({      grant_type: 'password',      username,      password,      client_id: 'truebar-client',    }),    { headers: { 'Content-Type': 'application/x-www-form-urlencoded' } }  );
  return response.data.access_token;}

2. Refreshing access tokens#

Tokens are only valid for a limited time. To keep your session alive, you can refresh the token using the refresh_token from the initial response to obtain a new access token:

async function refreshAccessToken(refreshToken) {  const response = await axios.post(    'https://auth.true-bar.si/realms/truebar/protocol/openid-connect/token',    new URLSearchParams({      grant_type: 'refresh_token',      refresh_token: refreshToken,      client_id: 'truebar-client',    }),    { headers: { 'Content-Type': 'application/x-www-form-urlencoded' } }  );
  return response.data.access_token;}

You can schedule this refresh operation to occur shortly before the access_token expires, using the expires_in value from the original authentication response.

Capturing audio

In this chapter, we'll walk through capturing microphone audio in a modern web browser using the Web Audio API. We'll show how to request microphone access, capture raw audio, and prepare it for transmission.

1. Check for Web Audio API Support#

Before requesting microphone access, ensure the browser supports the necessary APIs:

if (!navigator || !navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) throw "Your browser does not support WebAudio!";

2. Initialize AudioContext with a Target Sample Rate#

The Truebar service expects audio encoded at 16 kHz. While it’s best to capture audio directly at this rate, not all browsers support setting the sample rate via the AudioContext constructor. We recommend checking the browser and only specifying the sample rate if it's known to be supported:

let audioContext;const targetSampleRate = 16000;
if (navigator.userAgent.includes("Chrome") || navigator.userAgent.includes("Microsoft Edge")) {  audioContext = new AudioContext({ sampleRate: targetSampleRate });} else {  audioContext = new AudioContext(); // Will default to device rate (often 44.1kHz or 48kHz)}

⚠️ If the AudioContext sample rate does not match the target, we must resample the audio before sending it to the Truebar service.

3. Web Worker for Resampling and Format Conversion#

To offload audio processing (resampling and encoding to S16LE PCM), we use a Web Worker. This keeps the main thread responsive.

addEventListener('message', event => {    let sourceSampleRate = event["data"]["sourceSampleRate"]    let targetSampleRate = event["data"]["targetSampleRate"]    let audioChunk = event["data"]["audioChunk"];
    // Resample if needed    if( sourceSampleRate !== targetSampleRate ) {        audioChunk = waveResampler.resample(audioChunk, sourceSampleRate, targetSampleRate)    }
    // Convert to 16bit little endian encoding    let audioDataArray16b = floatTo16BitPCM(audioChunk);
    // Send back the processed audio buffer    postMessage(audioDataArray16b)});
function floatTo16BitPCM (input) {    // Each 32bit (4byte) float from input is converted to one 16bit (2byte) integer.    // Each element needs 2 bytes    let buffer = new ArrayBuffer(input.length * 2);
    // Define view to raw buffer so we can set values as int16.    let view = new DataView(buffer);
    for (let i = 0; i < input.length; i++) {        // Limit input to [-1, -1]        const s = Math.max(-1, Math.min(1, input[i]));
        // Convert float32 to int16 and force little endian        view.setInt16(2 * i, s < 0 ? s * 0x8000 : s * 0x7fff, true);    }
    return buffer;}

🔧 This example uses the wave-resampler library for resampling. Any other library or custom implementation will also work.

4. Using the Worker from Main Thread#

Create and use the worker in your main script:

let worket = new Worker('AudioChunkProcessor.js')worker.addEventListener( "message", (e) => {    //TODO Send audio chunk (e["data"]) to the service.});

5. Capturing Audio Chunks with ScriptProcessorNode#

Use the deprecated (but still functional) ScriptProcessorNode to capture chunks of audio from the microphone input:

const chunkSize = 4096; // Choose based on latency/throughput needs (recomended 4096, max 65536)
let scriptNode = this.audioContext.createScriptProcessor(chunkSize, 1, 1);scriptNode.onaudioprocess = async (event) => {        this.worker.postMessage( {        "sourceSampleRate" : this.audioContext.sampleRate,        "targetSampleRate" : targetSampleRate,        "audioChunk" : event.inputBuffer.getChannelData(0)    })}

⚠️ ScriptProcessorNode is deprecated. For production, consider using AudioWorkletNode for better performance and future compatibility.

In the above example, we create ScriptProcessing node and implement a callback that is called when a audio chunk of chunkSize become available. Within a callback we then pass this audio chunk to out ServiceWorker.

6. Connect the Audio Processing Chain#

In the end we have to connect out audio processing chain, such that we include ScriptProcessor in it:

const sourceNode = audioContext.createMediaStreamSource(stream);sourceNode.connect(scriptNode);scriptNode.connect(audioContext.destination); // Required in some browsers

Establishing and managing transcription session

To achieve real-time transcription of audio data, we use the streaming API provided by the service. The process begins by connecting to a WebSocket endpoint. Since WebSocket clients typically do not support adding custom headers, we pass the access_token as a URL parameter:

let ws = new Websocket("wss://true-bar.si/api/pipelines/stream?access_token=<...>")ws.addEventListener("message", onMessageListener);

The message listener contains the core logic for handling the communication protocol. As described in the API specification, the server may emit various message types, so the callback must be able to handle all of them.

The whole procedure is as follows:

  • 1 Open a WebSocket connection.
  • 2 Wait for an INITIALIZED status message before sending anything.
  • 3 Once the INITIALIZED message is received, send a CONFIG message containing the pipeline definition to configure the session.
  • 4 Wait for a CONFIGURED status message before sending any additional messages.
  • 5 Once the session is CONFIGURED, begin streaming binary audio data as described in the previous section.
  • 6 While streaming audio, handle incoming messages from the server. These may include:
    • TEXT_SEGMENT: A new transcription result.
    • WARNING: An API warning.
    • ERROR - An API error.
  • 7 When you want to end the session, send an EOS (End Of Stream) message.
  • 8 Wait for the FINISHED status message.
  • 9 Close the WebSocket connection.

Here's a skeleton implementation for handling received WebSocket messages:

const onMessageListener = (wsMsg) => {    const msg = JSON.parse(wsMsg.data);    switch (msg.type) {        case "STATUS":            switch (msg.status) {                case "INITIALIZED":                    ws.send(JSON.stringify({                        type: "CONFIG",                        pipeline: [                            {                                "task": "ASR",                                "exceptionHandlingPolicy": "THROW",                                "config": {                                    "tag": "<insert on of the available asr tags here>",                                    "parameters": {                                        "enableInterims": true                                    }                                }                            }                        ]                    }));                    break;                case "CONFIGURED":                    // TODO: Start streaming audio data.                    break;                case "FINISHED":                    // TODO: Clean up and close WebSocket connection.                    break;                default:                    console.warn("Unexpected session status:", msg.status);            }            break;        case "TEXT_SEGMENT":            // TODO: Handle transcription segment.            break;        case "WARNING":            // TODO: Handle warning message.            break;        case "ERROR":            // TODO: Handle error and close WebSocket connection if needed.            break;        default:            console.warn("Unexpected message type:", msg.type);    }};

The above example uses a simple pipeline containing only an ASR stage. For more information on building more complex pipelines, please refer to the main

API reference
.

Processing transcription responses

To ensure real-time transcription, transcripts are returned asynchronously to the client as soon as they become available. Each transcript is sent in the following format:

{    "type": "TEXT_SEGMENT",    "textSegment": {        "isFinal": "<boolean>",         // Indicates whether the transcript is final or interim        "startTimeMs": "<number>",      // Segment start time in milliseconds, relative to the start of the session        "endTimeMs": "<number>",        // Segment end time in milliseconds, relative to the start of the session        "tokens": [            {                "isLeftHanded" : "<boolean>",   // Indicates that the token is left handed (implies a space before a token)                "isRightHanded" : "<boolean>",  // Indicates that the token is right handed (implies a space after a token)                "startOffsetMs": "<number>",    // Token start time relative to the segment start                "endOffsetMs": "<number>",      // Token end time relative to the segment start                "text": "<string>"              // Token content            }        ]    }}

There are two types of transcripts, which is determined by the isFinal flag:

  • Interim: Partial and temporary results
  • Final: Stable and complete results

While the service is decoding audio, it continuously updates the transcript. Each update is sent to the client as an interim result. When either a sufficient length of text is reached or a pause in speech is detected, the segment is finalized and sent as a final transcript. Final segments will no longer be updated.

After a final segment is sent, the next portion of the audio will start producing new interim results, following the same process. The following example illustrates this flow:

  • Interim : To
  • Interim : Today
  • Interim : Today is
  • Interim : Today is a beauty
  • Interim : Today is a beautiful
  • Interim : Today is a beautiful day
  • Final : Today is a beautiful day.
  • Interim : Tomorrow
  • Interim : Tomorrow will
  • Interim : Tomorrow will rain
  • Final : Tomorrow will rain.

This incremental update approach allows real-time applications to show progressively refined results with minimal latency, while also receiving final, stable segments once the system has high confidence.

Each transcription segment is returned as a list of tokens, where every token contains its own word and optional metadata. To correctly reconstruct the original sentence from these tokens, you need to be aware of spacing rules based on the isLeftHanded and isRightHanded flags.

A space should be inserted between two adjacent tokens if:

  • The first token is not right-handed, and
  • The second token is not left-handed

Example#

Token sequence:

[  { "text": "Hello", "isLeftHanded": false, "isRightHanded": false},  { "text": ",", "isLeftHanded": true, "isRightHanded": false},  { "text": "world", "isLeftHanded": false, "isRightHanded": false},  { "text": "!", "isLeftHanded": true, "isRightHanded": false}]

Reconstructed text:

Hello, world!