Streaming STT Guide
In this tutorial, we'll walk through how to perform real-time speech-to-text (STT) using the streaming API. You'll learn how to:
- Establish a WebSocket connection with the STT service
- Stream live audio data to the server
- Receive and display transcribed text in real time
Examples are provided for Node.js, Python, and Java using familiar HTTP/WebSocket clients (axios, aiohttp, HttpClient).
A dedicated client library is also available to simplify integration and can serve as a starting point for your implementation.
For details, see Truebar Client Libraries.
By the end of this guide, you'll have a working STT integration capable of transcribing live microphone input directly in your browser.
Stage tag
Before you start, set TRUEBAR_ASR_TAG to one of the online ASR stages returned by /api/pipelines/stages (or copy the tag from your project’s .env.truebar). The snippets below fall back to KALDI:en-US:*:*, which only works if that tag is available in your account.
Authenticating
Before using the API, you need to authenticate with the Truebar service to obtain an access token. This token must be included in all subsequent API requests.
1. Acquiring an Access Token Using Username and Password#
The following example demonstrates how to obtain an access token using axios in JavaScript:
const axios = require('axios');
async function getAccessToken(username, password) { const response = await axios.post( 'https://auth.true-bar.si/realms/truebar/protocol/openid-connect/token', new URLSearchParams({ grant_type: 'password', username, password, client_id: 'truebar-client', }), { headers: { 'Content-Type': 'application/x-www-form-urlencoded' } } );
return response.data.access_token;}2. Refreshing access tokens#
Tokens are only valid for a limited time. To keep your session alive, you can refresh the token using the refresh_token from the initial response to obtain a new access token:
async function refreshAccessToken(refreshToken) { const response = await axios.post( 'https://auth.true-bar.si/realms/truebar/protocol/openid-connect/token', new URLSearchParams({ grant_type: 'refresh_token', refresh_token: refreshToken, client_id: 'truebar-client', }), { headers: { 'Content-Type': 'application/x-www-form-urlencoded' } } );
return response.data.access_token;}You can schedule this refresh operation to occur shortly before the access_token expires, using the expires_in value from the original authentication response.
Server-side quickstart (Node.js, Python, Java)
If you are building a backend service, you can reuse the same message flow without the browser-specific pieces. The snippets below match the production-safe pattern from the tutorials: bearer tokens in headers, a CONFIG handshake, binary PCM frames, and an EOS message with lockSession: false.
- JavaScript (Node.js)
- Python
- Java
import axios from "axios";import WebSocket from "ws";import { readFileSync } from "node:fs";
const tokensToText = (tokens = []) => { let output = ""; let prevRight = false; tokens.forEach((token, index) => { const text = token?.text ?? ""; if (!text) return; const left = Boolean(token?.isLeftHanded); if (index > 0 && !prevRight && !left) { output += " "; } output += text; prevRight = Boolean(token?.isRightHanded); }); return output;};
async function fetchToken() { const form = new URLSearchParams({ grant_type: "password", username: process.env.TRUEBAR_USERNAME, password: process.env.TRUEBAR_PASSWORD, client_id: process.env.TRUEBAR_CLIENT_ID ?? "truebar-client", });
const { data } = await axios.post(process.env.TRUEBAR_AUTH_URL, form, { headers: { "Content-Type": "application/x-www-form-urlencoded" }, });
if (!data?.access_token) throw new Error("Missing access_token"); return data.access_token;}
const token = await fetchToken();const ws = new WebSocket(process.env.TRUEBAR_STT_WS_URL, { headers: { Authorization: `Bearer ${token}` },});
const pcm = readFileSync("sample.pcm");const chunkSize = 3200 * 2; // 100 ms @ 16 kHz (16-bit samples)let streamed = false;
ws.on("message", (payload, isBinary) => { if (isBinary) return; const msg = JSON.parse(payload.toString());
if (msg.type === "STATUS") { console.log("STATUS:", msg.status); if (msg.status === "CONFIGURED" && !streamed) { streamed = true; for (let offset = 0; offset < pcm.length; offset += chunkSize) { ws.send(pcm.subarray(offset, offset + chunkSize)); } ws.send(JSON.stringify({ type: "EOS", lockSession: false })); } if (msg.status === "FINISHED") ws.close(); return; }
if (msg.type === "TEXT_SEGMENT") { const text = tokensToText(msg.textSegment.tokens); const label = msg.textSegment.isFinal ? "FINAL" : "INTERIM"; console.log(`${label} - ${text}`); }
if (msg.type === "ERROR") { console.error("Streaming error", msg); ws.close(); }});
ws.once("open", () => { ws.send( JSON.stringify({ type: "CONFIG", pipeline: [ { task: "ASR", exceptionHandlingPolicy: "THROW", config: { tag: process.env.TRUEBAR_ASR_TAG ?? "KALDI:en-US:*:*", parameters: { enableInterims: true }, }, }, ], }), );});import asyncioimport jsonimport os
import aiohttpimport soundfile as sfimport websockets
async def fetch_token(session: aiohttp.ClientSession) -> str: payload = { "grant_type": "password", "username": os.environ["TRUEBAR_USERNAME"], "password": os.environ["TRUEBAR_PASSWORD"], "client_id": os.getenv("TRUEBAR_CLIENT_ID", "truebar-client"), } async with session.post(os.environ["TRUEBAR_AUTH_URL"], data=payload) as resp: resp.raise_for_status() data = await resp.json() return data["access_token"]
def tokens_to_text(tokens: list[dict]) -> str: output, prev_right = [], False for token in tokens or []: text = token.get("text", "") if not text: continue left = bool(token.get("isLeftHanded")) if output and not prev_right and not left: output.append(" ") output.append(text) prev_right = bool(token.get("isRightHanded")) return "".join(output)
async def main() -> None: async with aiohttp.ClientSession() as session: token = await fetch_token(session)
headers = {"Authorization": f"Bearer {token}"} async with websockets.connect( os.environ["TRUEBAR_STT_WS_URL"], additional_headers=headers, ) as ws: configured = asyncio.Event() pcm, sample_rate = sf.read("sample.wav", dtype="int16") assert sample_rate == 16_000
async def sender() -> None: await configured.wait() await ws.send(json.dumps({ "type": "CONFIG", "pipeline": [ { "task": "ASR", "exceptionHandlingPolicy": "THROW", "config": { "tag": os.getenv("TRUEBAR_ASR_TAG", "KALDI:en-US:*:*"), "parameters": {"enableInterims": True}, }, } ], }))
chunk_samples = 3200 for start in range(0, len(pcm), chunk_samples): await ws.send(pcm[start:start + chunk_samples].tobytes())
await ws.send(json.dumps({"type": "EOS", "lockSession": False}))
async def receiver() -> None: async for message in ws: if isinstance(message, bytes): continue
msg = json.loads(message) if msg["type"] == "STATUS": print("STATUS:", msg["status"]) if msg["status"] in {"INITIALIZED", "CONFIG_REQUIRED", "CONFIGURED"}: configured.set() if msg["status"] == "FINISHED": break elif msg["type"] == "TEXT_SEGMENT": print(tokens_to_text(msg["textSegment"]["tokens"])) elif msg["type"] == "ERROR": raise RuntimeError(f"Pipeline error: {msg}")
await asyncio.gather(sender(), receiver())
if __name__ == "__main__": asyncio.run(main())import java.net.URI;import java.net.http.HttpClient;import java.net.http.HttpRequest;import java.net.http.HttpResponse;import java.net.http.WebSocket;import java.nio.ByteBuffer;import java.nio.file.Files;import java.nio.file.Path;import java.util.concurrent.CompletionStage;import java.util.regex.Matcher;import java.util.regex.Pattern;
public class StreamingStt { private static final Pattern STATUS_PATTERN = Pattern.compile("\"status\"\\s*:\\s*\"([^\"]+)\""); private static final Pattern ACCESS_TOKEN = Pattern.compile("\"access_token\"\\s*:\\s*\"([^\"]+)\"");
public static void main(String[] args) throws Exception { HttpClient client = HttpClient.newHttpClient(); String token = fetchToken(client);
byte[] pcm = Files.readAllBytes(Path.of("sample.pcm")); int chunkBytes = 3200 * 2; // 100 ms of 16-bit PCM String asrTag = System.getenv().getOrDefault("TRUEBAR_ASR_TAG", "KALDI:en-US:*:*"); String configMessage = String.format( "{\"type\":\"CONFIG\",\"pipeline\":[{\"task\":\"ASR\",\"exceptionHandlingPolicy\":\"THROW\",\"config\":{\"tag\":\"%s\",\"parameters\":{\"enableInterims\":true}}}]}", asrTag );
WebSocket.Listener listener = new WebSocket.Listener() { private boolean streamedAudio = false;
@Override public void onOpen(WebSocket webSocket) { webSocket.sendText(configMessage, true); webSocket.request(1); }
@Override public CompletionStage<?> onText(WebSocket webSocket, CharSequence data, boolean last) { String message = data.toString(); Matcher statusMatcher = STATUS_PATTERN.matcher(message); if (message.contains("\"type\":\"STATUS\"") && statusMatcher.find()) { String status = statusMatcher.group(1); System.out.println("STATUS: " + status); if ("CONFIGURED".equals(status) && !streamedAudio) { streamedAudio = true; for (int offset = 0; offset < pcm.length; offset += chunkBytes) { int end = Math.min(pcm.length, offset + chunkBytes); webSocket.sendBinary(ByteBuffer.wrap(pcm, offset, end - offset), true); } webSocket.sendText("{\"type\":\"EOS\",\"lockSession\":false}", true); } if ("FINISHED".equals(status)) { webSocket.sendClose(WebSocket.NORMAL_CLOSURE, "done"); } } else if (message.contains("\"type\":\"ERROR\"")) { System.err.println("Streaming error: " + message); webSocket.sendClose(WebSocket.NORMAL_CLOSURE, "error"); } else { System.out.println(message); } webSocket.request(1); return null; }
@Override public CompletionStage<?> onBinary(WebSocket webSocket, ByteBuffer data, boolean last) { webSocket.request(1); return null; } };
client.newWebSocketBuilder() .header("Authorization", "Bearer " + token) .buildAsync(URI.create(System.getenv("TRUEBAR_STT_WS_URL")), listener) .join(); }
private static String fetchToken(HttpClient client) throws Exception { String form = "grant_type=password&username=" + System.getenv("TRUEBAR_USERNAME") + "&password=" + System.getenv("TRUEBAR_PASSWORD") + "&client_id=" + System.getenv().getOrDefault("TRUEBAR_CLIENT_ID", "truebar-client");
HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(System.getenv("TRUEBAR_AUTH_URL"))) .header("Content-Type", "application/x-www-form-urlencoded") .POST(HttpRequest.BodyPublishers.ofString(form)) .build();
HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString()); if (response.statusCode() >= 400) { throw new IllegalStateException("Token request failed: " + response.body()); } Matcher matcher = ACCESS_TOKEN.matcher(response.body()); if (!matcher.find()) { throw new IllegalStateException("Missing access_token"); } return matcher.group(1); }}Both snippets wait for STATUS: CONFIGURED, stream PCM frames, and finish with EOS + lockSession: false. Wrap the send/receive loops in retry logic if you expect intermittent network errors.
Capturing audio
In this chapter, we'll walk through capturing microphone audio in a modern web browser using the Web Audio API. We'll show how to request microphone access, capture raw audio, and prepare it for transmission.
1. Check for Web Audio API Support#
Before requesting microphone access, ensure the browser supports the necessary APIs:
if (!navigator || !navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) throw "Your browser does not support WebAudio!";2. Initialize AudioContext with a Target Sample Rate#
The Truebar service expects audio encoded at 16 kHz. While it’s best to capture audio directly at this rate, not all browsers support setting the sample rate via the AudioContext constructor. We recommend checking the browser and only specifying the sample rate if it's known to be supported:
let audioContext;const targetSampleRate = 16000;
if (navigator.userAgent.includes("Chrome") || navigator.userAgent.includes("Microsoft Edge")) { audioContext = new AudioContext({ sampleRate: targetSampleRate });} else { audioContext = new AudioContext(); // Will default to device rate (often 44.1kHz or 48kHz)}⚠️ If the AudioContext sample rate does not match the target, we must resample the audio before sending it to the Truebar service.
3. Web Worker for Resampling and Format Conversion#
To offload audio processing (resampling and encoding to S16LE PCM), we use a Web Worker. This keeps the main thread responsive.
addEventListener('message', event => { let sourceSampleRate = event["data"]["sourceSampleRate"] let targetSampleRate = event["data"]["targetSampleRate"] let audioChunk = event["data"]["audioChunk"];
// Resample if needed if( sourceSampleRate !== targetSampleRate ) { audioChunk = waveResampler.resample(audioChunk, sourceSampleRate, targetSampleRate) }
// Convert to 16bit little endian encoding let audioDataArray16b = floatTo16BitPCM(audioChunk);
// Send back the processed audio buffer postMessage(audioDataArray16b)});
function floatTo16BitPCM (input) { // Each 32bit (4byte) float from input is converted to one 16bit (2byte) integer. // Each element needs 2 bytes let buffer = new ArrayBuffer(input.length * 2);
// Define view to raw buffer so we can set values as int16. let view = new DataView(buffer);
for (let i = 0; i < input.length; i++) { // Limit input to [-1, -1] const s = Math.max(-1, Math.min(1, input[i]));
// Convert float32 to int16 and force little endian view.setInt16(2 * i, s < 0 ? s * 0x8000 : s * 0x7fff, true); }
return buffer;}🔧 This example uses the wave-resampler library for resampling. Any other library or custom implementation will also work.
4. Using the Worker from Main Thread#
Create and use the worker in your main script:
let worket = new Worker('AudioChunkProcessor.js')worker.addEventListener( "message", (e) => { //TODO Send audio chunk (e["data"]) to the service.});5. Capturing Audio Chunks with ScriptProcessorNode#
Use the deprecated (but still functional) ScriptProcessorNode to capture chunks of audio from the microphone input:
const chunkSize = 4096; // Choose based on latency/throughput needs (recomended 4096, max 65536)
let scriptNode = this.audioContext.createScriptProcessor(chunkSize, 1, 1);scriptNode.onaudioprocess = async (event) => { this.worker.postMessage( { "sourceSampleRate" : this.audioContext.sampleRate, "targetSampleRate" : targetSampleRate, "audioChunk" : event.inputBuffer.getChannelData(0) })}⚠️ ScriptProcessorNode is deprecated. For production, consider using AudioWorkletNode for better performance and future compatibility.
In the above example, we create ScriptProcessing node and implement a callback that is called when a audio chunk of chunkSize become available.
Within a callback we then pass this audio chunk to out ServiceWorker.
6. Connect the Audio Processing Chain#
In the end we have to connect out audio processing chain, such that we include ScriptProcessor in it:
const sourceNode = audioContext.createMediaStreamSource(stream);sourceNode.connect(scriptNode);scriptNode.connect(audioContext.destination); // Required in some browsersEstablishing and managing transcription session
To achieve real-time transcription of audio data, we use the streaming API provided by the service.
The process begins by connecting to a WebSocket endpoint. When possible, send the bearer token via the Authorization header; fall back to a query parameter only if your environment does not allow custom headers.
const wsUrl = new URL("wss://true-bar.si/api/pipelines/stream");wsUrl.searchParams.set("access_token", accessToken); // Browsers cannot set custom headers
const ws = new WebSocket(wsUrl.toString());ws.addEventListener("message", onMessageListener);On backend services you can pass the bearer token in the
Authorizationheader instead of using the query parameter.
The message listener contains the core logic for handling the communication protocol. As described in the API specification, the server may emit various message types, so the callback must be able to handle all of them.
The whole procedure is as follows:
- 1 Open a WebSocket connection.
- 2 Wait for an INITIALIZED status message before sending anything.
- 3 Once the INITIALIZED message is received, send a CONFIG message containing the pipeline definition to configure the session.
- 4 Wait for a CONFIGURED status message before sending any additional messages.
- 5 Once the session is CONFIGURED, begin streaming binary audio data as described in the previous section.
- 6 While streaming audio, handle incoming messages from the server. These may include:
- TEXT_SEGMENT: A new transcription result.
- WARNING: An API warning.
- ERROR - An API error.
- 7 When you want to end the session, send an
EOSmessage ({"type":"EOS","lockSession":false}) so the server can finish processing sent data. - 8 Wait for the FINISHED status message. This will arrive when all data has been processed.
- 9 Close the WebSocket connection.
Sending lockSession: false in the EOS payload unlocks the session which means anyone with access to it can open it in write mode. If you plan to reconnect and resume it or edit it by any other means you can set lockSession: true and you will receive temporary session lock key which gives you exclusive write access to the session.
If the server returns an ERROR message, log the payload (it includes a descriptive message field) and close the socket—then reconnect with a fresh session once you have addressed the root cause. For transient network failures, a simple exponential-backoff retry loop is usually sufficient.
Please note that the procedure described above covers the most common usage. There are also alternative methods for terminating a session:
- Closing the WebSocket directly: This will immediately terminate the session on the API side. Any unprocessed data at the time of closure will be discarded. The session will be marked as CANCELED and cannot be resumed.
- Canceling the session without closing the WebSocket: This method is useful when the client does not need to wait for all results but wants to keep the WebSocket connection open. Instead of sending an EOS message, the client can send a CANCEL message. After canceling the session, the client will receive a message with status = CANCELED. The WebSocket connection will then return to the same state as when it was first INITIALIZED. To start a new session over the same WebSocket, the client should send a CONFIG message and begin streaming data messages again. Using this method, the session is again marked as CANCELED and can not be resumed.
Here's a skeleton implementation for handling received WebSocket messages:
const onMessageListener = (wsMsg) => { const msg = JSON.parse(wsMsg.data); switch (msg.type) { case "STATUS": switch (msg.status) { case "INITIALIZED": ws.send(JSON.stringify({ type: "CONFIG", pipeline: [ { "task": "ASR", "exceptionHandlingPolicy": "THROW", "config": { "tag": "<insert on of the available asr tags here>", "parameters": { "enableInterims": true } } } ] })); break; case "CONFIGURED": // Start streaming PCM16LE audio frames once you reach CONFIGURED. break; case "FINISHED": // Clean up and close the WebSocket connection. break; default: console.warn("Unexpected session status:", msg.status); } break; case "TEXT_SEGMENT": console.log(tokensToText(msg.textSegment.tokens)); break; case "WARNING": // TODO: Handle warning message. break; case "ERROR": console.error("Streaming error", msg); break; default: console.warn("Unexpected message type:", msg.type); }};The above example uses a simple pipeline containing only an ASR stage. For more information on building more complex pipelines, please refer to the main API reference.
Processing transcription responses
To ensure real-time transcription, transcripts are returned asynchronously to the client as soon as they become available. Each transcript is sent in the following format:
{ "type": "TEXT_SEGMENT", "textSegment": { "isFinal": "<boolean>", // Indicates whether the transcript is final or interim "startTimeMs": "<number>", // Segment start time in milliseconds, relative to the start of the session "endTimeMs": "<number>", // Segment end time in milliseconds, relative to the start of the session "tokens": [ { "isLeftHanded" : "<boolean>", // Indicates that the token is left handed (implies a space before a token) "isRightHanded" : "<boolean>", // Indicates that the token is right handed (implies a space after a token) "startOffsetMs": "<number>", // Token start time relative to the segment start "endOffsetMs": "<number>", // Token end time relative to the segment start "text": "<string>" // Token content } ] }}There are two types of transcripts, which is determined by the isFinal flag:
- Interim: Partial and temporary results
- Final: Stable and complete results
While the service is decoding audio, it continuously updates the transcript. Each update is sent to the client as an interim result. When either a sufficient length of text is reached or a pause in speech is detected, the segment is finalized and sent as a final transcript. Final segments will no longer be updated.
After a final segment is sent, the next portion of the audio will start producing new interim results, following the same process. The following example illustrates this flow:
- Interim : To
- Interim : Today
- Interim : Today is
- Interim : Today is a beauty
- Interim : Today is a beautiful
- Interim : Today is a beautiful day
- Final : Today is a beautiful day.
- Interim : Tomorrow
- Interim : Tomorrow will
- Interim : Tomorrow will rain
- Final : Tomorrow will rain.
This incremental update approach allows real-time applications to show progressively refined results with minimal latency, while also receiving final, stable segments once the system has high confidence.
Each transcription segment is returned as a list of tokens, where every token contains its own word and optional metadata.
To correctly reconstruct the original sentence from these tokens, you need to be aware of spacing rules based on the isLeftHanded and isRightHanded flags.
A space should be inserted between two adjacent tokens if:
- The first token is not right-handed, and
- The second token is not left-handed
const tokensToText = (tokens) => { let output = ""; let prevRight = false; tokens?.forEach((token, index) => { const text = token?.text ?? ""; if (!text) return; const left = Boolean(token?.isLeftHanded); if (index > 0 && !prevRight && !left) { output += " "; } output += text; prevRight = Boolean(token?.isRightHanded); }); return output;};Example#
Token sequence:
[ { "text": "Hello", "isLeftHanded": false, "isRightHanded": false}, { "text": ",", "isLeftHanded": true, "isRightHanded": false}, { "text": "world", "isLeftHanded": false, "isRightHanded": false}, { "text": "!", "isLeftHanded": true, "isRightHanded": false}]Reconstructed text:
Hello, world!