Real-time speech-to-text over WebSocket. Streams audio to the server and receives transcribed utterances as they are processed, with optional speaker diarization, emotion detection, accent detection, and PII/PHI tagging.
Endpoint
wss://platform.modulate.ai/api/velma-2-stt-streaming
Authentication
Pass your API key as a query parameter when opening the connection.
wss://platform.modulate.ai/api/velma-2-stt-streaming?api_key=YOUR_API_KEY
Unlike the batch endpoints, the streaming API does not use an X-API-Key header. The key must be in the query string at connection time.
See Authentication and rate limits for how to obtain and manage API keys.
Self-describing container formats are auto-detected from headers (no audio_format query parameter needed). Raw / headerless formats require audio_format, sample_rate, and num_channels. For the authoritative list of accepted values, see the spec’s audio_format enum or Audio formats and preprocessing.
Opus is recommended when you control the encoder — high quality at low bandwidth.
Query parameters
| Parameter | Type | Default | Description |
|---|
api_key | string | — | Required. Your API key |
speaker_diarization | boolean | true | Identify and label distinct speakers |
emotion_signal | boolean | false | Detect emotional tone per utterance |
accent_signal | boolean | false | Detect speaker accent per utterance |
deepfake_signal | boolean | false | Per-utterance synthetic-voice (deepfake) score |
pii_phi_tagging | boolean | false | Wrap PII/PHI in tags within utterance text |
partial_results | boolean | false | Stream interim partial_utterance messages with in-progress text before each utterance is finalized |
For a full explanation of what each feature does and when to enable it, see STT enrichment features.
Connection flow
- Connect to the WebSocket endpoint with
api_key and any optional feature parameters.
- Stream raw audio as binary WebSocket frames. Frames can be any size.
- Receive
utterance JSON messages as speech is transcribed. If partial_results=true, also receive partial_utterance previews for the currently active utterance.
- Send an empty text frame (
"") to signal end of audio.
- Receive a
done message containing total audio duration.
- The connection closes automatically.
Server messages
utterance
Sent each time a speech segment is transcribed.
{
"type": "utterance",
"utterance": {
"utterance_uuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"text": "Hello, how are you today?",
"start_ms": 0,
"duration_ms": 2500,
"speaker": 1,
"language": "en",
"emotion": "Neutral",
"accent": "American",
"deepfake_score": null
}
}
Utterance fields
| Field | Type | Description |
|---|
utterance_uuid | string (UUID) | Unique identifier for this utterance |
text | string | Transcribed text |
start_ms | integer | Start time in milliseconds from the beginning of the stream |
duration_ms | integer | Duration of the utterance in milliseconds |
speaker | integer | Speaker number, 1-indexed |
language | string | Detected language code (e.g. "en", "fr") |
emotion | string | null | Detected emotion. null when emotion_signal is disabled |
accent | string | null | Detected accent. null when accent_signal is disabled |
deepfake_score | float | null | Synthetic-voice score from 0.0 (likely natural) to 1.0 (likely synthetic). null when deepfake_signal is disabled or when the utterance is shorter than 0.5 seconds |
For all valid emotion and accent values, see STT enrichment features.
partial_utterance
Sent only when partial_results=true. Delivers in-progress text for the currently active utterance as a low-latency preview. Each partial_utterance supersedes the previous one for the same utterance; the finalized utterance message supersedes all preceding partials. Partials are fast previews and do not carry emotion, accent, deepfake, or PII/PHI fields.
{
"type": "partial_utterance",
"partial_utterance": {
"text": "Hello, how are",
"start_ms": 0,
"speaker": 1
}
}
Partial utterance fields
| Field | Type | Description |
|---|
text | string | In-progress transcribed text. May be empty |
start_ms | integer | null | Start time in milliseconds from the beginning of the stream. null if timing data is not yet available |
speaker | integer | null | Speaker number, 1-indexed. null if the speaker has not yet been identified |
done
Sent after all audio has been processed, in response to the end-of-stream signal.
{
"type": "done",
"duration_ms": 45000
}
error
Sent if transcription fails. The connection closes after this message.
{
"type": "error",
"error": "Internal server error"
}
WebSocket close codes
| Code | Meaning |
|---|
1000 | Normal closure after a successful done message |
1003 | Invalid connection parameters (unsupported audio_format, sample_rate, or num_channels; raw format missing sample_rate/num_channels) |
4003 | Request could not be validated, or is not permitted (auth failure, missing model access) |
4029 | Insufficient credits, or concurrent-connection limit exceeded |
An error JSON message is sent before the connection closes (except on 1000).
Rate limits
- Concurrent connection limits apply per organization.
- Monthly usage limits (in audio hours) apply per organization.
- Connections that exceed limits are rejected during the WebSocket handshake with close code
4029.
See Authentication and rate limits for retry guidance.
Examples
Python (aiohttp)
JavaScript (Node.js)
import asyncio
import json
import aiohttp
API_KEY = "YOUR_API_KEY"
AUDIO_FILE = "recording.opus"
CHUNK_SIZE = 8192
async def transcribe_streaming():
url = (
f"wss://platform.modulate.ai/api/velma-2-stt-streaming"
f"?api_key={API_KEY}"
f"&speaker_diarization=true"
f"&emotion_signal=true"
f"&accent_signal=true"
)
utterances = []
async with aiohttp.ClientSession() as session:
async with session.ws_connect(url) as ws:
async def send_audio():
with open(AUDIO_FILE, "rb") as f:
while chunk := f.read(CHUNK_SIZE):
await ws.send_bytes(chunk)
await asyncio.sleep(CHUNK_SIZE / 4000)
await ws.send_str("")
send_task = asyncio.create_task(send_audio())
try:
async for msg in ws:
if msg.type == aiohttp.WSMsgType.TEXT:
data = json.loads(msg.data)
if data["type"] == "utterance":
u = data["utterance"]
utterances.append(u)
print(f"[Speaker {u['speaker']}] {u['text']}")
elif data["type"] == "done":
print(f"Done. Duration: {data['duration_ms']}ms")
break
elif data["type"] == "error":
print(f"Error: {data['error']}")
break
elif msg.type in (
aiohttp.WSMsgType.ERROR,
aiohttp.WSMsgType.CLOSE,
aiohttp.WSMsgType.CLOSED,
):
break
finally:
if not send_task.done():
send_task.cancel()
full_text = " ".join(u["text"] for u in utterances)
print(f"\nFull transcript:\n{full_text}")
asyncio.run(transcribe_streaming())
const WebSocket = require("ws");
const fs = require("fs");
const API_KEY = "YOUR_API_KEY";
const AUDIO_FILE = "recording.opus";
const CHUNK_SIZE = 8192;
const url = new URL("wss://platform.modulate.ai/api/velma-2-stt-streaming");
url.searchParams.set("api_key", API_KEY);
url.searchParams.set("speaker_diarization", "true");
url.searchParams.set("emotion_signal", "true");
const ws = new WebSocket(url.toString());
const utterances = [];
ws.on("open", () => {
const stream = fs.createReadStream(AUDIO_FILE, { highWaterMark: CHUNK_SIZE });
stream.on("data", (chunk) => ws.send(chunk));
stream.on("end", () => ws.send(""));
});
ws.on("message", (data) => {
const msg = JSON.parse(data.toString());
if (msg.type === "utterance") {
utterances.push(msg.utterance);
console.log(`[Speaker ${msg.utterance.speaker}] ${msg.utterance.text}`);
} else if (msg.type === "done") {
console.log(`Done. Duration: ${msg.duration_ms}ms`);
console.log("Transcript:", utterances.map((u) => u.text).join(" "));
ws.close();
} else if (msg.type === "error") {
console.error("Error:", msg.error);
ws.close();
}
});
ws.on("error", (err) => console.error("WebSocket error:", err.message));
WebSocket APIs cannot be tested with cURL. For command-line testing, use websocat.