Low-latency English speech-to-text over WebSocket. Emits a rolling partial transcript every ~1.5 seconds while audio streams in, then delivers one complete final transcript at end-of-stream. Pure transcription — no speaker diarization, emotion detection, accent detection, or PII/PHI tagging.
Endpoint
wss://modulate-developer-apis.com/api/velma-2-stt-streaming-english-v2
Authentication
Pass your API key as a query parameter when opening the connection.
wss://modulate-developer-apis.com/api/velma-2-stt-streaming-english-v2?api_key=YOUR_API_KEY&audio_format=ogg
Unlike the batch endpoints, this API does not use an X-API-Key header. The key must be in the query string at connection time.
See Authentication and rate limits for how to obtain and manage API keys.
Container formats — sample_rate and num_channels are not required:
wav, mp3, ogg, flac, webm, aac, aiff
Opus audio is typically shipped as ogg (Opus-in-Ogg). Pass audio_format=ogg for Opus streams.
Raw PCM formats — sample_rate and num_channels are required:
s8, s16le, s16be, s24le, s24be, s32le, s32be, u8, u16le, u16be, u24le, u24be, u32le, u32be, f32le, f32be, f64le, f64be, mulaw, alaw
Valid sample rates: 8000, 11025, 16000, 22050, 32000, 44100, 48000, 96000
For lowest end-to-end latency, send audio_format=s16le&sample_rate=16000&num_channels=1. This matches the model’s native input format and bypasses the server’s audio decoder entirely.
Query parameters
| Parameter | Required | Description |
|---|
api_key | Yes | Your API key |
audio_format | Yes | Container or raw PCM format of the audio you will stream. See supported formats above. |
sample_rate | Raw PCM only | Source sample rate in Hz. Required when audio_format is a raw PCM type; ignored for container formats. |
num_channels | Raw PCM only | Number of audio channels (1–8). Required when audio_format is a raw PCM type; ignored for container formats. The server downmixes to mono internally. |
This endpoint accepts no feature toggles. Diarization, emotion, accent, and PII parameters are not recognized and have no effect. Use STT Streaming if you need those features.
Connection flow
- Connect to the WebSocket endpoint with
api_key, audio_format, and (for raw PCM) sample_rate and num_channels.
- Stream audio as binary WebSocket frames. Frames can be any size; 4–64 KB is typical.
- Receive
partial_utterance JSON messages every ~1.5 seconds. Each contains the complete transcript so far — replace any previously displayed partial, do not append.
- Send an empty text frame (
"") to signal end of audio.
- Receive one final
utterance message with the complete transcript.
- Receive a
done message with total audio duration.
- The connection closes automatically.
Connection limits:
- Maximum connection lifetime: 30 minutes per WebSocket.
- Idle timeout: 60 seconds without audio frames. The server finalizes the transcript and closes.
Server messages
partial_utterance
Sent roughly every 1.5 seconds while audio is streaming. Each message contains the complete transcript built so far, not a delta from the previous message. Replace your displayed partial text with each new value — never append.
{
"type": "partial_utterance",
"partial_utterance": {
"text": "Hello, how are you",
"is_final": false
}
}
| Field | Type | Description |
|---|
type | string | Always "partial_utterance" |
partial_utterance.text | string | Complete transcript so far. Replace, don’t append. |
partial_utterance.is_final | boolean | Always false |
utterance
Sent exactly once, after the client signals end-of-stream. Contains the final transcript covering the entire audio stream. Supersedes all preceding partial_utterance messages.
{
"type": "utterance",
"utterance": {
"text": "Hello, how are you doing today?",
"is_final": true
}
}
| Field | Type | Description |
|---|
type | string | Always "utterance" |
utterance.text | string | Final transcript. Will not be revised. |
utterance.is_final | boolean | Always true |
done
Sent immediately after the final utterance. Signals stream completion. The connection closes shortly after.
{
"type": "done",
"duration_ms": 14253
}
| Field | Type | Description |
|---|
type | string | Always "done" |
duration_ms | integer | Total audio duration in milliseconds |
error
Sent if something goes wrong. The connection closes after this message. No further messages follow an error.
{
"type": "error",
"error": "Invalid audio_format='xyz'."
}
| Field | Type | Description |
|---|
type | string | Always "error" |
error | string | Human-readable description of the error |
WebSocket close codes
| Code | Meaning |
|---|
1003 | Invalid query parameters — missing audio_format, unsupported value, invalid sample_rate, etc. |
4001 | Invalid API key |
4002 | Audio bytes did not match the declared raw PCM format |
4003 | Model access not enabled for your organization, or usage denied |
4029 | Rate limit exceeded — monthly audio-hours limit or concurrent connection limit reached |
Rate limits
- Concurrent connection limits apply per organization.
- Monthly usage limits (in audio hours) apply per organization.
- Connections that exceed limits are rejected during the WebSocket handshake with close code
4029.
See Authentication and rate limits for retry guidance.
Examples
Python (websockets)
JavaScript (Node.js)
import asyncio
import json
import os
import websockets
API_KEY = os.environ["MODULATE_API_KEY"]
AUDIO_FILE = "recording.ogg"
CHUNK_SIZE = 8192
async def transcribe():
url = (
f"wss://modulate-developer-apis.com/api/velma-2-stt-streaming-english-v2"
f"?api_key={API_KEY}&audio_format=ogg"
)
async with websockets.connect(url, max_size=None) as ws:
async def send_audio():
with open(AUDIO_FILE, "rb") as f:
while chunk := f.read(CHUNK_SIZE):
await ws.send(chunk)
await ws.send("") # signal end-of-stream
send_task = asyncio.create_task(send_audio())
try:
async for msg in ws:
data = json.loads(msg)
if data["type"] == "partial_utterance":
# Replace any previously displayed partial — not a delta
print(f"\r[partial] {data['partial_utterance']['text']}", end="", flush=True)
elif data["type"] == "utterance":
print(f"\n[final] {data['utterance']['text']}")
elif data["type"] == "done":
print(f"\nDone. Duration: {data['duration_ms']}ms")
break
elif data["type"] == "error":
print(f"\nError: {data['error']}")
break
finally:
if not send_task.done():
send_task.cancel()
asyncio.run(transcribe())
const WebSocket = require("ws");
const fs = require("fs");
const API_KEY = process.env.MODULATE_API_KEY;
const AUDIO_FILE = "recording.ogg";
const CHUNK_SIZE = 8192;
const url = new URL("wss://modulate-developer-apis.com/api/velma-2-stt-streaming-english-v2");
url.searchParams.set("api_key", API_KEY);
url.searchParams.set("audio_format", "ogg");
const ws = new WebSocket(url.toString());
ws.on("open", () => {
const stream = fs.createReadStream(AUDIO_FILE, { highWaterMark: CHUNK_SIZE });
stream.on("data", (chunk) => ws.send(chunk));
stream.on("end", () => ws.send("")); // signal end-of-stream
});
ws.on("message", (data) => {
const msg = JSON.parse(data.toString());
if (msg.type === "partial_utterance") {
// Each partial is the full transcript so far — replace, don't append
process.stdout.write(`\r[partial] ${msg.partial_utterance.text}`);
} else if (msg.type === "utterance") {
console.log(`\n[final] ${msg.utterance.text}`);
} else if (msg.type === "done") {
console.log(`\nDone. Duration: ${msg.duration_ms}ms`);
ws.close();
} else if (msg.type === "error") {
console.error("\nError:", msg.error);
ws.close();
}
});
ws.on("error", (err) => console.error("WebSocket error:", err.message));
ws.on("close", (code) => console.log(`Connection closed: ${code}`));
WebSocket APIs cannot be tested with cURL. For command-line testing, use websocat.