Speech-to-Text Streaming English

Low-latency English speech-to-text over WebSocket. Emits a rolling partial transcript every ~1.5 seconds while audio streams in, then delivers one complete final transcript at end-of-stream. Pure transcription — no speaker diarization, emotion detection, accent detection, or PII/PHI tagging.

Endpoint

wss://platform.modulate.ai/api/velma-2-stt-streaming-english-v2

Authentication

Pass your API key as a query parameter when opening the connection.

wss://platform.modulate.ai/api/velma-2-stt-streaming-english-v2?api_key=YOUR_API_KEY&audio_format=ogg

Unlike the batch endpoints, this API does not use an X-API-Key header. The key must be in the query string at connection time.

See Authentication and rate limits for how to obtain and manage API keys.

Supported audio formats

Container formats — sample_rate and num_channels are not required: wav, mp3, ogg, flac, webm, aac, aiff Opus audio is typically shipped as ogg (Opus-in-Ogg). Pass audio_format=ogg for Opus streams. Raw PCM formats — sample_rate and num_channels are required: s8, s16le, s16be, s24le, s24be, s32le, s32be, u8, u16le, u16be, u24le, u24be, u32le, u32be, f32le, f32be, f64le, f64be, mulaw, alaw Valid sample rates: 8000, 11025, 16000, 22050, 32000, 44100, 48000, 96000 For lowest end-to-end latency, send audio_format=s16le&sample_rate=16000&num_channels=1. This matches the model’s native input format and bypasses the server’s audio decoder entirely.

Query parameters

Parameter	Required	Description
`api_key`	Yes	Your API key
`audio_format`	Yes	Container or raw PCM format of the audio you will stream. See supported formats above.
`sample_rate`	Raw PCM only	Source sample rate in Hz. Required when `audio_format` is a raw PCM type; ignored for container formats.
`num_channels`	Raw PCM only	Number of audio channels (1–8). Required when `audio_format` is a raw PCM type; ignored for container formats. The server downmixes to mono internally.

This endpoint accepts no feature toggles. Diarization, emotion, accent, and PII parameters are not recognized and have no effect. Use STT Streaming if you need those features.

Connection flow

Connect to the WebSocket endpoint with api_key, audio_format, and (for raw PCM) sample_rate and num_channels.
Stream audio as binary WebSocket frames. Frames can be any size; 4–64 KB is typical.
Receive partial_utterance JSON messages every ~1.5 seconds. Each contains the complete transcript so far — replace any previously displayed partial, do not append.
Send an empty text frame ("") to signal end of audio.
Receive one final utterance message with the complete transcript.
Receive a done message with total audio duration.
The connection closes automatically.

Server messages

`partial_utterance`

Sent roughly every 1.5 seconds while audio is streaming. Each message contains the complete transcript built so far, not a delta from the previous message. Replace your displayed partial text with each new value — never append.

{
  "type": "partial_utterance",
  "partial_utterance": {
    "text": "Hello, how are you",
    "is_final": false
  }
}

Field	Type	Description
`type`	string	Always `"partial_utterance"`
`partial_utterance.text`	string	Complete transcript so far. Replace, don’t append.
`partial_utterance.is_final`	boolean	Always `false`

`utterance`

Sent exactly once, after the client signals end-of-stream. Contains the final transcript covering the entire audio stream. Supersedes all preceding partial_utterance messages.

{
  "type": "utterance",
  "utterance": {
    "text": "Hello, how are you doing today?",
    "is_final": true
  }
}

Field	Type	Description
`type`	string	Always `"utterance"`
`utterance.text`	string	Final transcript. Will not be revised.
`utterance.is_final`	boolean	Always `true`

`done`

Sent immediately after the final utterance. Signals stream completion. The connection closes shortly after.

{
  "type": "done",
  "duration_ms": 14253
}

Field	Type	Description
`type`	string	Always `"done"`
`duration_ms`	integer	Total audio duration in milliseconds

`error`

Sent if something goes wrong. The connection closes after this message. No further messages follow an error.

{
  "type": "error",
  "error": "Invalid audio_format='xyz'."
}

Field	Type	Description
`type`	string	Always `"error"`
`error`	string	Human-readable description of the error

WebSocket close codes

Code	Meaning
`1003`	Invalid query parameters — missing `audio_format`, unsupported value, invalid `sample_rate`, etc.
`4001`	Invalid API key
`4002`	Audio bytes did not match the declared raw PCM format
`4003`	Model access not enabled for your organization, or usage denied
`4029`	Rate limit exceeded — monthly audio-hours limit or concurrent connection limit reached

Rate limits

Concurrent connection limits apply per organization.
Monthly usage limits (in audio hours) apply per organization.
Connections that exceed limits are rejected during the WebSocket handshake with close code 4029.

See Authentication and rate limits for retry guidance.

Examples

Python (websockets)
JavaScript (Node.js)

import asyncio
import json
import os
import websockets

API_KEY = os.environ["MODULATE_API_KEY"]
AUDIO_FILE = "recording.ogg"
CHUNK_SIZE = 8192

async def transcribe():
    url = (
        f"wss://platform.modulate.ai/api/velma-2-stt-streaming-english-v2"
        f"?api_key={API_KEY}&audio_format=ogg"
    )

    async with websockets.connect(url, max_size=None) as ws:
        async def send_audio():
            with open(AUDIO_FILE, "rb") as f:
                while chunk := f.read(CHUNK_SIZE):
                    await ws.send(chunk)
            await ws.send("")  # signal end-of-stream

        send_task = asyncio.create_task(send_audio())

        try:
            async for msg in ws:
                data = json.loads(msg)
                if data["type"] == "partial_utterance":
                    # Replace any previously displayed partial — not a delta
                    print(f"\r[partial] {data['partial_utterance']['text']}", end="", flush=True)
                elif data["type"] == "utterance":
                    print(f"\n[final]   {data['utterance']['text']}")
                elif data["type"] == "done":
                    print(f"\nDone. Duration: {data['duration_ms']}ms")
                    break
                elif data["type"] == "error":
                    print(f"\nError: {data['error']}")
                    break
        finally:
            if not send_task.done():
                send_task.cancel()

asyncio.run(transcribe())

const WebSocket = require("ws");
const fs = require("fs");

const API_KEY = process.env.MODULATE_API_KEY;
const AUDIO_FILE = "recording.ogg";
const CHUNK_SIZE = 8192;

const url = new URL("wss://platform.modulate.ai/api/velma-2-stt-streaming-english-v2");
url.searchParams.set("api_key", API_KEY);
url.searchParams.set("audio_format", "ogg");

const ws = new WebSocket(url.toString());

ws.on("open", () => {
  const stream = fs.createReadStream(AUDIO_FILE, { highWaterMark: CHUNK_SIZE });
  stream.on("data", (chunk) => ws.send(chunk));
  stream.on("end", () => ws.send(""));  // signal end-of-stream
});

ws.on("message", (data) => {
  const msg = JSON.parse(data.toString());
  if (msg.type === "partial_utterance") {
    // Each partial is the full transcript so far — replace, don't append
    process.stdout.write(`\r[partial] ${msg.partial_utterance.text}`);
  } else if (msg.type === "utterance") {
    console.log(`\n[final]   ${msg.utterance.text}`);
  } else if (msg.type === "done") {
    console.log(`\nDone. Duration: ${msg.duration_ms}ms`);
    ws.close();
  } else if (msg.type === "error") {
    console.error("\nError:", msg.error);
    ws.close();
  }
});

ws.on("error", (err) => console.error("WebSocket error:", err.message));
ws.on("close", (code) => console.log(`Connection closed: ${code}`));

WebSocket APIs cannot be tested with cURL. For command-line testing, use websocat.

Which API should I use? — when Streaming v2 is the right choice vs STT Streaming or batch
STT Streaming — multilingual streaming with speaker diarization and enrichments
Authentication and rate limits

Velma

Speech-to-text Transcription

Deepfake Detection

Emotion Detection

Accent Detection

PII/PHI Redaction

Music Detection

AI Music Detection

Language Detection

Speech-to-Text Streaming English

Endpoint

Authentication

Supported audio formats

Query parameters

Connection flow

Server messages

`partial_utterance`

`utterance`

`done`

`error`

WebSocket close codes

Rate limits

Examples

​Endpoint

​Authentication

​Supported audio formats

​Query parameters

​Connection flow

​Server messages

​partial_utterance

​utterance

​done

​error

​WebSocket close codes

​Rate limits

​Examples

​Related

Endpoint

Authentication

Supported audio formats

Query parameters

Connection flow

Server messages

`partial_utterance`

`utterance`

`done`

`error`

WebSocket close codes

Rate limits

Examples

Related