Skip to main content
Low-latency English speech-to-text over WebSocket. Emits a rolling partial transcript every ~1.5 seconds while audio streams in, then delivers one complete final transcript at end-of-stream. Pure transcription — no speaker diarization, emotion detection, accent detection, or PII/PHI tagging.

Endpoint

wss://modulate-developer-apis.com/api/velma-2-stt-streaming-english-v2

Authentication

Pass your API key as a query parameter when opening the connection.
wss://modulate-developer-apis.com/api/velma-2-stt-streaming-english-v2?api_key=YOUR_API_KEY&audio_format=ogg
Unlike the batch endpoints, this API does not use an X-API-Key header. The key must be in the query string at connection time.
See Authentication and rate limits for how to obtain and manage API keys.

Supported audio formats

Container formatssample_rate and num_channels are not required: wav, mp3, ogg, flac, webm, aac, aiff Opus audio is typically shipped as ogg (Opus-in-Ogg). Pass audio_format=ogg for Opus streams. Raw PCM formatssample_rate and num_channels are required: s8, s16le, s16be, s24le, s24be, s32le, s32be, u8, u16le, u16be, u24le, u24be, u32le, u32be, f32le, f32be, f64le, f64be, mulaw, alaw Valid sample rates: 8000, 11025, 16000, 22050, 32000, 44100, 48000, 96000 For lowest end-to-end latency, send audio_format=s16le&sample_rate=16000&num_channels=1. This matches the model’s native input format and bypasses the server’s audio decoder entirely.

Query parameters

ParameterRequiredDescription
api_keyYesYour API key
audio_formatYesContainer or raw PCM format of the audio you will stream. See supported formats above.
sample_rateRaw PCM onlySource sample rate in Hz. Required when audio_format is a raw PCM type; ignored for container formats.
num_channelsRaw PCM onlyNumber of audio channels (1–8). Required when audio_format is a raw PCM type; ignored for container formats. The server downmixes to mono internally.
This endpoint accepts no feature toggles. Diarization, emotion, accent, and PII parameters are not recognized and have no effect. Use STT Streaming if you need those features.

Connection flow

  1. Connect to the WebSocket endpoint with api_key, audio_format, and (for raw PCM) sample_rate and num_channels.
  2. Stream audio as binary WebSocket frames. Frames can be any size; 4–64 KB is typical.
  3. Receive partial_utterance JSON messages every ~1.5 seconds. Each contains the complete transcript so far — replace any previously displayed partial, do not append.
  4. Send an empty text frame ("") to signal end of audio.
  5. Receive one final utterance message with the complete transcript.
  6. Receive a done message with total audio duration.
  7. The connection closes automatically.
Connection limits:
  • Maximum connection lifetime: 30 minutes per WebSocket.
  • Idle timeout: 60 seconds without audio frames. The server finalizes the transcript and closes.

Server messages

partial_utterance

Sent roughly every 1.5 seconds while audio is streaming. Each message contains the complete transcript built so far, not a delta from the previous message. Replace your displayed partial text with each new value — never append.
{
  "type": "partial_utterance",
  "partial_utterance": {
    "text": "Hello, how are you",
    "is_final": false
  }
}
FieldTypeDescription
typestringAlways "partial_utterance"
partial_utterance.textstringComplete transcript so far. Replace, don’t append.
partial_utterance.is_finalbooleanAlways false

utterance

Sent exactly once, after the client signals end-of-stream. Contains the final transcript covering the entire audio stream. Supersedes all preceding partial_utterance messages.
{
  "type": "utterance",
  "utterance": {
    "text": "Hello, how are you doing today?",
    "is_final": true
  }
}
FieldTypeDescription
typestringAlways "utterance"
utterance.textstringFinal transcript. Will not be revised.
utterance.is_finalbooleanAlways true

done

Sent immediately after the final utterance. Signals stream completion. The connection closes shortly after.
{
  "type": "done",
  "duration_ms": 14253
}
FieldTypeDescription
typestringAlways "done"
duration_msintegerTotal audio duration in milliseconds

error

Sent if something goes wrong. The connection closes after this message. No further messages follow an error.
{
  "type": "error",
  "error": "Invalid audio_format='xyz'."
}
FieldTypeDescription
typestringAlways "error"
errorstringHuman-readable description of the error

WebSocket close codes

CodeMeaning
1003Invalid query parameters — missing audio_format, unsupported value, invalid sample_rate, etc.
4001Invalid API key
4002Audio bytes did not match the declared raw PCM format
4003Model access not enabled for your organization, or usage denied
4029Rate limit exceeded — monthly audio-hours limit or concurrent connection limit reached

Rate limits

  • Concurrent connection limits apply per organization.
  • Monthly usage limits (in audio hours) apply per organization.
  • Connections that exceed limits are rejected during the WebSocket handshake with close code 4029.
See Authentication and rate limits for retry guidance.

Examples

import asyncio
import json
import os
import websockets

API_KEY = os.environ["MODULATE_API_KEY"]
AUDIO_FILE = "recording.ogg"
CHUNK_SIZE = 8192

async def transcribe():
    url = (
        f"wss://modulate-developer-apis.com/api/velma-2-stt-streaming-english-v2"
        f"?api_key={API_KEY}&audio_format=ogg"
    )

    async with websockets.connect(url, max_size=None) as ws:
        async def send_audio():
            with open(AUDIO_FILE, "rb") as f:
                while chunk := f.read(CHUNK_SIZE):
                    await ws.send(chunk)
            await ws.send("")  # signal end-of-stream

        send_task = asyncio.create_task(send_audio())

        try:
            async for msg in ws:
                data = json.loads(msg)
                if data["type"] == "partial_utterance":
                    # Replace any previously displayed partial — not a delta
                    print(f"\r[partial] {data['partial_utterance']['text']}", end="", flush=True)
                elif data["type"] == "utterance":
                    print(f"\n[final]   {data['utterance']['text']}")
                elif data["type"] == "done":
                    print(f"\nDone. Duration: {data['duration_ms']}ms")
                    break
                elif data["type"] == "error":
                    print(f"\nError: {data['error']}")
                    break
        finally:
            if not send_task.done():
                send_task.cancel()

asyncio.run(transcribe())
WebSocket APIs cannot be tested with cURL. For command-line testing, use websocat.