Skip to main content
Real-time speech-to-text over WebSocket. Streams audio to the server and receives transcribed utterances as they are processed, with optional speaker diarization, emotion detection, accent detection, and PII/PHI tagging.

Endpoint

wss://platform.modulate.ai/api/velma-2-stt-streaming

Authentication

Pass your API key as a query parameter when opening the connection.
wss://platform.modulate.ai/api/velma-2-stt-streaming?api_key=YOUR_API_KEY
Unlike the batch endpoints, the streaming API does not use an X-API-Key header. The key must be in the query string at connection time.
See Authentication and rate limits for how to obtain and manage API keys.

Supported audio formats

Self-describing container formats are auto-detected from headers (no audio_format query parameter needed). Raw / headerless formats require audio_format, sample_rate, and num_channels. For the authoritative list of accepted values, see the spec’s audio_format enum or Audio formats and preprocessing. Opus is recommended when you control the encoder — high quality at low bandwidth.

Query parameters

ParameterTypeDefaultDescription
api_keystringRequired. Your API key
speaker_diarizationbooleantrueIdentify and label distinct speakers
emotion_signalbooleanfalseDetect emotional tone per utterance
accent_signalbooleanfalseDetect speaker accent per utterance
deepfake_signalbooleanfalsePer-utterance synthetic-voice (deepfake) score
pii_phi_taggingbooleanfalseWrap PII/PHI in tags within utterance text
partial_resultsbooleanfalseStream interim partial_utterance messages with in-progress text before each utterance is finalized
For a full explanation of what each feature does and when to enable it, see STT enrichment features.

Connection flow

  1. Connect to the WebSocket endpoint with api_key and any optional feature parameters.
  2. Stream raw audio as binary WebSocket frames. Frames can be any size.
  3. Receive utterance JSON messages as speech is transcribed. If partial_results=true, also receive partial_utterance previews for the currently active utterance.
  4. Send an empty text frame ("") to signal end of audio.
  5. Receive a done message containing total audio duration.
  6. The connection closes automatically.

Server messages

utterance

Sent each time a speech segment is transcribed.
{
  "type": "utterance",
  "utterance": {
    "utterance_uuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "text": "Hello, how are you today?",
    "start_ms": 0,
    "duration_ms": 2500,
    "speaker": 1,
    "language": "en",
    "emotion": "Neutral",
    "accent": "American",
    "deepfake_score": null
  }
}

Utterance fields

FieldTypeDescription
utterance_uuidstring (UUID)Unique identifier for this utterance
textstringTranscribed text
start_msintegerStart time in milliseconds from the beginning of the stream
duration_msintegerDuration of the utterance in milliseconds
speakerintegerSpeaker number, 1-indexed
languagestringDetected language code (e.g. "en", "fr")
emotionstring | nullDetected emotion. null when emotion_signal is disabled
accentstring | nullDetected accent. null when accent_signal is disabled
deepfake_scorefloat | nullSynthetic-voice score from 0.0 (likely natural) to 1.0 (likely synthetic). null when deepfake_signal is disabled or when the utterance is shorter than 0.5 seconds
For all valid emotion and accent values, see STT enrichment features.

partial_utterance

Sent only when partial_results=true. Delivers in-progress text for the currently active utterance as a low-latency preview. Each partial_utterance supersedes the previous one for the same utterance; the finalized utterance message supersedes all preceding partials. Partials are fast previews and do not carry emotion, accent, deepfake, or PII/PHI fields.
{
  "type": "partial_utterance",
  "partial_utterance": {
    "text": "Hello, how are",
    "start_ms": 0,
    "speaker": 1
  }
}

Partial utterance fields

FieldTypeDescription
textstringIn-progress transcribed text. May be empty
start_msinteger | nullStart time in milliseconds from the beginning of the stream. null if timing data is not yet available
speakerinteger | nullSpeaker number, 1-indexed. null if the speaker has not yet been identified

done

Sent after all audio has been processed, in response to the end-of-stream signal.
{
  "type": "done",
  "duration_ms": 45000
}

error

Sent if transcription fails. The connection closes after this message.
{
  "type": "error",
  "error": "Internal server error"
}

WebSocket close codes

CodeMeaning
1000Normal closure after a successful done message
1003Invalid connection parameters (unsupported audio_format, sample_rate, or num_channels; raw format missing sample_rate/num_channels)
4003Request could not be validated, or is not permitted (auth failure, missing model access)
4029Insufficient credits, or concurrent-connection limit exceeded
An error JSON message is sent before the connection closes (except on 1000).

Rate limits

  • Concurrent connection limits apply per organization.
  • Monthly usage limits (in audio hours) apply per organization.
  • Connections that exceed limits are rejected during the WebSocket handshake with close code 4029.
See Authentication and rate limits for retry guidance.

Examples

import asyncio
import json
import aiohttp

API_KEY = "YOUR_API_KEY"
AUDIO_FILE = "recording.opus"
CHUNK_SIZE = 8192

async def transcribe_streaming():
    url = (
        f"wss://platform.modulate.ai/api/velma-2-stt-streaming"
        f"?api_key={API_KEY}"
        f"&speaker_diarization=true"
        f"&emotion_signal=true"
        f"&accent_signal=true"
    )

    utterances = []

    async with aiohttp.ClientSession() as session:
        async with session.ws_connect(url) as ws:

            async def send_audio():
                with open(AUDIO_FILE, "rb") as f:
                    while chunk := f.read(CHUNK_SIZE):
                        await ws.send_bytes(chunk)
                        await asyncio.sleep(CHUNK_SIZE / 4000)
                await ws.send_str("")

            send_task = asyncio.create_task(send_audio())

            try:
                async for msg in ws:
                    if msg.type == aiohttp.WSMsgType.TEXT:
                        data = json.loads(msg.data)
                        if data["type"] == "utterance":
                            u = data["utterance"]
                            utterances.append(u)
                            print(f"[Speaker {u['speaker']}] {u['text']}")
                        elif data["type"] == "done":
                            print(f"Done. Duration: {data['duration_ms']}ms")
                            break
                        elif data["type"] == "error":
                            print(f"Error: {data['error']}")
                            break
                    elif msg.type in (
                        aiohttp.WSMsgType.ERROR,
                        aiohttp.WSMsgType.CLOSE,
                        aiohttp.WSMsgType.CLOSED,
                    ):
                        break
            finally:
                if not send_task.done():
                    send_task.cancel()

    full_text = " ".join(u["text"] for u in utterances)
    print(f"\nFull transcript:\n{full_text}")

asyncio.run(transcribe_streaming())
WebSocket APIs cannot be tested with cURL. For command-line testing, use websocat.