Speech-to-Text Transcription Streaming Multilingual

Real-time speech-to-text over WebSocket. Streams audio to the server and receives transcribed utterances as they are processed, with optional speaker diarization, emotion detection, accent detection, and PII/PHI tagging.

Endpoint

wss://modulate-developer-apis.com/api/velma-2-stt-streaming

Authentication

Pass your API key as a query parameter when opening the connection.

wss://modulate-developer-apis.com/api/velma-2-stt-streaming?api_key=YOUR_API_KEY

Unlike the batch endpoints, the streaming API does not use an X-API-Key header. The key must be in the query string at connection time.

See Authentication and rate limits for how to obtain and manage API keys.

Supported audio formats

AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM Opus is recommended for optimal quality and bandwidth efficiency.

Query parameters

Parameter	Type	Default	Description
`api_key`	string	—	Required. Your API key
`speaker_diarization`	boolean	`true`	Identify and label distinct speakers
`emotion_signal`	boolean	`false`	Detect emotional tone per utterance
`accent_signal`	boolean	`false`	Detect speaker accent per utterance
`pii_phi_tagging`	boolean	`false`	Wrap PII/PHI in tags within utterance text

For a full explanation of what each feature does and when to enable it, see STT enrichment features.

Connection flow

Connect to the WebSocket endpoint with api_key and any optional feature parameters.
Stream raw audio as binary WebSocket frames. Frames can be any size.
Receive utterance JSON messages as speech is transcribed.
Send an empty text frame ("") to signal end of audio.
Receive a done message containing total audio duration.
The connection closes automatically.

Server messages

`utterance`

Sent each time a speech segment is transcribed.

{
  "type": "utterance",
  "utterance": {
    "utterance_uuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "text": "Hello, how are you today?",
    "start_ms": 0,
    "duration_ms": 2500,
    "speaker": 1,
    "language": "en",
    "emotion": "Neutral",
    "accent": "American"
  }
}

Utterance fields

Field	Type	Description
`utterance_uuid`	string (UUID)	Unique identifier for this utterance
`text`	string	Transcribed text
`start_ms`	integer	Start time in milliseconds from the beginning of the stream
`duration_ms`	integer	Duration of the utterance in milliseconds
`speaker`	integer	Speaker number, 1-indexed
`language`	string	Detected language code (e.g. `"en"`, `"fr"`)
`emotion`	string \| null	Detected emotion. `null` when `emotion_signal` is disabled
`accent`	string \| null	Detected accent. `null` when `accent_signal` is disabled

For all valid emotion and accent values, see STT enrichment features.

`done`

Sent after all audio has been processed, in response to the end-of-stream signal.

{
  "type": "done",
  "duration_ms": 45000
}

`error`

Sent if transcription fails. The connection closes after this message.

{
  "type": "error",
  "error": "Internal server error"
}

WebSocket close codes

Code	Meaning
`4001`	Invalid API key
`4003`	Model access not enabled for your organization
`4029`	Rate limit exceeded — monthly usage or concurrent connections

Rate limits

Concurrent connection limits apply per organization.
Monthly usage limits (in audio hours) apply per organization.
Connections that exceed limits are rejected during the WebSocket handshake with close code 4029.

See Authentication and rate limits for retry guidance.

Examples

Python (aiohttp)
JavaScript (Node.js)

import asyncio
import json
import aiohttp

API_KEY = "YOUR_API_KEY"
AUDIO_FILE = "recording.opus"
CHUNK_SIZE = 8192

async def transcribe_streaming():
    url = (
        f"wss://modulate-developer-apis.com/api/velma-2-stt-streaming"
        f"?api_key={API_KEY}"
        f"&speaker_diarization=true"
        f"&emotion_signal=true"
        f"&accent_signal=true"
    )

    utterances = []

    async with aiohttp.ClientSession() as session:
        async with session.ws_connect(url) as ws:

            async def send_audio():
                with open(AUDIO_FILE, "rb") as f:
                    while chunk := f.read(CHUNK_SIZE):
                        await ws.send_bytes(chunk)
                        await asyncio.sleep(CHUNK_SIZE / 4000)
                await ws.send_str("")

            send_task = asyncio.create_task(send_audio())

            try:
                async for msg in ws:
                    if msg.type == aiohttp.WSMsgType.TEXT:
                        data = json.loads(msg.data)
                        if data["type"] == "utterance":
                            u = data["utterance"]
                            utterances.append(u)
                            print(f"[Speaker {u['speaker']}] {u['text']}")
                        elif data["type"] == "done":
                            print(f"Done. Duration: {data['duration_ms']}ms")
                            break
                        elif data["type"] == "error":
                            print(f"Error: {data['error']}")
                            break
                    elif msg.type in (
                        aiohttp.WSMsgType.ERROR,
                        aiohttp.WSMsgType.CLOSE,
                        aiohttp.WSMsgType.CLOSED,
                    ):
                        break
            finally:
                if not send_task.done():
                    send_task.cancel()

    full_text = " ".join(u["text"] for u in utterances)
    print(f"\nFull transcript:\n{full_text}")

asyncio.run(transcribe_streaming())

const WebSocket = require("ws");
const fs = require("fs");

const API_KEY = "YOUR_API_KEY";
const AUDIO_FILE = "recording.opus";
const CHUNK_SIZE = 8192;

const url = new URL("wss://modulate-developer-apis.com/api/velma-2-stt-streaming");
url.searchParams.set("api_key", API_KEY);
url.searchParams.set("speaker_diarization", "true");
url.searchParams.set("emotion_signal", "true");

const ws = new WebSocket(url.toString());
const utterances = [];

ws.on("open", () => {
  const stream = fs.createReadStream(AUDIO_FILE, { highWaterMark: CHUNK_SIZE });
  stream.on("data", (chunk) => ws.send(chunk));
  stream.on("end", () => ws.send(""));
});

ws.on("message", (data) => {
  const msg = JSON.parse(data.toString());
  if (msg.type === "utterance") {
    utterances.push(msg.utterance);
    console.log(`[Speaker ${msg.utterance.speaker}] ${msg.utterance.text}`);
  } else if (msg.type === "done") {
    console.log(`Done. Duration: ${msg.duration_ms}ms`);
    console.log("Transcript:", utterances.map((u) => u.text).join(" "));
    ws.close();
  } else if (msg.type === "error") {
    console.error("Error:", msg.error);
    ws.close();
  }
});

ws.on("error", (err) => console.error("WebSocket error:", err.message));

WebSocket APIs cannot be tested with cURL. For command-line testing, use websocat.

Which API should I use? — when streaming is the right choice vs batch
STT enrichment features — full reference for diarization, emotion, accent, and PII/PHI tagging
Authentication and rate limits

Speech-to-text Transcription

Deepfake Detection

PII/PHI Redaction

Speech-to-Text Transcription Streaming Multilingual

Endpoint

Authentication

Supported audio formats

Query parameters

Connection flow

Server messages

`utterance`

Utterance fields

`done`

`error`

WebSocket close codes

Rate limits

Examples

Speech-to-text Transcription

Deepfake Detection

PII/PHI Redaction

Documentation Index

​Endpoint

​Authentication

​Supported audio formats

​Query parameters

​Connection flow

​Server messages

​utterance

​Utterance fields

​done

​error

​WebSocket close codes

​Rate limits

​Examples

​Related

Endpoint

Authentication

Supported audio formats

Query parameters

Connection flow

Server messages

`utterance`

Utterance fields

`done`

`error`

WebSocket close codes

Rate limits

Examples

Related