Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.modulate.ai/llms.txt

Use this file to discover all available pages before exploring further.

Real-time speech-to-text over WebSocket. Streams audio to the server and receives transcribed utterances as they are processed, with optional speaker diarization, emotion detection, accent detection, and PII/PHI tagging.

Endpoint

wss://modulate-developer-apis.com/api/velma-2-stt-streaming

Authentication

Pass your API key as a query parameter when opening the connection.
wss://modulate-developer-apis.com/api/velma-2-stt-streaming?api_key=YOUR_API_KEY
Unlike the batch endpoints, the streaming API does not use an X-API-Key header. The key must be in the query string at connection time.
See Authentication and rate limits for how to obtain and manage API keys.

Supported audio formats

AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM Opus is recommended for optimal quality and bandwidth efficiency.

Query parameters

ParameterTypeDefaultDescription
api_keystringRequired. Your API key
speaker_diarizationbooleantrueIdentify and label distinct speakers
emotion_signalbooleanfalseDetect emotional tone per utterance
accent_signalbooleanfalseDetect speaker accent per utterance
pii_phi_taggingbooleanfalseWrap PII/PHI in tags within utterance text
For a full explanation of what each feature does and when to enable it, see STT enrichment features.

Connection flow

  1. Connect to the WebSocket endpoint with api_key and any optional feature parameters.
  2. Stream raw audio as binary WebSocket frames. Frames can be any size.
  3. Receive utterance JSON messages as speech is transcribed.
  4. Send an empty text frame ("") to signal end of audio.
  5. Receive a done message containing total audio duration.
  6. The connection closes automatically.

Server messages

utterance

Sent each time a speech segment is transcribed.
{
  "type": "utterance",
  "utterance": {
    "utterance_uuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "text": "Hello, how are you today?",
    "start_ms": 0,
    "duration_ms": 2500,
    "speaker": 1,
    "language": "en",
    "emotion": "Neutral",
    "accent": "American"
  }
}

Utterance fields

FieldTypeDescription
utterance_uuidstring (UUID)Unique identifier for this utterance
textstringTranscribed text
start_msintegerStart time in milliseconds from the beginning of the stream
duration_msintegerDuration of the utterance in milliseconds
speakerintegerSpeaker number, 1-indexed
languagestringDetected language code (e.g. "en", "fr")
emotionstring | nullDetected emotion. null when emotion_signal is disabled
accentstring | nullDetected accent. null when accent_signal is disabled
For all valid emotion and accent values, see STT enrichment features.

done

Sent after all audio has been processed, in response to the end-of-stream signal.
{
  "type": "done",
  "duration_ms": 45000
}

error

Sent if transcription fails. The connection closes after this message.
{
  "type": "error",
  "error": "Internal server error"
}

WebSocket close codes

CodeMeaning
4001Invalid API key
4003Model access not enabled for your organization
4029Rate limit exceeded — monthly usage or concurrent connections

Rate limits

  • Concurrent connection limits apply per organization.
  • Monthly usage limits (in audio hours) apply per organization.
  • Connections that exceed limits are rejected during the WebSocket handshake with close code 4029.
See Authentication and rate limits for retry guidance.

Examples

import asyncio
import json
import aiohttp

API_KEY = "YOUR_API_KEY"
AUDIO_FILE = "recording.opus"
CHUNK_SIZE = 8192

async def transcribe_streaming():
    url = (
        f"wss://modulate-developer-apis.com/api/velma-2-stt-streaming"
        f"?api_key={API_KEY}"
        f"&speaker_diarization=true"
        f"&emotion_signal=true"
        f"&accent_signal=true"
    )

    utterances = []

    async with aiohttp.ClientSession() as session:
        async with session.ws_connect(url) as ws:

            async def send_audio():
                with open(AUDIO_FILE, "rb") as f:
                    while chunk := f.read(CHUNK_SIZE):
                        await ws.send_bytes(chunk)
                        await asyncio.sleep(CHUNK_SIZE / 4000)
                await ws.send_str("")

            send_task = asyncio.create_task(send_audio())

            try:
                async for msg in ws:
                    if msg.type == aiohttp.WSMsgType.TEXT:
                        data = json.loads(msg.data)
                        if data["type"] == "utterance":
                            u = data["utterance"]
                            utterances.append(u)
                            print(f"[Speaker {u['speaker']}] {u['text']}")
                        elif data["type"] == "done":
                            print(f"Done. Duration: {data['duration_ms']}ms")
                            break
                        elif data["type"] == "error":
                            print(f"Error: {data['error']}")
                            break
                    elif msg.type in (
                        aiohttp.WSMsgType.ERROR,
                        aiohttp.WSMsgType.CLOSE,
                        aiohttp.WSMsgType.CLOSED,
                    ):
                        break
            finally:
                if not send_task.done():
                    send_task.cancel()

    full_text = " ".join(u["text"] for u in utterances)
    print(f"\nFull transcript:\n{full_text}")

asyncio.run(transcribe_streaming())
WebSocket APIs cannot be tested with cURL. For command-line testing, use websocat.