Skip to main content
Real-time conversation analysis over WebSocket. Stream audio to the server and receive analysis events as they are produced: transcribed clips, the inferred conversation type, participant roles, behavior detections, topics, per-speaker topic sentiment, and a final summary — terminating with a done event at end-of-stream.

Endpoint

wss://modulate-developer-apis.com/api/velma-2-streaming

Authentication

Pass your API key as a query parameter when opening the connection.
wss://modulate-developer-apis.com/api/velma-2-streaming?api_key=YOUR_API_KEY
Unlike the batch endpoint, the streaming API does not use an X-API-Key header. The key must be in the query string at connection time.
See Authentication and rate limits for how to obtain and manage API keys.

Connection parameters

Connection parameters carry only the API key and audio-format hints. All analysis configuration is sent in the config frame, not the query string.
ParameterTypeRequiredDescription
api_keystringYesYour API key
audio_formatstringRaw formats onlyAudio encoding. Omit for self-describing formats; required for raw/headerless formats. May optionally be set to override auto-detection
sample_rateintegerRaw formats onlySample rate in Hz. Required for raw formats; must not be set otherwise
num_channelsintegerRaw formats onlyNumber of channels (1–8). Required for raw formats; must not be set otherwise

Supported audio formats

Self-describing formats (auto-detected from file headers — no extra parameters needed): AAC, AIFF, FLAC, MP3, OGG, WAV, WebM
OGG / Opus: OGG is a container that may carry Opus-encoded audio. Pass audio_format=ogg, not audio_format=opus.
Raw / headerless formats (require audio_format, sample_rate, and num_channels): s8, s16le, s16be, s24le, s24be, s32le, s32be, u8, u16le, u16be, u24le, u24be, u32le, u32be, f32le, f32be, f64le, f64be, mulaw, alaw Valid sample rates: 8000, 11025, 16000, 22050, 32000, 44100, 48000, 96000

Configuration

After the connection opens, send exactly one text frame before any audio:
  • The literal string default to use the built-in default configuration, or
  • A JSON-encoded BatchConfig describing the conversation types, participant roles, behaviors, STT options, and which aggregate outputs (topics, sentiments, summary) to produce.
The BatchConfig schema is identical to the batch endpoint — see the Batch reference for the full field list. Behaviors may be referenced from the preset catalog with the preset:<identifier> syntax; list available presets with List behavior presets.
{
  "behaviors": ["preset:empathy", "preset:complaints"],
  "stt": { "speaker_diarization": true, "emotion_signal": true },
  "produce_topics": true,
  "produce_topic_sentiments": true,
  "produce_summary": true
}
You must send the config frame before any audio. Sending a binary audio frame before the config frame is a protocol error and closes the connection with code 1003.

Connection flow

  1. Connect to the WebSocket endpoint with api_key (and audio_format, sample_rate, num_channels for raw formats).
  2. Send one text frame: either the literal string default or a JSON-encoded BatchConfig.
  3. Stream audio as binary WebSocket frames. Frames can be any size.
  4. Receive analysis events as JSON text frames as results are produced.
  5. Send an empty text frame ("") to signal end of audio.
  6. Receive a final done event with the total audio duration.
  7. The connection closes automatically.

Server events

The server emits JSON text frames as results are produced. Every event carries a type discriminator. The payload objects (clip, conversation-type pick, participant-role pick, behavior detection, topic sentiment) use the same schemas the batch endpoint returns — see the Batch reference for full field definitions.
typePayload keyDescription
clipclipA transcribed clip with speaker label, timing, and optional emotion / accent / deepfake signals
conversation_typepickThe inferred or default conversation-type classification for the session
participant_rolepickThe inferred or default role for a speaker
behavior_detectiondetectionA per-behavior detection result
topicstopicsThe aggregated list of conversation topics
topic_sentimenttopic_sentimentPer-speaker sentiment for one aggregated topic
summarytextA free-form summary of the conversation
doneduration_msStreaming completed; carries the total audio duration
errorerrorAn error occurred; the connection closes after this event

clip

A transcribed clip. Emitted progressively as speech is processed.
{
  "type": "clip",
  "clip": {
    "clip_uuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "text": "Thanks for calling support, how can I help?",
    "start_ms": 0,
    "duration_ms": 3200,
    "speaker_label": "speaker_1",
    "language": "en",
    "emotion": null,
    "accent": null,
    "deepfake_score": null
  }
}
The emotion, accent, and deepfake_score fields are null unless the corresponding STT options are enabled in the config frame.

conversation_type

The conversation-type classification for the session.
{
  "type": "conversation_type",
  "pick": {
    "conversation_type_uuid": "8f1d2c3b-4a5e-6f70-8192-a3b4c5d6e7f8",
    "name": "Customer support call",
    "confidence": 0.92,
    "selection_source": "inferred",
    "detail": "Caller is seeking help resolving a billing issue.",
    "reasoning": "The agent greets the caller and offers assistance with an account problem."
  }
}
selection_source is one of inferred, auto_selected_single_option, or default.

participant_role

A role assignment for one speaker. Emitted once per identified speaker.
{
  "type": "participant_role",
  "pick": {
    "speaker_label": "speaker_1",
    "participant_role_uuid": "2b7c9d10-1e2f-3a4b-5c6d-7e8f90a1b2c3",
    "name": "Support agent",
    "confidence": 0.88,
    "selection_source": "inferred",
    "detail": "Speaker offers assistance and asks diagnostic questions.",
    "reasoning": "Greets on behalf of the company and drives the troubleshooting."
  }
}

behavior_detection

A per-behavior detection result. Emitted once per configured behavior.
{
  "type": "behavior_detection",
  "detection": {
    "behavior_uuid": "c4d5e6f7-8091-a2b3-c4d5-e6f708192a3b",
    "behavior_name": "Empathy",
    "speaker_label": "speaker_1",
    "detected": true,
    "confidence": 0.81,
    "evidence_clip_uuids": ["a1b2c3d4-e5f6-7890-abcd-ef1234567890"],
    "definitive_clip_uuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "reasoning": "Agent acknowledges the customer's frustration before resolving the issue.",
    "skipped": false,
    "skip_reason": null,
    "error_reason": null
  }
}
When a behavior could not be evaluated, skipped is true and skip_reason (or error_reason) explains why.

topics

The aggregated list of conversation topics.
{
  "type": "topics",
  "topics": ["billing", "refund policy", "account access"]
}

topic_sentiment

Per-speaker sentiment for one aggregated topic. Emitted once per topic/speaker pair.
{
  "type": "topic_sentiment",
  "topic_sentiment": {
    "topic": "billing",
    "speaker_label": "speaker_2",
    "sentiment_score": -0.4,
    "sentiment_label": "negative"
  }
}
sentiment_score ranges from -1.0 (most negative) to 1.0 (most positive).

summary

A free-form summary of the conversation.
{
  "type": "summary",
  "text": "The customer called about a duplicate charge. The agent confirmed the error, issued a refund, and explained the billing cycle."
}

done

Sent after all audio has been processed, in response to the end-of-stream signal.
{
  "type": "done",
  "duration_ms": 45000
}

error

Sent if processing fails. The connection closes after this event.
{
  "type": "error",
  "error": "Internal server error"
}

WebSocket close codes

CodeMeaning
1000Normal closure after a successful done message
1003Protocol error — invalid config JSON, audio sent before the config frame, or an unsupported audio format / sample rate / channel count
4003Request could not be validated, or is not permitted (auth failure, missing model access)
4029Insufficient credits, or concurrent-connection limit exceeded
An error JSON message is sent before the connection closes (except on 1000).

Rate limits

  • Concurrent connection limits apply per organization.
  • Monthly usage limits (in audio hours) apply per organization.
  • Connections that exceed limits are rejected during the WebSocket handshake with close code 4029.
See Authentication and rate limits for retry guidance.

Examples

The examples below send a JSON BatchConfig as the config frame. To use the built-in defaults instead, send the literal string "default" in place of the JSON.
import asyncio
import json
import aiohttp

API_KEY = "YOUR_API_KEY"
AUDIO_FILE = "conversation.ogg"
CHUNK_SIZE = 8192

# Either the string "default" or a JSON-encoded BatchConfig.
CONFIG = json.dumps({
    "behaviors": ["preset:empathy", "preset:complaints"],
    "stt": {"speaker_diarization": True, "emotion_signal": True},
    "produce_topics": True,
    "produce_topic_sentiments": True,
    "produce_summary": True,
})

async def analyze_streaming():
    url = f"wss://modulate-developer-apis.com/api/velma-2-streaming?api_key={API_KEY}"

    async with aiohttp.ClientSession() as session:
        async with session.ws_connect(url) as ws:
            # 1. Send the config frame before any audio.
            await ws.send_str(CONFIG)

            async def send_audio():
                with open(AUDIO_FILE, "rb") as f:
                    while chunk := f.read(CHUNK_SIZE):
                        await ws.send_bytes(chunk)
                        await asyncio.sleep(CHUNK_SIZE / 4000)
                await ws.send_str("")  # 2. Signal end of audio.

            send_task = asyncio.create_task(send_audio())

            try:
                async for msg in ws:
                    if msg.type == aiohttp.WSMsgType.TEXT:
                        event = json.loads(msg.data)
                        etype = event["type"]
                        if etype == "clip":
                            c = event["clip"]
                            print(f"[{c['speaker_label']}] {c['text']}")
                        elif etype == "conversation_type":
                            print(f"Conversation type: {event['pick']['name']}")
                        elif etype == "participant_role":
                            p = event["pick"]
                            print(f"Role for {p['speaker_label']}: {p['name']}")
                        elif etype == "behavior_detection":
                            d = event["detection"]
                            print(f"Behavior {d['behavior_name']}: detected={d['detected']}")
                        elif etype == "topics":
                            print(f"Topics: {', '.join(event['topics'])}")
                        elif etype == "topic_sentiment":
                            ts = event["topic_sentiment"]
                            print(f"Sentiment ({ts['topic']}, {ts['speaker_label']}): {ts['sentiment_label']}")
                        elif etype == "summary":
                            print(f"Summary: {event['text']}")
                        elif etype == "done":
                            print(f"Done. Duration: {event['duration_ms']}ms")
                            break
                        elif etype == "error":
                            print(f"Error: {event['error']}")
                            break
                    elif msg.type in (
                        aiohttp.WSMsgType.ERROR,
                        aiohttp.WSMsgType.CLOSE,
                        aiohttp.WSMsgType.CLOSED,
                    ):
                        break
            finally:
                if not send_task.done():
                    send_task.cancel()

asyncio.run(analyze_streaming())
WebSocket APIs cannot be tested with cURL. For command-line testing, use websocat.