Real-time conversation analysis over WebSocket. Stream audio to the server and receive analysis events as they are produced: transcribed clips, the inferred conversation type, participant roles, behavior detections, topics, per-speaker topic sentiment, and a final summary — terminating with a done event at end-of-stream.
Endpoint
wss://modulate-developer-apis.com/api/velma-2-streaming
Authentication
Pass your API key as a query parameter when opening the connection.
wss://modulate-developer-apis.com/api/velma-2-streaming?api_key=YOUR_API_KEY
Unlike the batch endpoint, the streaming API does not use an X-API-Key header. The key must be in the query string at connection time.
See Authentication and rate limits for how to obtain and manage API keys.
Connection parameters
Connection parameters carry only the API key and audio-format hints. All analysis configuration is sent in the config frame, not the query string.
| Parameter | Type | Required | Description |
|---|
api_key | string | Yes | Your API key |
audio_format | string | Raw formats only | Audio encoding. Omit for self-describing formats; required for raw/headerless formats. May optionally be set to override auto-detection |
sample_rate | integer | Raw formats only | Sample rate in Hz. Required for raw formats; must not be set otherwise |
num_channels | integer | Raw formats only | Number of channels (1–8). Required for raw formats; must not be set otherwise |
Self-describing formats (auto-detected from file headers — no extra parameters needed):
AAC, AIFF, FLAC, MP3, OGG, WAV, WebM
OGG / Opus: OGG is a container that may carry Opus-encoded audio. Pass audio_format=ogg, not audio_format=opus.
Raw / headerless formats (require audio_format, sample_rate, and num_channels):
s8, s16le, s16be, s24le, s24be, s32le, s32be, u8, u16le, u16be, u24le, u24be, u32le, u32be, f32le, f32be, f64le, f64be, mulaw, alaw
Valid sample rates: 8000, 11025, 16000, 22050, 32000, 44100, 48000, 96000
Configuration
After the connection opens, send exactly one text frame before any audio:
- The literal string
default to use the built-in default configuration, or
- A JSON-encoded
BatchConfig describing the conversation types, participant roles, behaviors, STT options, and which aggregate outputs (topics, sentiments, summary) to produce.
The BatchConfig schema is identical to the batch endpoint — see the Batch reference for the full field list. Behaviors may be referenced from the preset catalog with the preset:<identifier> syntax; list available presets with List behavior presets.
{
"behaviors": ["preset:empathy", "preset:complaints"],
"stt": { "speaker_diarization": true, "emotion_signal": true },
"produce_topics": true,
"produce_topic_sentiments": true,
"produce_summary": true
}
You must send the config frame before any audio. Sending a binary audio frame before the config frame is a protocol error and closes the connection with code 1003.
Connection flow
- Connect to the WebSocket endpoint with
api_key (and audio_format, sample_rate, num_channels for raw formats).
- Send one text frame: either the literal string
default or a JSON-encoded BatchConfig.
- Stream audio as binary WebSocket frames. Frames can be any size.
- Receive analysis events as JSON text frames as results are produced.
- Send an empty text frame (
"") to signal end of audio.
- Receive a final
done event with the total audio duration.
- The connection closes automatically.
Server events
The server emits JSON text frames as results are produced. Every event carries a type discriminator. The payload objects (clip, conversation-type pick, participant-role pick, behavior detection, topic sentiment) use the same schemas the batch endpoint returns — see the Batch reference for full field definitions.
type | Payload key | Description |
|---|
clip | clip | A transcribed clip with speaker label, timing, and optional emotion / accent / deepfake signals |
conversation_type | pick | The inferred or default conversation-type classification for the session |
participant_role | pick | The inferred or default role for a speaker |
behavior_detection | detection | A per-behavior detection result |
topics | topics | The aggregated list of conversation topics |
topic_sentiment | topic_sentiment | Per-speaker sentiment for one aggregated topic |
summary | text | A free-form summary of the conversation |
done | duration_ms | Streaming completed; carries the total audio duration |
error | error | An error occurred; the connection closes after this event |
clip
A transcribed clip. Emitted progressively as speech is processed.
{
"type": "clip",
"clip": {
"clip_uuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"text": "Thanks for calling support, how can I help?",
"start_ms": 0,
"duration_ms": 3200,
"speaker_label": "speaker_1",
"language": "en",
"emotion": null,
"accent": null,
"deepfake_score": null
}
}
The emotion, accent, and deepfake_score fields are null unless the corresponding STT options are enabled in the config frame.
conversation_type
The conversation-type classification for the session.
{
"type": "conversation_type",
"pick": {
"conversation_type_uuid": "8f1d2c3b-4a5e-6f70-8192-a3b4c5d6e7f8",
"name": "Customer support call",
"confidence": 0.92,
"selection_source": "inferred",
"detail": "Caller is seeking help resolving a billing issue.",
"reasoning": "The agent greets the caller and offers assistance with an account problem."
}
}
selection_source is one of inferred, auto_selected_single_option, or default.
participant_role
A role assignment for one speaker. Emitted once per identified speaker.
{
"type": "participant_role",
"pick": {
"speaker_label": "speaker_1",
"participant_role_uuid": "2b7c9d10-1e2f-3a4b-5c6d-7e8f90a1b2c3",
"name": "Support agent",
"confidence": 0.88,
"selection_source": "inferred",
"detail": "Speaker offers assistance and asks diagnostic questions.",
"reasoning": "Greets on behalf of the company and drives the troubleshooting."
}
}
behavior_detection
A per-behavior detection result. Emitted once per configured behavior.
{
"type": "behavior_detection",
"detection": {
"behavior_uuid": "c4d5e6f7-8091-a2b3-c4d5-e6f708192a3b",
"behavior_name": "Empathy",
"speaker_label": "speaker_1",
"detected": true,
"confidence": 0.81,
"evidence_clip_uuids": ["a1b2c3d4-e5f6-7890-abcd-ef1234567890"],
"definitive_clip_uuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"reasoning": "Agent acknowledges the customer's frustration before resolving the issue.",
"skipped": false,
"skip_reason": null,
"error_reason": null
}
}
When a behavior could not be evaluated, skipped is true and skip_reason (or error_reason) explains why.
topics
The aggregated list of conversation topics.
{
"type": "topics",
"topics": ["billing", "refund policy", "account access"]
}
topic_sentiment
Per-speaker sentiment for one aggregated topic. Emitted once per topic/speaker pair.
{
"type": "topic_sentiment",
"topic_sentiment": {
"topic": "billing",
"speaker_label": "speaker_2",
"sentiment_score": -0.4,
"sentiment_label": "negative"
}
}
sentiment_score ranges from -1.0 (most negative) to 1.0 (most positive).
summary
A free-form summary of the conversation.
{
"type": "summary",
"text": "The customer called about a duplicate charge. The agent confirmed the error, issued a refund, and explained the billing cycle."
}
done
Sent after all audio has been processed, in response to the end-of-stream signal.
{
"type": "done",
"duration_ms": 45000
}
error
Sent if processing fails. The connection closes after this event.
{
"type": "error",
"error": "Internal server error"
}
WebSocket close codes
| Code | Meaning |
|---|
1000 | Normal closure after a successful done message |
1003 | Protocol error — invalid config JSON, audio sent before the config frame, or an unsupported audio format / sample rate / channel count |
4003 | Request could not be validated, or is not permitted (auth failure, missing model access) |
4029 | Insufficient credits, or concurrent-connection limit exceeded |
An error JSON message is sent before the connection closes (except on 1000).
Rate limits
- Concurrent connection limits apply per organization.
- Monthly usage limits (in audio hours) apply per organization.
- Connections that exceed limits are rejected during the WebSocket handshake with close code
4029.
See Authentication and rate limits for retry guidance.
Examples
The examples below send a JSON BatchConfig as the config frame. To use the built-in defaults instead, send the literal string "default" in place of the JSON.
Python (aiohttp)
JavaScript (Node.js)
import asyncio
import json
import aiohttp
API_KEY = "YOUR_API_KEY"
AUDIO_FILE = "conversation.ogg"
CHUNK_SIZE = 8192
# Either the string "default" or a JSON-encoded BatchConfig.
CONFIG = json.dumps({
"behaviors": ["preset:empathy", "preset:complaints"],
"stt": {"speaker_diarization": True, "emotion_signal": True},
"produce_topics": True,
"produce_topic_sentiments": True,
"produce_summary": True,
})
async def analyze_streaming():
url = f"wss://modulate-developer-apis.com/api/velma-2-streaming?api_key={API_KEY}"
async with aiohttp.ClientSession() as session:
async with session.ws_connect(url) as ws:
# 1. Send the config frame before any audio.
await ws.send_str(CONFIG)
async def send_audio():
with open(AUDIO_FILE, "rb") as f:
while chunk := f.read(CHUNK_SIZE):
await ws.send_bytes(chunk)
await asyncio.sleep(CHUNK_SIZE / 4000)
await ws.send_str("") # 2. Signal end of audio.
send_task = asyncio.create_task(send_audio())
try:
async for msg in ws:
if msg.type == aiohttp.WSMsgType.TEXT:
event = json.loads(msg.data)
etype = event["type"]
if etype == "clip":
c = event["clip"]
print(f"[{c['speaker_label']}] {c['text']}")
elif etype == "conversation_type":
print(f"Conversation type: {event['pick']['name']}")
elif etype == "participant_role":
p = event["pick"]
print(f"Role for {p['speaker_label']}: {p['name']}")
elif etype == "behavior_detection":
d = event["detection"]
print(f"Behavior {d['behavior_name']}: detected={d['detected']}")
elif etype == "topics":
print(f"Topics: {', '.join(event['topics'])}")
elif etype == "topic_sentiment":
ts = event["topic_sentiment"]
print(f"Sentiment ({ts['topic']}, {ts['speaker_label']}): {ts['sentiment_label']}")
elif etype == "summary":
print(f"Summary: {event['text']}")
elif etype == "done":
print(f"Done. Duration: {event['duration_ms']}ms")
break
elif etype == "error":
print(f"Error: {event['error']}")
break
elif msg.type in (
aiohttp.WSMsgType.ERROR,
aiohttp.WSMsgType.CLOSE,
aiohttp.WSMsgType.CLOSED,
):
break
finally:
if not send_task.done():
send_task.cancel()
asyncio.run(analyze_streaming())
const WebSocket = require("ws");
const fs = require("fs");
const API_KEY = "YOUR_API_KEY";
const AUDIO_FILE = "conversation.ogg";
const CHUNK_SIZE = 8192;
// Either the string "default" or a JSON-encoded BatchConfig.
const CONFIG = JSON.stringify({
behaviors: ["preset:empathy", "preset:complaints"],
stt: { speaker_diarization: true, emotion_signal: true },
produce_topics: true,
produce_topic_sentiments: true,
produce_summary: true,
});
const url = new URL("wss://modulate-developer-apis.com/api/velma-2-streaming");
url.searchParams.set("api_key", API_KEY);
const ws = new WebSocket(url.toString());
ws.on("open", () => {
// 1. Send the config frame before any audio.
ws.send(CONFIG);
// 2. Stream audio as binary frames, then signal end of audio.
const stream = fs.createReadStream(AUDIO_FILE, { highWaterMark: CHUNK_SIZE });
stream.on("data", (chunk) => ws.send(chunk));
stream.on("end", () => ws.send(""));
});
ws.on("message", (data) => {
const event = JSON.parse(data.toString());
switch (event.type) {
case "clip":
console.log(`[${event.clip.speaker_label}] ${event.clip.text}`);
break;
case "conversation_type":
console.log(`Conversation type: ${event.pick.name}`);
break;
case "participant_role":
console.log(`Role for ${event.pick.speaker_label}: ${event.pick.name}`);
break;
case "behavior_detection":
console.log(`Behavior ${event.detection.behavior_name}: detected=${event.detection.detected}`);
break;
case "topics":
console.log(`Topics: ${event.topics.join(", ")}`);
break;
case "topic_sentiment": {
const ts = event.topic_sentiment;
console.log(`Sentiment (${ts.topic}, ${ts.speaker_label}): ${ts.sentiment_label}`);
break;
}
case "summary":
console.log(`Summary: ${event.text}`);
break;
case "done":
console.log(`Done. Duration: ${event.duration_ms}ms`);
ws.close();
break;
case "error":
console.error(`Error: ${event.error}`);
ws.close();
break;
}
});
ws.on("error", (err) => console.error("WebSocket error:", err.message));
WebSocket APIs cannot be tested with cURL. For command-line testing, use websocat.