Velma Streaming

Real-time conversation analysis over WebSocket. Stream audio to the server and receive analysis events as they are produced: transcribed clips, the inferred conversation type, participant roles, behavior detections, topics, per-speaker topic sentiment, and a summary — terminating with a final done event.

Endpoint

wss://platform.modulate.ai/api/velma-2-streaming

Authentication

Pass your API key as a query parameter when opening the connection.

wss://platform.modulate.ai/api/velma-2-streaming?api_key=YOUR_API_KEY

Unlike the batch endpoint, the streaming API does not use an X-API-Key header. The key must be in the query string at connection time.

See Authentication and rate limits for how to obtain and manage API keys.

Connection parameters

Connection parameters carry only the API key and audio-format hints. All analysis configuration is sent in the config frame, not the query string.

Parameter	Type	Required	Description
`api_key`	string	Yes	Your API key
`audio_format`	string	Raw formats only	Audio encoding. Omit for self-describing formats; required for raw/headerless formats. May optionally be set to override auto-detection
`sample_rate`	integer	Raw formats only	Sample rate in Hz. Required for raw formats; must not be set otherwise
`num_channels`	integer	Raw formats only	Number of channels (1–8). Required for raw formats; must not be set otherwise

Supported audio formats

Self-describing formats (auto-detected from file headers — no extra parameters needed): AAC, AIFF, FLAC, MP3, OGG, WAV, WebM

OGG / Opus: OGG is a container that may carry Opus-encoded audio. Pass audio_format=ogg, not audio_format=opus.

Raw / headerless formats (require audio_format, sample_rate, and num_channels): s8, s16le, s16be, s24le, s24be, s32le, s32be, u8, u16le, u16be, u24le, u24be, u32le, u32be, f32le, f32be, f64le, f64be, mulaw, alaw Valid sample rates: 8000, 11025, 16000, 22050, 32000, 44100, 48000, 96000

Configuration

After the connection opens, send exactly one text frame before any audio:

The literal string default to use the built-in default configuration, or
A JSON-encoded BatchConfig describing the conversation types, participant roles, behaviors, STT options, and which aggregate outputs (topics, sentiments, summary) to produce.

The BatchConfig schema is identical to the batch endpoint — see the Batch reference for the full field list. Behaviors may be referenced from the preset catalog with the preset:<identifier> syntax; list available presets with List behavior presets.

{
  "behaviors": ["preset:empathy", "preset:complaints"],
  "stt": { "speaker_diarization": true, "emotion_signal": true },
  "produce_topics": true,
  "produce_topic_sentiments": true,
  "produce_summary": true
}

You must send the config frame before any audio. Sending a binary audio frame before the config frame is a protocol error and closes the connection with code 1003.

Connection flow

Connect to the WebSocket endpoint with api_key (and audio_format, sample_rate, num_channels for raw formats).
Send one text frame: either the literal string default or a JSON-encoded BatchConfig.
Stream audio as binary WebSocket frames. Frames can be any size.
Receive analysis events as JSON text frames as results are produced.
Send an empty text frame ("") to signal end of audio.
Receive a final done event with the total audio duration.
The connection closes automatically.

Server events

The server emits JSON text frames as results are produced. Every event carries a type discriminator. The payload objects (clip, conversation-type pick, participant-role pick, behavior detection, topic sentiment) use the same schemas the batch endpoint returns — see the Batch reference for full field definitions. The partial_clip and clip_update payloads are streaming-only and documented below.

`type`	Payload key	Description
`clip`	`clip`	A transcribed clip with speaker label, timing, and optional emotion / accent / deepfake signals
`partial_clip`	`partial_clip`	An in-progress clip streamed while an utterance is still being spoken, before it finalizes
`clip_update`	`clip_update`	Refined values for a previously finalized clip
`conversation_type`	`pick`	The inferred or default conversation-type classification for the session
`participant_role`	`pick`	The inferred or default role for a speaker
`behavior_detection`	`detection`	A per-behavior detection result
`topics`	`topics`	The aggregated list of conversation topics; each event fully replaces the previous one
`topic_sentiment`	`topic_sentiment`	Per-speaker sentiment for one aggregated topic; a later event supersedes an earlier one for the same topic and speaker
`summary`	`text`	A free-form summary of the conversation; each event fully replaces the previous one
`done`	`duration_ms`	Streaming completed; carries the total audio duration
`error`	`error`	An error occurred; the connection closes after this event

`clip`

A transcribed clip. Emitted progressively as speech is processed.

{
  "type": "clip",
  "clip": {
    "clip_uuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "text": "Thanks for calling support, how can I help?",
    "start_ms": 0,
    "duration_ms": 3200,
    "speaker_label": "speaker_1",
    "language": "en",
    "emotion": null,
    "accent": null,
    "deepfake_score": null
  }
}

The emotion, accent, and deepfake_score fields are null unless the corresponding STT options are enabled in the config frame.

`partial_clip`

An in-progress clip streamed while an utterance is still being spoken, before it finalizes. Multiple partials may be emitted for the same clip_uuid as the utterance grows; the eventual clip event for that utterance reuses the same clip_uuid, so a run of partials can be correlated with its final clip.

{
  "type": "partial_clip",
  "partial_clip": {
    "clip_uuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "text": "Thanks for calling support,",
    "start_ms": 0,
    "end_ms": 1800,
    "speaker_label": "speaker_1",
    "emotion": null,
    "accent": null,
    "deepfake_score": null
  }
}

end_ms is the current end of the in-progress utterance and grows as the utterance extends; it is null until available. The finalized clip reports duration_ms instead of end_ms. speaker_label is null until diarization resolves the speaker. emotion, accent, and deepfake_score carry the latest in-progress values, or null when no value is available yet.

`clip_update`

Refined values for a previously finalized clip; clip_update.clip_uuid matches the clip_uuid of an earlier clip event. A finalized clip may receive any number of clip_update events (including none), each emitted after that clip’s clip event — possibly interleaved with events for other clips — and always before the done event on clean completion. For each field present, the latest received value supersedes the value on the clip event and on any earlier clip_update for that clip.

{
  "type": "clip_update",
  "clip_update": {
    "clip_uuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "emotion": "Calm",
    "accent": "American"
  }
}

`conversation_type`

The conversation-type classification for the session.

{
  "type": "conversation_type",
  "pick": {
    "conversation_type_uuid": "8f1d2c3b-4a5e-6f70-8192-a3b4c5d6e7f8",
    "name": "Customer support call",
    "confidence": 0.92,
    "selection_source": "inferred",
    "detail": "Caller is seeking help resolving a billing issue.",
    "reasoning": "The agent greets the caller and offers assistance with an account problem."
  }
}

selection_source is one of inferred, auto_selected_single_option, or default.

`participant_role`

A role assignment for one speaker. Emitted once per identified speaker.

{
  "type": "participant_role",
  "pick": {
    "speaker_label": "speaker_1",
    "participant_role_uuid": "2b7c9d10-1e2f-3a4b-5c6d-7e8f90a1b2c3",
    "name": "Support agent",
    "confidence": 0.88,
    "selection_source": "inferred",
    "detail": "Speaker offers assistance and asks diagnostic questions.",
    "reasoning": "Greets on behalf of the company and drives the troubleshooting."
  }
}

`behavior_detection`

A per-behavior detection result. Emitted once per configured behavior.

{
  "type": "behavior_detection",
  "detection": {
    "behavior_uuid": "c4d5e6f7-8091-a2b3-c4d5-e6f708192a3b",
    "behavior_name": "Empathy",
    "speaker_label": "speaker_1",
    "detected": true,
    "confidence": 0.81,
    "evidence_clip_uuids": ["a1b2c3d4-e5f6-7890-abcd-ef1234567890"],
    "definitive_clip_uuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "reasoning": "Agent acknowledges the customer's frustration before resolving the issue."
  }
}

`topics`

The aggregated list of conversation topics. May be emitted more than once as the stream progresses; each event fully replaces the previous topics event, so always treat the latest as authoritative and never merge with earlier ones.

{
  "type": "topics",
  "topics": ["billing", "refund policy", "account access"]
}

`topic_sentiment`

Per-speaker sentiment for one aggregated topic, keyed by topic and speaker. May be emitted more than once as the stream progresses; a later event supersedes an earlier one for the same topic and speaker.

{
  "type": "topic_sentiment",
  "topic_sentiment": {
    "topic": "billing",
    "speaker_label": "speaker_2",
    "sentiment_score": -0.4,
    "sentiment_label": "negative"
  }
}

sentiment_score ranges from -1.0 (most negative) to 1.0 (most positive).

`summary`

A free-form summary of the conversation. May be emitted more than once as the stream progresses; each event fully replaces the previous summary event, so always treat the latest as authoritative.

{
  "type": "summary",
  "text": "The customer called about a duplicate charge. The agent confirmed the error, issued a refund, and explained the billing cycle."
}

`done`

Sent after all audio has been processed, in response to the end-of-stream signal.

{
  "type": "done",
  "duration_ms": 45000
}

`error`

Sent if processing fails. The connection closes after this event.

{
  "type": "error",
  "error": "Internal server error"
}

WebSocket close codes

Code	Meaning
`1000`	Normal closure after a successful `done` message
`1003`	Protocol error — invalid or incomplete config, audio sent before the config frame, or an unsupported audio format / sample rate / channel count
`4003`	Request could not be validated, or is not permitted (auth failure, missing model access)
`4029`	Insufficient credits, or concurrent-connection limit exceeded

An error JSON message is sent before the connection closes (except on 1000).

Rate limits

Concurrent connection limits apply per organization.
Monthly usage limits (in audio hours) apply per organization.
Connections that exceed limits are rejected during the WebSocket handshake with close code 4029.

See Authentication and rate limits for retry guidance.

Examples

The examples below send a JSON BatchConfig as the config frame. To use the built-in defaults instead, send the literal string "default" in place of the JSON.

Python (aiohttp)
JavaScript (Node.js)

import asyncio
import json
import aiohttp

API_KEY = "YOUR_API_KEY"
AUDIO_FILE = "conversation.ogg"
CHUNK_SIZE = 8192

# Either the string "default" or a JSON-encoded BatchConfig.
CONFIG = json.dumps({
    "behaviors": ["preset:empathy", "preset:complaints"],
    "stt": {"speaker_diarization": True, "emotion_signal": True},
    "produce_topics": True,
    "produce_topic_sentiments": True,
    "produce_summary": True,
})

async def analyze_streaming():
    url = f"wss://platform.modulate.ai/api/velma-2-streaming?api_key={API_KEY}"

    async with aiohttp.ClientSession() as session:
        async with session.ws_connect(url) as ws:
            # 1. Send the config frame before any audio.
            await ws.send_str(CONFIG)

            async def send_audio():
                with open(AUDIO_FILE, "rb") as f:
                    while chunk := f.read(CHUNK_SIZE):
                        await ws.send_bytes(chunk)
                        await asyncio.sleep(CHUNK_SIZE / 4000)
                await ws.send_str("")  # 2. Signal end of audio.

            send_task = asyncio.create_task(send_audio())

            try:
                async for msg in ws:
                    if msg.type == aiohttp.WSMsgType.TEXT:
                        event = json.loads(msg.data)
                        etype = event["type"]
                        if etype == "clip":
                            c = event["clip"]
                            print(f"[{c['speaker_label']}] {c['text']}")
                        elif etype == "partial_clip":
                            pc = event["partial_clip"]
                            print(f"[partial {pc['clip_uuid'][:8]}] {pc['text']}")
                        elif etype == "clip_update":
                            cu = event["clip_update"]
                            print(f"[update {cu['clip_uuid'][:8]}] emotion={cu.get('emotion')} accent={cu.get('accent')}")
                        elif etype == "conversation_type":
                            print(f"Conversation type: {event['pick']['name']}")
                        elif etype == "participant_role":
                            p = event["pick"]
                            print(f"Role for {p['speaker_label']}: {p['name']}")
                        elif etype == "behavior_detection":
                            d = event["detection"]
                            print(f"Behavior {d['behavior_name']}: detected={d['detected']}")
                        elif etype == "topics":
                            print(f"Topics: {', '.join(event['topics'])}")
                        elif etype == "topic_sentiment":
                            ts = event["topic_sentiment"]
                            print(f"Sentiment ({ts['topic']}, {ts['speaker_label']}): {ts['sentiment_label']}")
                        elif etype == "summary":
                            print(f"Summary: {event['text']}")
                        elif etype == "done":
                            print(f"Done. Duration: {event['duration_ms']}ms")
                            break
                        elif etype == "error":
                            print(f"Error: {event['error']}")
                            break
                    elif msg.type in (
                        aiohttp.WSMsgType.ERROR,
                        aiohttp.WSMsgType.CLOSE,
                        aiohttp.WSMsgType.CLOSED,
                    ):
                        break
            finally:
                if not send_task.done():
                    send_task.cancel()

asyncio.run(analyze_streaming())

const WebSocket = require("ws");
const fs = require("fs");

const API_KEY = "YOUR_API_KEY";
const AUDIO_FILE = "conversation.ogg";
const CHUNK_SIZE = 8192;

// Either the string "default" or a JSON-encoded BatchConfig.
const CONFIG = JSON.stringify({
  behaviors: ["preset:empathy", "preset:complaints"],
  stt: { speaker_diarization: true, emotion_signal: true },
  produce_topics: true,
  produce_topic_sentiments: true,
  produce_summary: true,
});

const url = new URL("wss://platform.modulate.ai/api/velma-2-streaming");
url.searchParams.set("api_key", API_KEY);

const ws = new WebSocket(url.toString());

ws.on("open", () => {
  // 1. Send the config frame before any audio.
  ws.send(CONFIG);

  // 2. Stream audio as binary frames, then signal end of audio.
  const stream = fs.createReadStream(AUDIO_FILE, { highWaterMark: CHUNK_SIZE });
  stream.on("data", (chunk) => ws.send(chunk));
  stream.on("end", () => ws.send(""));
});

ws.on("message", (data) => {
  const event = JSON.parse(data.toString());
  switch (event.type) {
    case "clip":
      console.log(`[${event.clip.speaker_label}] ${event.clip.text}`);
      break;
    case "partial_clip":
      console.log(`[partial ${event.partial_clip.clip_uuid.slice(0, 8)}] ${event.partial_clip.text}`);
      break;
    case "clip_update": {
      const cu = event.clip_update;
      console.log(`[update ${cu.clip_uuid.slice(0, 8)}] emotion=${cu.emotion} accent=${cu.accent}`);
      break;
    }
    case "conversation_type":
      console.log(`Conversation type: ${event.pick.name}`);
      break;
    case "participant_role":
      console.log(`Role for ${event.pick.speaker_label}: ${event.pick.name}`);
      break;
    case "behavior_detection":
      console.log(`Behavior ${event.detection.behavior_name}: detected=${event.detection.detected}`);
      break;
    case "topics":
      console.log(`Topics: ${event.topics.join(", ")}`);
      break;
    case "topic_sentiment": {
      const ts = event.topic_sentiment;
      console.log(`Sentiment (${ts.topic}, ${ts.speaker_label}): ${ts.sentiment_label}`);
      break;
    }
    case "summary":
      console.log(`Summary: ${event.text}`);
      break;
    case "done":
      console.log(`Done. Duration: ${event.duration_ms}ms`);
      ws.close();
      break;
    case "error":
      console.error(`Error: ${event.error}`);
      ws.close();
      break;
  }
});

ws.on("error", (err) => console.error("WebSocket error:", err.message));

WebSocket APIs cannot be tested with cURL. For command-line testing, use websocat.

Velma overview — what Velma analyzes and when to use batch vs streaming
Velma Batch — the BatchConfig and event payload schemas in full
List behavior presets — discover behavior preset identifiers
Authentication and rate limits

Velma

Speech-to-text Transcription

Deepfake Detection

Emotion Detection

Accent Detection

PII/PHI Redaction

Music Detection

AI Music Detection

Language Detection

Endpoint

Authentication

Connection parameters

Supported audio formats

Configuration

Connection flow

Server events

`clip`

`partial_clip`

`clip_update`

`conversation_type`

`participant_role`

`behavior_detection`

`topics`

`topic_sentiment`

`summary`

`done`

`error`

WebSocket close codes

Rate limits

Examples

​Endpoint

​Authentication

​Connection parameters

​Supported audio formats

​Configuration

​Connection flow

​Server events

​clip

​partial_clip

​clip_update

​conversation_type

​participant_role

​behavior_detection

​topics

​topic_sentiment

​summary

​done

​error

​WebSocket close codes

​Rate limits

​Examples

​Related

Endpoint

Authentication

Connection parameters

Supported audio formats

Configuration

Connection flow

Server events

`clip`

`partial_clip`

`clip_update`

`conversation_type`

`participant_role`

`behavior_detection`

`topics`

`topic_sentiment`

`summary`

`done`

`error`

WebSocket close codes

Rate limits

Examples

Related