BatchConfig schema and return the same set of outputs — the difference is protocol and response shape.
| Batch | Streaming | |
|---|---|---|
| Endpoint | POST /api/velma-2-batch | wss://modulate-developer-apis.com/api/velma-2-streaming |
| Auth | X-API-Key header | api_key query parameter |
| Config | config form field (JSON string or "default") | First text frame (JSON or "default") |
| Response | Single JSON object | Stream of typed events |
Configuration
Both endpoints use the sameBatchConfig schema. You can send the literal string "default" instead of a full config to use Velma’s built-in default behavior set without specifying anything.
Conversation types
A conversation type tells Velma what kind of interaction it is analyzing. Velma uses this to contextualize behavior detection and role assignment. You can define multiple types — Velma will infer which one best matches, or usedefault_conversation_type as the fallback.
Participant roles
Roles describe the speakers Velma expects. Scope roles to specific conversation types viaapplies_to_conversation_type_uuids. If omitted, the role applies to all types.
Behaviors
Thebehaviors array accepts two types of entries — full BehaviorDef objects and preset reference strings — and you can mix both in the same array.
Preset reference: a string in the form "preset:<identifier>". Velma expands it into the full behavior definition before processing. Use GET /api/velma-2-batch/list-presets or GET /api/velma-2-streaming/list-presets to discover available identifiers.
Full BehaviorDef: supply all four required fields yourself. Takes precedence over any preset entry with the same UUID.
STT options
Control what transcription data appears in clip outputs:| Option | Type | Default | What it adds |
|---|---|---|---|
speaker_diarization | boolean | true | Per-speaker clip attribution |
emotion_signal | boolean | false | Per-clip emotion label |
accent_signal | boolean | false | Per-clip accent label |
deepfake_signal | boolean | false | Per-clip deepfake_score (0–1) |
pii_phi_tagging | boolean | false | Sensitive spans wrapped in entity tags |
language | string | auto | Force a specific language code |
Aggregate outputs
| Option | Type | Default |
|---|---|---|
produce_topics | boolean | true |
produce_topic_sentiments | boolean | true |
produce_summary | boolean | true |
false to suppress the corresponding output.
Batch endpoint
POST /api/velma-2-batch — submit a complete audio file, receive a single JSON response.
Request — multipart/form-data:
| Field | Type | Required | Description |
|---|---|---|---|
upload_file | binary | Yes | Audio file. Max 100 MB. Supported formats: AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM. |
config | string | No | JSON-encoded BatchConfig, or the literal string "default". Defaults to "default" if omitted. |
BatchResponse:
| Field | Type | Description |
|---|---|---|
duration_ms | integer | Total audio duration |
clips | array | Transcribed segments — see Clip |
conversation_type_pick | object or null | Inferred conversation type |
participant_role_picks | array | Per-speaker role assignments |
behaviors | array | Per-behavior detection results — see BehaviorDetection |
topics | array | Extracted topic strings |
topic_sentiments | array | Per-speaker sentiment per topic |
summary | string or null | Narrative summary |
| Status | Meaning |
|---|---|
400 | Unsupported file format, empty file, or malformed request |
403 | Request not permitted |
422 | Invalid config value — malformed JSON or unknown preset identifier |
429 | Insufficient credits |
500 | Internal server error |
502 | Request could not be validated or completed — retry |
Streaming events
Velma emits JSON events throughout a streaming session. Every event has atype field.
clip
A transcribed segment of speech. Emitted in near real time.
emotion, accent, and deepfake_score are non-null only when their corresponding STT options are enabled.
conversation_type
Velma’s pick for the conversation type, emitted once enough context is available.
selection_source is one of inferred, auto_selected_single_option, or default.
participant_role
A per-speaker role assignment. One event per speaker label.
behavior_detection
A per-behavior verdict. Emitted for each behavior once Velma has enough audio to decide.
skipped: true means Velma did not attempt detection — check skip_reason. error_reason is non-null if detection failed.
topics
Aggregated list of subjects discussed. Emitted at end of stream.
topic_sentiment
Per-speaker sentiment for each topic. One event per speaker per topic.
sentiment_score ranges from −1 (strongly negative) to +1 (strongly positive).
summary
A free-form narrative summary. Emitted at end of stream.
done
Signals streaming is complete. Always the final event.
error
Emitted if a processing error occurs. The connection closes after this event.
WebSocket close codes
| Code | Meaning |
|---|---|
1000 | Normal closure after the done event |
1003 | Protocol error — invalid config JSON, audio sent before config, unsupported audio format or sample rate |
4003 | Request could not be validated, or not permitted |
4029 | Insufficient credits |
Related
- Behaviors — define what Velma listens for
- Audio formats — supported formats and raw PCM parameters
- Authentication — API key setup