Capabilities - Modulate

Velma Triage offers two endpoints with identical analysis capabilities. Both accept the same BatchConfig schema and return the same set of outputs — the difference is protocol and response shape.

	Batch	Streaming
Endpoint	`POST /api/velma-2-batch`	`wss://platform.modulate.ai/api/velma-2-streaming`
Auth	`X-API-Key` header	`api_key` query parameter
Config	`config` form field (JSON string or `"default"`)	First text frame (JSON or `"default"`)
Response	Single JSON object	Stream of typed events

Configuration

Both endpoints use the same BatchConfig schema. You can send the literal string "default" instead of a full config to use Velma’s built-in default behavior set without specifying anything.

Conversation types

A conversation type tells Velma what kind of interaction it is analyzing. Velma uses this to contextualize behavior detection and role assignment. You can define multiple types — Velma will infer which one best matches, or use default_conversation_type as the fallback.

{
  "conversation_types": [
    {
      "conversation_type_uuid": "11111111-1111-4111-8111-111111111008",
      "name": "Customer Support Call",
      "short_description": "A customer contacting a business for help with a product or service.",
      "detailed_description": "The conversation involves a customer with a service issue and a representative resolving it. Tone is professional. The representative is expected to follow company procedure."
    }
  ]
}

Participant roles

Roles describe the speakers Velma expects. Scope roles to specific conversation types via applies_to_conversation_type_uuids. If omitted, the role applies to all types.

{
  "participant_roles": [
    {
      "participant_role_uuid": "22222222-2222-4222-8222-222222222017",
      "name": "Customer",
      "short_description": "The recipient of a service or good.",
      "detailed_description": "A customer calling in with a service issue.",
      "applies_to_conversation_type_uuids": ["11111111-1111-4111-8111-111111111008"]
    }
  ]
}

Behaviors

The behaviors array accepts two types of entries — full BehaviorDef objects and preset reference strings — and you can mix both in the same array. Preset reference: a string in the form "preset:<identifier>". Velma expands it into the full behavior definition before processing. Use GET /api/velma-2-batch/list-presets or GET /api/velma-2-streaming/list-presets to discover available identifiers. Full BehaviorDef: supply all four required fields yourself. Takes precedence over any preset entry with the same UUID.

{
  "behaviors": [
    "preset:harassment",
    "preset:service-churn",
    {
      "behavior_uuid": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
      "name": "Agent Script Deviation",
      "short_description": "Agent departs from the required call opening script.",
      "detailed_description": "..."
    }
  ]
}

See Behaviors for the full guide.

Transcription options

Control what transcription data appears in clip outputs:

Option	Type	Default	What it adds
`speaker_diarization`	boolean	`true`	Per-speaker clip attribution
`emotion_signal`	boolean	`false`	Per-clip `emotion` label
`accent_signal`	boolean	`false`	Per-clip `accent` label
`deepfake_signal`	boolean	`false`	Per-clip `deepfake_score` (0–1)
`pii_phi_tagging`	boolean	`false`	Sensitive spans wrapped in entity tags
`language`	string	auto	Optional language hint (case-insensitive ISO 639-1 code, e.g. `en`); the language is detected automatically for each clip when omitted
`custom_terms`	array	none	Custom vocabulary to bias transcription toward domain terms and names — see Custom vocabulary

Aggregate outputs

Option	Type	Default
`produce_topics`	boolean	`true`
`produce_topic_sentiments`	boolean	`true`
`produce_summary`	boolean	`true`

Set any of these to false to suppress the corresponding output.

Batch endpoint

POST /api/velma-2-batch — submit a complete audio file, receive a single JSON response. Request — multipart/form-data:

Field	Type	Required	Description
`upload_file`	binary	Yes	Audio file. Max 100 MB. Supported formats: AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM.
`config`	string	No	JSON-encoded `BatchConfig`, or the literal string `"default"`. Defaults to `"default"` if omitted.

curl -X POST https://platform.modulate.ai/api/velma-2-batch \
  -H "X-API-Key: $MODULATE_API_KEY" \
  -F "upload_file=@recording.mp3" \
  -F 'config={"behaviors":["preset:harassment","preset:service-churn"]}'

import os, json, requests

config = {
    "behaviors": ["preset:harassment", "preset:service-churn"],
    "stt": {"speaker_diarization": True},
    "produce_summary": True,
}

response = requests.post(
    "https://platform.modulate.ai/api/velma-2-batch",
    headers={"X-API-Key": os.environ["MODULATE_API_KEY"]},
    data={"config": json.dumps(config)},
    files={"upload_file": open("recording.mp3", "rb")},
)
response.raise_for_status()
result = response.json()

Response — BatchResponse:

Field	Type	Description
`duration_ms`	integer	Total audio duration
`clips`	array	Transcribed segments — see Clip
`conversation_type_pick`	object or null	Inferred conversation type
`participant_role_picks`	array	Per-speaker role assignments
`behaviors`	array	Per-behavior detection results — see BehaviorDetection
`topics`	array	Extracted topic strings
`topic_sentiments`	array	Per-speaker sentiment per topic
`summary`	string or null	Narrative summary

Error responses:

Status	Meaning
`400`	Unsupported file format, empty file, or malformed request
`403`	Request not permitted
`422`	Invalid `config` value (malformed JSON, unknown preset identifier, or a definition missing a required field), or a required request field missing or the wrong type (e.g. the `X-API-Key` header or `upload_file` part)
`429`	Insufficient credits
`500`	Internal server error
`502`	Request could not be validated or completed — retry

Streaming events

Velma emits JSON events throughout a streaming session. Every event has a type field.

`clip`

A transcribed segment of speech. Emitted in near real time.

{
  "type": "clip",
  "clip": {
    "clip_uuid": "a1b2c3d4-...",
    "text": "I'd like to cancel my subscription.",
    "start_ms": 4800,
    "duration_ms": 2100,
    "speaker_label": "Speaker_1",
    "language": "en",
    "emotion": null,
    "accent": null,
    "deepfake_score": null
  }
}

emotion, accent, and deepfake_score are non-null only when their corresponding Transcription options are enabled.

`partial_clip`

An in-progress clip streamed while an utterance is still being spoken. Multiple partials may arrive for the same clip_uuid as the utterance grows; the eventual clip event reuses that clip_uuid and supersedes all of its partials.

{
  "type": "partial_clip",
  "partial_clip": {
    "clip_uuid": "a1b2c3d4-...",
    "text": "I'd like to cancel",
    "start_ms": 4800,
    "end_ms": 5900,
    "speaker_label": "Speaker_1",
    "emotion": null,
    "accent": null,
    "deepfake_score": null
  }
}

`clip_update`

Refined emotion / accent values for a previously finalized clip; clip_uuid matches an earlier clip event. A clip may receive any number of updates (including none), always before the done event — for each field present, the latest received value wins.

{
  "type": "clip_update",
  "clip_update": {
    "clip_uuid": "a1b2c3d4-...",
    "emotion": "Frustrated",
    "accent": "American"
  }
}

`conversation_type`

Velma’s pick for the conversation type, emitted once enough context is available.

{
  "type": "conversation_type",
  "pick": {
    "conversation_type_uuid": "11111111-1111-4111-8111-111111111008",
    "name": "Customer Support Call",
    "confidence": 0.94,
    "selection_source": "inferred",
    "detail": "...",
    "reasoning": "..."
  }
}

selection_source is one of inferred, auto_selected_single_option, or default.

`participant_role`

A per-speaker role assignment. One event per speaker label.

{
  "type": "participant_role",
  "pick": {
    "speaker_label": "Speaker_1",
    "participant_role_uuid": "22222222-2222-4222-8222-222222222017",
    "name": "Customer",
    "confidence": 0.88,
    "selection_source": "inferred",
    "detail": "...",
    "reasoning": "..."
  }
}

`behavior_detection`

A per-behavior verdict. Emitted for each behavior once Velma has enough audio to decide.

{
  "type": "behavior_detection",
  "detection": {
    "behavior_uuid": "33333333-3333-4333-8333-033333333006",
    "behavior_name": "Service Churn",
    "speaker_label": "Speaker_1",
    "detected": true,
    "confidence": 0.91,
    "evidence_clip_uuids": ["a1b2c3d4-...", "e5f6a7b8-..."],
    "definitive_clip_uuid": "a1b2c3d4-...",
    "reasoning": "Speaker explicitly stated they want to cancel their subscription."
  }
}

`topics`

Aggregated list of subjects discussed. May be emitted more than once as the conversation progresses; each event fully replaces the previous list, so always treat the latest as authoritative.

{
  "type": "topics",
  "topics": ["subscription cancellation", "billing dispute", "refund policy"]
}

`topic_sentiment`

Per-speaker sentiment for each topic. May be emitted more than once as the conversation progresses; a later event supersedes an earlier one for the same topic and speaker.

{
  "type": "topic_sentiment",
  "topic_sentiment": {
    "topic": "subscription cancellation",
    "speaker_label": "Speaker_1",
    "sentiment_score": -0.72,
    "sentiment_label": "negative"
  }
}

sentiment_score ranges from −1 (strongly negative) to +1 (strongly positive).

`summary`

A free-form narrative summary. May be emitted more than once as the conversation progresses; each event fully replaces the previous summary.

{
  "type": "summary",
  "text": "The customer called to cancel their subscription due to billing concerns. The representative offered a partial credit, which the customer declined. The call ended without resolution."
}

`done`

Signals streaming is complete. Always the final event.

{
  "type": "done",
  "duration_ms": 183400
}

`error`

Emitted if a processing error occurs. The connection closes after this event.

{
  "type": "error",
  "error": "Invalid input audio"
}

WebSocket close codes

Code	Meaning
`1000`	Normal closure after the `done` event
`1003`	Protocol error — invalid or incomplete config, audio sent before config, unsupported audio format or sample rate
`4003`	Request could not be validated, or not permitted
`4029`	Insufficient credits

Behaviors — define what Velma listens for
Audio formats — supported formats and raw PCM parameters
Authentication — API key setup

​Configuration

​Conversation types

​Participant roles

​Behaviors

​Transcription options

​Aggregate outputs

​Batch endpoint

​Streaming events

​clip

​partial_clip

​clip_update

​conversation_type

​participant_role

​behavior_detection

​topics

​topic_sentiment

​summary

​done

​error

​WebSocket close codes

​Related