Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.modulate.ai/llms.txt

Use this file to discover all available pages before exploring further.

The Velma-2 STT APIs support several optional features that add metadata to transcription output. This article explains what each feature does, when to enable it, and what values to expect. Feature availability varies by endpoint — see the summary table below.

Feature availability

FeatureSTT BatchSTT StreamingSTT English VFast
Speaker diarization
Emotion detection
Accent detection
PII/PHI tagging
Synthetic voice scoring
All features are disabled by default except speaker diarization, which defaults to true in both the batch and streaming endpoints.

Speaker diarization

Request field: speaker_diarization (boolean, default true) Speaker diarization identifies distinct speakers in the audio and assigns each utterance a speaker integer, starting from 1. Speaker numbers are consistent within a single file — speaker 1 in one utterance is the same speaker as speaker 1 in any other utterance from the same request. Speaker numbers are not consistent across separate requests or files. The feature works independently of language detection. In a multilingual conversation, each speaker still receives a consistent label even if they switch languages mid-call. Disable diarization if you are processing single-speaker audio and want to reduce processing overhead, or if the speaker field is not useful to your application.

Emotion detection

Request field: emotion_signal (boolean, default false) Emotion detection classifies the emotional tone of each utterance from the speaker’s voice signal. The result appears in the emotion field on each utterance object. When disabled, emotion is null. Detection is per-utterance and based on acoustic features, not on the words spoken. Two utterances with the same text may receive different emotion labels if the delivery differs.

Possible values

Neutral, Calm, Happy, Amused, Excited, Proud, Affectionate, Interested, Hopeful, Frustrated, Angry, Contemptuous, Concerned, Afraid, Sad, Ashamed, Bored, Tired, Surprised, Anxious, Stressed, Disgusted, Disappointed, Confused, Relieved, Confident

Accent detection

Request field: accent_signal (boolean, default false) Accent detection classifies the regional or national accent of each utterance’s speaker. The result appears in the accent field on each utterance object. When disabled, accent is null. Like emotion detection, accent classification is per-utterance. For a speaker with a consistent accent across the file, results will typically be consistent but may vary on short or acoustically challenging segments.

Possible values

American, British, Australian, Southern, Indian, Irish, Scottish, Eastern_European, African, Asian, Latin_American, Middle_Eastern, Unknown

PII/PHI tagging

Request field: pii_phi_tagging (boolean, default false) PII/PHI tagging identifies personally identifiable information (PII) and personal health information (PHI) in the transcribed text and wraps those spans in tags within the text field of each utterance. The transcript content is preserved — only the tag markup is added. Enable PII/PHI tagging when your downstream systems need to identify or handle sensitive text spans, but you still need the original content in the transcript.
Need the audio silenced, not just the transcript tagged? If you need PII/PHI replaced with entity-type tags in the transcript and the corresponding audio ranges silenced — for example, for shareable recordings or compliance archiving — use the dedicated PII/PHI Redaction APIs instead. See the API Reference tab.

Synthetic voice scoring (batch only)

Request field: deepfake_signal (boolean, default false) The synthetic voice signal scores each utterance for the likelihood that it contains AI-generated speech. The score appears in the deepfake_score field on each utterance object.
ValueMeaning
0.0Likely natural human speech
1.0Likely synthetic speech
nullFeature is disabled, or utterance is shorter than 0.5 seconds
Utterances shorter than 0.5 seconds are not scored and return null regardless of whether the feature is enabled. This feature is only available on the STT Batch endpoint. For dedicated synthetic voice analysis across an entire file (including frame-level results and a no-content verdict for silence), use the Synthetic Voice Detection APIs.

How this relates to synthetic voice detection

The deepfake_signal in the STT Batch API and the Synthetic Voice Detection APIs both detect synthetic speech, but they serve different use cases. The STT batch deepfake_signal is convenient when you are already transcribing and want a per-utterance score without a separate API call. The dedicated Synthetic Voice Detection APIs provide frame-level analysis across the full file, explicit no-content handling for silence, and a streaming mode — making them the better choice when detection is the primary goal rather than a supplementary signal.