Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.modulate.ai/llms.txt

Use this file to discover all available pages before exploring further.

Velma-2 offers several endpoints with overlapping capabilities. This guide helps you pick the right one based on your latency needs, language requirements, audio format constraints, and required features.

Quick reference

Speech-to-text (STT)

STT BatchSTT English VFastSTT Streaming
Use caseTranscription with rich metadataFast English-only transcriptionReal-time transcription
ProtocolHTTP POSTHTTP POSTWebSocket
LanguagesMultilingualEnglish onlyMultilingual
Audio formatsAAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebMOpus onlyAAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM
Max file size100 MB100 MB— (streaming)
Speaker diarization
Emotion detection
Accent detection
PII/PHI tagging
Synthetic voice scoring✓ (per-utterance)
Utterance-level output

Synthetic voice detection (SVD)

SVD BatchSVD Streaming
Use caseDeepfake detection on a fileReal-time deepfake detection
ProtocolHTTP POSTWebSocket
Audio formatsAAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebMRaw PCM and container formats
Max file size100 MB— (streaming)
Synthetic voice scoring✓ (per-frame)✓ (per-frame)

PII/PHI redaction

Redaction BatchRedaction Streaming
Use caseTranscription with audio redactionReal-time transcription with audio redaction
ProtocolHTTP POSTWebSocket
LanguagesMultilingualMultilingual
Audio formatsAAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebMAAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM
Max file size100 MB— (streaming)
Speaker diarization
PII/PHI audio redaction
Utterance-level output
PII/PHI tagging (STT APIs) — wraps sensitive spans in tags within the transcript text; original content is preserved. PII/PHI audio redaction (Redaction APIs) — replaces each detected PII/PHI span with an entity-type tag (e.g. [FIRSTNAME], [SSN], [PHI]) in the transcript text and silences the corresponding audio ranges in the returned MP3.

Decision tree

Do you need transcription?

Yes → continue below. No, you need deepfake detection only → use the Synthetic Voice Detection batch endpoint for pre-recorded files, or the streaming variant for live audio. See the API Reference tab.

Do you need PII/PHI removed from the audio itself — not just tagged in the transcript?

Yes → use the PII/PHI Redaction Batch API for complete files, or the PII/PHI Redaction Streaming API for live audio. Both return a redacted MP3 with sensitive audio ranges silenced, alongside a transcript where each PII/PHI span is replaced with an entity-type tag. No, you only need the transcript tagged → continue below, and enable pii_phi_tagging=true on whichever transcription endpoint you choose.

Do you need results in real time, while audio is still being captured?

Yes → use the STT Streaming API. It delivers utterances over WebSocket as they are transcribed. No, you have a complete file → continue below.

Is the audio English-only, and do you need maximum throughput without speaker or emotion metadata?

Yes → use the STT Batch English VFast API.
  • Fastest option for English audio.
  • Opus format only — convert other formats before uploading.
  • No speaker diarization, emotion, accent, PII/PHI tagging, or utterance-level data.
  • Returns a single text string and duration_ms.
  • Processing timeout: 60 seconds per request.
No → use the STT Batch API.

STT Batch vs STT English VFast — detailed tradeoffs

ConsiderationChoose STT BatchChoose STT English VFast
Audio languageNon-English, or unknown/multilingualEnglish only
Audio formatAny supported format (MP3, WAV, etc.)Can convert to Opus or already have Opus files
Need speaker IDsYesNo
Need emotion or accent signalsYesNo
Need PII/PHI taggingYesNo
Need utterance timestampsYesNo
Need deepfake scoring alongside transcriptionYesNo
Processing speed is the top priorityLower priorityYes
Transcription quality is the top priorityYesLower priority
Large-scale batch jobs in EnglishPreferred

Do you need deepfake detection alongside transcription?

The STT Batch API supports a deepfake_signal parameter that adds a per-utterance deepfake_score to transcription output. This is convenient when you already need transcription and want a synthetic voice signal without a second API call. Use the dedicated Synthetic Voice Detection APIs instead when:
  • Transcription is not needed — you only want to know if audio is synthetic.
  • You need frame-level results across the full audio (not just utterance-level scores).
  • You need explicit no-content verdicts for silent regions.
  • You need streaming deepfake verdicts in real time.

Common scenarios

Call center QA on recorded English calls, high volume → STT Batch English VFast. Convert to Opus if needed. Use parallel requests with a semaphore to respect concurrent limits. Meeting transcription with speaker attribution and emotion → STT Batch with speaker_diarization=true and emotion_signal=true. Live interview transcription with real-time captions → STT Streaming. Stream audio over WebSocket and display utterances as they arrive. Detect AI-generated voice in a submitted audio clip → Synthetic Voice Detection batch endpoint. Real-time anti-spoofing check during a voice authentication flow → Synthetic Voice Detection streaming endpoint. Stream audio and act on frame verdicts as they arrive. Transcript of a support call with PII/PHI spans tagged for downstream review → STT Batch with pii_phi_tagging=true. Language is auto-detected per utterance. The transcript content is preserved; sensitive spans are wrapped in tags. Compliance recording that must be shareable with PII/PHI silenced → PII/PHI Redaction Batch API. Returns a redacted MP3 with sensitive audio ranges silenced and a transcript where each PII/PHI span is replaced with an entity-type tag.