Documentation Index
Fetch the complete documentation index at: https://docs.modulate.ai/llms.txt
Use this file to discover all available pages before exploring further.
Velma-2 offers several endpoints with overlapping capabilities. This guide helps you pick the right one based on your latency needs, language requirements, audio format constraints, and required features.
Quick reference
Speech-to-text (STT)
| STT Batch | STT English VFast | STT Streaming |
|---|
| Use case | Transcription with rich metadata | Fast English-only transcription | Real-time transcription |
| Protocol | HTTP POST | HTTP POST | WebSocket |
| Languages | Multilingual | English only | Multilingual |
| Audio formats | AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM | Opus only | AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM |
| Max file size | 100 MB | 100 MB | — (streaming) |
| Speaker diarization | ✓ | — | ✓ |
| Emotion detection | ✓ | — | ✓ |
| Accent detection | ✓ | — | ✓ |
| PII/PHI tagging | ✓ | — | ✓ |
| Synthetic voice scoring | ✓ (per-utterance) | — | — |
| Utterance-level output | ✓ | — | ✓ |
Synthetic voice detection (SVD)
| SVD Batch | SVD Streaming |
|---|
| Use case | Deepfake detection on a file | Real-time deepfake detection |
| Protocol | HTTP POST | WebSocket |
| Audio formats | AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM | Raw PCM and container formats |
| Max file size | 100 MB | — (streaming) |
| Synthetic voice scoring | ✓ (per-frame) | ✓ (per-frame) |
PII/PHI redaction
| Redaction Batch | Redaction Streaming |
|---|
| Use case | Transcription with audio redaction | Real-time transcription with audio redaction |
| Protocol | HTTP POST | WebSocket |
| Languages | Multilingual | Multilingual |
| Audio formats | AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM | AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM |
| Max file size | 100 MB | — (streaming) |
| Speaker diarization | ✓ | ✓ |
| PII/PHI audio redaction | ✓ | ✓ |
| Utterance-level output | ✓ | ✓ |
PII/PHI tagging (STT APIs) — wraps sensitive spans in tags within the transcript text; original content is preserved.
PII/PHI audio redaction (Redaction APIs) — replaces each detected PII/PHI span with an entity-type tag (e.g. [FIRSTNAME], [SSN], [PHI]) in the transcript text and silences the corresponding audio ranges in the returned MP3.
Decision tree
Do you need transcription?
Yes → continue below.
No, you need deepfake detection only → use the Synthetic Voice Detection batch endpoint for pre-recorded files, or the streaming variant for live audio. See the API Reference tab.
Do you need PII/PHI removed from the audio itself — not just tagged in the transcript?
Yes → use the PII/PHI Redaction Batch API for complete files, or the PII/PHI Redaction Streaming API for live audio. Both return a redacted MP3 with sensitive audio ranges silenced, alongside a transcript where each PII/PHI span is replaced with an entity-type tag.
No, you only need the transcript tagged → continue below, and enable pii_phi_tagging=true on whichever transcription endpoint you choose.
Do you need results in real time, while audio is still being captured?
Yes → use the STT Streaming API. It delivers utterances over WebSocket as they are transcribed.
No, you have a complete file → continue below.
Yes → use the STT Batch English VFast API.
- Fastest option for English audio.
- Opus format only — convert other formats before uploading.
- No speaker diarization, emotion, accent, PII/PHI tagging, or utterance-level data.
- Returns a single
text string and duration_ms.
- Processing timeout: 60 seconds per request.
No → use the STT Batch API.
STT Batch vs STT English VFast — detailed tradeoffs
| Consideration | Choose STT Batch | Choose STT English VFast |
|---|
| Audio language | Non-English, or unknown/multilingual | English only |
| Audio format | Any supported format (MP3, WAV, etc.) | Can convert to Opus or already have Opus files |
| Need speaker IDs | Yes | No |
| Need emotion or accent signals | Yes | No |
| Need PII/PHI tagging | Yes | No |
| Need utterance timestamps | Yes | No |
| Need deepfake scoring alongside transcription | Yes | No |
| Processing speed is the top priority | Lower priority | Yes |
| Transcription quality is the top priority | Yes | Lower priority |
| Large-scale batch jobs in English | — | Preferred |
Do you need deepfake detection alongside transcription?
The STT Batch API supports a deepfake_signal parameter that adds a per-utterance deepfake_score to transcription output. This is convenient when you already need transcription and want a synthetic voice signal without a second API call.
Use the dedicated Synthetic Voice Detection APIs instead when:
- Transcription is not needed — you only want to know if audio is synthetic.
- You need frame-level results across the full audio (not just utterance-level scores).
- You need explicit
no-content verdicts for silent regions.
- You need streaming deepfake verdicts in real time.
Common scenarios
Call center QA on recorded English calls, high volume
→ STT Batch English VFast. Convert to Opus if needed. Use parallel requests with a semaphore to respect concurrent limits.
Meeting transcription with speaker attribution and emotion
→ STT Batch with speaker_diarization=true and emotion_signal=true.
Live interview transcription with real-time captions
→ STT Streaming. Stream audio over WebSocket and display utterances as they arrive.
Detect AI-generated voice in a submitted audio clip
→ Synthetic Voice Detection batch endpoint.
Real-time anti-spoofing check during a voice authentication flow
→ Synthetic Voice Detection streaming endpoint. Stream audio and act on frame verdicts as they arrive.
Transcript of a support call with PII/PHI spans tagged for downstream review
→ STT Batch with pii_phi_tagging=true. Language is auto-detected per utterance. The transcript content is preserved; sensitive spans are wrapped in tags.
Compliance recording that must be shareable with PII/PHI silenced
→ PII/PHI Redaction Batch API. Returns a redacted MP3 with sensitive audio ranges silenced and a transcript where each PII/PHI span is replaced with an entity-type tag.