Which API should I use?

Velma-2 offers several endpoints with overlapping capabilities. This guide helps you pick the right one based on your latency needs, language requirements, audio format constraints, and required features.

Quick reference

Speech-to-text (STT)

	STT Batch	STT English VFast	STT Streaming
Use case	Transcription with rich metadata	Fast English-only transcription	Real-time transcription
Protocol	HTTP POST	HTTP POST	WebSocket
Languages	Multilingual	English only	Multilingual
Audio formats	AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM	Opus only	AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM
Max file size	100 MB	100 MB	— (streaming)
Speaker diarization	✓	—	✓
Emotion detection	✓	—	✓
Accent detection	✓	—	✓
PII/PHI tagging	✓	—	✓
Synthetic voice scoring	✓ (per-utterance)	—	—
Utterance-level output	✓	—	✓

Synthetic voice detection (SVD)

	SVD Batch	SVD Streaming
Use case	Deepfake detection on a file	Real-time deepfake detection
Protocol	HTTP POST	WebSocket
Audio formats	AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM	Raw PCM and container formats
Max file size	100 MB	— (streaming)
Synthetic voice scoring	✓ (per-frame)	✓ (per-frame)

PII/PHI redaction

	Redaction Batch	Redaction Streaming
Use case	Transcription with audio redaction	Real-time transcription with audio redaction
Protocol	HTTP POST	WebSocket
Languages	Multilingual	Multilingual
Audio formats	AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM	AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM
Max file size	100 MB	— (streaming)
Speaker diarization	✓	✓
PII/PHI audio redaction	✓	✓
Utterance-level output	✓	✓

PII/PHI tagging (STT APIs) — wraps sensitive spans in tags within the transcript text; original content is preserved. PII/PHI audio redaction (Redaction APIs) — replaces each detected PII/PHI span with an entity-type tag (e.g. [FIRSTNAME], [SSN], [PHI]) in the transcript text and silences the corresponding audio ranges in the returned MP3.

Decision tree

Do you need transcription?

Yes → continue below. No, you need deepfake detection only → use the Synthetic Voice Detection batch endpoint for pre-recorded files, or the streaming variant for live audio. See the API Reference tab.

Do you need PII/PHI removed from the audio itself — not just tagged in the transcript?

Yes → use the PII/PHI Redaction Batch API for complete files, or the PII/PHI Redaction Streaming API for live audio. Both return a redacted MP3 with sensitive audio ranges silenced, alongside a transcript where each PII/PHI span is replaced with an entity-type tag. No, you only need the transcript tagged → continue below, and enable pii_phi_tagging=true on whichever transcription endpoint you choose.

Do you need results in real time, while audio is still being captured?

Yes → use the STT Streaming API. It delivers utterances over WebSocket as they are transcribed. No, you have a complete file → continue below.

Is the audio English-only, and do you need maximum throughput without speaker or emotion metadata?

Yes → use the STT Batch English VFast API.

Fastest option for English audio.
Opus format only — convert other formats before uploading.
No speaker diarization, emotion, accent, PII/PHI tagging, or utterance-level data.
Returns a single text string and duration_ms.
Processing timeout: 60 seconds per request.

No → use the STT Batch API.

STT Batch vs STT English VFast — detailed tradeoffs

Consideration	Choose STT Batch	Choose STT English VFast
Audio language	Non-English, or unknown/multilingual	English only
Audio format	Any supported format (MP3, WAV, etc.)	Can convert to Opus or already have Opus files
Need speaker IDs	Yes	No
Need emotion or accent signals	Yes	No
Need PII/PHI tagging	Yes	No
Need utterance timestamps	Yes	No
Need deepfake scoring alongside transcription	Yes	No
Processing speed is the top priority	Lower priority	Yes
Transcription quality is the top priority	Yes	Lower priority
Large-scale batch jobs in English	—	Preferred

Do you need deepfake detection alongside transcription?

The STT Batch API supports a deepfake_signal parameter that adds a per-utterance deepfake_score to transcription output. This is convenient when you already need transcription and want a synthetic voice signal without a second API call. Use the dedicated Synthetic Voice Detection APIs instead when:

Transcription is not needed — you only want to know if audio is synthetic.
You need frame-level results across the full audio (not just utterance-level scores).
You need explicit no-content verdicts for silent regions.
You need streaming deepfake verdicts in real time.

Common scenarios

Call center QA on recorded English calls, high volume → STT Batch English VFast. Convert to Opus if needed. Use parallel requests with a semaphore to respect concurrent limits. Meeting transcription with speaker attribution and emotion → STT Batch with speaker_diarization=true and emotion_signal=true. Live interview transcription with real-time captions → STT Streaming. Stream audio over WebSocket and display utterances as they arrive. Detect AI-generated voice in a submitted audio clip → Synthetic Voice Detection batch endpoint. Real-time anti-spoofing check during a voice authentication flow → Synthetic Voice Detection streaming endpoint. Stream audio and act on frame verdicts as they arrive. Transcript of a support call with PII/PHI spans tagged for downstream review → STT Batch with pii_phi_tagging=true. Language is auto-detected per utterance. The transcript content is preserved; sensitive spans are wrapped in tags. Compliance recording that must be shareable with PII/PHI silenced → PII/PHI Redaction Batch API. Returns a redacted MP3 with sensitive audio ranges silenced and a transcript where each PII/PHI span is replaced with an entity-type tag.

Get started

Guides

Resources

Which API should I use?

Quick reference

Speech-to-text (STT)

Synthetic voice detection (SVD)

PII/PHI redaction

Decision tree

Do you need transcription?

Do you need PII/PHI removed from the audio itself — not just tagged in the transcript?

Do you need results in real time, while audio is still being captured?

Is the audio English-only, and do you need maximum throughput without speaker or emotion metadata?

STT Batch vs STT English VFast — detailed tradeoffs

Do you need deepfake detection alongside transcription?

Common scenarios

Get started

Guides

Resources

Documentation Index

​Quick reference

​Speech-to-text (STT)

​Synthetic voice detection (SVD)

​PII/PHI redaction

​Decision tree

​Do you need transcription?

​Do you need PII/PHI removed from the audio itself — not just tagged in the transcript?

​Do you need results in real time, while audio is still being captured?

​Is the audio English-only, and do you need maximum throughput without speaker or emotion metadata?

​STT Batch vs STT English VFast — detailed tradeoffs

​Do you need deepfake detection alongside transcription?

​Common scenarios

​Related

Quick reference

Speech-to-text (STT)

Synthetic voice detection (SVD)

PII/PHI redaction

Decision tree

Do you need transcription?

Do you need PII/PHI removed from the audio itself — not just tagged in the transcript?

Do you need results in real time, while audio is still being captured?

Is the audio English-only, and do you need maximum throughput without speaker or emotion metadata?

STT Batch vs STT English VFast — detailed tradeoffs

Do you need deepfake detection alongside transcription?

Common scenarios

Related