Skip to main content
Velma-2 offers several endpoints with overlapping capabilities. This guide helps you pick the right one based on your latency needs, language requirements, audio format constraints, and required features.

Quick reference

Speech-to-text (STT)

STT BatchSTT English VFastSTT StreamingSTT Streaming v2
Use caseTranscription with rich metadataFast English-only transcriptionReal-time transcriptionLow-latency English real-time transcription
ProtocolHTTP POSTHTTP POSTWebSocketWebSocket
LanguagesMultilingualEnglish onlyMultilingualEnglish only
Audio formatsSee Audio formatsSee Audio formatsSee Audio formatsSee Audio formats
Max file size100 MB100 MB— (streaming)— (streaming)
Speaker diarization
Emotion detection
Accent detection
PII/PHI tagging
Synthetic voice scoring✓ (per-utterance)✓ (per-utterance)
Utterance-level output
Partial transcripts during streaming✓ (every ~1.5 s)

Synthetic voice detection (SVD)

SVD BatchSVD Streaming
Use caseDeepfake detection on a fileReal-time deepfake detection
ProtocolHTTP POSTWebSocket
Audio formatsAAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebMRaw PCM and container formats
Max file size100 MB— (streaming)
Synthetic voice scoring✓ (per-frame)✓ (per-frame)

PII/PHI redaction

Redaction BatchRedaction Streaming
Use caseTranscription with audio redactionReal-time transcription with audio redaction
ProtocolHTTP POSTWebSocket
LanguagesMultilingualMultilingual
Audio formatsAAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebMAAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM
Max file size100 MB— (streaming)
Speaker diarization
PII/PHI audio redaction
Utterance-level output
PII/PHI tagging (STT APIs) — wraps sensitive spans in tags within the transcript text; original content is preserved. PII/PHI audio redaction (Redaction APIs) — replaces each detected PII/PHI span with an entity-type tag (e.g. [FIRSTNAME], [SSN], [PHI]) in the transcript text and silences the corresponding audio ranges in the returned MP3.

Language detection

Language Detection Batch
Use caseIdentify the spoken language of an audio file
ProtocolHTTP POST
Languages100 spoken languages
Audio formatsAAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM
Max file size100 MB
Audio analyzedFirst 30 seconds only
OutputISO 639-1 language code, display name, confidence score
Language detection is a pure classification endpoint — it returns no transcript, diarization, or enrichment data.

Decision tree

Do you need transcription?

Yes → continue below. No, you need deepfake detection only → use the Synthetic Voice Detection batch endpoint for pre-recorded files, or the streaming variant for live audio. See the API Reference tab. No, you need language identification → use the Language Detection batch endpoint. It returns the detected language as an ISO 639-1 code and display name with a confidence score. No transcription, diarization, or enrichment data is returned. See Language detection above.

Do you need PII/PHI removed from the audio itself — not just tagged in the transcript?

Yes → use the PII/PHI Redaction Batch API for complete files, or the PII/PHI Redaction Streaming API for live audio. Both return a redacted MP3 with sensitive audio ranges silenced, alongside a transcript where each PII/PHI span is replaced with an entity-type tag. No, you only need the transcript tagged → continue below, and enable pii_phi_tagging=true on whichever transcription endpoint you choose.

Do you need results in real time, while audio is still being captured?

Yes → choose based on your requirements:
  • Need enrichments or multilingual support → use the STT Streaming API. Delivers per-utterance results with speaker labels, emotion, accent, and PII/PHI tagging.
  • English only, pure transcription, lowest latency → use the STT Streaming v2 API. Emits a rolling partial transcript every ~1.5 seconds and a single final transcript at end-of-stream — no enrichments.
No, you have a complete file → continue below.

Is the audio English-only, and do you need maximum throughput without speaker or emotion metadata?

Yes → use the STT Batch English VFast API.
  • Fastest option for English audio.
  • No speaker diarization, emotion, accent, PII/PHI tagging, or utterance-level data.
  • Returns a single text string and duration_ms.
  • Processing timeout: 60 seconds per request.
No → use the STT Batch API.

STT Batch vs STT English VFast — detailed tradeoffs

ConsiderationChoose STT BatchChoose STT English VFast
Audio languageNon-English, or unknown/multilingualEnglish only
Audio formatAny supported formatAny supported format
Need speaker IDsYesNo
Need emotion or accent signalsYesNo
Need PII/PHI taggingYesNo
Need utterance timestampsYesNo
Need deepfake scoring alongside transcriptionYesNo
Processing speed is the top priorityLower priorityYes
Transcription quality is the top priorityYesLower priority
Large-scale batch jobs in EnglishPreferred

Do you need deepfake detection alongside transcription?

The STT Batch API supports a deepfake_signal parameter that adds a per-utterance deepfake_score to transcription output. This is convenient when you already need transcription and want a synthetic voice signal without a second API call. Use the dedicated Synthetic Voice Detection APIs instead when:
  • Transcription is not needed — you only want to know if audio is synthetic.
  • You need frame-level results across the full audio (not just utterance-level scores).
  • You need explicit no-content verdicts for silent regions.
  • You need streaming deepfake verdicts in real time.

Common scenarios

Call center QA on recorded English calls, high volume → STT Batch English VFast. Use parallel requests with a semaphore to respect concurrent limits. Meeting transcription with speaker attribution and emotion → STT Batch with speaker_diarization=true and emotion_signal=true. Live interview transcription with real-time captions, multilingual or with speaker labels → STT Streaming. Stream audio over WebSocket and display utterances as they arrive. Live English captions or voice assistant input with minimal latency → STT Streaming v2. Partial transcripts update every ~1.5 seconds — replace your displayed text with each new partial, then finalize on the utterance message. Detect AI-generated voice in a submitted audio clip → Synthetic Voice Detection batch endpoint. Real-time anti-spoofing check during a voice authentication flow → Synthetic Voice Detection streaming endpoint. Stream audio and act on frame verdicts as they arrive. Transcript of a support call with PII/PHI spans tagged for downstream review → STT Batch with pii_phi_tagging=true. Language is auto-detected per utterance. The transcript content is preserved; sensitive spans are wrapped in tags. Compliance recording that must be shareable with PII/PHI silenced → PII/PHI Redaction Batch API. Returns a redacted MP3 with sensitive audio ranges silenced and a transcript where each PII/PHI span is replaced with an entity-type tag. Route audio to the right transcription pipeline based on spoken language → Language Detection batch endpoint. Send the clip, read predicted_language_code from the response, and route accordingly. Use the confidence field to catch low-certainty results and handle them separately.