Quick reference
Transcription
| Multilingual Transcription (batch) | English Fast Transcription (batch) | Multilingual Transcription (streaming) | English Fast Transcription (streaming) | |
|---|---|---|---|---|
| Use case | Transcription with rich metadata | Fast English-only transcription | Real-time transcription | Low-latency English real-time transcription |
| Protocol | HTTP POST | HTTP POST | WebSocket | WebSocket |
| Languages | Multilingual | English only | Multilingual | English only |
| Audio formats | See Audio formats | See Audio formats | See Audio formats | See Audio formats |
| Max file size | 100 MB | 100 MB | — (streaming) | — (streaming) |
| Speaker diarization | ✓ | — | ✓ | — |
| Emotion detection | ✓ | — | ✓ | — |
| Accent detection | ✓ | — | ✓ | — |
| PII/PHI tagging | ✓ | — | ✓ | — |
| Deepfake scoring | ✓ (per-utterance) | — | ✓ (per-utterance) | — |
| Utterance-level output | ✓ | — | ✓ | — |
| Partial transcripts during streaming | — | — | — | ✓ (every ~1.5 s) |
Deepfake Detection
| Deepfake Detection (batch) | Deepfake Detection (streaming) | |
|---|---|---|
| Use case | Deepfake detection on a file | Real-time deepfake detection |
| Protocol | HTTP POST | WebSocket |
| Audio formats | AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM | Raw PCM and container formats |
| Max file size | 100 MB | — (streaming) |
| Deepfake scoring | ✓ (per-frame) | ✓ (per-frame) |
PII/PHI Redaction
| PII/PHI Redaction (batch) | PII/PHI Redaction (streaming) | |
|---|---|---|
| Use case | Transcription with audio redaction | Real-time transcription with audio redaction |
| Protocol | HTTP POST | WebSocket |
| Languages | Multilingual | Multilingual |
| Audio formats | AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM | AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM |
| Max file size | 100 MB | — (streaming) |
| Speaker diarization | ✓ | ✓ |
| PII/PHI audio redaction | ✓ | ✓ |
| Utterance-level output | ✓ | ✓ |
<pii:name></pii:name>, <pii:ssn></pii:ssn>, <phi></phi>) in the transcript text and silences the corresponding audio ranges in the returned MP3.
Language Detection
| Language Detection Batch | |
|---|---|
| Use case | Identify the spoken language of an audio file |
| Protocol | HTTP POST |
| Languages | 100 spoken languages |
| Audio formats | AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM |
| Max file size | 100 MB |
| Audio analyzed | First 30 seconds only |
| Output | ISO 639-1 language code, display name, confidence score |
Decision tree
Do you need transcription?
Yes → continue below. No, you need deepfake detection only → use the Deepfake Detection batch endpoint for pre-recorded files, or the streaming variant for live audio. See the API Reference tab. No, you need language identification → use the Language Detection batch endpoint. It returns the detected language as an ISO 639-1 code and display name with a confidence score. No transcription, diarization, or enrichment data is returned. See Language Detection above.Do you need PII/PHI removed from the audio itself — not just tagged in the transcript?
Yes → use the PII/PHI Redaction Batch API for complete files, or the PII/PHI Redaction Streaming API for live audio. Both return a redacted MP3 with sensitive audio ranges silenced, alongside a transcript where each PII/PHI span is replaced with an entity-type tag. No, you only need the transcript tagged → continue below, and enablepii_phi_tagging=true on whichever transcription endpoint you choose.
Do you need results in real time, while audio is still being captured?
Yes → choose based on your requirements:- Need enrichments or multilingual support → use the Multilingual Transcription streaming API. Delivers per-utterance results with speaker labels, emotion, accent, and PII/PHI tagging.
- English only, pure transcription, lowest latency → use the English Fast Transcription streaming API. Emits a rolling partial transcript every ~1.5 seconds and a single final transcript at end-of-stream — no enrichments.
Is the audio English-only, and do you need maximum throughput without speaker or emotion metadata?
Yes → use the English Fast Transcription batch API.- Fastest option for English audio.
- No speaker diarization, emotion, accent, PII/PHI tagging, or utterance-level data.
- Returns a single
textstring andduration_ms. - Processing timeout: 60 seconds per request.
Multilingual Transcription vs English Fast Transcription — detailed tradeoffs
| Consideration | Choose Multilingual Transcription | Choose English Fast Transcription |
|---|---|---|
| Audio language | Non-English, or unknown/multilingual | English only |
| Audio format | Any supported format | Any supported format |
| Need speaker IDs | Yes | No |
| Need emotion or accent signals | Yes | No |
| Need PII/PHI tagging | Yes | No |
| Need utterance timestamps | Yes | No |
| Need deepfake scoring alongside transcription | Yes | No |
| Processing speed is the top priority | Lower priority | Yes |
| Transcription quality is the top priority | Yes | Lower priority |
| Large-scale batch jobs in English | — | Preferred |
Do you need deepfake detection alongside transcription?
The Multilingual Transcription batch API supports adeepfake_signal parameter that adds a per-utterance deepfake_score to transcription output. This is convenient when you already need transcription and want a deepfake signal without a second API call.
Use the dedicated Deepfake Detection APIs instead when:
- Transcription is not needed — you only want to know if audio is synthetic.
- You need frame-level results across the full audio (not just utterance-level scores).
- You need explicit
no-contentverdicts for silent regions. - You need streaming deepfake verdicts in real time.
Common scenarios
Call center QA on recorded English calls, high volume → English Fast Transcription (batch). Use parallel requests with a semaphore to respect concurrent limits. Meeting transcription with speaker attribution and emotion → Multilingual Transcription (batch) withspeaker_diarization=true and emotion_signal=true.
Live interview transcription with real-time captions, multilingual or with speaker labels
→ Multilingual Transcription streaming. Stream audio over WebSocket and display utterances as they arrive.
Live English captions or voice assistant input with minimal latency
→ English Fast Transcription streaming. Partial transcripts update every ~1.5 seconds — replace your displayed text with each new partial, then finalize on the utterance message.
Detect AI-generated voice in a submitted audio clip
→ Deepfake Detection batch endpoint.
Real-time anti-spoofing check during a voice authentication flow
→ Deepfake Detection streaming endpoint. Stream audio and act on frame verdicts as they arrive.
Transcript of a support call with PII/PHI spans tagged for downstream review
→ Multilingual Transcription (batch) with pii_phi_tagging=true. Language is auto-detected per utterance. The transcript content is preserved; sensitive spans are wrapped in tags.
Compliance recording that must be shareable with PII/PHI silenced
→ PII/PHI Redaction Batch API. Returns a redacted MP3 with sensitive audio ranges silenced and a transcript where each PII/PHI span is replaced with an entity-type tag.
Route audio to the right transcription pipeline based on spoken language
→ Language Detection batch endpoint. Send the clip, read predicted_language_code from the response, and route accordingly. Use the confidence field to catch low-certainty results and handle them separately.