Which model should I use?
| STT Batch | STT English VFast | STT Streaming | STT Streaming v2 | |
|---|---|---|---|---|
| Protocol | HTTP POST | HTTP POST | WebSocket | WebSocket |
| Languages | Multilingual | English only | Multilingual | English only |
| Utterance-level output | ✓ | — | ✓ | — |
| Speaker diarization | ✓ | — | ✓ | — |
| Emotion / accent / PII signals | ✓ | — | ✓ | — |
| Partial transcripts during streaming | — | — | — | ✓ (every ~1.5 s) |
| Best for | Rich metadata on complete files | Maximum English throughput | Real-time captions and live audio | Low-latency English captions and voice assistants |
STT Batch (multilingual)
Send a complete audio file, get back a full transcript with per-utterance timing, speaker labels, and optional enrichments.Expected response
Expected response
| Parameter | What it adds |
|---|---|
emotion_signal=true | Per-utterance emotion label |
accent_signal=true | Per-utterance accent label |
deepfake_signal=true | Per-utterance deepfake_score (0–1) |
pii_phi_tagging=true | Sensitive spans wrapped in entity tags in the transcript |
STT Batch — English VFast
Optimized for high-throughput English transcription. Returns a single transcript string with no per-utterance breakdown.Expected response
Expected response
No speaker diarization, emotion, accent, PII/PHI tagging, or utterance-level data. Use STT Batch if you need any of those.
STT Streaming (WebSocket)
Connect over WebSocket and receive utterances as speech is recognized — ideal for live audio, phone calls, and real-time captions.Example messages received
Example messages received
"") to signal end of stream.
For self-describing formats (MP3, WAV, OGG, FLAC, WebM, AAC, AIFF) the format is auto-detected. Raw PCM formats require audio_format, sample_rate, and num_channels query parameters. See Audio formats.
STT Streaming v2 (English, low-latency)
Connect over WebSocket for low-latency English transcription. Unlike STT Streaming, this model emits a rolling partial transcript every ~1.5 seconds while audio arrives — useful for captions, voice assistants, and single-speaker workflows where you want to display text before the speaker finishes.English only. No speaker diarization, emotion, accent, or PII/PHI enrichments. For those features use STT Streaming above.
Example messages received
Example messages received
audio_format=s16le&sample_rate=16000&num_channels=1) — this bypasses the server’s audio decoder. Container formats (ogg, mp3, wav, etc.) are supported and require no additional parameters.
API reference
- STT Batch — full parameter and response schema
- STT Batch English VFast
- STT Streaming — WebSocket protocol, close codes, all parameters
- STT Streaming v2 — WebSocket protocol, close codes, all parameters