Speech-to-text Transcription

Modulate offers four speech-to-text endpoints. Pick the one that matches your latency, language, and feature needs.

	Batch (multilingual)	Batch English VFast	Streaming	Streaming English
Use case	Transcription with rich metadata	Fast English-only transcription	Real-time transcription	Low-latency English real-time transcription
Protocol	HTTP POST	HTTP POST	WebSocket	WebSocket
Languages	Multilingual	English only	Multilingual	English only
Speaker diarization	✓	—	✓	—
Emotion / accent detection	✓	—	✓	—
PII/PHI tagging	✓	—	✓	—
Partial transcripts during streaming	—	—	—	✓ (every ~1.5 s)

For a side-by-side comparison with the other Modulate capabilities, see Which API should I use?.

Authentication

Batch endpoints use the X-API-Key header. Streaming endpoints use an api_key query parameter at connection time. See Authentication and rate limits.

Velma Streaming

Speech-to-Text Transcription Batch Multilingual

⌘I

​Authentication

Authentication