Skip to main content
Velma-2 offers three speech-to-text models. Use this page to pick the right one and get working code for each.

Which model should I use?

STT BatchSTT English VFastSTT StreamingSTT Streaming v2
ProtocolHTTP POSTHTTP POSTWebSocketWebSocket
LanguagesMultilingualEnglish onlyMultilingualEnglish only
Utterance-level output
Speaker diarization
Emotion / accent / PII signals
Partial transcripts during streaming✓ (every ~1.5 s)
Best forRich metadata on complete filesMaximum English throughputReal-time captions and live audioLow-latency English captions and voice assistants

STT Batch (multilingual)

Send a complete audio file, get back a full transcript with per-utterance timing, speaker labels, and optional enrichments.
curl -X POST https://modulate-developer-apis.com/api/velma-2-stt-batch \
  -H "X-API-Key: $MODULATE_API_KEY" \
  -F "upload_file=@audio.mp3" \
  -F "speaker_diarization=true"
{
  "text": "Hello everyone. Welcome to the meeting.",
  "duration_ms": 4200,
  "utterances": [
    {
      "start_ms": 0,
      "end_ms": 4200,
      "speaker": 1,
      "language": "en",
      "text": "Hello everyone. Welcome to the meeting."
    }
  ]
}
Optional enrichments — add any of these query parameters to the request:
ParameterWhat it adds
emotion_signal=truePer-utterance emotion label
accent_signal=truePer-utterance accent label
deepfake_signal=truePer-utterance deepfake_score (0–1)
pii_phi_tagging=trueSensitive spans wrapped in entity tags in the transcript
See STT enrichment features for the full field reference and example responses.

STT Batch — English VFast

Optimized for high-throughput English transcription. Returns a single transcript string with no per-utterance breakdown.
curl -X POST https://modulate-developer-apis.com/api/velma-2-stt-batch-english-vfast \
  -H "X-API-Key: $MODULATE_API_KEY" \
  -F "upload_file=@audio.mp3"
{
  "text": "Good morning, everyone. Today we're covering the quarterly results.",
  "duration_ms": 5200
}
No speaker diarization, emotion, accent, PII/PHI tagging, or utterance-level data. Use STT Batch if you need any of those.

STT Streaming (WebSocket)

Connect over WebSocket and receive utterances as speech is recognized — ideal for live audio, phone calls, and real-time captions.
websocat "wss://modulate-developer-apis.com/api/velma-2-stt-streaming?api_key=$MODULATE_API_KEY&speaker_diarization=true" \
  --binary - < audio.mp3
{ "type": "utterance", "utterance": { "start_ms": 0, "end_ms": 3100, "speaker": 1, "language": "en", "text": "Hello, how are you today?" } }
{ "type": "utterance", "utterance": { "start_ms": 3100, "end_ms": 6400, "speaker": 1, "language": "en", "text": "I wanted to go over the project timeline." } }
{ "type": "done", "duration_ms": 9800 }
Audio is sent as binary WebSocket frames in any chunk size. Send an empty string ("") to signal end of stream. For self-describing formats (MP3, WAV, OGG, FLAC, WebM, AAC, AIFF) the format is auto-detected. Raw PCM formats require audio_format, sample_rate, and num_channels query parameters. See Audio formats.

STT Streaming v2 (English, low-latency)

Connect over WebSocket for low-latency English transcription. Unlike STT Streaming, this model emits a rolling partial transcript every ~1.5 seconds while audio arrives — useful for captions, voice assistants, and single-speaker workflows where you want to display text before the speaker finishes.
English only. No speaker diarization, emotion, accent, or PII/PHI enrichments. For those features use STT Streaming above.
websocat "wss://modulate-developer-apis.com/api/velma-2-stt-streaming-english-v2?api_key=$MODULATE_API_KEY&audio_format=ogg" \
  --binary - < audio.ogg
{"type": "partial_utterance", "partial_utterance": {"text": "Hello, how are you", "is_final": false}}
{"type": "partial_utterance", "partial_utterance": {"text": "Hello, how are you doing", "is_final": false}}
{"type": "utterance", "utterance": {"text": "Hello, how are you doing today?", "is_final": true}}
{"type": "done", "duration_ms": 14253}
Each partial_utterance contains the complete transcript so far, not a delta. Always replace your displayed text with the new value — never append.
For lowest latency, stream raw 16 kHz mono PCM (audio_format=s16le&sample_rate=16000&num_channels=1) — this bypasses the server’s audio decoder. Container formats (ogg, mp3, wav, etc.) are supported and require no additional parameters.

API reference