Transcription

Modulate offers three Transcription models: Multilingual Transcription and English Fast Transcription are available in batch and streaming modes, and Multilingual Fast Transcription is batch-only. Use this page to pick the right one and get working code for each.

Which model should I use?

	Multilingual Transcription (batch)	English Fast Transcription (batch)	Multilingual Fast Transcription (batch)	Multilingual Transcription (streaming)	English Fast Transcription (streaming)
Protocol	HTTP POST	HTTP POST	HTTP POST	WebSocket	WebSocket
Languages	Multilingual	English only	Multilingual	Multilingual	English only
Utterance-level output	✓	—	—	✓	—
Speaker diarization	✓	—	—	✓	—
Emotion / accent / PII signals	✓	—	—	✓	—
Partial transcripts during streaming	—	—	—	✓ (opt-in `partial_results`)	✓ (every ~1.5 s)
Best for	Rich metadata on complete files	Maximum English throughput	Fast turnaround on non-English or mixed-language files	Real-time captions and live audio	Low-latency English captions and voice assistants

Multilingual Transcription — batch

Send a complete audio file, get back a full transcript with per-utterance timing, speaker labels, and optional enrichments.

curl -X POST https://platform.modulate.ai/api/velma-2-stt-batch \
  -H "X-API-Key: $MODULATE_API_KEY" \
  -F "upload_file=@audio.mp3" \
  -F "speaker_diarization=true"

import os, requests

response = requests.post(
    "https://platform.modulate.ai/api/velma-2-stt-batch",
    headers={"X-API-Key": os.environ["MODULATE_API_KEY"]},
    data={"speaker_diarization": "true"},
    files={"upload_file": open("audio.mp3", "rb")},
)
response.raise_for_status()
result = response.json()
print(result["text"])
for u in result["utterances"]:
    print(f"[{u['start_ms']}ms] Speaker {u['speaker']}: {u['text']}")

Expected response

{
  "text": "Hello everyone. Welcome to the meeting.",
  "duration_ms": 4200,
  "utterances": [
    {
      "utterance_uuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "start_ms": 0,
      "duration_ms": 4200,
      "speaker": 1,
      "language": "en",
      "text": "Hello everyone. Welcome to the meeting.",
      "emotion": null,
      "accent": null,
      "deepfake_score": null
    }
  ]
}

Optional enrichments — add any of these query parameters to the request:

Parameter	What it adds
`emotion_signal=true`	Per-utterance `emotion` label
`accent_signal=true`	Per-utterance `accent` label
`deepfake_signal=true`	Per-utterance `deepfake_score` (0–1)
`pii_phi_tagging=true`	Sensitive spans wrapped in entity tags in the transcript

See Transcription enrichment features for the full field reference and example responses. All three batch endpoints accept common audio formats (MP3, WAV, FLAC, MP4, OGG, and more) — see Audio formats.

English Fast Transcription — batch

Optimized for high-throughput English transcription. Returns a single transcript string with no per-utterance breakdown.

curl -X POST https://platform.modulate.ai/api/velma-2-stt-batch-english-vfast \
  -H "X-API-Key: $MODULATE_API_KEY" \
  -F "upload_file=@audio.mp3"

import os, requests

response = requests.post(
    "https://platform.modulate.ai/api/velma-2-stt-batch-english-vfast",
    headers={"X-API-Key": os.environ["MODULATE_API_KEY"]},
    files={"upload_file": open("audio.mp3", "rb")},
)
response.raise_for_status()
print(response.json()["text"])

Expected response

{
  "text": "Good morning, everyone. Today we're covering the quarterly results.",
  "duration_ms": 5200
}

No speaker diarization, emotion, accent, PII/PHI tagging, or utterance-level data. Use Multilingual Transcription (batch) if you need any of those.

Multilingual Fast Transcription — batch

Optimized for fast turnaround on audio in any supported language. Returns a single transcript string with no per-utterance breakdown. Optionally declare the spoken language with the language form field for the fastest, most direct path; when omitted, the language is detected automatically.

curl -X POST https://platform.modulate.ai/api/velma-2-stt-batch-multilingual-vfast \
  -H "X-API-Key: $MODULATE_API_KEY" \
  -F "upload_file=@audio.mp3" \
  -F "language=es"

import os, requests

response = requests.post(
    "https://platform.modulate.ai/api/velma-2-stt-batch-multilingual-vfast",
    headers={"X-API-Key": os.environ["MODULATE_API_KEY"]},
    data={"language": "es"},
    files={"upload_file": open("audio.mp3", "rb")},
)
response.raise_for_status()
print(response.json()["text"])

Expected response

{
  "text": "Hola a todos. Bienvenidos a la junta semanal.",
  "duration_ms": 4200
}

language is a short code such as en, es, fr, or ja. Omit it to have the language detected automatically.

No speaker diarization, emotion, accent, PII/PHI tagging, or utterance-level data. Use Multilingual Transcription (batch) if you need any of those.

Multilingual Transcription — streaming (WebSocket)

Connect over WebSocket and receive utterances as speech is recognized — ideal for live audio, phone calls, and real-time captions.

websocat "wss://platform.modulate.ai/api/velma-2-stt-streaming?api_key=$MODULATE_API_KEY&speaker_diarization=true" \
  --binary - < audio.mp3

import os, asyncio, json, websockets

API_KEY = os.environ["MODULATE_API_KEY"]
AUDIO_FILE = "audio.mp3"
CHUNK_SIZE = 4096

async def stream():
    url = (
        f"wss://platform.modulate.ai/api/velma-2-stt-streaming"
        f"?api_key={API_KEY}&speaker_diarization=true"
    )
    async with websockets.connect(url) as ws:
        async def send():
            with open(AUDIO_FILE, "rb") as f:
                while chunk := f.read(CHUNK_SIZE):
                    await ws.send(chunk)
            await ws.send("")

        async def receive():
            async for message in ws:
                msg = json.loads(message)
                if msg["type"] == "utterance":
                    u = msg["utterance"]
                    print(f"[{u['start_ms']}ms] Speaker {u['speaker']}: {u['text']}")
                elif msg["type"] == "done":
                    print(f"Done — {msg['duration_ms']}ms total")
                    break

        await asyncio.gather(send(), receive())

asyncio.run(stream())

Example messages received

{ "type": "utterance", "utterance": { "utterance_uuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "start_ms": 0, "duration_ms": 3100, "speaker": 1, "language": "en", "text": "Hello, how are you today?", "emotion": null, "accent": null, "deepfake_score": null } }
{ "type": "utterance", "utterance": { "utterance_uuid": "b2c3d4e5-f6a7-8901-bcde-f12345678901", "start_ms": 3100, "duration_ms": 3300, "speaker": 1, "language": "en", "text": "I wanted to go over the project timeline.", "emotion": null, "accent": null, "deepfake_score": null } }
{ "type": "done", "duration_ms": 9800 }

Audio is sent as binary WebSocket frames in any chunk size. Send an empty string ("") to signal end of stream. For self-describing formats (MP3, WAV, OGG, FLAC, WebM, AAC, AIFF) the format is auto-detected. Raw PCM formats require audio_format, sample_rate, and num_channels query parameters. See Audio formats.

English Fast Transcription — streaming (English, low-latency)

Connect over WebSocket for low-latency English transcription. Unlike Multilingual Transcription streaming, this model emits a rolling partial transcript every ~1.5 seconds while audio arrives — useful for captions, voice assistants, and single-speaker workflows where you want to display text before the speaker finishes.

English only. No speaker diarization, emotion, accent, or PII/PHI enrichments. For those features use Multilingual Transcription streaming above.

websocat "wss://platform.modulate.ai/api/velma-2-stt-streaming-english-v2?api_key=$MODULATE_API_KEY&audio_format=ogg" \
  --binary - < audio.ogg

import asyncio, json, os, websockets

API_KEY = os.environ["MODULATE_API_KEY"]
AUDIO_FILE = "audio.ogg"
CHUNK_SIZE = 8192

async def transcribe():
    url = (
        f"wss://platform.modulate.ai/api/velma-2-stt-streaming-english-v2"
        f"?api_key={API_KEY}&audio_format=ogg"
    )
    async with websockets.connect(url, max_size=None) as ws:
        async def send():
            with open(AUDIO_FILE, "rb") as f:
                while chunk := f.read(CHUNK_SIZE):
                    await ws.send(chunk)
            await ws.send("")  # signal end-of-stream

        send_task = asyncio.create_task(send())
        try:
            async for msg in ws:
                data = json.loads(msg)
                if data["type"] == "partial_utterance":
                    # Replace any previously displayed partial — not a delta
                    print(f"\r[partial] {data['partial_utterance']['text']}", end="", flush=True)
                elif data["type"] == "utterance":
                    print(f"\n[final]   {data['utterance']['text']}")
                elif data["type"] == "done":
                    print(f"\nDone — {data['duration_ms']}ms")
                    break
        finally:
            if not send_task.done():
                send_task.cancel()

asyncio.run(transcribe())

Example messages received

{"type": "partial_utterance", "partial_utterance": {"text": "Hello, how are you", "is_final": false}}
{"type": "partial_utterance", "partial_utterance": {"text": "Hello, how are you doing", "is_final": false}}
{"type": "utterance", "utterance": {"text": "Hello, how are you doing today?", "is_final": true}}
{"type": "done", "duration_ms": 14253}

Each partial_utterance contains the complete transcript so far, not a delta. Always replace your displayed text with the new value — never append.

For lowest latency, stream raw 16 kHz mono PCM (audio_format=s16le&sample_rate=16000&num_channels=1) — this bypasses the server’s audio decoder. Container formats (ogg, mp3, wav, etc.) are supported and require no additional parameters.

API reference

Multilingual Transcription Batch — full parameter and response schema
English Fast Transcription Batch
Multilingual Fast Transcription Batch
Multilingual Transcription Streaming — WebSocket protocol, close codes, all parameters
English Fast Transcription Streaming — WebSocket protocol, close codes, all parameters

Get started

By capability

Guides

Which model should I use?

Multilingual Transcription — batch

English Fast Transcription — batch

Multilingual Fast Transcription — batch

Multilingual Transcription — streaming (WebSocket)

English Fast Transcription — streaming (English, low-latency)

API reference

​Which model should I use?

​Multilingual Transcription — batch

​English Fast Transcription — batch

​Multilingual Fast Transcription — batch

​Multilingual Transcription — streaming (WebSocket)

​English Fast Transcription — streaming (English, low-latency)

​API reference

Which model should I use?

Multilingual Transcription — batch

English Fast Transcription — batch

Multilingual Fast Transcription — batch

Multilingual Transcription — streaming (WebSocket)

English Fast Transcription — streaming (English, low-latency)

API reference