Skip to main content
Modulate offers several models with overlapping capabilities. This guide helps you pick the right one based on your latency needs, language requirements, audio format constraints, and required features.

Quick reference

Transcription

Multilingual Transcription (batch)English Fast Transcription (batch)Multilingual Transcription (streaming)English Fast Transcription (streaming)
Use caseTranscription with rich metadataFast English-only transcriptionReal-time transcriptionLow-latency English real-time transcription
ProtocolHTTP POSTHTTP POSTWebSocketWebSocket
LanguagesMultilingualEnglish onlyMultilingualEnglish only
Audio formatsSee Audio formatsSee Audio formatsSee Audio formatsSee Audio formats
Max file size100 MB100 MB— (streaming)— (streaming)
Speaker diarization
Emotion detection
Accent detection
PII/PHI tagging
Deepfake scoring✓ (per-utterance)✓ (per-utterance)
Utterance-level output
Partial transcripts during streaming✓ (every ~1.5 s)

Deepfake Detection

Deepfake Detection (batch)Deepfake Detection (streaming)
Use caseDeepfake detection on a fileReal-time deepfake detection
ProtocolHTTP POSTWebSocket
Audio formatsAAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebMRaw PCM and container formats
Max file size100 MB— (streaming)
Deepfake scoring✓ (per-frame)✓ (per-frame)

PII/PHI Redaction

PII/PHI Redaction (batch)PII/PHI Redaction (streaming)
Use caseTranscription with audio redactionReal-time transcription with audio redaction
ProtocolHTTP POSTWebSocket
LanguagesMultilingualMultilingual
Audio formatsAAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebMAAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM
Max file size100 MB— (streaming)
Speaker diarization
PII/PHI audio redaction
Utterance-level output
PII/PHI tagging (Transcription APIs) — wraps sensitive spans in tags within the transcript text; original content is preserved. PII/PHI audio redaction (Redaction APIs) — replaces each detected PII/PHI span with an empty marker tag (e.g. <pii:name></pii:name>, <pii:ssn></pii:ssn>, <phi></phi>) in the transcript text and silences the corresponding audio ranges in the returned MP3.

Language Detection

Language Detection Batch
Use caseIdentify the spoken language of an audio file
ProtocolHTTP POST
Languages100 spoken languages
Audio formatsAAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM
Max file size100 MB
Audio analyzedFirst 30 seconds only
OutputISO 639-1 language code, display name, confidence score
Language Detection is a pure classification endpoint — it returns no transcript, diarization, or enrichment data.

Decision tree

Do you need transcription?

Yes → continue below. No, you need deepfake detection only → use the Deepfake Detection batch endpoint for pre-recorded files, or the streaming variant for live audio. See the API Reference tab. No, you need language identification → use the Language Detection batch endpoint. It returns the detected language as an ISO 639-1 code and display name with a confidence score. No transcription, diarization, or enrichment data is returned. See Language Detection above.

Do you need PII/PHI removed from the audio itself — not just tagged in the transcript?

Yes → use the PII/PHI Redaction Batch API for complete files, or the PII/PHI Redaction Streaming API for live audio. Both return a redacted MP3 with sensitive audio ranges silenced, alongside a transcript where each PII/PHI span is replaced with an entity-type tag. No, you only need the transcript tagged → continue below, and enable pii_phi_tagging=true on whichever transcription endpoint you choose.

Do you need results in real time, while audio is still being captured?

Yes → choose based on your requirements:
  • Need enrichments or multilingual support → use the Multilingual Transcription streaming API. Delivers per-utterance results with speaker labels, emotion, accent, and PII/PHI tagging.
  • English only, pure transcription, lowest latency → use the English Fast Transcription streaming API. Emits a rolling partial transcript every ~1.5 seconds and a single final transcript at end-of-stream — no enrichments.
No, you have a complete file → continue below.

Is the audio English-only, and do you need maximum throughput without speaker or emotion metadata?

Yes → use the English Fast Transcription batch API.
  • Fastest option for English audio.
  • No speaker diarization, emotion, accent, PII/PHI tagging, or utterance-level data.
  • Returns a single text string and duration_ms.
  • Processing timeout: 60 seconds per request.
No → use the Multilingual Transcription batch API.

Multilingual Transcription vs English Fast Transcription — detailed tradeoffs

ConsiderationChoose Multilingual TranscriptionChoose English Fast Transcription
Audio languageNon-English, or unknown/multilingualEnglish only
Audio formatAny supported formatAny supported format
Need speaker IDsYesNo
Need emotion or accent signalsYesNo
Need PII/PHI taggingYesNo
Need utterance timestampsYesNo
Need deepfake scoring alongside transcriptionYesNo
Processing speed is the top priorityLower priorityYes
Transcription quality is the top priorityYesLower priority
Large-scale batch jobs in EnglishPreferred

Do you need deepfake detection alongside transcription?

The Multilingual Transcription batch API supports a deepfake_signal parameter that adds a per-utterance deepfake_score to transcription output. This is convenient when you already need transcription and want a deepfake signal without a second API call. Use the dedicated Deepfake Detection APIs instead when:
  • Transcription is not needed — you only want to know if audio is synthetic.
  • You need frame-level results across the full audio (not just utterance-level scores).
  • You need explicit no-content verdicts for silent regions.
  • You need streaming deepfake verdicts in real time.

Common scenarios

Call center QA on recorded English calls, high volume → English Fast Transcription (batch). Use parallel requests with a semaphore to respect concurrent limits. Meeting transcription with speaker attribution and emotion → Multilingual Transcription (batch) with speaker_diarization=true and emotion_signal=true. Live interview transcription with real-time captions, multilingual or with speaker labels → Multilingual Transcription streaming. Stream audio over WebSocket and display utterances as they arrive. Live English captions or voice assistant input with minimal latency → English Fast Transcription streaming. Partial transcripts update every ~1.5 seconds — replace your displayed text with each new partial, then finalize on the utterance message. Detect AI-generated voice in a submitted audio clip → Deepfake Detection batch endpoint. Real-time anti-spoofing check during a voice authentication flow → Deepfake Detection streaming endpoint. Stream audio and act on frame verdicts as they arrive. Transcript of a support call with PII/PHI spans tagged for downstream review → Multilingual Transcription (batch) with pii_phi_tagging=true. Language is auto-detected per utterance. The transcript content is preserved; sensitive spans are wrapped in tags. Compliance recording that must be shareable with PII/PHI silenced → PII/PHI Redaction Batch API. Returns a redacted MP3 with sensitive audio ranges silenced and a transcript where each PII/PHI span is replaced with an entity-type tag. Route audio to the right transcription pipeline based on spoken language → Language Detection batch endpoint. Send the clip, read predicted_language_code from the response, and route accordingly. Use the confidence field to catch low-certainty results and handle them separately.