Reference for supported audio formats across all Modulate model endpoints, with guidance on format selection, conversion, and the special requirements of the streaming Deepfake Detection endpoint.
Every endpoint advertises its accepted audio_format values in its spec. The spec is the authoritative list — check the API reference for each endpoint to confirm exact format support, since formats are added over time. Broadly:
- HTTP batch endpoints (Multilingual Transcription, English Fast Transcription, Deepfake Detection, PII/PHI Redaction) accept common container formats; the file extension determines decoding. See each spec for the current list.
- Multilingual Transcription streaming and PII/PHI Redaction streaming accept self-describing container formats (auto-detected from headers) and raw PCM / mu-law / A-law (which require
audio_format, sample_rate, and num_channels).
- Deepfake Detection streaming has the broadest container support plus raw PCM, with required
audio_format and conditional sample_rate / num_channels for raw formats.
Recommended maximum file size for all HTTP batch endpoints: 100 MB.
Opus is the recommended format for streaming use cases. It provides excellent audio quality at low bitrates, reducing bandwidth consumption while preserving the acoustic detail the models need.
For batch endpoints, any supported container format works — use whatever format your audio pipeline already produces.
The Deepfake Detection streaming endpoint accepts two categories of audio format, declared via the audio_format query parameter.
Container formats include metadata — sample rate, channel count, codec — within the stream itself. When using a container format, sample_rate and num_channels are not required. See the Deepfake Detection streaming spec’s audio_format enum for the full list of accepted container values.
wss://...?api_key=YOUR_API_KEY&audio_format=webm
Raw formats are headerless audio samples. The server cannot infer sample rate or channel count from the data itself, so sample_rate and num_channels are required query parameters when using any raw format.
wss://...?api_key=YOUR_API_KEY&audio_format=s16le&sample_rate=16000&num_channels=1
Supported raw formats:
s8, s16le, s16be, s24le, s24be, s32le, s32be, u8, u16le, u16be, u24le, u24be, u32le, u32be, f32le, f32be, f64le, f64be, mulaw, alaw
| Use case | audio_format | sample_rate | num_channels |
|---|
| Default / native app | s16le | 16000 | 1 |
Web Audio API (AudioWorklet) | f32le | 48000 | 1 |
| Native stereo capture | s16le | 48000 | 2 |
| Telephony (mu-law) | mulaw | 8000 | 1 |
| Telephony (A-law) | alaw | 8000 | 1 |
The s16le passthrough optimization
When the input is s16le at 16 kHz mono, no format conversion is performed before analysis. This is the most efficient configuration for the Deepfake Detection streaming endpoint.
All other formats are decoded and resampled to 16 kHz mono before analysis. There is no functional difference in output, but the passthrough path avoids the conversion overhead.
If you control the audio capture pipeline and are integrating with the Deepfake Detection streaming endpoint, capture in s16le at 16 kHz mono to take advantage of zero-cost passthrough.
Supported sample_rate values
8000, 11025, 16000, 22050, 32000, 44100, 48000, 96000
num_channels range
1–8. Multi-channel audio is downmixed to mono before analysis.
| Endpoint | Status / code | Cause |
|---|
| English Fast Transcription (batch) | 400 | Unsupported file extension, empty file, or decode error |
| Multilingual Transcription (batch) | 400 | Unsupported format or empty file |
| Deepfake Detection (batch) | 400 | Empty file or unsupported format |
| Deepfake Detection (batch) | 422 | Audio shorter than 0.5 seconds |
| Deepfake Detection (streaming) | Close code 1003 | Invalid audio_format, sample_rate, or num_channels query parameter |
| Deepfake Detection (streaming) | Close code 4002 | Audio could not be decoded or does not match the declared format |
The Deepfake Detection streaming endpoint validates format parameters at connection time (close code 1003) and again when it attempts to decode audio (close code 4002). This covers both undecodable audio (e.g. a truncated or corrupt stream) and format mismatches (e.g. declaring audio_format=s16le but sending WebM data).