Documentation Index
Fetch the complete documentation index at: https://docs.modulate.ai/llms.txt
Use this file to discover all available pages before exploring further.
Authentication and API keys
How do I get an API key? Create a free account and your API key will be available in the dashboard after sign-up. How do I authenticate my requests? Authentication works differently depending on the API type. For REST endpoints, pass your key in theX-API-Key header:
.env file, load it at runtime via python-dotenv or your environment’s secret manager, and ensure .env is in your .gitignore. The Quick start guide covers this setup in full.
Models and capabilities
What models are available?| Model | API type | Best for |
|---|---|---|
| STT Batch (English Fast) | REST | High-throughput English transcription at lowest cost |
| STT Batch (Multilingual) | REST | 70+ language transcription with full feature set |
| STT Streaming | WebSocket | Real-time transcription with per-utterance results |
| Deepfake Detection Batch | REST | Analyzing recorded audio files for synthetic speech |
| Deepfake Detection Streaming | WebSocket | Live deepfake detection with results from 500ms onward |
| PII/PHI Redaction Batch | REST | Transcription with PII/PHI text redaction and audio silencing |
| PII/PHI Redaction Streaming | WebSocket | Real-time PII/PHI redaction — redacted transcript and MP3 clips per utterance |
| Parameter | What it adds | Default |
|---|---|---|
speaker_diarization | Distinct speaker labels per utterance | true |
emotion_signal | Detected emotional tone (e.g. Neutral, Happy, Frustrated) | false |
accent_signal | Detected accent (e.g. American, British, Indian) | false |
deepfake_signal | Synthetic voice score per utterance (0.0 = human, 1.0 = synthetic) | false |
pii_phi_tagging | PII/PHI wrapped with tags in transcript text | false |
- PII/PHI tagging (
pii_phi_tagging=trueon STT Batch or STT Streaming) — identifies PII/PHI spans in the transcript and wraps them with tags. The original text content is preserved. Use this when downstream systems need to detect or handle sensitive spans while retaining the full transcript. - PII/PHI redaction (PII/PHI Redaction APIs) — replaces each detected PII/PHI span with an entity-type tag (e.g.
[FIRSTNAME],[SSN],[PHI]) in the transcript and silences the corresponding audio ranges in the returned MP3. Use this when the audio itself must be clean — for example, recordings that will be shared, archived, or reviewed by parties who should not hear sensitive information.
confidence score mean?
Confidence represents how certain the model is in its verdict, on a scale of 0 to 1. A frame with verdict: "synthetic" and confidence: 0.97 means the model is highly confident that segment contains AI-generated speech. A no-content verdict indicates the frame is silent or contains no usable audio — these frames are not sent through the model and always return confidence: 1.0.
Audio formats and file requirements
What audio formats are supported? Most models accept: AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM. There are two exceptions worth knowing: Is there a file size limit? Yes — 100 MB maximum for all batch REST endpoints. Is there a minimum audio length? For deepfake detection, audio must be at least 0.5 seconds. Files shorter than this are rejected with a422. For transcription, very short clips may return empty or minimal results.
What is the recommended audio length for deepfake detection?
4–60 seconds is the recommended range. Files shorter than one full 4-second analysis window are padded before inference. Leading and trailing silence is trimmed automatically — frame timestamps reflect positions in the original file.
Pricing and billing
How is usage billed? Billing is credit-based and priced per hour of audio processed:| Model | Price |
|---|---|
| STT Batch (English Fast) | $0.025 / hr |
| STT Batch (Multilingual) | $0.03 / hr |
| STT Streaming | $0.06 / hr |
| PII/PHI Redaction (Batch) | $0.05 / hr |
| PII/PHI Redaction (Streaming) | $0.08 / hr |
| Deepfake Detection (Batch) | $0.25 / hr |
| Deepfake Detection (Streaming) | $0.25 / hr |
Rate limits
What rate limits apply? Two limits apply per organization, per model:- Concurrency limit — maximum simultaneous in-flight requests or active WebSocket connections
- Monthly usage limit — maximum audio hours processed per calendar month
403 (monthly quota exceeded) or 429 (too many concurrent requests). WebSocket connections are rejected at the handshake with close code 4029. Your Usage dashboard shows current usage against both limits.
Can limits be increased?
Contact support@modulate.ai to discuss increases.
Streaming (WebSocket)
How do I signal the end of my audio stream? Send an empty text frame ("") on the open connection. The server will drain any buffered audio, deliver outstanding results, send a done message, then close the connection cleanly.
What does the done message look like?
STT streaming done
STT streaming done
Deepfake streaming done
Deepfake streaming done
partial_results=true, the server emits partial_utterance messages while speech is in progress, before the utterance is finalized. Each partial replaces the previous one — the final utterance message supersedes all preceding partials for that segment. Useful for live caption rendering where low perceived latency matters.
What WebSocket close codes should I handle?
WebSocket close code reference
WebSocket close code reference
| Code | Meaning |
|---|---|
1000 | Normal closure — received after done, connection finished cleanly |
1003 | Invalid query parameters (bad audio_format, sample_rate, or num_channels) |
1011 | Internal server error during streaming |
4001 | Invalid API key (STT streaming) |
4002 | Audio data doesn’t match the declared format |
4003 | Authentication failed or model access not enabled for your organization |
4029 | Rate limit exceeded — monthly quota or concurrency limit hit |
Errors
What does a503 response mean?
The inference server is temporarily overloaded. Wait a moment and retry. For production workloads, implement exponential backoff with jitter rather than an immediate retry loop.
What does a 504 response mean?
The request timed out — batch processing has a 60-second limit. This is uncommon for typical audio lengths. If you see it consistently on files within the recommended size range, contact support@modulate.ai.
My file was rejected with 422. Why?
The audio is too short for analysis. Deepfake detection requires a minimum of 0.5 seconds. Check the actual duration of your file — empty or near-silent files sometimes report a longer duration than their usable content.
Privacy and data
Does Modulate sell the audio I send through the API? No. Modulate does not sell personal data, including audio submitted through the API. Audio processed via the platform is used solely to deliver the service and, in some cases, to improve Modulate’s models — see the retention and training questions below for details. How long is my audio retained after I send it? Audio submitted through the platform API is retained for 35 days from the date of upload, after which it is permanently deleted. Enterprise customers with annual commitment agreements can negotiate custom retention periods through their account representative. Self-service customers cannot configure retention periods. If you need to delete specific audio before the 35-day period expires, you can do so through the platform interface or API. Is my audio used to train Modulate’s AI models? It depends on your account type:- Self-service (pay-as-you-go) customers — audio may be used by Modulate to train and improve its models. Self-service customers are automatically enrolled and cannot opt out.
- Enterprise customers (annual commitment) — participation in model training is optional. To opt out, contact legal@modulate.ai. Opting out does not affect the quality or functionality of your API results.
- Providing required privacy notices to your end users before collecting audio
- Obtaining any necessary consents for recording and analysis
- Establishing a lawful basis for processing under applicable data protection laws
- Complying with audio recording laws in your jurisdiction (wiretapping statutes, consent-to-record requirements, biometric data regulations)
- Responding to your end users’ data rights requests (access, deletion, correction)
This section covers the details most relevant to API users. For complete information — including cookie practices, third-party data sharing, and regional rights — see Modulate’s full Privacy Policy and the Velma Services Privacy Addendum.