FAQ

Authentication and API keys

How do I get an API key? Create a free account and your API key will be available in the dashboard after sign-up. How do I authenticate my requests? Authentication works differently depending on the API type. For REST endpoints, pass your key in the X-API-Key header:

X-API-Key: your_api_key_here

For WebSocket endpoints, pass it as a query parameter at connection time:

wss://modulate-developer-apis.com/api/velma-2-stt-streaming?api_key=your_api_key_here

Is it safe to put my API key in code?

Never commit credentials to source control. API keys pushed to a repository — even briefly — should be considered compromised. Rotate the key immediately if this happens.

Store your key in a .env file, load it at runtime via python-dotenv or your environment’s secret manager, and ensure .env is in your .gitignore. The Quick start guide covers this setup in full.

Models and capabilities

What models are available?

Model	API type	Best for
STT Batch (English Fast)	REST	High-throughput English transcription at lowest cost
STT Batch (Multilingual)	REST	70+ language transcription with full feature set
STT Streaming	WebSocket	Real-time transcription with per-utterance results
Deepfake Detection Batch	REST	Analyzing recorded audio files for synthetic speech
Deepfake Detection Streaming	WebSocket	Live deepfake detection with results from 500ms onward
PII/PHI Redaction Batch	REST	Transcription with PII/PHI text redaction and audio silencing
PII/PHI Redaction Streaming	WebSocket	Real-time PII/PHI redaction — redacted transcript and MP3 clips per utterance

What languages does transcription support? The multilingual batch and streaming STT models support 70+ languages with automatic language detection. Language is detected per-utterance, so code-switching within a single file is handled automatically. The English Fast model is English-only. What optional enrichments are available on transcription? The multilingual batch and streaming STT models support the following optional fields, configurable per request:

Parameter	What it adds	Default
`speaker_diarization`	Distinct speaker labels per utterance	`true`
`emotion_signal`	Detected emotional tone (e.g. Neutral, Happy, Frustrated)	`false`
`accent_signal`	Detected accent (e.g. American, British, Indian)	`false`
`deepfake_signal`	Synthetic voice score per utterance (0.0 = human, 1.0 = synthetic)	`false`
`pii_phi_tagging`	PII/PHI wrapped with tags in transcript text	`false`

What’s the difference between PII/PHI tagging and PII/PHI redaction? These are two separate capabilities with different outputs:

PII/PHI tagging (pii_phi_tagging=true on STT Batch or STT Streaming) — identifies PII/PHI spans in the transcript and wraps them with tags. The original text content is preserved. Use this when downstream systems need to detect or handle sensitive spans while retaining the full transcript.
PII/PHI redaction (PII/PHI Redaction APIs) — replaces each detected PII/PHI span with an entity-type tag (e.g. [FIRSTNAME], [SSN], [PHI]) in the transcript and silences the corresponding audio ranges in the returned MP3. Use this when the audio itself must be clean — for example, recordings that will be shared, archived, or reviewed by parties who should not hear sensitive information.

What types of PII/PHI get redacted? For a full list, see the PII/PHI Redaction Batch reference in the API Reference tab. Can I configure which PII/PHI tags are enabled? Not at this time, but it’s on our roadmap. Reach out to support@modulate.ai to let us know what your specific needs are. How accurate is the deepfake detection? The Velma-2 synthetic voice detection models achieve 98.9% average accuracy and are currently ranked #1 on the Hugging Face Speech Deepfake Arena leaderboard. What does the deepfake confidence score mean? Confidence represents how certain the model is in its verdict, on a scale of 0 to 1. A frame with verdict: "synthetic" and confidence: 0.97 means the model is highly confident that segment contains AI-generated speech. A no-content verdict indicates the frame is silent or contains no usable audio — these frames are not sent through the model and always return confidence: 1.0.

Audio formats and file requirements

What audio formats are supported? Most models accept: AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM. There are two exceptions worth knowing:

English Fast batch — .opus only. This model only accepts Opus-encoded audio. Convert any other format first: ffmpeg -i input.mp3 output.opus

Deepfake Detection streaming — raw PCM only. This model requires headerless raw PCM. Container formats like MP3 and WAV are not supported. The most common configuration is s16le at 16 kHz mono, which is also the zero-cost passthrough path (no resampling): ffmpeg -i input.mp3 -ar 16000 -ac 1 -f s16le output.raw. You must also pass audio_format, sample_rate, and num_channels as query parameters — omitting any of them returns an error.

Is there a file size limit? Yes — 100 MB maximum for all batch REST endpoints. Is there a minimum audio length? For deepfake detection, audio must be at least 0.5 seconds. Files shorter than this are rejected with a 422. For transcription, very short clips may return empty or minimal results. What is the recommended audio length for deepfake detection? 4–60 seconds is the recommended range. Files shorter than one full 4-second analysis window are padded before inference. Leading and trailing silence is trimmed automatically — frame timestamps reflect positions in the original file.

Pricing and billing

How is usage billed? Billing is credit-based and priced per hour of audio processed:

Model	Price
STT Batch (English Fast)	$0.025 / hr
STT Batch (Multilingual)	$0.03 / hr
STT Streaming	$0.06 / hr
PII/PHI Redaction (Batch)	$0.05 / hr
PII/PHI Redaction (Streaming)	$0.08 / hr
Deepfake Detection (Batch)	$0.25 / hr
Deepfake Detection (Streaming)	$0.25 / hr

Is there a free tier? Yes. Free credits are included when you create an account — no credit card required to get started. Where can I monitor my usage? Real-time usage and billing details are in the Usage dashboard.

Rate limits

What rate limits apply? Two limits apply per organization, per model:

Concurrency limit — maximum simultaneous in-flight requests or active WebSocket connections
Monthly usage limit — maximum audio hours processed per calendar month

What happens when I hit a rate limit? REST endpoints return 403 (monthly quota exceeded) or 429 (too many concurrent requests). WebSocket connections are rejected at the handshake with close code 4029. Your Usage dashboard shows current usage against both limits.

Production recommendation: implement exponential backoff with jitter when handling 429 responses. Concurrency limits are per-model, so distributing load across models can help avoid hitting them.

Can limits be increased? Contact support@modulate.ai to discuss increases.

Streaming (WebSocket)

How do I signal the end of my audio stream? Send an empty text frame ("") on the open connection. The server will drain any buffered audio, deliver outstanding results, send a done message, then close the connection cleanly. What does the done message look like?

STT streaming done

{
  "type": "done",
  "duration_ms": 45000
}

Deepfake streaming done

{
  "type": "done",
  "duration_ms": 12500,
  "frame_count": 10
}

What are partial results in STT streaming? When partial_results=true, the server emits partial_utterance messages while speech is in progress, before the utterance is finalized. Each partial replaces the previous one — the final utterance message supersedes all preceding partials for that segment. Useful for live caption rendering where low perceived latency matters. What WebSocket close codes should I handle?

WebSocket close code reference

Code	Meaning
`1000`	Normal closure — received after `done`, connection finished cleanly
`1003`	Invalid query parameters (bad `audio_format`, `sample_rate`, or `num_channels`)
`1011`	Internal server error during streaming
`4001`	Invalid API key (STT streaming)
`4002`	Audio data doesn’t match the declared format
`4003`	Authentication failed or model access not enabled for your organization
`4029`	Rate limit exceeded — monthly quota or concurrency limit hit

Errors

What does a 503 response mean? The inference server is temporarily overloaded. Wait a moment and retry. For production workloads, implement exponential backoff with jitter rather than an immediate retry loop. What does a 504 response mean? The request timed out — batch processing has a 60-second limit. This is uncommon for typical audio lengths. If you see it consistently on files within the recommended size range, contact support@modulate.ai. My file was rejected with 422. Why? The audio is too short for analysis. Deepfake detection requires a minimum of 0.5 seconds. Check the actual duration of your file — empty or near-silent files sometimes report a longer duration than their usable content.

Privacy and data

Does Modulate sell the audio I send through the API? No. Modulate does not sell personal data, including audio submitted through the API. Audio processed via the platform is used solely to deliver the service and, in some cases, to improve Modulate’s models — see the retention and training questions below for details. How long is my audio retained after I send it? Audio submitted through the platform API is retained for 35 days from the date of upload, after which it is permanently deleted. Enterprise customers with annual commitment agreements can negotiate custom retention periods through their account representative. Self-service customers cannot configure retention periods. If you need to delete specific audio before the 35-day period expires, you can do so through the platform interface or API. Is my audio used to train Modulate’s AI models? It depends on your account type:

Self-service (pay-as-you-go) customers — audio may be used by Modulate to train and improve its models. Self-service customers are automatically enrolled and cannot opt out.
Enterprise customers (annual commitment) — participation in model training is optional. To opt out, contact legal@modulate.ai. Opting out does not affect the quality or functionality of your API results.

When audio is used for training, Modulate extracts acoustic and linguistic patterns to improve model performance. Customer audio is never sold or used for purposes unrelated to platform improvement. What data does Modulate collect about my API usage? Modulate collects account identifiers, API usage logs, session data, and any metadata you provide. Usage metadata — such as timestamps, API call counts, and conversation counts — is retained separately from audio for operational and billing purposes. Who is responsible for privacy compliance when I use the API to process my users’ audio? When you use the Velma platform to analyze audio from your own end users, you are the data controller and Modulate acts as a data processor on your behalf. This means you are responsible for:

Providing required privacy notices to your end users before collecting audio
Obtaining any necessary consents for recording and analysis
Establishing a lawful basis for processing under applicable data protection laws
Complying with audio recording laws in your jurisdiction (wiretapping statutes, consent-to-record requirements, biometric data regulations)
Responding to your end users’ data rights requests (access, deletion, correction)

If your end users submit data rights requests related to audio you processed through the API, those requests should come to you as the data controller. You can then request Modulate’s assistance, and Modulate will cooperate with verified requests. Does Modulate offer a Data Processing Agreement (DPA)? Yes. If you are subject to GDPR, UK GDPR, or other regulations requiring a DPA, Modulate offers a standard agreement that includes appropriate data protection terms, security commitments, and Standard Contractual Clauses for international transfers. Contact legal@modulate.ai to request a DPA. Where is my data processed and stored? Audio and account data are primarily processed and stored in the United States. For customers transferring data from the EU, UK, or other jurisdictions with cross-border transfer requirements, Modulate implements appropriate safeguards including Standard Contractual Clauses. How do I exercise my privacy rights or submit a data request? Contact privacy@modulate.ai for access, correction, or deletion requests. Modulate aims to respond within 30 days. Because audio is processed on behalf of platform customers, Modulate may direct end-user requests to the relevant customer (data controller) in some cases. To opt out of marketing communications, click the unsubscribe link in any Modulate email or contact privacy@modulate.ai directly.

This section covers the details most relevant to API users. For complete information — including cookie practices, third-party data sharing, and regional rights — see Modulate’s full Privacy Policy and the Velma Services Privacy Addendum.

Support

How do I get help? Email support@modulate.ai for technical questions or issues. Where can I learn more about Modulate? Visit modulate.ai to learn about Modulate’s mission, or try Velma Preview to explore voice analysis in the browser without writing any code.

Get started

Guides

Resources

Authentication and API keys

Models and capabilities

Audio formats and file requirements

Pricing and billing

Rate limits

Streaming (WebSocket)

Errors

Privacy and data

Support

Get started

Guides

Resources

Documentation Index

​Authentication and API keys

​Models and capabilities

​Audio formats and file requirements

​Pricing and billing

​Rate limits

​Streaming (WebSocket)

​Errors

​Privacy and data

​Support

Authentication and API keys

Models and capabilities

Audio formats and file requirements

Pricing and billing

Rate limits

Streaming (WebSocket)

Errors

Privacy and data

Support