Multilingual batch transcription with automatic language detection, speaker diarization, emotion and accent detection, and PII/PHI tagging.
Documentation Index
Fetch the complete documentation index at: https://docs.modulate.ai/llms.txt
Use this file to discover all available pages before exploring further.
API key for authentication. Your API key must be included in the X-API-Key header
for all requests. API keys are tied to your organization and
determine your access to models and usage limits.
Audio file to transcribe. Supported formats: AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM. Maximum file size: 100MB. Empty files are rejected.
Content-Type Requirement:
The MIME type for this part SHOULD match the audio format being uploaded.
Using application/octet-stream is strongly discouraged — the server uses
the content type to select the correct audio decoder, and generic binary
types may cause intermittent decoding failures or empty responses.
Correct MIME types by format:
audio/aacaudio/aiffaudio/flacaudio/mpegvideo/mp4video/quicktimeaudio/oggaudio/opusaudio/wavvideo/webmSpeaker diarization identifies different speakers in the audio. When enabled, each utterance includes a speaker identifier (e.g., 1, 2).
Emotion detection for each utterance. When enabled, each utterance includes an emotion signal detected from the speaker's voice.
Accent detection for each utterance. When enabled, each utterance includes an accent signal detected from the speaker's voice.
PII/PHI tagging in utterance text. When enabled, personally identifiable information and personal health information are wrapped with appropriate tags in the transcription text.
Transcription completed successfully
The complete transcribed text from the audio file, containing all utterances concatenated together. This provides a full transcript of the audio content.
"Hello everyone. Welcome to the meeting."
The total duration of the processed audio in milliseconds. This value represents the actual audio duration and is used for usage tracking and billing purposes.
x >= 045000
Array of individual utterances detected in the audio, ordered by start time. Each utterance represents a continuous segment of speech, potentially from a specific speaker if diarization is enabled.