Speech-to-Text Transcription Batch Multilingual

curl --request POST \ --url https://modulate-developer-apis.com/api/velma-2-stt-batch \ --header 'Content-Type: multipart/form-data' \ --header 'X-API-Key: <api-key>' \ --form upload_file='@example-file' \ --form speaker_diarization=true \ --form emotion_signal=false \ --form accent_signal=false \ --form pii_phi_tagging=false

{ "text": "Hello, how are you? Bonjour, ça va?", "duration_ms": 5000, "utterances": [ { "utterance_uuid": "e5f6a7b8-c9d0-1234-efab-345678901234", "text": "Hello, how are you?", "start_ms": 0, "duration_ms": 2000, "speaker": 1, "language": "en", "emotion": "Neutral", "accent": "American" }, { "utterance_uuid": "f6a7b8c9-d0e1-2345-fabc-456789012345", "text": "Bonjour, ça va?", "start_ms": 2500, "duration_ms": 2500, "speaker": 2, "language": "fr", "emotion": "Happy", "accent": "British" } ] }

Authorizations

X-API-Key

string

header

required

API key for authentication. Your API key must be included in the X-API-Key header for all requests. API keys are tied to your organization and determine your access to models and usage limits.

Body

multipart/form-data

upload_file

file

required

Audio file to transcribe. Supported formats: AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM. Maximum file size: 100MB. Empty files are rejected.

Content-Type Requirement: The MIME type for this part SHOULD match the audio format being uploaded. Using application/octet-stream is strongly discouraged — the server uses the content type to select the correct audio decoder, and generic binary types may cause intermittent decoding failures or empty responses.

Correct MIME types by format:

AAC: audio/aac
AIFF: audio/aiff
FLAC: audio/flac
MP3: audio/mpeg
MP4: video/mp4
MOV: video/quicktime
OGG: audio/ogg
Opus: audio/opus
WAV: audio/wav
WebM: video/webm

speaker_diarization

boolean

default:true

Speaker diarization identifies different speakers in the audio. When enabled, each utterance includes a speaker identifier (e.g., 1, 2).

emotion_signal

boolean

default:false

Emotion detection for each utterance. When enabled, each utterance includes an emotion signal detected from the speaker's voice.

accent_signal

boolean

default:false

Accent detection for each utterance. When enabled, each utterance includes an accent signal detected from the speaker's voice.

pii_phi_tagging

boolean

default:false

PII/PHI tagging in utterance text. When enabled, personally identifiable information and personal health information are wrapped with appropriate tags in the transcription text.

Response

Transcription completed successfully

text

string

required

The complete transcribed text from the audio file, containing all utterances concatenated together. This provides a full transcript of the audio content.

Example:

"Hello everyone. Welcome to the meeting."

duration_ms

integer<int32>

required

The total duration of the processed audio in milliseconds. This value represents the actual audio duration and is used for usage tracking and billing purposes.

Required range: x >= 0

Example:

45000

utterances

object[]

required

Array of individual utterances detected in the audio, ordered by start time. Each utterance represents a continuous segment of speech, potentially from a specific speaker if diarization is enabled.

Show child attributes

Speech-to-text Transcription

Deepfake Detection

PII/PHI Redaction

Speech-to-Text Transcription Batch Multilingual

Authorizations

Body

Response

Speech-to-text Transcription

Deepfake Detection

PII/PHI Redaction

Documentation Index

Authorizations

Body

Response