Skip to main content
POST
/
api
/
velma-2-stt-batch
Transcribe audio file with automatic language detection
curl --request POST \
  --url https://platform.modulate.ai/api/velma-2-stt-batch \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-API-Key: <api-key>' \
  --form upload_file='@example-file' \
  --form speaker_diarization=true \
  --form emotion_signal=false \
  --form accent_signal=false \
  --form deepfake_signal=false \
  --form pii_phi_tagging=false
{ "text": "Hello, how are you? Bonjour, ça va?", "duration_ms": 5000, "utterances": [ { "utterance_uuid": "e5f6a7b8-c9d0-1234-efab-345678901234", "text": "Hello, how are you?", "start_ms": 0, "duration_ms": 2000, "speaker": 1, "language": "en", "emotion": "Neutral", "accent": "American", "deepfake_score": null }, { "utterance_uuid": "f6a7b8c9-d0e1-2345-fabc-456789012345", "text": "Bonjour, ça va?", "start_ms": 2500, "duration_ms": 2500, "speaker": 2, "language": "fr", "emotion": "Happy", "accent": "British", "deepfake_score": null } ] }

Authorizations

X-API-Key
string
header
required

API key used for authentication and usage tracking.

Body

multipart/form-data
upload_file
file
required

Audio file to transcribe. Supported formats: AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM. Maximum file size: 100MB. Empty files are rejected.

speaker_diarization
boolean
default:true

Speaker diarization identifies different speakers in the audio. When enabled, each utterance includes a speaker identifier (e.g., 1, 2).

emotion_signal
boolean
default:false

Emotion detection for each utterance. When enabled, each utterance includes an emotion signal detected from the speaker's voice.

accent_signal
boolean
default:false

Accent detection for each utterance. When enabled, each utterance includes an accent signal detected from the speaker's voice.

deepfake_signal
boolean
default:false

Synthetic voice (deepfake) detection for each utterance. When enabled, each utterance includes a deepfake_score indicating the likelihood that the speech is AI-generated. The score ranges from 0.0 (likely natural) to 1.0 (likely synthetic).

pii_phi_tagging
boolean
default:false

PII/PHI tagging in utterance text. When enabled, personally identifiable information and personal health information are wrapped with appropriate tags in the transcription text.

Response

Transcription completed successfully

text
string
required

The complete transcribed text from the audio file, containing all utterances concatenated together. This provides a full transcript of the audio content. Always present; may be an empty string when no speech was recognized.

Example:

"Hello everyone. Welcome to the meeting."

duration_ms
integer
required

The total duration of the processed audio in milliseconds. This value represents the actual audio duration and is used for usage tracking and billing purposes.

Required range: x >= 0
Example:

45000

utterances
object[]
required

Array of individual utterances detected in the audio, ordered by start time. Each utterance represents a continuous segment of speech, potentially from a specific speaker if diarization is enabled. Always present; may be an empty array when no speech was recognized.