Skip to main content
This guide gets you from zero to a real transcription response. You’ll send an audio file to the STT Batch API and get a full transcript back.

Prerequisites

1

Get an API key

Create a free account and create an API key from the API Key Tab.
2

Set up your Python environment

All examples use Python 3.8+. Create a virtual environment and install dependencies:
mkdir modulate-quickstart && cd modulate-quickstart

python3 -m venv .venv
source .venv/bin/activate          # macOS / Linux
# .venv\Scripts\activate           # Windows
Create requirements.txt in your project root:
requirements.txt
requests>=2.31.0
requests-toolbelt>=1.0.0
websockets>=12.0
python-dotenv>=1.0.0
urllib3<2.0
Install requirements:
pip install -r requirements.txt
3

Store your API key

Set your key as an environment variable — never hard-code credentials.
export MODULATE_API_KEY=your_api_key_here
Or store it in a .env file and load it with python-dotenv.
echo ".env" >> .gitignore
4

Get a sample audio file

Any short clip (5–30 seconds) of speech works. Place it in your project directory and note the filename — the examples below assume audio.mp3.

Make your first call

curl -X POST https://modulate-developer-apis.com/api/velma-2-stt-batch \
  -H "X-API-Key: $MODULATE_API_KEY" \
  -F "upload_file=@audio.mp3"
{
  "text": "Hello everyone. Welcome to the meeting. We'll be discussing results today.",
  "duration_ms": 8400,
  "utterances": [
    {
      "start_ms": 0,
      "end_ms": 4200,
      "speaker": 1,
      "language": "en",
      "text": "Hello everyone. Welcome to the meeting."
    },
    {
      "start_ms": 4200,
      "end_ms": 8400,
      "speaker": 1,
      "language": "en",
      "text": "We'll be discussing results today."
    }
  ]
}
The text field is the full transcript. utterances breaks it into per-speaker, per-language segments with millisecond timestamps.

Go deeper by capability

Speech to text

All three STT models — multilingual batch, English fast, and real-time streaming.

Deepfake detection

Detect synthetic voice in recorded files or live audio streams.

PII/PHI redaction

Remove sensitive content from transcripts and audio.

Music detection

Classify audio as music, speech, or neither — batch and streaming.