Quick start

This guide gets you to your first successful API call for each Velma-2 model — meaning you’ll send real audio and receive real results back. Pick the section for the model you want to start with, or work through all of them. Time to complete: ~10 minutes per section once your environment is set up.

Prerequisites

Get an API key

Create a free account and copy your API key from the dashboard.

Set up your Python environment

All examples use Python 3.8+. Create a virtual environment and install dependencies:

mkdir modulate-quickstart && cd modulate-quickstart

python3 -m venv .venv
source .venv/bin/activate          # macOS / Linux
# .venv\Scripts\activate           # Windows

pip install -r requirements.txt

Create requirements.txt in your project root:

requirements.txt

requests>=2.31.0
requests-toolbelt>=1.0.0
websockets>=12.0
python-dotenv>=1.0.0
urllib3<2.0

Store your API key

Never hard-code credentials. Create a .env file in your project root:

.env

MODULATE_API_KEY=your_api_key_here

Add .env to your .gitignore so it is never committed:

echo ".env" >> .gitignore

Get a sample audio file

Any short clip (5–30 seconds) of speech works. Place it in your project directory and note the filename — the examples below assume audio.mp3.

Transcription — batch (multilingual)

Endpoint: POST /api/velma-2-stt-batch When to use: You have a complete audio file and want a full transcript with per-utterance timing, speaker labels, and optional enrichments (emotion, accent, deepfake score, PII redaction). Supported formats: AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM Pricing: $0.03 / hour of audio

stt_batch.py

import os
import requests
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.environ["MODULATE_API_KEY"]
ENDPOINT = "https://modulate-developer-apis.com/api/velma-2-stt-batch"
AUDIO_FILE = "audio.mp3"

def transcribe(filepath: str) -> dict:
    headers = {"X-API-Key": API_KEY}
    params = {
        "speaker_diarization": True,
        "emotion_signal": False,
        "accent_signal": False,
        "deepfake_signal": False,
        "pii_phi_tagging": False,
    }

    with open(filepath, "rb") as f:
        response = requests.post(
            ENDPOINT,
            headers=headers,
            data=params,
            files={"upload_file": f},
        )

    response.raise_for_status()
    return response.json()


if __name__ == "__main__":
    result = transcribe(AUDIO_FILE)

    print("\n── Transcript ──")
    print(result["text"])
    print(f"\n── Duration: {result['duration_ms']} ms ──")

    print(f"\n── Utterances ({len(result['utterances'])} total) ──")
    for u in result["utterances"]:
        print(
            f"  [{u['start_ms']}ms] Speaker {u['speaker']} ({u['language']}): "
            f"{u['text']}"
        )

Run it:

python stt_batch.py

Expected output:

── Transcript ──
Hello everyone. Welcome to the meeting. We'll be discussing results today.

── Duration: 8400 ms ──

── Utterances (2 total) ──
  [0ms] Speaker 1 (en): Hello everyone. Welcome to the meeting.
  [4200ms] Speaker 1 (en): We'll be discussing results today.

Transcription — batch (English fast)

Endpoint: POST /api/velma-2-stt-batch-english-vfast When to use: English-only audio where you want maximum throughput at the lowest price. Returns a single transcript string with no per-utterance breakdown. Supported formats: Opus (.opus) only Pricing: $0.025 / hour of audio

This model only accepts .opus files. If your audio is in another format, convert it first:

ffmpeg -i audio.mp3 audio.opus

stt_batch_english.py

import os
import requests
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.environ["MODULATE_API_KEY"]
ENDPOINT = "https://modulate-developer-apis.com/api/velma-2-stt-batch-english-vfast"
AUDIO_FILE = "audio.opus"

def transcribe_english(filepath: str) -> dict:
    headers = {"X-API-Key": API_KEY}
    with open(filepath, "rb") as f:
        response = requests.post(
            ENDPOINT,
            headers=headers,
            files={"upload_file": f},
        )
    response.raise_for_status()
    return response.json()


if __name__ == "__main__":
    result = transcribe_english(AUDIO_FILE)
    print("\n── Transcript ──")
    print(result["text"])
    print(f"\n── Duration: {result['duration_ms']} ms ──")

Expected output:

── Transcript ──
Good morning, everyone. Today we're covering the quarterly results.

── Duration: 5200 ms ──

Transcription — streaming (WebSocket)

Endpoint: wss://modulate-developer-apis.com/api/velma-2-stt-streaming When to use: Live audio where you need results as speech happens — phone calls, live captions, real-time meeting transcription. Supported formats: Self-describing (WAV, MP3, OGG, FLAC, WebM, AAC, AIFF) are auto-detected. Raw PCM formats require audio_format, sample_rate, and num_channels. Pricing: $0.06 / hour of audio This example simulates streaming by reading a local file in chunks — the same pattern applies to a real microphone or live audio source.

stt_streaming.py

import os
import asyncio
import json
import websockets
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.environ["MODULATE_API_KEY"]
BASE_URL = "wss://modulate-developer-apis.com/api/velma-2-stt-streaming"
AUDIO_FILE = "audio.mp3"
CHUNK_SIZE = 4096


def build_url() -> str:
    params = {
        "api_key": API_KEY,
        "speaker_diarization": "true",
        "emotion_signal": "false",
        "accent_signal": "false",
        "deepfake_signal": "false",
        "pii_phi_tagging": "false",
        "partial_results": "false",
    }
    query = "&".join(f"{k}={v}" for k, v in params.items())
    return f"{BASE_URL}?{query}"


async def stream_audio(filepath: str):
    url = build_url()

    async with websockets.connect(url) as ws:
        print("Connected. Streaming audio...\n")

        async def send_audio():
            with open(filepath, "rb") as f:
                while chunk := f.read(CHUNK_SIZE):
                    await ws.send(chunk)
                    await asyncio.sleep(0)
            await ws.send("")
            print("[sent end-of-stream signal]\n")

        async def receive_results():
            async for message in ws:
                msg = json.loads(message)
                msg_type = msg.get("type")

                if msg_type == "utterance":
                    u = msg["utterance"]
                    print(
                        f"[{u['start_ms']}ms] Speaker {u['speaker']} ({u['language']}): "
                        f"{u['text']}"
                    )
                elif msg_type == "partial_utterance":
                    pu = msg["partial_utterance"]
                    print(f"  (partial) {pu['text']}", end="\r")
                elif msg_type == "done":
                    print(f"\n── Done. Total audio: {msg['duration_ms']} ms ──")
                    break
                elif msg_type == "error":
                    print(f"Error from server: {msg['error']}")
                    break

        await asyncio.gather(send_audio(), receive_results())


if __name__ == "__main__":
    asyncio.run(stream_audio(AUDIO_FILE))

Expected output:

Connected. Streaming audio...

[0ms] Speaker 1 (en): Hello, how are you today?
[3100ms] Speaker 1 (en): I wanted to go over the project timeline.
[sent end-of-stream signal]

── Done. Total audio: 9800 ms ──

PII/PHI redaction — batch

Endpoint: POST /api/velma-2-pii-phi-redaction-batch When to use: You have a complete audio file and need PII/PHI removed from both the transcript and the audio — for example, a recording that will be shared or archived. Supported formats: AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM Pricing: $0.05 / hour of audio

This endpoint returns multipart/form-data with two parts: metadata (JSON) and audio (MP3). The example below uses requests-toolbelt to decode the response — install it with pip install requests-toolbelt.

pii_phi_redaction_batch.py

import os
import json
import requests
from requests_toolbelt.multipart.decoder import MultipartDecoder
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.environ["MODULATE_API_KEY"]
ENDPOINT = "https://modulate-developer-apis.com/api/velma-2-pii-phi-redaction-batch"
AUDIO_FILE = "audio.mp3"


def redact(filepath: str):
    headers = {"X-API-Key": API_KEY}
    with open(filepath, "rb") as f:
        response = requests.post(
            ENDPOINT,
            headers=headers,
            files={"upload_file": f},
            data={
                "speaker_diarization": "true",
                "start_redaction_padding_ms": "100",
                "end_redaction_padding_ms": "0",
            },
        )
    response.raise_for_status()

    decoder = MultipartDecoder.from_response(response)
    metadata = None
    audio_bytes = None
    for part in decoder.parts:
        disposition = part.headers.get(b"Content-Disposition", b"").decode()
        if 'name="metadata"' in disposition:
            metadata = json.loads(part.content)
        elif 'name="audio"' in disposition:
            audio_bytes = part.content
    return metadata, audio_bytes


if __name__ == "__main__":
    metadata, audio_bytes = redact(AUDIO_FILE)

    print("\n── Redacted transcript ──")
    print(metadata["text"])
    print(f"\n── Duration: {metadata['duration_ms']} ms ──")
    print(f"── Redaction ranges: {metadata['redaction_ranges']} ──\n")

    if audio_bytes:
        with open("redacted.mp3", "wb") as f:
            f.write(audio_bytes)
        print(f"Saved redacted audio → redacted.mp3 ({len(audio_bytes)} bytes)")

PII/PHI redaction — streaming (WebSocket)

Endpoint: wss://modulate-developer-apis.com/api/velma-2-pii-phi-redaction-streaming When to use: Live audio where you need PII/PHI redacted in real time — redacted transcript text and redacted MP3 clips are delivered as each utterance completes. Supported formats: Self-describing (WAV, MP3, OGG, FLAC, WebM, AAC, AIFF) are auto-detected. Raw PCM formats require audio_format, sample_rate, and num_channels. Pricing: $0.08 / hour of audio

pii_phi_redaction_streaming.py

import os
import asyncio
import json
import websockets
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.environ["MODULATE_API_KEY"]
BASE_URL = "wss://modulate-developer-apis.com/api/velma-2-pii-phi-redaction-streaming"
AUDIO_FILE = "audio.mp3"
CHUNK_SIZE = 4096


def build_url() -> str:
    params = {
        "api_key": API_KEY,
        "speaker_diarization": "true",
        "start_redaction_padding_ms": "100",
        "end_redaction_padding_ms": "0",
    }
    query = "&".join(f"{k}={v}" for k, v in params.items())
    return f"{BASE_URL}?{query}"


async def redact_streaming(filepath: str):
    url = build_url()
    audio_clips = []

    async with websockets.connect(url) as ws:
        async def send_audio():
            with open(filepath, "rb") as f:
                while chunk := f.read(CHUNK_SIZE):
                    await ws.send(chunk)
                    await asyncio.sleep(0)
            await ws.send("")

        async def receive_results():
            is_done = False
            async for message in ws:
                if isinstance(message, bytes):
                    audio_clips.append(message)
                    if is_done:
                        break
                    continue
                msg = json.loads(message)
                if msg.get("type") == "utterance":
                    u = msg["utterance"]
                    print(f"[{u['start_ms']}ms] Speaker {u['speaker']}: {u['text']}")
                elif msg.get("type") == "done":
                    is_done = True
                    if not msg.get("trailing_redacted_audio"):
                        break
                elif msg.get("type") == "error":
                    print(f"Error: {msg['error']}")
                    break

        await asyncio.gather(send_audio(), receive_results())

    if audio_clips:
        with open("redacted.mp3", "wb") as f:
            for clip in audio_clips:
                f.write(clip)


if __name__ == "__main__":
    asyncio.run(redact_streaming(AUDIO_FILE))

Deepfake detection — batch

Endpoint: POST /api/velma-2-synthetic-voice-detection-batch When to use: You have a recorded audio file and want to know whether it contains AI-generated (synthetic) speech. Results cover the full file, broken into time-windowed frames. Supported formats: AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM Pricing: $0.25 / hour of audio

deepfake_batch.py

import os
import requests
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.environ["MODULATE_API_KEY"]
ENDPOINT = "https://modulate-developer-apis.com/api/velma-2-synthetic-voice-detection-batch"
AUDIO_FILE = "audio.mp3"

VERDICT_LABELS = {
    "synthetic": "Synthetic (AI-generated)",
    "non-synthetic": "Non-synthetic (human)",
    "no-content": "No content (silence)",
}


def detect_deepfake(filepath: str) -> dict:
    headers = {"X-API-Key": API_KEY}
    with open(filepath, "rb") as f:
        response = requests.post(ENDPOINT, headers=headers, files={"upload_file": f})
    response.raise_for_status()
    return response.json()


if __name__ == "__main__":
    result = detect_deepfake(AUDIO_FILE)

    print(f"\nFile:     {result['filename']}")
    print(f"Duration: {result['duration_ms']} ms")
    print(f"Frames:   {len(result['frames'])}\n")

    for frame in result["frames"]:
        label = VERDICT_LABELS.get(frame["verdict"], frame["verdict"])
        print(
            f"  {frame['start_time_ms']:>6}ms – {frame['end_time_ms']:>6}ms  "
            f"{label}  (confidence: {frame['confidence']:.2%})"
        )

Deepfake detection — streaming (WebSocket)

Endpoint: wss://modulate-developer-apis.com/api/velma-2-synthetic-voice-detection-streaming Required audio format: Raw PCM — the server does not auto-detect format here. You must specify audio_format, sample_rate, and num_channels. The most common setup is s16le at 16kHz mono. Pricing: $0.25 / hour of audio

This model requires raw PCM. If your source is MP3 or WAV, convert first:

ffmpeg -i audio.mp3 -ar 16000 -ac 1 -f s16le audio.raw

deepfake_streaming.py

import os
import asyncio
import json
import websockets
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.environ["MODULATE_API_KEY"]
BASE_URL = "wss://modulate-developer-apis.com/api/velma-2-synthetic-voice-detection-streaming"
AUDIO_FILE = "audio.raw"
CHUNK_SIZE = 8192


def build_url() -> str:
    params = {
        "api_key": API_KEY,
        "audio_format": "s16le",
        "sample_rate": "16000",
        "num_channels": "1",
    }
    query = "&".join(f"{k}={v}" for k, v in params.items())
    return f"{BASE_URL}?{query}"


async def stream_detection(filepath: str):
    url = build_url()

    async with websockets.connect(url) as ws:
        async def send_audio():
            with open(filepath, "rb") as f:
                while chunk := f.read(CHUNK_SIZE):
                    await ws.send(chunk)
                    await asyncio.sleep(0)
            await ws.send("")

        async def receive_results():
            async for message in ws:
                msg = json.loads(message)
                if msg.get("type") == "frame":
                    frame = msg["frame"]
                    print(
                        f"  {frame['start_time_ms']:>6}ms – {frame['end_time_ms']:>6}ms  "
                        f"{frame['verdict']}  (confidence: {frame['confidence']:.2%})"
                    )
                elif msg.get("type") == "done":
                    print(f"\nDone. Total: {msg['duration_ms']} ms, frames: {msg['frame_count']}")
                    break
                elif msg.get("type") == "error":
                    print(f"Error: {msg['error']}")
                    break

        await asyncio.gather(send_audio(), receive_results())


if __name__ == "__main__":
    asyncio.run(stream_detection(AUDIO_FILE))

Next steps

Enable optional enrichments on the batch STT models (emotion_signal, accent_signal, deepfake_signal, pii_phi_tagging) and inspect the per-utterance fields. See the STT enrichment features guide.
Try partial_results=true on the STT streaming model to see in-progress text before each utterance is finalized.
Review the FAQ for questions on rate limits, billing, error handling, and supported audio formats.

Get started

Guides

Resources

Prerequisites

Transcription — batch (multilingual)

Transcription — batch (English fast)

Transcription — streaming (WebSocket)

PII/PHI redaction — batch

PII/PHI redaction — streaming (WebSocket)

Deepfake detection — batch

Deepfake detection — streaming (WebSocket)

Next steps

Get started

Guides

Resources

Documentation Index

​Prerequisites

​Transcription — batch (multilingual)

​Transcription — batch (English fast)

​Transcription — streaming (WebSocket)

​PII/PHI redaction — batch

​PII/PHI redaction — streaming (WebSocket)

​Deepfake detection — batch

​Deepfake detection — streaming (WebSocket)

​Next steps

Prerequisites

Transcription — batch (multilingual)

Transcription — batch (English fast)

Transcription — streaming (WebSocket)

PII/PHI redaction — batch

PII/PHI redaction — streaming (WebSocket)

Deepfake detection — batch

Deepfake detection — streaming (WebSocket)

Next steps