Documentation Index
Fetch the complete documentation index at: https://docs.modulate.ai/llms.txt
Use this file to discover all available pages before exploring further.
This guide gets you to your first successful API call for each Velma-2 model — meaning you’ll send real audio and receive real results back. Pick the section for the model you want to start with, or work through all of them.
Time to complete: ~10 minutes per section once your environment is set up.
Prerequisites
Set up your Python environment
All examples use Python 3.8+. Create a virtual environment and install dependencies:mkdir modulate-quickstart && cd modulate-quickstart
python3 -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows
pip install -r requirements.txt
Create requirements.txt in your project root:requests>=2.31.0
requests-toolbelt>=1.0.0
websockets>=12.0
python-dotenv>=1.0.0
urllib3<2.0
Store your API key
Never hard-code credentials. Create a .env file in your project root:MODULATE_API_KEY=your_api_key_here
Add .env to your .gitignore so it is never committed:echo ".env" >> .gitignore
Get a sample audio file
Any short clip (5–30 seconds) of speech works. Place it in your project directory and note the filename — the examples below assume audio.mp3.
Transcription — batch (multilingual)
Endpoint: POST /api/velma-2-stt-batch
When to use: You have a complete audio file and want a full transcript with per-utterance timing, speaker labels, and optional enrichments (emotion, accent, deepfake score, PII redaction).
Supported formats: AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM
Pricing: $0.03 / hour of audio
import os
import requests
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.environ["MODULATE_API_KEY"]
ENDPOINT = "https://modulate-developer-apis.com/api/velma-2-stt-batch"
AUDIO_FILE = "audio.mp3"
def transcribe(filepath: str) -> dict:
headers = {"X-API-Key": API_KEY}
params = {
"speaker_diarization": True,
"emotion_signal": False,
"accent_signal": False,
"deepfake_signal": False,
"pii_phi_tagging": False,
}
with open(filepath, "rb") as f:
response = requests.post(
ENDPOINT,
headers=headers,
data=params,
files={"upload_file": f},
)
response.raise_for_status()
return response.json()
if __name__ == "__main__":
result = transcribe(AUDIO_FILE)
print("\n── Transcript ──")
print(result["text"])
print(f"\n── Duration: {result['duration_ms']} ms ──")
print(f"\n── Utterances ({len(result['utterances'])} total) ──")
for u in result["utterances"]:
print(
f" [{u['start_ms']}ms] Speaker {u['speaker']} ({u['language']}): "
f"{u['text']}"
)
Run it:
Expected output:
── Transcript ──
Hello everyone. Welcome to the meeting. We'll be discussing results today.
── Duration: 8400 ms ──
── Utterances (2 total) ──
[0ms] Speaker 1 (en): Hello everyone. Welcome to the meeting.
[4200ms] Speaker 1 (en): We'll be discussing results today.
Transcription — batch (English fast)
Endpoint: POST /api/velma-2-stt-batch-english-vfast
When to use: English-only audio where you want maximum throughput at the lowest price. Returns a single transcript string with no per-utterance breakdown.
Supported formats: Opus (.opus) only
Pricing: $0.025 / hour of audio
This model only accepts .opus files. If your audio is in another format, convert it first:ffmpeg -i audio.mp3 audio.opus
import os
import requests
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.environ["MODULATE_API_KEY"]
ENDPOINT = "https://modulate-developer-apis.com/api/velma-2-stt-batch-english-vfast"
AUDIO_FILE = "audio.opus"
def transcribe_english(filepath: str) -> dict:
headers = {"X-API-Key": API_KEY}
with open(filepath, "rb") as f:
response = requests.post(
ENDPOINT,
headers=headers,
files={"upload_file": f},
)
response.raise_for_status()
return response.json()
if __name__ == "__main__":
result = transcribe_english(AUDIO_FILE)
print("\n── Transcript ──")
print(result["text"])
print(f"\n── Duration: {result['duration_ms']} ms ──")
Expected output:
── Transcript ──
Good morning, everyone. Today we're covering the quarterly results.
── Duration: 5200 ms ──
Transcription — streaming (WebSocket)
Endpoint: wss://modulate-developer-apis.com/api/velma-2-stt-streaming
When to use: Live audio where you need results as speech happens — phone calls, live captions, real-time meeting transcription.
Supported formats: Self-describing (WAV, MP3, OGG, FLAC, WebM, AAC, AIFF) are auto-detected. Raw PCM formats require audio_format, sample_rate, and num_channels.
Pricing: $0.06 / hour of audio
This example simulates streaming by reading a local file in chunks — the same pattern applies to a real microphone or live audio source.
import os
import asyncio
import json
import websockets
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.environ["MODULATE_API_KEY"]
BASE_URL = "wss://modulate-developer-apis.com/api/velma-2-stt-streaming"
AUDIO_FILE = "audio.mp3"
CHUNK_SIZE = 4096
def build_url() -> str:
params = {
"api_key": API_KEY,
"speaker_diarization": "true",
"emotion_signal": "false",
"accent_signal": "false",
"deepfake_signal": "false",
"pii_phi_tagging": "false",
"partial_results": "false",
}
query = "&".join(f"{k}={v}" for k, v in params.items())
return f"{BASE_URL}?{query}"
async def stream_audio(filepath: str):
url = build_url()
async with websockets.connect(url) as ws:
print("Connected. Streaming audio...\n")
async def send_audio():
with open(filepath, "rb") as f:
while chunk := f.read(CHUNK_SIZE):
await ws.send(chunk)
await asyncio.sleep(0)
await ws.send("")
print("[sent end-of-stream signal]\n")
async def receive_results():
async for message in ws:
msg = json.loads(message)
msg_type = msg.get("type")
if msg_type == "utterance":
u = msg["utterance"]
print(
f"[{u['start_ms']}ms] Speaker {u['speaker']} ({u['language']}): "
f"{u['text']}"
)
elif msg_type == "partial_utterance":
pu = msg["partial_utterance"]
print(f" (partial) {pu['text']}", end="\r")
elif msg_type == "done":
print(f"\n── Done. Total audio: {msg['duration_ms']} ms ──")
break
elif msg_type == "error":
print(f"Error from server: {msg['error']}")
break
await asyncio.gather(send_audio(), receive_results())
if __name__ == "__main__":
asyncio.run(stream_audio(AUDIO_FILE))
Expected output:
Connected. Streaming audio...
[0ms] Speaker 1 (en): Hello, how are you today?
[3100ms] Speaker 1 (en): I wanted to go over the project timeline.
[sent end-of-stream signal]
── Done. Total audio: 9800 ms ──
PII/PHI redaction — batch
Endpoint: POST /api/velma-2-pii-phi-redaction-batch
When to use: You have a complete audio file and need PII/PHI removed from both the transcript and the audio — for example, a recording that will be shared or archived.
Supported formats: AAC, AIFF, FLAC, MP3, MP4, MOV, OGG, Opus, WAV, WebM
Pricing: $0.05 / hour of audio
This endpoint returns multipart/form-data with two parts: metadata (JSON) and audio (MP3). The example below uses requests-toolbelt to decode the response — install it with pip install requests-toolbelt.
pii_phi_redaction_batch.py
import os
import json
import requests
from requests_toolbelt.multipart.decoder import MultipartDecoder
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.environ["MODULATE_API_KEY"]
ENDPOINT = "https://modulate-developer-apis.com/api/velma-2-pii-phi-redaction-batch"
AUDIO_FILE = "audio.mp3"
def redact(filepath: str):
headers = {"X-API-Key": API_KEY}
with open(filepath, "rb") as f:
response = requests.post(
ENDPOINT,
headers=headers,
files={"upload_file": f},
data={
"speaker_diarization": "true",
"start_redaction_padding_ms": "100",
"end_redaction_padding_ms": "0",
},
)
response.raise_for_status()
decoder = MultipartDecoder.from_response(response)
metadata = None
audio_bytes = None
for part in decoder.parts:
disposition = part.headers.get(b"Content-Disposition", b"").decode()
if 'name="metadata"' in disposition:
metadata = json.loads(part.content)
elif 'name="audio"' in disposition:
audio_bytes = part.content
return metadata, audio_bytes
if __name__ == "__main__":
metadata, audio_bytes = redact(AUDIO_FILE)
print("\n── Redacted transcript ──")
print(metadata["text"])
print(f"\n── Duration: {metadata['duration_ms']} ms ──")
print(f"── Redaction ranges: {metadata['redaction_ranges']} ──\n")
if audio_bytes:
with open("redacted.mp3", "wb") as f:
f.write(audio_bytes)
print(f"Saved redacted audio → redacted.mp3 ({len(audio_bytes)} bytes)")
PII/PHI redaction — streaming (WebSocket)
Endpoint: wss://modulate-developer-apis.com/api/velma-2-pii-phi-redaction-streaming
When to use: Live audio where you need PII/PHI redacted in real time — redacted transcript text and redacted MP3 clips are delivered as each utterance completes.
Supported formats: Self-describing (WAV, MP3, OGG, FLAC, WebM, AAC, AIFF) are auto-detected. Raw PCM formats require audio_format, sample_rate, and num_channels.
Pricing: $0.08 / hour of audio
pii_phi_redaction_streaming.py
import os
import asyncio
import json
import websockets
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.environ["MODULATE_API_KEY"]
BASE_URL = "wss://modulate-developer-apis.com/api/velma-2-pii-phi-redaction-streaming"
AUDIO_FILE = "audio.mp3"
CHUNK_SIZE = 4096
def build_url() -> str:
params = {
"api_key": API_KEY,
"speaker_diarization": "true",
"start_redaction_padding_ms": "100",
"end_redaction_padding_ms": "0",
}
query = "&".join(f"{k}={v}" for k, v in params.items())
return f"{BASE_URL}?{query}"
async def redact_streaming(filepath: str):
url = build_url()
audio_clips = []
async with websockets.connect(url) as ws:
async def send_audio():
with open(filepath, "rb") as f:
while chunk := f.read(CHUNK_SIZE):
await ws.send(chunk)
await asyncio.sleep(0)
await ws.send("")
async def receive_results():
is_done = False
async for message in ws:
if isinstance(message, bytes):
audio_clips.append(message)
if is_done:
break
continue
msg = json.loads(message)
if msg.get("type") == "utterance":
u = msg["utterance"]
print(f"[{u['start_ms']}ms] Speaker {u['speaker']}: {u['text']}")
elif msg.get("type") == "done":
is_done = True
if not msg.get("trailing_redacted_audio"):
break
elif msg.get("type") == "error":
print(f"Error: {msg['error']}")
break
await asyncio.gather(send_audio(), receive_results())
if audio_clips:
with open("redacted.mp3", "wb") as f:
for clip in audio_clips:
f.write(clip)
if __name__ == "__main__":
asyncio.run(redact_streaming(AUDIO_FILE))
Deepfake detection — batch
Endpoint: POST /api/velma-2-synthetic-voice-detection-batch
When to use: You have a recorded audio file and want to know whether it contains AI-generated (synthetic) speech. Results cover the full file, broken into time-windowed frames.
Supported formats: AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM
Pricing: $0.25 / hour of audio
import os
import requests
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.environ["MODULATE_API_KEY"]
ENDPOINT = "https://modulate-developer-apis.com/api/velma-2-synthetic-voice-detection-batch"
AUDIO_FILE = "audio.mp3"
VERDICT_LABELS = {
"synthetic": "Synthetic (AI-generated)",
"non-synthetic": "Non-synthetic (human)",
"no-content": "No content (silence)",
}
def detect_deepfake(filepath: str) -> dict:
headers = {"X-API-Key": API_KEY}
with open(filepath, "rb") as f:
response = requests.post(ENDPOINT, headers=headers, files={"upload_file": f})
response.raise_for_status()
return response.json()
if __name__ == "__main__":
result = detect_deepfake(AUDIO_FILE)
print(f"\nFile: {result['filename']}")
print(f"Duration: {result['duration_ms']} ms")
print(f"Frames: {len(result['frames'])}\n")
for frame in result["frames"]:
label = VERDICT_LABELS.get(frame["verdict"], frame["verdict"])
print(
f" {frame['start_time_ms']:>6}ms – {frame['end_time_ms']:>6}ms "
f"{label} (confidence: {frame['confidence']:.2%})"
)
Deepfake detection — streaming (WebSocket)
Endpoint: wss://modulate-developer-apis.com/api/velma-2-synthetic-voice-detection-streaming
Required audio format: Raw PCM — the server does not auto-detect format here. You must specify audio_format, sample_rate, and num_channels. The most common setup is s16le at 16kHz mono.
Pricing: $0.25 / hour of audio
This model requires raw PCM. If your source is MP3 or WAV, convert first:ffmpeg -i audio.mp3 -ar 16000 -ac 1 -f s16le audio.raw
import os
import asyncio
import json
import websockets
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.environ["MODULATE_API_KEY"]
BASE_URL = "wss://modulate-developer-apis.com/api/velma-2-synthetic-voice-detection-streaming"
AUDIO_FILE = "audio.raw"
CHUNK_SIZE = 8192
def build_url() -> str:
params = {
"api_key": API_KEY,
"audio_format": "s16le",
"sample_rate": "16000",
"num_channels": "1",
}
query = "&".join(f"{k}={v}" for k, v in params.items())
return f"{BASE_URL}?{query}"
async def stream_detection(filepath: str):
url = build_url()
async with websockets.connect(url) as ws:
async def send_audio():
with open(filepath, "rb") as f:
while chunk := f.read(CHUNK_SIZE):
await ws.send(chunk)
await asyncio.sleep(0)
await ws.send("")
async def receive_results():
async for message in ws:
msg = json.loads(message)
if msg.get("type") == "frame":
frame = msg["frame"]
print(
f" {frame['start_time_ms']:>6}ms – {frame['end_time_ms']:>6}ms "
f"{frame['verdict']} (confidence: {frame['confidence']:.2%})"
)
elif msg.get("type") == "done":
print(f"\nDone. Total: {msg['duration_ms']} ms, frames: {msg['frame_count']}")
break
elif msg.get("type") == "error":
print(f"Error: {msg['error']}")
break
await asyncio.gather(send_audio(), receive_results())
if __name__ == "__main__":
asyncio.run(stream_detection(AUDIO_FILE))
Next steps
- Enable optional enrichments on the batch STT models (
emotion_signal, accent_signal, deepfake_signal, pii_phi_tagging) and inspect the per-utterance fields. See the STT enrichment features guide.
- Try
partial_results=true on the STT streaming model to see in-progress text before each utterance is finalized.
- Review the FAQ for questions on rate limits, billing, error handling, and supported audio formats.