Skip to main content
Velma-2 synthetic voice detection (SVD) classifies audio as synthetic, non-synthetic, or no-content across time-windowed frames. Use batch for complete files; use streaming for real-time anti-spoofing checks.

Batch

Send a complete audio file and receive frame-level verdicts for the full clip.
curl -X POST https://modulate-developer-apis.com/api/velma-2-synthetic-voice-detection-batch \
  -H "X-API-Key: $MODULATE_API_KEY" \
  -F "upload_file=@audio.mp3"
{
  "filename": "audio.mp3",
  "duration_ms": 8000,
  "frames": [
    { "start_time_ms": 0,    "end_time_ms": 2000, "verdict": "non-synthetic", "confidence": 0.94 },
    { "start_time_ms": 2000, "end_time_ms": 4000, "verdict": "non-synthetic", "confidence": 0.91 },
    { "start_time_ms": 4000, "end_time_ms": 6000, "verdict": "synthetic",     "confidence": 0.87 },
    { "start_time_ms": 6000, "end_time_ms": 8000, "verdict": "no-content",    "confidence": 1.00 }
  ]
}
Verdicts:
VerdictMeaning
syntheticAI-generated voice detected
non-syntheticHuman voice detected
no-contentSilence or non-voice content
confidence (0–1) reflects the model’s certainty for that frame’s verdict.

Streaming (WebSocket)

Connect over WebSocket and receive frame verdicts progressively as audio arrives — useful for real-time anti-spoofing in voice authentication flows.
SVD streaming requires raw PCM audio. Container formats (MP3, WAV, etc.) must be decoded before sending. The most common setup is 16kHz mono signed 16-bit little-endian (s16le):
ffmpeg -i audio.mp3 -ar 16000 -ac 1 -f s16le audio.raw
websocat "wss://modulate-developer-apis.com/api/velma-2-synthetic-voice-detection-streaming?api_key=$MODULATE_API_KEY&audio_format=s16le&sample_rate=16000&num_channels=1" \
  --binary - < audio.raw
{ "type": "frame", "frame": { "start_time_ms": 0, "end_time_ms": 2000, "verdict": "non-synthetic", "confidence": 0.94 } }
{ "type": "frame", "frame": { "start_time_ms": 2000, "end_time_ms": 4000, "verdict": "non-synthetic", "confidence": 0.91 } }
{ "type": "done", "duration_ms": 4000, "frame_count": 2 }

Deepfake score in STT

If you need transcription and a synthetic voice signal, the STT Batch API supports deepfake_signal=true — it adds a per-utterance deepfake_score without a second API call. Use the dedicated SVD APIs when you need frame-level results, explicit no-content verdicts, or streaming verdicts without transcription.

API reference

  • SVD Batch — full parameter and response schema
  • SVD Streaming — WebSocket protocol, PCM format requirements, close codes