Deepfake Detection

Deepfake Detection classifies audio as synthetic, non-synthetic, or no-content across time-windowed frames. Use batch for complete files; use streaming for real-time anti-spoofing checks.

Batch

Send a complete audio file and receive frame-level verdicts for the full clip.

curl -X POST https://platform.modulate.ai/api/velma-2-synthetic-voice-detection-batch \
  -H "X-API-Key: $MODULATE_API_KEY" \
  -F "upload_file=@audio.mp3"

import os, requests

response = requests.post(
    "https://platform.modulate.ai/api/velma-2-synthetic-voice-detection-batch",
    headers={"X-API-Key": os.environ["MODULATE_API_KEY"]},
    files={"upload_file": open("audio.mp3", "rb")},
)
response.raise_for_status()
result = response.json()
print(f"Duration: {result['duration_ms']}ms, Frames: {len(result['frames'])}")
for frame in result["frames"]:
    print(f"  {frame['start_time_ms']}ms – {frame['end_time_ms']}ms  {frame['verdict']}  ({frame['confidence']:.0%})")

Expected response

{
  "filename": "audio.mp3",
  "duration_ms": 8000,
  "frames": [
    { "start_time_ms": 0,    "end_time_ms": 2000, "verdict": "non-synthetic", "confidence": 0.94 },
    { "start_time_ms": 2000, "end_time_ms": 4000, "verdict": "non-synthetic", "confidence": 0.91 },
    { "start_time_ms": 4000, "end_time_ms": 6000, "verdict": "synthetic",     "confidence": 0.87 },
    { "start_time_ms": 6000, "end_time_ms": 8000, "verdict": "no-content",    "confidence": 1.00 }
  ]
}

Verdicts:

Verdict	Meaning
`synthetic`	AI-generated voice detected
`non-synthetic`	Human voice detected
`no-content`	Silence or non-voice content

confidence (0–1) reflects the model’s certainty for that frame’s verdict.

Streaming (WebSocket)

Connect over WebSocket and receive frame verdicts progressively as audio arrives — useful for real-time anti-spoofing in voice authentication flows.

Deepfake Detection streaming always requires the audio_format query parameter. Raw PCM — 16 kHz mono signed 16-bit little-endian (s16le) — gives the lowest latency and is the most common setup. Container formats (mp3, wav, ogg, flac, webm, and more) are also supported: pass the container’s audio_format (e.g. audio_format=webm) and omit sample_rate/num_channels, since the container carries them.To convert a file to raw PCM:

ffmpeg -i audio.mp3 -ar 16000 -ac 1 -f s16le audio.raw

websocat "wss://platform.modulate.ai/api/velma-2-synthetic-voice-detection-streaming?api_key=$MODULATE_API_KEY&audio_format=s16le&sample_rate=16000&num_channels=1" \
  --binary - < audio.raw

import os, asyncio, json, websockets

API_KEY = os.environ["MODULATE_API_KEY"]
AUDIO_FILE = "audio.raw"
CHUNK_SIZE = 8192

async def stream():
    url = (
        f"wss://platform.modulate.ai/api/velma-2-synthetic-voice-detection-streaming"
        f"?api_key={API_KEY}&audio_format=s16le&sample_rate=16000&num_channels=1"
    )
    async with websockets.connect(url) as ws:
        async def send():
            with open(AUDIO_FILE, "rb") as f:
                while chunk := f.read(CHUNK_SIZE):
                    await ws.send(chunk)
            await ws.send("")

        async def receive():
            async for message in ws:
                msg = json.loads(message)
                if msg["type"] == "frame":
                    f = msg["frame"]
                    print(f"  {f['start_time_ms']}ms – {f['end_time_ms']}ms  {f['verdict']}  ({f['confidence']:.0%})")
                elif msg["type"] == "done":
                    print(f"Done — {msg['duration_ms']}ms, {msg['frame_count']} frames")
                    break

        await asyncio.gather(send(), receive())

asyncio.run(stream())

Example messages received

{ "type": "frame", "frame": { "start_time_ms": 0, "end_time_ms": 2000, "verdict": "non-synthetic", "confidence": 0.94 } }
{ "type": "frame", "frame": { "start_time_ms": 2000, "end_time_ms": 4000, "verdict": "non-synthetic", "confidence": 0.91 } }
{ "type": "done", "duration_ms": 4000, "frame_count": 2 }

Deepfake score in transcription

If you need transcription and a deepfake signal, the Multilingual Transcription Batch API supports deepfake_signal=true — it adds a per-utterance deepfake_score without a second API call. Use the dedicated Deepfake Detection APIs when you need frame-level results, explicit no-content verdicts, or streaming verdicts without transcription.

API reference

Deepfake Detection Batch — full parameter and response schema
Deepfake Detection Streaming — WebSocket protocol, PCM format requirements, close codes

Get started

By capability

Guides

Deepfake Detection

Batch

Streaming (WebSocket)

Deepfake score in transcription

API reference

​Batch

​Streaming (WebSocket)

​Deepfake score in transcription

​API reference

Batch

Streaming (WebSocket)

Deepfake score in transcription

API reference