This article explains the mechanics behind Velma-2’s synthetic voice detection: how audio is segmented, scored, and what the results mean. It covers behavior shared between the batch and streaming detection endpoints.Documentation Index
Fetch the complete documentation index at: https://docs.modulate.ai/llms.txt
Use this file to discover all available pages before exploring further.
What the model detects
The synthetic voice detection model classifies whether a given segment of audio contains naturally produced human speech or synthetic speech. “Synthetic speech” includes text-to-speech systems, voice cloning, and other AI voice generation methods.Not all instances of synthetic speech are indicative of harm or deception. For example, Velma-2 is unable to distinguish between someone using accessibility tools to communicate and someone using AI voice cloning for nefarious purposes. Please review your use case and use your best judgment to decide how to interpret the results.
Windowing
Audio is not analyzed as a single unit. Instead, the system divides audio into overlapping windows and scores each independently. This produces frame-level results, where each frame covers a span of time in the original file. Batch endpoint: Each frame covers a 4-second window. The file is windowed from start to finish, and all frames are returned in the response once processing is complete. Streaming endpoint: The first prediction is emitted once the minimum audio duration is received. Each subsequent prediction fires 1 second later, growing from time 0. Once the window reaches the full clip length, the start of the window slides forward so the window size stays constant. This sliding window behavior means you receive incremental verdicts during streaming rather than waiting for the full file.Silence trimming (batch)
Before windowing, the batch endpoint trims leading and trailing silence from the audio. Frame timestamps in the response reflect positions in the original file, accounting for any trimmed silence. You can usestart_time_ms and end_time_ms to match frames back to the source audio without offset calculations.
The no-content verdict
Frames that contain only silence or otherwise unusable audio receive a verdict of "no-content" with a confidence of 1.0. These frames are not passed through the detection model — the "no-content" classification is determined before inference.
This means:
- Silent sections of audio are clearly identified in the output rather than producing arbitrary synthetic/non-synthetic guesses.
- You can filter
"no-content"frames from your results when building aggregated verdicts.
Confidence scores
Each frame includes aconfidence value between 0.0 and 1.0 representing the model’s confidence in the stated verdict — not the probability of the audio being synthetic.
A frame with verdict: "non-synthetic" and confidence: 0.97 means the model is 97% confident the audio is non-synthetic, not that there is a 3% chance it is synthetic. A "no-content" verdict always carries confidence: 1.0.
Minimum audio length
Files or streams shorter than 0.5 seconds of usable audio are rejected. Files shorter than one full 4-second window are padded before inference.How this relates to synthetic voice scoring in STT Batch
The STT Batch API includes an optionaldeepfake_signal parameter that adds a deepfake_score to each transcribed utterance. This is a per-utterance score (not per-frame), and is only available when you are already transcribing.
Use deepfake_signal in the STT Batch API when synthetic voice detection is a supplementary signal alongside transcription. Use the dedicated Synthetic Voice Detection APIs when detection is the primary goal, when you need frame-level results across the full file, or when you need streaming verdicts.
Related
- STT enrichment features —
deepfake_signalparameter - Which API should I use?