Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.modulate.ai/llms.txt

Use this file to discover all available pages before exploring further.

This article explains the mechanics behind Velma-2’s synthetic voice detection: how audio is segmented, scored, and what the results mean. It covers behavior shared between the batch and streaming detection endpoints.

What the model detects

The synthetic voice detection model classifies whether a given segment of audio contains naturally produced human speech or synthetic speech. “Synthetic speech” includes text-to-speech systems, voice cloning, and other AI voice generation methods.
Not all instances of synthetic speech are indicative of harm or deception. For example, Velma-2 is unable to distinguish between someone using accessibility tools to communicate and someone using AI voice cloning for nefarious purposes. Please review your use case and use your best judgment to decide how to interpret the results.
The model operates on audio segments and analyzes the acoustic characteristics, not the content of the words spoken.

Windowing

Audio is not analyzed as a single unit. Instead, the system divides audio into overlapping windows and scores each independently. This produces frame-level results, where each frame covers a span of time in the original file. Batch endpoint: Each frame covers a 4-second window. The file is windowed from start to finish, and all frames are returned in the response once processing is complete. Streaming endpoint: The first prediction is emitted once the minimum audio duration is received. Each subsequent prediction fires 1 second later, growing from time 0. Once the window reaches the full clip length, the start of the window slides forward so the window size stays constant. This sliding window behavior means you receive incremental verdicts during streaming rather than waiting for the full file.

Silence trimming (batch)

Before windowing, the batch endpoint trims leading and trailing silence from the audio. Frame timestamps in the response reflect positions in the original file, accounting for any trimmed silence. You can use start_time_ms and end_time_ms to match frames back to the source audio without offset calculations.

The no-content verdict

Frames that contain only silence or otherwise unusable audio receive a verdict of "no-content" with a confidence of 1.0. These frames are not passed through the detection model — the "no-content" classification is determined before inference. This means:
  • Silent sections of audio are clearly identified in the output rather than producing arbitrary synthetic/non-synthetic guesses.
  • You can filter "no-content" frames from your results when building aggregated verdicts.

Confidence scores

Each frame includes a confidence value between 0.0 and 1.0 representing the model’s confidence in the stated verdict — not the probability of the audio being synthetic. A frame with verdict: "non-synthetic" and confidence: 0.97 means the model is 97% confident the audio is non-synthetic, not that there is a 3% chance it is synthetic. A "no-content" verdict always carries confidence: 1.0.

Minimum audio length

Files or streams shorter than 0.5 seconds of usable audio are rejected. Files shorter than one full 4-second window are padded before inference.

How this relates to synthetic voice scoring in STT Batch

The STT Batch API includes an optional deepfake_signal parameter that adds a deepfake_score to each transcribed utterance. This is a per-utterance score (not per-frame), and is only available when you are already transcribing. Use deepfake_signal in the STT Batch API when synthetic voice detection is a supplementary signal alongside transcription. Use the dedicated Synthetic Voice Detection APIs when detection is the primary goal, when you need frame-level results across the full file, or when you need streaming verdicts.