Deepfake Detection Streaming

Real-time synthetic voice detection over WebSocket. Streams audio to the server and receives per-frame verdicts (synthetic, non-synthetic, or no-content) with confidence scores as analysis windows complete. For a conceptual explanation of how detection works — including windowing, silence trimming, and the no-content verdict — see How synthetic voice detection works.

Endpoint

wss://modulate-developer-apis.com/api/velma-2-synthetic-voice-detection-streaming

Authentication

Pass your API key as a query parameter when opening the connection.

wss://modulate-developer-apis.com/api/velma-2-synthetic-voice-detection-streaming?api_key=YOUR_API_KEY&audio_format=s16le&sample_rate=16000&num_channels=1

Unlike the batch endpoint, the streaming API does not use an X-API-Key header. The key must be in the query string at connection time.

Query parameters

Parameter	Type	Required	Description
`api_key`	string	Yes	Your API key
`audio_format`	string	Yes	Audio encoding format — see Audio formats and preprocessing
`sample_rate`	integer	Conditional	Required for raw (headerless) formats. One of: `8000`, `11025`, `16000`, `22050`, `32000`, `44100`, `48000`, `96000`
`num_channels`	integer	Conditional	Required for raw formats. 1–8

For supported format values and format selection guidance, see Audio formats and preprocessing.

Connection flow

Connect with api_key, audio_format, and (for raw formats) sample_rate and num_channels.
Stream audio as binary WebSocket frames. Frames can be any size.
Receive frame JSON messages as analysis windows complete.
Send an empty text frame ("") to signal end of audio.
Receive a done message with total duration and frame count.
The connection closes automatically.

Server messages

Frame result

Sent each time an analysis window is complete.

{
  "type": "frame",
  "frame": {
    "start_time_ms": 0,
    "end_time_ms": 4000,
    "verdict": "synthetic",
    "confidence": 0.9732
  }
}

Field	Type	Description
`start_time_ms`	integer	Frame start time in the audio stream (ms)
`end_time_ms`	integer	Frame end time in the audio stream (ms)
`verdict`	string	`"synthetic"`, `"non-synthetic"`, or `"no-content"`
`confidence`	float	Confidence in the verdict, 0.0–1.0

Done

{
  "type": "done",
  "duration_ms": 12500,
  "frame_count": 10
}

Field	Type	Description
`duration_ms`	integer	Total duration of the streamed audio in milliseconds
`frame_count`	integer	Total number of frames analyzed

Error

{
  "type": "error",
  "error": "Invalid audio_format='mp4'. Valid values: ['aac', 'aiff', ...]"
}

WebSocket close codes

Code	Meaning
`1000`	Normal closure after a successful `done` message
`1003`	Invalid query parameters (bad format, sample rate, or channels)
`4002`	Audio data does not match the declared format
`4003`	Authentication failed or usage denied
`1011`	Server error during streaming

Rate limits

Concurrent connection limits apply per organization.
Monthly usage limits (in audio hours) apply per organization.
Connections that exceed limits are rejected during the WebSocket handshake with close code 4003.

How synthetic voice detection works — windowing, silence trimming, and scoring explained
Audio formats and preprocessing — format options for the streaming endpoint
Which API should I use? — batch vs streaming tradeoffs
Authentication and rate limits

Speech-to-text Transcription

Deepfake Detection

PII/PHI Redaction

Deepfake Detection Streaming

Endpoint

Authentication

Query parameters

Connection flow

Server messages

Frame result

Done

Error

WebSocket close codes

Rate limits

Speech-to-text Transcription

Deepfake Detection

PII/PHI Redaction

Documentation Index

​Endpoint

​Authentication

​Query parameters

​Connection flow

​Server messages

​Frame result

​Done

​Error

​WebSocket close codes

​Rate limits

​Related

Endpoint

Authentication

Query parameters

Connection flow

Server messages

Frame result

Done

Error

WebSocket close codes

Rate limits

Related