AI Music Detection Streaming

Real-time AI music detection over WebSocket. The client streams audio and receives per-window vocal AI verdicts as they become available, followed by a final clip-level summary - including instrumental AI detection - on completion.

Endpoint

wss://platform.modulate.ai/api/velma-2-ai-music-detection-streaming

Authentication

Pass your API key as a query parameter on the connection URL:

wss://.../velma-2-ai-music-detection-streaming?api_key=YOUR_API_KEY&audio_format=mp3

Connection parameters

Parameter	Required	Description
`api_key`	Yes	Your API key
`audio_format`	Yes	Audio format - see supported formats below
`sample_rate`	Raw PCM only	Sample rate in Hz
`num_channels`	Raw PCM only	Number of channels (1-8)

Supported audio formats

Container formats - sample_rate and num_channels must not be specified (the headers already carry this metadata): wav, mp3, ogg, flac, webm, aac, aiff Raw PCM formats - sample_rate and num_channels are required: s8, s16le, s16be, s24le, s24be, s32le, s32be, u8, u16le, u16be, u24le, u24be, u32le, u32be, f32le, f32be, f64le, f64be, mulaw, alaw Valid sample rates: 8000, 11025, 16000, 22050, 32000, 44100, 48000, 96000

Protocol

Client -> server

Message	Description
Binary frame	Chunk of audio bytes in the declared format (any size)
Empty text frame `""`	Signals end of stream

Server -> client

Message	Description
`{"type": "window", "window": {...}}`	Per-window result - emitted for each completed 4-second window, in order
`{"type": "done", ...}`	Stream complete - clip-level verdict plus instrumental AI detection
`{"type": "error", "error": "..."}`	An error occurred - connection will close

Vocal AI detection runs on each 4-second window as audio arrives and is reported in window messages. Instrumental AI detection runs on the accumulated audio at end-of-stream, so instrumental_ai_percentage and instrumental_ai_confidence appear only in the final done message.

Window object

Field	Type	Description
`start_time_ms`	integer	Window start time in milliseconds
`end_time_ms`	integer	Window end time in milliseconds
`vocal_percentage`	float	Percentage of the window containing vocal content (0-100)
`vocal_ai_percentage`	float	`100` if the window is classified as AI-generated vocals, `0` otherwise (always `0` without sufficient vocal content)
`vocal_ai_confidence`	float	Confidence the window contains AI-generated vocals (0-1); `0` without sufficient vocal content
`instrumental_percentage`	float	Percentage of the window containing instrumental music content (0-100)
`silence_percentage`	float	Percentage of the window containing neither vocal nor instrumental content (0-100)

Done object

Field	Type	Description
`duration_ms`	integer	Total duration of the streamed audio in milliseconds
`window_count`	integer	Total number of windows analysed during the session
`primary_verdict`	string	Clip-level classification: `"ai-vocal-music"`, `"ai-instrumental"`, or `"not-ai-music"`
`vocal_percentage`	float	Average percentage of audio with vocal content, across all windows (0-100)
`vocal_ai_percentage`	float	Percentage of clip duration classified as AI-generated vocals (0-100). A clip-level measure — it may differ from what the per-window messages suggest
`vocal_ai_confidence`	float	Average confidence that vocal windows contain AI-generated vocals (0-1)
`instrumental_percentage`	float	Average percentage of audio with instrumental content, across all windows (0-100)
`instrumental_ai_percentage`	float	AI detection score for the clip’s instrumental content (0-100)
`instrumental_ai_confidence`	float	Confidence in the instrumental AI assessment for the full clip (0-1)
`silence_percentage`	float	Average percentage of audio with neither vocal nor instrumental content (0-100)

WebSocket close codes

Code	Meaning
`1000`	Normal closure after the `done` message, or after the connection completes
`1003`	Invalid or missing query parameters (unknown format, bad sample rate, missing `audio_format`)
`4002`	Audio could not be decoded, or does not match the declared raw `audio_format`
`4003`	The request could not be validated, or the request is not permitted
`4029`	The request could not be completed due to insufficient credits

An error message is sent before the connection closes for the 1003, 4002, 4003, and 4029 cases.

Rate limits

Concurrent connection limits apply per organization
Monthly usage limits (in audio hours) apply per organization

Velma

Speech-to-text Transcription

Deepfake Detection

Emotion Detection

Accent Detection

PII/PHI Redaction

Music Detection