primary_verdict of ai-vocal-music, ai-instrumental, or not-ai-music.
This is distinct from music detection, which classifies audio as music, speech, or neither. AI music detection answers a different question: is this music AI-generated?
How windows are routed
Each window is routed to one of two detection paths based on its vocal content:- Vocal windows (sufficient vocal content) are classified for AI-generated vocals, producing
vocal_ai_percentageandvocal_ai_confidence. Theirinstrumental_ai_*fields are0. - Instrumental windows (no sufficient vocal content) are classified for AI-generated instrumental content, producing
instrumental_ai_percentageandinstrumental_ai_confidence. Theirvocal_ai_*fields are0.
primary_verdict is ai-vocal-music when the vocal AI percentage exceeds the vocal threshold; otherwise ai-instrumental when enough non-vocal, non-silent windows exceed the instrumental threshold; otherwise not-ai-music.
Batch
Send a complete audio file and receive a clip-level verdict plus a per-window breakdown.Expected response
Expected response
vocal_percentage, instrumental_percentage, and silence_percentage are averages of the per-window values; these are not mutually exclusive, since a window can contain both vocals and instrumental music.
Supported formats: .aac, .flac, .m4a, .mp3, .mp4, .ogg, .opus, .wav. Maximum file size is 100 MB.
Streaming (WebSocket)
Connect over WebSocket and receive per-window vocal AI verdicts progressively as audio arrives, followed by a final clip-level summary that includes instrumental AI detection.Vocal AI detection runs on each 4-second window as audio arrives and is reported in
window messages. Instrumental AI detection runs on the accumulated audio at end-of-stream, so instrumental_ai_percentage and instrumental_ai_confidence appear only in the final done message.Example messages received
Example messages received
window message per completed 4-second window. Streaming window messages carry vocal AI fields only — instrumental AI results arrive in the done message. Send an empty string ("") to signal end of stream.
For raw PCM formats, pass audio_format, sample_rate, and num_channels as query parameters. Container formats (wav, mp3, ogg, flac, webm, aac, aiff) only need audio_format. See Audio formats.
Per-window results can be less accurate than the clip-level verdict, so rely on the clip-level result when judging a whole song or segment. Heavily processed or high-production tracks are sometimes mislabeled as AI-generated; this is a known gap targeted by future model updates.