music_prob and speech_prob values (0–1) across your audio. Both probabilities are independent — a frame with vocals can score high on both.
Batch
Send a complete audio file and receive frame-level probabilities, percentage breakdowns, and aprimary_label for the full clip.
Expected response
Expected response
primary_label summarises the clip as "music" (music covers at least as much of the clip as speech), "speech" (speech covers more of the clip than music), "neither" (neither crossed the classification threshold in any frame), or "unknown" (no frames could be produced from the audio).
Streaming (WebSocket)
Connect over WebSocket and receive frame probabilities progressively as audio arrives — useful for live content moderation, broadcast monitoring, or real-time scene classification.Example messages received
Example messages received
frame message per 192ms of audio processed. Send an empty string ("") to signal end of stream.
For raw PCM formats, pass audio_format, sample_rate, and num_channels as query parameters. Container formats (mp3, wav, ogg, flac, webm, aac, aiff) only need audio_format. See Audio formats.
API reference
- Music Detection Batch — full parameter and response schema
- Music Detection Streaming — WebSocket protocol, format requirements, close codes