Skip to main content
Velma-2 music detection returns frame-level music_prob and speech_prob values (0–1) across your audio. Both probabilities are independent — a frame with vocals can score high on both.

Batch

Send a complete audio file and receive frame-level probabilities, percentage breakdowns, and a primary_label for the full clip.
curl -X POST https://modulate-developer-apis.com/api/velma-2-music-detection-batch \
  -H "X-API-Key: $MODULATE_API_KEY" \
  -F "upload_file=@audio.mp3"
{
  "filename": "audio.mp3",
  "duration_s": 5.76,
  "primary_label": "speech",
  "music_pct": 0.0,
  "speech_pct": 86.7,
  "latency_ms": 1243.5,
  "frames": [
    { "start_time_ms": 0,   "end_time_ms": 192, "music_prob": 0.0213, "speech_prob": 0.9888 },
    { "start_time_ms": 192, "end_time_ms": 384, "music_prob": 0.0204, "speech_prob": 0.9931 }
  ]
}
Frames are 192ms each. primary_label summarises the clip as "music" (music covers at least as much of the clip as speech), "speech" (speech covers more of the clip than music), "neither" (neither crossed the classification threshold in any frame), or "unknown" (no frames could be produced from the audio).

Streaming (WebSocket)

Connect over WebSocket and receive frame probabilities progressively as audio arrives — useful for live content moderation, broadcast monitoring, or real-time scene classification.
websocat "wss://modulate-developer-apis.com/api/velma-2-music-detection-streaming?api_key=$MODULATE_API_KEY&audio_format=mp3" \
  --binary - < audio.mp3
{ "type": "frame", "frame": { "start_time_ms": 0, "end_time_ms": 192, "music_prob": 0.0213, "speech_prob": 0.9888 } }
{ "type": "frame", "frame": { "start_time_ms": 192, "end_time_ms": 384, "music_prob": 0.0204, "speech_prob": 0.9931 } }
{ "type": "done", "duration_ms": 5760, "frame_count": 30, "primary_label": "speech", "music_pct": 0.0, "speech_pct": 86.7 }
The server emits one frame message per 192ms of audio processed. Send an empty string ("") to signal end of stream. For raw PCM formats, pass audio_format, sample_rate, and num_channels as query parameters. Container formats (mp3, wav, ogg, flac, webm, aac, aiff) only need audio_format. See Audio formats.

API reference