Music & Speech Detection

Music & Speech Detection returns frame-level music_prob and speech_prob values (0–1) across your audio. Both probabilities are independent — a frame with vocals can score high on both.

Batch

Send a complete audio file and receive frame-level probabilities, percentage breakdowns, and a primary_label for the full clip.

curl -X POST https://platform.modulate.ai/api/velma-2-music-detection-batch \
  -H "X-API-Key: $MODULATE_API_KEY" \
  -F "upload_file=@audio.mp3"

import os, requests

response = requests.post(
    "https://platform.modulate.ai/api/velma-2-music-detection-batch",
    headers={"X-API-Key": os.environ["MODULATE_API_KEY"]},
    files={"upload_file": open("audio.mp3", "rb")},
)
response.raise_for_status()
result = response.json()
print(f"Primary label: {result['primary_label']}")
print(f"Music: {result['music_pct']}%  Speech: {result['speech_pct']}%")
for frame in result["frames"]:
    print(f"  {frame['start_time_ms']}ms – {frame['end_time_ms']}ms  music={frame['music_prob']:.4f}  speech={frame['speech_prob']:.4f}")

Expected response

{
  "filename": "audio.mp3",
  "duration_s": 5.76,
  "primary_label": "speech",
  "music_pct": 0.0,
  "speech_pct": 86.7,
  "latency_ms": 1243.5,
  "frames": [
    { "start_time_ms": 0,   "end_time_ms": 192, "music_prob": 0.0213, "speech_prob": 0.9888 },
    { "start_time_ms": 192, "end_time_ms": 384, "music_prob": 0.0204, "speech_prob": 0.9931 }
  ]
}

Frames are 192ms each. primary_label summarises the clip as "music" (music covers at least as much of the clip as speech), "speech" (speech covers more of the clip than music), "neither" (neither crossed the classification threshold in any frame), or "unknown" (no frames could be produced from the audio). Batch accepts common audio formats (MP3, WAV, FLAC, MP4, OGG, and more) — see Audio formats.

Streaming (WebSocket)

Connect over WebSocket and receive frame probabilities progressively as audio arrives — useful for live content moderation, broadcast monitoring, or real-time scene classification.

websocat "wss://platform.modulate.ai/api/velma-2-music-detection-streaming?api_key=$MODULATE_API_KEY&audio_format=mp3" \
  --binary - < audio.mp3

import os, asyncio, json, websockets

API_KEY = os.environ["MODULATE_API_KEY"]
AUDIO_FILE = "audio.mp3"
CHUNK_SIZE = 65536

async def stream():
    url = (
        f"wss://platform.modulate.ai/api/velma-2-music-detection-streaming"
        f"?api_key={API_KEY}&audio_format=mp3"
    )
    async with websockets.connect(url) as ws:
        async def send():
            with open(AUDIO_FILE, "rb") as f:
                while chunk := f.read(CHUNK_SIZE):
                    await ws.send(chunk)
            await ws.send("")

        async def receive():
            async for message in ws:
                msg = json.loads(message)
                if msg["type"] == "frame":
                    f = msg["frame"]
                    print(f"  {f['start_time_ms']}ms – {f['end_time_ms']}ms  music={f['music_prob']:.4f}  speech={f['speech_prob']:.4f}")
                elif msg["type"] == "done":
                    print(f"Done — {msg['duration_ms']}ms, {msg['frame_count']} frames")
                    print(f"Primary label: {msg['primary_label']}  Music: {msg['music_pct']}%  Speech: {msg['speech_pct']}%")
                    break

        await asyncio.gather(send(), receive())

asyncio.run(stream())

Example messages received

{ "type": "frame", "frame": { "start_time_ms": 0, "end_time_ms": 192, "music_prob": 0.0213, "speech_prob": 0.9888 } }
{ "type": "frame", "frame": { "start_time_ms": 192, "end_time_ms": 384, "music_prob": 0.0204, "speech_prob": 0.9931 } }
{ "type": "done", "duration_ms": 5760, "frame_count": 30, "primary_label": "speech", "music_pct": 0.0, "speech_pct": 86.7 }

The server emits one frame message per 192ms of audio processed. Send an empty string ("") to signal end of stream. For raw PCM formats, pass audio_format, sample_rate, and num_channels as query parameters. Container formats (mp3, wav, ogg, flac, webm, aac, aiff) only need audio_format. See Audio formats.

API reference

Music & Speech Detection Batch — full parameter and response schema
Music & Speech Detection Streaming — WebSocket protocol, format requirements, close codes

Get started

By capability

Guides

Music & Speech Detection

Batch

Streaming (WebSocket)

API reference

​Batch

​Streaming (WebSocket)

​API reference

Batch

Streaming (WebSocket)

API reference