Music Detection Streaming

Real-time frame-level music and speech classification over WebSocket. Frames are returned progressively as audio is streamed — no need to wait for the full file to upload before results begin arriving.

Endpoint

wss://platform.modulate.ai/api/velma-2-music-detection-streaming

Authentication

Pass your API key as a query parameter on the connection URL:

wss://.../velma-2-music-detection-streaming?api_key=YOUR_API_KEY&audio_format=s16le&...

Features

Real-time output — frames emitted progressively after each 192ms chunk of audio
Music detection — identifies frames containing music content
Speech detection — identifies frames containing speech content
Non-exclusive labels — music and speech are independent; both can be high simultaneously (e.g. music with vocals)
Any chunk size — send audio in whatever chunk size suits your pipeline
Container and raw PCM support — stream compressed files or raw PCM directly from a microphone

Connection parameters

Parameter	Required	Description
`api_key`	Yes	Your API key
`audio_format`	Yes	Audio format — see supported formats below
`sample_rate`	Raw PCM only	Sample rate in Hz
`num_channels`	Raw PCM only	Number of channels (1–8)

Supported audio formats

Container formats — sample_rate and num_channels must not be specified (the headers already carry this metadata): wav, mp3, ogg, flac, webm, aac, aiff Raw PCM formats — sample_rate and num_channels are required: s16le, s16be, s32le, s32be, s24le, s24be, s8, u8, u16le, u16be, u24le, u24be, u32le, u32be, f32le, f32be, f64le, f64be, mulaw, alaw Valid sample rates: 8000, 11025, 16000, 22050, 32000, 44100, 48000, 96000

Protocol

Client → server

Message	Description
Binary frame	Chunk of audio bytes in the declared format (any size)
Empty text frame `""`	Signals end of stream

Server → client

Message	Description
`{"type": "frame", "frame": {...}}`	Frame result — emitted after each 192ms chunk
`{"type": "done", "duration_ms": ..., ...}`	Stream complete — includes overall summary across all frames
`{"type": "error", "error": "..."}`	An error occurred — connection will close

Frame object

Field	Type	Description
`start_time_ms`	integer	Frame start time in milliseconds
`end_time_ms`	integer	Frame end time in milliseconds
`music_prob`	float	Music probability (0.0–1.0)
`speech_prob`	float	Speech probability (0.0–1.0)

Done object

Field	Type	Description
`duration_ms`	integer	Total audio duration processed in milliseconds
`frame_count`	integer	Total number of frames returned
`music_pct`	float	Percentage of the analysed duration classified as music, rounded to one decimal place. `0.0` when no audio was analysed
`speech_pct`	float	Percentage of the analysed duration classified as speech, rounded to one decimal place. `0.0` when no audio was analysed
`primary_label`	string	Dominant classification: `"music"`, `"speech"`, `"neither"`, or `"unknown"` (no frames produced from the audio)

WebSocket close codes

Code	Meaning
`1000`	Normal closure after a successful `done` message
`1003`	Invalid query parameters (unknown format, bad sample rate, missing `audio_format`)
`4002`	Audio could not be decoded or does not match the declared format
`4003`	Access denied or server-side usage check failed
`4029`	Insufficient credits

Chunking behaviour

Audio is buffered in 192ms chunks (one output frame each). Frames are emitted as soon as each chunk is ready, so results begin arriving within 192ms of the first audio being received. At end-of-stream, any remaining audio ≥ 192ms is processed and its frames are emitted before the done message.

Examples

import asyncio
import websockets
import json

WS_URL = "wss://platform.modulate.ai/api/velma-2-music-detection-streaming"
API_KEY = "YOUR_API_KEY"

async def stream_audio(file_path: str) -> None:
    url = (
        f"{WS_URL}?api_key={API_KEY}"
        f"&audio_format=s16le&sample_rate=16000&num_channels=1"
    )
    async with websockets.connect(url) as ws:
        # Send audio in chunks
        with open(file_path, "rb") as f:
            while chunk := f.read(16000):  # 0.5s of s16le/16kHz mono
                await ws.send(chunk)

        # Signal end of stream
        await ws.send("")

        # Receive results
        async for message in ws:
            msg = json.loads(message)
            if msg["type"] == "frame":
                frame = msg["frame"]
                print(
                    f"{frame['start_time_ms']}ms – {frame['end_time_ms']}ms  "
                    f"music={frame['music_prob']:.4f}  speech={frame['speech_prob']:.4f}"
                )
            elif msg["type"] == "done":
                print(f"\nDone — {msg['duration_ms']}ms, {msg['frame_count']} frames")
                print(f"music={msg['music_pct']}%  speech={msg['speech_pct']}%  label={msg['primary_label']}")
                break
            elif msg["type"] == "error":
                raise RuntimeError(f"Server error: {msg['error']}")

asyncio.run(stream_audio("/path/to/audio.raw"))

import asyncio
import websockets
import json

WS_URL = "wss://platform.modulate.ai/api/velma-2-music-detection-streaming"
API_KEY = "YOUR_API_KEY"

async def stream_audio_file(file_path: str, audio_format: str) -> None:
    url = f"{WS_URL}?api_key={API_KEY}&audio_format={audio_format}"
    async with websockets.connect(url) as ws:
        with open(file_path, "rb") as f:
            while chunk := f.read(65536):
                await ws.send(chunk)
        await ws.send("")

        async for message in ws:
            msg = json.loads(message)
            if msg["type"] == "frame":
                frame = msg["frame"]
                print(
                    f"{frame['start_time_ms']}ms – {frame['end_time_ms']}ms  "
                    f"music={frame['music_prob']:.4f}  speech={frame['speech_prob']:.4f}"
                )
            elif msg["type"] == "done":
                print(f"\nDone — {msg['duration_ms']}ms, {msg['frame_count']} frames")
                print(f"music={msg['music_pct']}%  speech={msg['speech_pct']}%  label={msg['primary_label']}")
                break
            elif msg["type"] == "error":
                raise RuntimeError(f"Server error: {msg['error']}")

asyncio.run(stream_audio_file("/path/to/audio.mp3", "mp3"))

import { WebSocket } from "ws";
import { createReadStream } from "fs";

const WS_URL = "wss://platform.modulate.ai/api/velma-2-music-detection-streaming";
const API_KEY = "YOUR_API_KEY";

async function streamAudio(filePath, audioFormat) {
  const url = `${WS_URL}?api_key=${API_KEY}&audio_format=${audioFormat}`;
  const ws = new WebSocket(url);

  await new Promise((resolve, reject) => {
    ws.on("open", () => {
      const stream = createReadStream(filePath, { highWaterMark: 65536 });
      stream.on("data", (chunk) => ws.send(chunk));
      stream.on("end", () => ws.send(""));
      stream.on("error", reject);
    });

    ws.on("message", (data) => {
      const msg = JSON.parse(data);
      if (msg.type === "frame") {
        const { start_time_ms, end_time_ms, music_prob, speech_prob } = msg.frame;
        console.log(
          `${start_time_ms}ms – ${end_time_ms}ms  ` +
          `music=${music_prob.toFixed(4)}  speech=${speech_prob.toFixed(4)}`
        );
      } else if (msg.type === "done") {
        console.log(`\nDone — ${msg.duration_ms}ms, ${msg.frame_count} frames`);
        console.log(`music=${msg.music_pct}%  speech=${msg.speech_pct}%  label=${msg.primary_label}`);
        ws.close();
        resolve();
      } else if (msg.type === "error") {
        reject(new Error(`Server error: ${msg.error}`));
      }
    });

    ws.on("error", reject);
  });
}

await streamAudio("/path/to/audio.mp3", "mp3");

Rate limits

Concurrent connection limits apply per organization
Monthly usage limits (in audio hours) apply per organization

Velma

Speech-to-text Transcription

Deepfake Detection

Emotion Detection

Accent Detection

PII/PHI Redaction

Music Detection

AI Music Detection

Language Detection

Music Detection Streaming

Endpoint

Authentication

Features

Connection parameters

Supported audio formats

Protocol

Client → server

Server → client

Frame object

Done object

WebSocket close codes

Chunking behaviour

Examples

Rate limits

​Endpoint

​Authentication

​Features

​Connection parameters

​Supported audio formats

​Protocol

​Client → server

​Server → client

​Frame object

​Done object

​WebSocket close codes

​Chunking behaviour

​Examples

​Rate limits

Endpoint

Authentication

Features

Connection parameters

Supported audio formats

Protocol

Client → server

Server → client

Frame object

Done object

WebSocket close codes

Chunking behaviour

Examples

Rate limits