Skip to main content
Real-time frame-level music and speech classification over WebSocket. Frames are returned progressively as audio is streamed — no need to wait for the full file to upload before results begin arriving.

Endpoint

wss://modulate-developer-apis.com/api/velma-2-music-detection-streaming

Authentication

Pass your API key as a query parameter on the connection URL:
wss://.../velma-2-music-detection-streaming?api_key=YOUR_API_KEY&audio_format=s16le&...

Features

  • Real-time output — frames emitted progressively after each 192ms chunk of audio
  • Music detection — identifies frames containing music content
  • Speech detection — identifies frames containing speech content
  • Non-exclusive labels — music and speech are independent; both can be high simultaneously (e.g. music with vocals)
  • Any chunk size — send audio in whatever chunk size suits your pipeline
  • Container and raw PCM support — stream compressed files or raw PCM directly from a microphone

Connection parameters

ParameterRequiredDescription
api_keyYesYour API key
audio_formatYesAudio format — see supported formats below
sample_rateRaw PCM onlySample rate in Hz
num_channelsRaw PCM onlyNumber of channels (1–8)

Supported audio formats

Container formatssample_rate and num_channels must not be specified (the headers already carry this metadata): wav, mp3, ogg, flac, webm, aac, aiff Raw PCM formatssample_rate and num_channels are required: s16le, s16be, s32le, s32be, s24le, s24be, s8, u8, u16le, u16be, u24le, u24be, u32le, u32be, f32le, f32be, f64le, f64be, mulaw, alaw Valid sample rates: 8000, 11025, 16000, 22050, 32000, 44100, 48000, 96000

Protocol

Client → server

MessageDescription
Binary frameChunk of audio bytes in the declared format (any size)
Empty text frame ""Signals end of stream

Server → client

MessageDescription
{"type": "frame", "frame": {...}}Frame result — emitted after each 192ms chunk
{"type": "done", "duration_ms": ..., ...}Stream complete — includes overall summary across all frames
{"type": "error", "error": "..."}An error occurred — connection will close

Frame object

FieldTypeDescription
start_time_msintegerFrame start time in milliseconds
end_time_msintegerFrame end time in milliseconds
music_probfloatMusic probability (0.0–1.0)
speech_probfloatSpeech probability (0.0–1.0)

Done object

FieldTypeDescription
duration_msintegerTotal audio duration processed in milliseconds
frame_countintegerTotal number of frames returned
music_pctfloatPercentage of the analysed duration classified as music, rounded to one decimal place. 0.0 when no audio was analysed
speech_pctfloatPercentage of the analysed duration classified as speech, rounded to one decimal place. 0.0 when no audio was analysed
primary_labelstringDominant classification: "music", "speech", "neither", or "unknown" (no frames produced from the audio)

WebSocket close codes

CodeMeaning
1000Normal closure after a successful done message
1003Invalid query parameters (unknown format, bad sample rate, missing audio_format)
4002Audio could not be decoded or does not match the declared format
4003Access denied or server-side usage check failed
4029Insufficient credits

Chunking behaviour

Audio is buffered in 192ms chunks (one output frame each). Frames are emitted as soon as each chunk is ready, so results begin arriving within 192ms of the first audio being received. At end-of-stream, any remaining audio ≥ 192ms is processed and its frames are emitted before the done message.

Examples

import asyncio
import websockets
import json

WS_URL = "wss://modulate-developer-apis.com/api/velma-2-music-detection-streaming"
API_KEY = "YOUR_API_KEY"

async def stream_audio(file_path: str) -> None:
    url = (
        f"{WS_URL}?api_key={API_KEY}"
        f"&audio_format=s16le&sample_rate=16000&num_channels=1"
    )
    async with websockets.connect(url) as ws:
        # Send audio in chunks
        with open(file_path, "rb") as f:
            while chunk := f.read(16000):  # 0.5s of s16le/16kHz mono
                await ws.send(chunk)

        # Signal end of stream
        await ws.send("")

        # Receive results
        async for message in ws:
            msg = json.loads(message)
            if msg["type"] == "frame":
                frame = msg["frame"]
                print(
                    f"{frame['start_time_ms']}ms – {frame['end_time_ms']}ms  "
                    f"music={frame['music_prob']:.4f}  speech={frame['speech_prob']:.4f}"
                )
            elif msg["type"] == "done":
                print(f"\nDone — {msg['duration_ms']}ms, {msg['frame_count']} frames")
                print(f"music={msg['music_pct']}%  speech={msg['speech_pct']}%  label={msg['primary_label']}")
                break
            elif msg["type"] == "error":
                raise RuntimeError(f"Server error: {msg['error']}")

asyncio.run(stream_audio("/path/to/audio.raw"))

Rate limits

  • Concurrent connection limits apply per organization
  • Monthly usage limits (in audio hours) apply per organization