Endpoint
Authentication
Pass your API key as a query parameter on the connection URL:Features
- Real-time output — frames emitted progressively after each 192ms chunk of audio
- Music detection — identifies frames containing music content
- Speech detection — identifies frames containing speech content
- Non-exclusive labels — music and speech are independent; both can be high simultaneously (e.g. music with vocals)
- Any chunk size — send audio in whatever chunk size suits your pipeline
- Container and raw PCM support — stream compressed files or raw PCM directly from a microphone
Connection parameters
| Parameter | Required | Description |
|---|---|---|
api_key | Yes | Your API key |
audio_format | Yes | Audio format — see supported formats below |
sample_rate | Raw PCM only | Sample rate in Hz |
num_channels | Raw PCM only | Number of channels (1–8) |
Supported audio formats
Container formats —sample_rate and num_channels must not be specified (the headers already carry this metadata):
wav, mp3, ogg, flac, webm, aac, aiff
Raw PCM formats — sample_rate and num_channels are required:
s16le, s16be, s32le, s32be, s24le, s24be, s8, u8, u16le, u16be, u24le, u24be, u32le, u32be, f32le, f32be, f64le, f64be, mulaw, alaw
Valid sample rates: 8000, 11025, 16000, 22050, 32000, 44100, 48000, 96000
Protocol
Client → server
| Message | Description |
|---|---|
| Binary frame | Chunk of audio bytes in the declared format (any size) |
Empty text frame "" | Signals end of stream |
Server → client
| Message | Description |
|---|---|
{"type": "frame", "frame": {...}} | Frame result — emitted after each 192ms chunk |
{"type": "done", "duration_ms": ..., ...} | Stream complete — includes overall summary across all frames |
{"type": "error", "error": "..."} | An error occurred — connection will close |
Frame object
| Field | Type | Description |
|---|---|---|
start_time_ms | integer | Frame start time in milliseconds |
end_time_ms | integer | Frame end time in milliseconds |
music_prob | float | Music probability (0.0–1.0) |
speech_prob | float | Speech probability (0.0–1.0) |
Done object
| Field | Type | Description |
|---|---|---|
duration_ms | integer | Total audio duration processed in milliseconds |
frame_count | integer | Total number of frames returned |
music_pct | float | Percentage of the analysed duration classified as music, rounded to one decimal place. 0.0 when no audio was analysed |
speech_pct | float | Percentage of the analysed duration classified as speech, rounded to one decimal place. 0.0 when no audio was analysed |
primary_label | string | Dominant classification: "music", "speech", "neither", or "unknown" (no frames produced from the audio) |
WebSocket close codes
| Code | Meaning |
|---|---|
1000 | Normal closure after a successful done message |
1003 | Invalid query parameters (unknown format, bad sample rate, missing audio_format) |
4002 | Audio could not be decoded or does not match the declared format |
4003 | Access denied or server-side usage check failed |
4029 | Insufficient credits |
Chunking behaviour
Audio is buffered in 192ms chunks (one output frame each). Frames are emitted as soon as each chunk is ready, so results begin arriving within 192ms of the first audio being received. At end-of-stream, any remaining audio ≥ 192ms is processed and its frames are emitted before thedone message.
Examples
Rate limits
- Concurrent connection limits apply per organization
- Monthly usage limits (in audio hours) apply per organization