Skip to main content
POST
/
api
/
velma-2-music-detection-batch
Detect music and speech in an audio file
curl --request POST \
  --url https://modulate-developer-apis.com/api/velma-2-music-detection-batch \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-API-Key: <api-key>' \
  --form upload_file='@example-file'
{
  "filename": "my_audio.wav",
  "duration_s": 5.76,
  "primary_label": "speech",
  "music_pct": 0,
  "speech_pct": 86.7,
  "latency_ms": 1243.5,
  "frames": [
    {
      "start_time_ms": 0,
      "end_time_ms": 192,
      "music_prob": 0.0213,
      "speech_prob": 0.9888
    },
    {
      "start_time_ms": 192,
      "end_time_ms": 384,
      "music_prob": 0.0204,
      "speech_prob": 0.9931
    }
  ]
}

Authorizations

X-API-Key
string
header
required

API key used for authentication and usage tracking.

Body

multipart/form-data
upload_file
file
required

Audio file to analyse. Must be non-empty and of a supported format. Maximum file size: 100 MB.

Response

Detection completed successfully.

filename
string
required

Name of the submitted audio file. Empty string if no filename was provided in the upload.

Example:

"my_audio.wav"

duration_s
number<double>
required

Total duration of the analysed audio in seconds.

Required range: x >= 0
Example:

5.76

primary_label
enum<string>
required

Overall classification of the clip:

  • music - music covers at least as much of the clip as speech, and more than zero.
  • speech - speech covers more of the clip than music, and more than zero.
  • neither - neither music nor speech reached the dominant threshold for any portion of the clip.
  • unknown - no frames could be produced from the audio.
Available options:
music,
speech,
neither,
unknown
Example:

"speech"

music_pct
number<double>
required

Percentage of the clip classified as containing music.

Required range: 0 <= x <= 100
Example:

0

speech_pct
number<double>
required

Percentage of the clip classified as containing speech.

Required range: 0 <= x <= 100
Example:

86.7

latency_ms
number<double>
required

End-to-end inference time in milliseconds.

Required range: x >= 0
Example:

1243.5

frames
object[]
required

Ordered list of per-frame classification results covering the full duration of the clip.