Skip to main content
POST
/
api
/
velma-2-ai-music-detection-batch
Detect AI-generated music in an audio file
curl --request POST \
  --url https://platform.modulate.ai/api/velma-2-ai-music-detection-batch \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-API-Key: <api-key>' \
  --form upload_file='@example-file'
{
  "filename": "my_audio.mp3",
  "duration_s": 89.28,
  "primary_verdict": "ai-vocal-music",
  "vocal_percentage": 87.5,
  "vocal_ai_percentage": 56.5,
  "vocal_ai_confidence": 0.96,
  "instrumental_percentage": 64.3,
  "instrumental_ai_percentage": 10.5,
  "instrumental_ai_confidence": 0.95,
  "silence_percentage": 3.51,
  "windows": [
    {
      "start_time_ms": 0,
      "end_time_ms": 4000,
      "vocal_percentage": 100,
      "vocal_ai_percentage": 100,
      "vocal_ai_confidence": 0.97,
      "instrumental_percentage": 79,
      "instrumental_ai_percentage": 0,
      "instrumental_ai_confidence": 0,
      "silence_percentage": 0
    },
    {
      "start_time_ms": 4000,
      "end_time_ms": 8000,
      "vocal_percentage": 0,
      "vocal_ai_percentage": 0,
      "vocal_ai_confidence": 0,
      "instrumental_percentage": 82,
      "instrumental_ai_percentage": 0,
      "instrumental_ai_confidence": 0.76,
      "silence_percentage": 18
    }
  ],
  "latency_ms": 1333
}

Authorizations

X-API-Key
string
header
required

API key used for authentication and usage tracking.

Body

multipart/form-data
upload_file
file
required

Audio file to analyse. Must be non-empty. Supported formats: .aac, .flac, .m4a, .mp3, .mp4, .ogg, .opus, .wav. Maximum file size: 100 MB.

Response

Detection completed successfully.

filename
string
required

Name of the submitted audio file. Empty string if no filename was provided in the upload.

Example:

"my_audio.mp3"

duration_s
number<double>
required

Total duration of the analysed audio in seconds.

Required range: x >= 0
Example:

89.28

primary_verdict
enum<string>
required

Clip-level classification:

  • ai-vocal-music - AI-generated music with a detected synthetic voice (covers AI songs and AI synthetic vocal tracks).
  • ai-instrumental - AI-generated instrumental music with no detectable synthetic voice.
  • not-ai-music - the clip does not appear to contain AI-generated music.
Available options:
ai-vocal-music,
ai-instrumental,
not-ai-music
Example:

"ai-vocal-music"

vocal_percentage
number<double>
required

Clip-level average percentage of the audio that contains vocal content, averaged across all windows.

Required range: 0 <= x <= 100
Example:

87.5

vocal_ai_percentage
number<double>
required

Percentage of the clip duration classified as AI-generated vocals. Computed as (seconds of windows classified as AI vocals) / (total clip seconds) * 100. A window contributes its full duration when it is classified as AI-generated vocals; zero otherwise.

Required range: 0 <= x <= 100
Example:

56.5

vocal_ai_confidence
number<double>
required

Average confidence that the vocal windows contain AI-generated vocals, across all windows with vocal content. Not diluted by non-vocal windows.

Required range: 0 <= x <= 1
Example:

0.89

instrumental_percentage
number<double>
required

Clip-level average percentage of the audio that contains instrumental music content, averaged across all windows.

Required range: 0 <= x <= 100
Example:

64.3

instrumental_ai_percentage
number<double>
required

Percentage of non-vocal non-silent windows classified as AI-generated instrumental content (0-100 scale).

Required range: 0 <= x <= 100
Example:

10.5

instrumental_ai_confidence
number<double>
required

Maximum confidence that a non-vocal window contains AI-generated instrumental content. Zero if no such window was found.

Required range: 0 <= x <= 1
Example:

0.95

silence_percentage
number<double>
required

Clip-level average percentage of the audio that contains neither vocal nor instrumental content, averaged across all windows.

Required range: 0 <= x <= 100
Example:

3.51

windows
object[]
required

Per-window breakdown of detection results.

latency_ms
number<double>
required

End-to-end inference time in milliseconds.

Required range: x >= 0
Example:

1333