Skip to main content
AI music detection determines whether a clip contains AI-generated music. Each window is classified by its vocal and instrumental content, then aggregated into a clip-level primary_verdict of ai-vocal-music, ai-instrumental, or not-ai-music.
This is distinct from music detection, which classifies audio as music, speech, or neither. AI music detection answers a different question: is this music AI-generated?

How windows are routed

Each window is routed to one of two detection paths based on its vocal content:
  • Vocal windows (sufficient vocal content) are classified for AI-generated vocals, producing vocal_ai_percentage and vocal_ai_confidence. Their instrumental_ai_* fields are 0.
  • Instrumental windows (no sufficient vocal content) are classified for AI-generated instrumental content, producing instrumental_ai_percentage and instrumental_ai_confidence. Their vocal_ai_* fields are 0.
The clip-level primary_verdict is ai-vocal-music when the vocal AI percentage exceeds the vocal threshold; otherwise ai-instrumental when enough non-vocal, non-silent windows exceed the instrumental threshold; otherwise not-ai-music.

Batch

Send a complete audio file and receive a clip-level verdict plus a per-window breakdown.
curl -X POST https://platform.modulate.ai/api/velma-2-ai-music-detection-batch \
  -H "X-API-Key: $MODULATE_API_KEY" \
  -F "upload_file=@audio.mp3"
{
  "filename": "my_audio.mp3",
  "duration_s": 89.28,
  "primary_verdict": "ai-vocal-music",
  "vocal_percentage": 87.5,
  "vocal_ai_percentage": 56.5,
  "vocal_ai_confidence": 0.96,
  "instrumental_percentage": 64.3,
  "instrumental_ai_percentage": 10.5,
  "instrumental_ai_confidence": 0.95,
  "silence_percentage": 3.51,
  "latency_ms": 1333.0,
  "windows": [
    {
      "start_time_ms": 0,
      "end_time_ms": 4000,
      "vocal_percentage": 100.0,
      "vocal_ai_percentage": 100.0,
      "vocal_ai_confidence": 0.97,
      "instrumental_percentage": 79.0,
      "instrumental_ai_percentage": 0.0,
      "instrumental_ai_confidence": 0.0,
      "silence_percentage": 0.0
    },
    {
      "start_time_ms": 4000,
      "end_time_ms": 8000,
      "vocal_percentage": 0.0,
      "vocal_ai_percentage": 0.0,
      "vocal_ai_confidence": 0.0,
      "instrumental_percentage": 82.0,
      "instrumental_ai_percentage": 0.0,
      "instrumental_ai_confidence": 0.76,
      "silence_percentage": 18.0
    }
  ]
}
Each window is 4 seconds. Clip-level vocal_percentage, instrumental_percentage, and silence_percentage are averages of the per-window values; these are not mutually exclusive, since a window can contain both vocals and instrumental music. Supported formats: .aac, .flac, .m4a, .mp3, .mp4, .ogg, .opus, .wav. Maximum file size is 100 MB.

Streaming (WebSocket)

Connect over WebSocket and receive per-window vocal AI verdicts progressively as audio arrives, followed by a final clip-level summary that includes instrumental AI detection.
Vocal AI detection runs on each 4-second window as audio arrives and is reported in window messages. Instrumental AI detection runs on the accumulated audio at end-of-stream, so instrumental_ai_percentage and instrumental_ai_confidence appear only in the final done message.
websocat "wss://platform.modulate.ai/api/velma-2-ai-music-detection-streaming?api_key=$MODULATE_API_KEY&audio_format=mp3" \
  --binary - < audio.mp3
{ "type": "window", "window": { "start_time_ms": 0, "end_time_ms": 4000, "vocal_percentage": 87.5, "vocal_ai_percentage": 100.0, "vocal_ai_confidence": 0.97, "instrumental_percentage": 64.3, "silence_percentage": 3.5 } }
{ "type": "window", "window": { "start_time_ms": 4000, "end_time_ms": 8000, "vocal_percentage": 0.0, "vocal_ai_percentage": 0.0, "vocal_ai_confidence": 0.0, "instrumental_percentage": 82.0, "silence_percentage": 18.0 } }
{ "type": "done", "duration_ms": 89280, "window_count": 22, "primary_verdict": "ai-vocal-music", "vocal_percentage": 87.5, "vocal_ai_percentage": 56.5, "vocal_ai_confidence": 0.96, "instrumental_percentage": 64.3, "instrumental_ai_percentage": 42.0, "instrumental_ai_confidence": 0.95, "silence_percentage": 3.51 }
The server emits one window message per completed 4-second window. Streaming window messages carry vocal AI fields only — instrumental AI results arrive in the done message. Send an empty string ("") to signal end of stream. For raw PCM formats, pass audio_format, sample_rate, and num_channels as query parameters. Container formats (wav, mp3, ogg, flac, webm, aac, aiff) only need audio_format. See Audio formats.
Per-window results can be less accurate than the clip-level verdict, so rely on the clip-level result when judging a whole song or segment. Heavily processed or high-production tracks are sometimes mislabeled as AI-generated; this is a known gap targeted by future model updates.