AI music detection

AI music detection determines whether a clip contains AI-generated music. Each window is classified by its vocal and instrumental content, then aggregated into a clip-level primary_verdict of ai-vocal-music, ai-instrumental, or not-ai-music.

This is distinct from music detection, which classifies audio as music, speech, or neither. AI music detection answers a different question: is this music AI-generated?

How windows are routed

Each window is routed to one of two detection paths based on its vocal content:

Vocal windows (sufficient vocal content) are classified for AI-generated vocals, producing vocal_ai_percentage and vocal_ai_confidence. Their instrumental_ai_* fields are 0.
Instrumental windows (no sufficient vocal content) are classified for AI-generated instrumental content, producing instrumental_ai_percentage and instrumental_ai_confidence. Their vocal_ai_* fields are 0.

The clip-level primary_verdict is ai-vocal-music when the vocal AI percentage exceeds the vocal threshold; otherwise ai-instrumental when enough non-vocal, non-silent windows exceed the instrumental threshold; otherwise not-ai-music.

Batch

Send a complete audio file and receive a clip-level verdict plus a per-window breakdown.

curl -X POST https://platform.modulate.ai/api/velma-2-ai-music-detection-batch \
  -H "X-API-Key: $MODULATE_API_KEY" \
  -F "upload_file=@audio.mp3"

import os, requests

response = requests.post(
    "https://platform.modulate.ai/api/velma-2-ai-music-detection-batch",
    headers={"X-API-Key": os.environ["MODULATE_API_KEY"]},
    files={"upload_file": open("audio.mp3", "rb")},
)
response.raise_for_status()
result = response.json()
print(f"Verdict: {result['primary_verdict']}")
print(f"Vocal AI: {result['vocal_ai_percentage']}% ({result['vocal_ai_confidence']:.0%} conf)")
print(f"Instrumental AI: {result['instrumental_ai_percentage']}% ({result['instrumental_ai_confidence']:.0%} conf)")
for w in result["windows"]:
    print(f"  {w['start_time_ms']}ms – {w['end_time_ms']}ms  vocal_ai={w['vocal_ai_percentage']}  instr_ai={w['instrumental_ai_percentage']}")

Expected response

{
  "filename": "my_audio.mp3",
  "duration_s": 89.28,
  "primary_verdict": "ai-vocal-music",
  "vocal_percentage": 87.5,
  "vocal_ai_percentage": 56.5,
  "vocal_ai_confidence": 0.96,
  "instrumental_percentage": 64.3,
  "instrumental_ai_percentage": 10.5,
  "instrumental_ai_confidence": 0.95,
  "silence_percentage": 3.51,
  "latency_ms": 1333.0,
  "windows": [
    {
      "start_time_ms": 0,
      "end_time_ms": 4000,
      "vocal_percentage": 100.0,
      "vocal_ai_percentage": 100.0,
      "vocal_ai_confidence": 0.97,
      "instrumental_percentage": 79.0,
      "instrumental_ai_percentage": 0.0,
      "instrumental_ai_confidence": 0.0,
      "silence_percentage": 0.0
    },
    {
      "start_time_ms": 4000,
      "end_time_ms": 8000,
      "vocal_percentage": 0.0,
      "vocal_ai_percentage": 0.0,
      "vocal_ai_confidence": 0.0,
      "instrumental_percentage": 82.0,
      "instrumental_ai_percentage": 0.0,
      "instrumental_ai_confidence": 0.76,
      "silence_percentage": 18.0
    }
  ]
}

Each window is 4 seconds. Clip-level vocal_percentage, instrumental_percentage, and silence_percentage are averages of the per-window values; these are not mutually exclusive, since a window can contain both vocals and instrumental music. Supported formats: .aac, .flac, .m4a, .mp3, .mp4, .ogg, .opus, .wav. Maximum file size is 100 MB.

Streaming (WebSocket)

Connect over WebSocket and receive per-window vocal AI verdicts progressively as audio arrives, followed by a final clip-level summary that includes instrumental AI detection.

Vocal AI detection runs on each 4-second window as audio arrives and is reported in window messages. Instrumental AI detection runs on the accumulated audio at end-of-stream, so instrumental_ai_percentage and instrumental_ai_confidence appear only in the final done message.

websocat "wss://platform.modulate.ai/api/velma-2-ai-music-detection-streaming?api_key=$MODULATE_API_KEY&audio_format=mp3" \
  --binary - < audio.mp3

import os, asyncio, json, websockets

API_KEY = os.environ["MODULATE_API_KEY"]
AUDIO_FILE = "audio.mp3"
CHUNK_SIZE = 65536

async def stream():
    url = (
        f"wss://platform.modulate.ai/api/velma-2-ai-music-detection-streaming"
        f"?api_key={API_KEY}&audio_format=mp3"
    )
    async with websockets.connect(url) as ws:
        async def send():
            with open(AUDIO_FILE, "rb") as f:
                while chunk := f.read(CHUNK_SIZE):
                    await ws.send(chunk)
            await ws.send("")

        async def receive():
            async for message in ws:
                msg = json.loads(message)
                if msg["type"] == "window":
                    w = msg["window"]
                    print(f"  {w['start_time_ms']}ms – {w['end_time_ms']}ms  vocal_ai={w['vocal_ai_percentage']} ({w['vocal_ai_confidence']:.0%})")
                elif msg["type"] == "done":
                    print(f"Done — {msg['duration_ms']}ms, {msg['window_count']} windows")
                    print(f"Verdict: {msg['primary_verdict']}  vocal_ai={msg['vocal_ai_percentage']}%  instr_ai={msg['instrumental_ai_percentage']}%")
                    break
                elif msg["type"] == "error":
                    raise RuntimeError(f"Server error: {msg['error']}")

        await asyncio.gather(send(), receive())

asyncio.run(stream())

Example messages received

{ "type": "window", "window": { "start_time_ms": 0, "end_time_ms": 4000, "vocal_percentage": 87.5, "vocal_ai_percentage": 100.0, "vocal_ai_confidence": 0.97, "instrumental_percentage": 64.3, "silence_percentage": 3.5 } }
{ "type": "window", "window": { "start_time_ms": 4000, "end_time_ms": 8000, "vocal_percentage": 0.0, "vocal_ai_percentage": 0.0, "vocal_ai_confidence": 0.0, "instrumental_percentage": 82.0, "silence_percentage": 18.0 } }
{ "type": "done", "duration_ms": 89280, "window_count": 22, "primary_verdict": "ai-vocal-music", "vocal_percentage": 87.5, "vocal_ai_percentage": 56.5, "vocal_ai_confidence": 0.96, "instrumental_percentage": 64.3, "instrumental_ai_percentage": 42.0, "instrumental_ai_confidence": 0.95, "silence_percentage": 3.51 }

The server emits one window message per completed 4-second window. Streaming window messages carry vocal AI fields only — instrumental AI results arrive in the done message. Send an empty string ("") to signal end of stream. For raw PCM formats, pass audio_format, sample_rate, and num_channels as query parameters. Container formats (wav, mp3, ogg, flac, webm, aac, aiff) only need audio_format. See Audio formats.

Per-window results can be less accurate than the clip-level verdict, so rely on the clip-level result when judging a whole song or segment. Heavily processed or high-production tracks are sometimes mislabeled as AI-generated; this is a known gap targeted by future model updates.

Get started

By capability

Guides

AI music detection

How windows are routed

Batch

Streaming (WebSocket)

​How windows are routed

​Batch

​Streaming (WebSocket)

How windows are routed

Batch

Streaming (WebSocket)