Skip to main content
Every model is audio-native — analyzing acoustic signals, not just transcribed words, for more accurate results. Together, these models let you understand the true meaning of every conversation — and act on it in real time.

Velma

Velma: Conversation Understanding

Detect behaviors, classify conversations, identify participant roles, and extract topics with sentiment. Velma is the flagship model for developers building complete voice intelligence solutions — available as a single API call or a real-time stream.

Individual models

Speech to text

Multilingual transcription with speaker diarization, emotion, accent, and PII/PHI tagging — batch or real-time streaming.

Deepfake detection

Per-frame synthetic voice detection. Classify recorded files or stream live audio for real-time anti-spoofing.

PII/PHI redaction

Remove sensitive spans from transcripts and silence the matching audio ranges — batch and streaming.

Music detection

Frame-level music and speech probability scoring. Classify a file or stream audio for real-time content analysis.

Language detection

Identify the spoken language of an audio file — confidence-scored results across 100 languages in a single synchronous call.
Not sure which API fits your use case? See Which API should I use?

New here?

Quick start

Make your first API call in under five minutes — no SDK required.

What you can build

Meeting transcription

Multilingual transcripts with speaker labels, timestamps, and optional emotion or accent signals.

Live captions

Stream audio over WebSocket and receive utterances as they’re spoken.

Anti-spoofing

Real-time synthetic voice detection during voice authentication flows.

Compliance archives

Shareable recordings with PII/PHI silenced from both transcript and audio.

Content moderation

Classify audio as music, speech, or neither — frame by frame, at scale.

Deepfake screening

Batch-process uploaded audio to flag AI-generated voice content.