Question 1

What is SAM Audio?

Accepted Answer

SAM Audio (Segment Anything Model for Audio) is Meta's groundbreaking open-source AI model released in December 2025. It's the first unified multimodal model that can separate any sound from audio or video using text, visual, or temporal prompts - extending the 'Segment Anything' philosophy from vision to audio.

Question 2

How is SAM Audio different from traditional audio separation tools?

Accepted Answer

Traditional tools like Demucs or iZotope RX are limited to predefined categories (vocals, drums, bass). This foundation model can separate ANY describable sound without category limitations. It also supports visual prompts (click on a speaker in video) and temporal anchors (mark time segments) - features no other tool offers.

Question 3

What are the three prompting modes?

Accepted Answer

1) Text Prompting: Describe sounds in natural language like 'a dog barking'. 2) Visual Prompting: Click on objects in video frames to isolate their sounds. 3) Span/Temporal Prompting: Mark time segments as positive (+) or negative (-) examples. These can be combined for maximum accuracy.

Question 4

What technology powers SAM Audio?

Accepted Answer

The system uses a Flow-Matching Diffusion Transformer architecture operating in DAC-VAE latent space. It's powered by PE-AV (Perception Encoder Audiovisual) trained on 100M+ videos for cross-modal understanding. This enables real-time processing with RTF ~0.7.

Question 5

Is SAM Audio free and open source?

Accepted Answer

Yes! The model is released on GitHub and Hugging Face under the SAM License. The code and model weights are available for research and non-commercial use. You can also try it free through Meta's Segment Anything Playground.

Question 6

What are some practical use cases?

Accepted Answer

Audio/video editing (remove unwanted sounds, isolate dialogue), music production (extract instruments, create stems), podcast production (remove background noise), accessibility (hearing aids that focus on specific speakers), and research (bioacoustics, environmental monitoring).

Question 7

How does Visual Prompting work?

Accepted Answer

When you click on an object in a video frame (like a person speaking), the engine uses its PE-AV encoder to understand the visual-audio correlation. It analyzes motion patterns (like lip movement or hand gestures) and temporal synchronization to isolate the sound produced by that specific object.

Question 8

What are the current limitations?

Accepted Answer

The tool struggles with highly similar sound sources (like two people with identical voices arguing) and cannot use audio samples as prompts (only text/visual/temporal). It also requires prompts - it cannot automatically separate all sources blindly. Meta acknowledges these limitations in their research.

SAM Audio: AI Audio Separation

SAM AUDIO CAPABILITIES

SAM Audio separates target and residual sounds from any audio or audiovisual source—across general sound, music, and speech.

Text prompts

Visual prompts

Span prompts

What is SAM Audio?

What is SAM Audio?

Why use it?

How it works (Zero-shot)

Who is SAM Audio for?

Empowering Creators Across Industries

For Musicians

For Video Editors

For Content Creators

Capabilities

What Makes These Features Revolutionary

Universal Audio Separation

Audio-Visual Grounding (PE-AV)

Target + Residual Output

State-of-the-Art Performance

How it works

How to Separate Audio Online with AI

Step 1: Upload Your File

Step 2: Choose Your Prompt

Step 3: AI Processing

Step 4: Download Stems

FAQ

What is SAM Audio?

How is SAM Audio different from traditional audio separation tools?

What are the three prompting modes?

What technology powers SAM Audio?

Is SAM Audio free and open source?

What are some practical use cases?

How does Visual Prompting work?

What are the current limitations?

Can I use SAM Audio as a vocal remover for karaoke?

How do I extract drum stems from a song?

Can SAM Audio clean up background noise from recordings?

Does this tool work directly with video files?

What audio and video formats are supported?

Is there a SAM Audio API for developers?

How fast is the audio separation process?

Can I use this vocal remover on my mobile phone?

Is my uploaded audio data private?

How does SAM Audio compare to Spleeter or Demucs?

Experience the Future of Audio AI

SAM Audio: AI Audio Separation

SAM AUDIO CAPABILITIES

SAM Audio separates target and residual sounds from any audio or audiovisual source—across general sound, music, and speech.

Text prompts

Visual prompts

Span prompts

What is SAM Audio?

What is SAM Audio?

Why use it?

How it works (Zero-shot)

Who is SAM Audio for?

Empowering Creators Across Industries

For Musicians

For Video Editors

For Content Creators

Capabilities

What Makes These Features Revolutionary

Universal Audio Separation

Audio-Visual Grounding (PE-AV)

Target + Residual Output

State-of-the-Art Performance

How it works

How to Separate Audio Online with AI

Step 1: Upload Your File

Step 2: Choose Your Prompt

Step 3: AI Processing

Step 4: Download Stems

FAQ

What is SAM Audio?

How is SAM Audio different from traditional audio separation tools?

What are the three prompting modes?

What technology powers SAM Audio?

Is SAM Audio free and open source?