Sam Audio

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Dec 17 18:45
Editor
Edited
Edited
2025 Dec 20 23:39
Refs
Refs
 
 
 
 
SAM Audio
With SAM Audio, you can use simple text prompts to accurately separate any sound from any audio or audio-visual source.
SAM Audio
demo
Segment Anything Playground | Meta
A playground for interactive media
Segment Anything Playground | Meta

Paper

PE-AV as open-source, foundational model supporting SAM Audio's audio separation. PEAV (Perception Encoder Audiovisual) is a multimodal encoder that aligns audio–video–text through contrastive learning. Uses high-quality synthetic captions (~100M pairs) generated by LLM + multiple (up to 10) pairwise contrastive learning to strengthen cross-modal alignment. Supports broad audio domains including speech, music, and general sounds, overcoming single-anchor limitations. Achieves zero-shot SoTA in audio/video classification and retrieval, especially reaching practical-level speech retrieval for the first time. PEA-Frame enables frame-level sound–text alignment → improves sound event detection (SED) performance.
Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning | Research - AI at Meta
We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning....
 
 
 

Recommendations