SAM Audio
With SAM Audio, you can use simple text prompts to accurately separate any sound from any audio or audio-visual source.
https://ai.meta.com/samaudio/

demo
Segment Anything Playground | Meta
A playground for interactive media
https://aidemos.meta.com/segment-anything/editor/segment-audio

Paper
PE-AV as open-source, foundational model supporting SAM Audio's audio separation. PEAV (Perception Encoder Audiovisual) is a multimodal encoder that aligns audio–video–text through contrastive learning. Uses high-quality synthetic captions (~100M pairs) generated by LLM + multiple (up to 10) pairwise contrastive learning to strengthen cross-modal alignment. Supports broad audio domains including speech, music, and general sounds, overcoming single-anchor limitations. Achieves zero-shot SoTA in audio/video classification and retrieval, especially reaching practical-level speech retrieval for the first time. PEA-Frame enables frame-level sound–text alignment → improves sound event detection (SED) performance.
Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning | Research - AI at Meta
We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning....
https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

Seonglae Cho
