Sam Audio

https://ai.meta.com/samaudio/

demo

Segment Anything Playground | Meta

A playground for interactive media

https://aidemos.meta.com/segment-anything/editor/segment-audio

Paper

PE-AV as open-source, foundational model supporting SAM Audio's audio separation. PEAV (Perception Encoder Audiovisual) is a multimodal encoder that aligns audio–video–text through contrastive learning. Uses high-quality synthetic captions (~100M pairs) generated by LLM + multiple (up to 10) pairwise contrastive learning to strengthen cross-modal alignment. Supports broad audio domains including speech, music, and general sounds, overcoming single-anchor limitations. Achieves zero-shot SoTA in audio/video classification and retrieval, especially reaching practical-level speech retrieval for the first time. PEA-Frame enables frame-level sound–text alignment → improves sound event detection (SED) performance.

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning | Research - AI at Meta

We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning....

https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

Sam Audio

Paper

Recommendations