Sam Audio

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Dec 17 18:45
Editor
Edited
Edited
2025 Dec 20 23:39
Refs
Refs
 
 
 
 
demo

Paper

PE-AV as open-source, foundational model supporting SAM Audio's audio separation. PEAV (Perception Encoder Audiovisual) is a multimodal encoder that aligns audio–video–text through contrastive learning. Uses high-quality synthetic captions (~100M pairs) generated by LLM + multiple (up to 10) pairwise contrastive learning to strengthen cross-modal alignment. Supports broad audio domains including speech, music, and general sounds, overcoming single-anchor limitations. Achieves zero-shot SoTA in audio/video classification and retrieval, especially reaching practical-level speech retrieval for the first time. PEA-Frame enables frame-level sound–text alignment → improves sound event detection (SED) performance.
 
 
 

Recommendations