demo
Paper
PE-AV as open-source, foundational model supporting SAM Audio's audio separation. PEAV (Perception Encoder Audiovisual) is a multimodal encoder that aligns audio–video–text through contrastive learning. Uses high-quality synthetic captions (~100M pairs) generated by LLM + multiple (up to 10) pairwise contrastive learning to strengthen cross-modal alignment. Supports broad audio domains including speech, music, and general sounds, overcoming single-anchor limitations. Achieves zero-shot SoTA in audio/video classification and retrieval, especially reaching practical-level speech retrieval for the first time. PEA-Frame enables frame-level sound–text alignment → improves sound event detection (SED) performance.

Seonglae Cho


