MoME

Mixture of Multimodal Experts

When a single generalist model is trained on many tasks simultaneously, task interference can cause it to underperform task-specific specialist models. This paper starts from the observation that, in both vision and language modalities, feature distributions differ significantly across tasks, and argues that prior work focused mainly on textual differences within LLMs while overlooking task-to-task differences in visual information.

Mixture of Vision Experts (MoVE) integrates features from three vision encoders: CLIP-ViT, DINOv2, and Pix2Struct. The Adaptive Deformable Transformation (ADT) module converts each encoder’s features—despite differing resolutions and feature spaces—into a fixed-length sequence, expressed as . Here, deformable cross-attention samples from the original feature map to selectively extract information. An instance-level soft router takes the instruction’s sentence embedding and produces weights over the vision-encoder features via , and the final visual feature is aggregated as . Second, Mixture of Language Experts (MoLE) inserts lightweight adapter experts in parallel into each FFN layer of the LLM. Each adapter follows a bottleneck structure , and a sparsely-activated top-1 router selects an expert based on the instruction embedding: .

MoME: Mixture of Multimodal Experts for Generalist Multimodal...

Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a...

https://arxiv.org/abs/2407.12709

MoME

Mixture of Multimodal Experts

Recommendations