The architecture of the proposed MoME model:
Mixture of Vision Experts (MoVE): MoVE adaptively aggregates features from various vision encoders.
To avoid feature mismatch in different vision encoders, we propose an adaptive deformable transformation (ADT) module (a) in MoVE and use it to transfer features of vision encoders into a unified-length sequence of feature vectors. Our ADT module combines adaptive average pooling and deformable attention to obtain compressed and self-enhanced visual features. After feature transformation, our MoVE uses an instance-level soft router (b) to modulate and aggregate transformed visual features according to the instructions.
Mixture of Language Experts (MoLE): MoLE introduces several parameter-efficient adapters as experts and integrates them by using an instance-level sparsely-activated router (c). Due to the utilization of adapters, MoLE can be integrated into each feed-forward network layer of an LLM and only incurs a few computational costs with consistent performance gains.
MoVE (Table 1) can achieve an average performance gain of 12.87 points across all VL tasks, and improve by over 20 points on the "Document" group. MoME (Table 2) further enhances the multitasking capability of MLLM, as shown in Experiments #7 and #8.
We summarize the evaluation results of MoME and other MLLMs with similar resource consumption on popular VL tasks in Table 3 . The results show that our MoME method achieves promising outcomes on most datasets compared with other generalist and MoE MLLMs, especially on TextCaps, Flicker30K, and IconQA.
@inproceedings{shen2024mome,
title={MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models},
author={Shen, Leyang and Chen, Gongwei and Shao, Rui and Guan, Weili and Nie, Liqiang},
booktitle={Advances in neural information processing systems},
year={2024}
}
We referred to the project page of AvatarCLIP when creating this project page.