MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

  • School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
*Equal contribution
†Corresponding author
TL;DR: In this work, we proposed a mixture of multimodal experts (MoME) framework to mitigate task interference and obtain a generalist MLLM.

Abstract
Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM. Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE). MoVE can adaptively modulate the features transformed from various vision encoders, and has a strong compatibility in transformation architecture. MoLE incorporates sparsely gated experts into LLMs to achieve painless improvements with roughly unchanged inference costs. In response to task interference, our MoME specializes in both vision and language modality to adapt to task discrepancies. Extensive experiments show that MoME significantly improves the performance of generalist MLLMs across various VL tasks.

Method

The architecture of the proposed MoME model:



Mixture of Vision Experts (MoVE): MoVE adaptively aggregates features from various vision encoders.
To avoid feature mismatch in different vision encoders, we propose an adaptive deformable transformation (ADT) module (a) in MoVE and use it to transfer features of vision encoders into a unified-length sequence of feature vectors. Our ADT module combines adaptive average pooling and deformable attention to obtain compressed and self-enhanced visual features. After feature transformation, our MoVE uses an instance-level soft router (b) to modulate and aggregate transformed visual features according to the instructions.

Mixture of Language Experts (MoLE): MoLE introduces several parameter-efficient adapters as experts and integrates them by using an instance-level sparsely-activated router (c). Due to the utilization of adapters, MoLE can be integrated into each feed-forward network layer of an LLM and only incurs a few computational costs with consistent performance gains.


Results

MoVE (Table 1) can achieve an average performance gain of 12.87 points across all VL tasks, and improve by over 20 points on the "Document" group. MoME (Table 2) further enhances the multitasking capability of MLLM, as shown in Experiments #7 and #8.



We summarize the evaluation results of MoME and other MLLMs with similar resource consumption on popular VL tasks in Table 3 . The results show that our MoME method achieves promising outcomes on most datasets compared with other generalist and MoE MLLMs, especially on TextCaps, Flicker30K, and IconQA.

Qualitative Examples

The MoVE distributions on the left represent Pix2Struct, DINOv2, and CLIP-ViT from top to bottom. MoLE is on the bottom, with different colors indicating different experts.
In the REC case, DINOv2 accounts for nearly 50% among vision experts, providing fine-grained visual information. So the model can recognize the blue car in the background and provide its precise bounding box. The Pix2Struct branch accounts for over 70% in the Document case for structured text understanding. The REG case utilizes information from both the CLIP-ViT and DINOv2 to locate objects and generate captions. In contrast, the conventional caption task in the General group only requires an image-level perception, so the CLIP-ViT is dominant. Remarkably, we can observe significant differences among the MoLE routing results. These examples show how MoME selects vision and language experts to adapt to various tasks.
Bibtex
@inproceedings{shen2024mome,
    title={MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models},
    author={Shen, Leyang and Chen, Gongwei and Shao, Rui and Guan, Weili and Nie, Liqiang},
    booktitle={Advances in neural information processing systems},
    year={2024}
}
Acknowledgement

We referred to the project page of AvatarCLIP when creating this project page.