π Homepage | π€ Dataset | π€ Paper | π arXiv | GitHub
This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context".
- [2024.7.10] πππ MileBench is accepted by COLM24!πππ
- [2024.4.15] πππ MileBench is public!πππ
We introduce MileBench, a pioneering benchmark designed to rigorously test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises a mix of text and images, long contexts, multiple tasks, and tasks requiring both comprehension and generation. To systematically assess the capabilities of MLLM in multimodal long contexts, our benchmark consists of two distinct evaluation sets, diagnostic evaluation and realistic evaluation. The former explores the long-context recall abilities of MLLMs, using needle-in-a-haystack and image retrieval tasks, while the latter stress-tests the model in a manner akin to real-world conditions using both temporal multi-image tasks and semantic multi-image tasks.
After evaluating 20 models, the closed-source Gemini 1.5 excelled in the realistic evaluation, achieving an impressive score of 54.7%, though it still falls short of a perfect 100% score. Meanwhile, GPT-4(Vision) managed to reach a peak score of 99.4% in the diagnostic evaluation. On the contrary, most open-source multimodal models struggled with long-context tasks. Only VILA and Qwen-VL-7B managed average scores of 44.4% and 37.2% in realistic and diagnostic evaluations respectively. These results underscore that there are "miles to go" towards fully-realized long-context MLLMs, prompting a call for increased research focus on such tasks, especially those involving numerous images.
The MileBench dataset comprises 6,440 samples from 29 datasets, with each sample containing multiple images. The data has been archived on the cloud and can be downloaded from HuggingFace link or BaiduYun link. Save the dataset under data folder.
Install required packages:
pip install -r requirements.txt
Click to expand
In configs/model_configs.yaml
:
# Add a new model "my_model"
my_model:
model_name: "my_model"
model_dir: "path/to/full/model" # HuggingFace model weights
cfg_path: "path/to/full/model_config" # can be none
gen_kwargs:
max_new_tokens: 512
min_new_tokens: 1
do_sample: false
Click to expand
In workers/model_workers.py
:
- add a new model class
class MyModel(BaseWorker):
def init_components(self, config) -> None:
# init the model components
def forward(self, questions: list[str], image_paths: list[list], device, gen_kwargs) -> list[str]:
# Prepare images and text for generate function
- for github packages of different VLM models, we recommand you to save them to
./packages
directory. Then you don't need to install pip packages in your env.
Click to expand
In utils.py
:
import your model
from workers.model_workers import MyModel # modify here
name2worker = {
"my_model": MyModel, # modify here
}
Click to expand
Set GPU num in `/configs/accelerate_configs.yaml`:num_processes: GPU_NUM # modify here
Modify eval.sh
:
gpu_num=GPU_NUM # modify here
for model in my_model; do # modify here
for dataset_name in dataset_name; do # modify here
...
and run:
source eval.sh
Click to expand
run:
python score.py \
--result-dir outputs \
--models my_model # models to eval
# Result saved to outputs/result.csv
All software is licensed under the Apache License, Version 2.0 (Apache 2.0). All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY).
The dataset we're using is an aggregation of publicly accessible datasets licensed under the Creative Commons license (CC-BY) or other open-source licenses. We've meticulously adhered to all required legal procedures to incorporate this data into our research, recognizing the importance of transparency in data licensing for proper attribution and suitable data utilization. Our dataset also encompasses images derived from publicly accessible datasets and language data created through the GPT-4V api. While measures have been put in place to secure suitable content, we acknowledge the potential existence of problematic content. Should you come across any such content, we urge you to inform us immediately so we can make the necessary adjustments to sustain a dataset free from inappropriate content. We are unwavering in our commitment to maintain a high-quality, ethically responsible dataset and promise to uphold principles of privacy and transparency throughout our work.
- Dingjie Song: bbsngg@outlook.com
- Benyou Wang: wangbenyou@cuhk.edu.cn
If you find this repository helpful, please consider citing it:
@article{song2024milebench,
title={MileBench: Benchmarking MLLMs in Long Context},
author={Song, Dingjie and Chen, Shunian and Chen, Guiming Hardy and Yu, Fei and Wan, Xiang and Wang, Benyou},
journal={arXiv preprint arXiv:2404.18532},
year={2024}
}