LongVideoBench

The LongVideoBench highlights referred reasoning questions, which are dependent on long frame inputs and cannot be well-addressed by a single frame or a few sparse frames. Here's an example:

Video:

Question:

At the beginning of the video (0:19 - 0:22), a woman with a headband tied to her head, wearing a red top, carrying a black backpack, when the woman comes down from a hill with tall rocks (3:34 - 3:40), what changes occur to her backpack?

Options:

A. There is a dark red jacket hanging on her black backpack
B. Nothing changed
C. There is a white jacket hanging on her black backpack
D. There is a dark blue jacket hanging on her black backpack

On LongVideoBench, best open-source LMMs can already significantly improve with up to 256 (Aria) or 128 (LLaVA-Video-72B) input frames, catching up with proprietary GPT-4o and Gemini-1.5-Pro.

Specifically, we include 17 categories of referred reasoning questions, exemplified as follows:
Please see our released dataset for all samples.

Leaderboard

Email teowu@rhymes.ai to submit to our benchmark!

Rank	Model (max_frames)	Open-Source?	Test Total (5341)	Test 8s-15s	Test 15s-60s	Test 180s-600s	Test 900s-3600s	Val Total (1337)
1	GPT-4o (0513) (256)	No	66.7	71.6	76.8	66.7	61.6	66.7
2	Aria (256)	Yes	65.0	69.4	76.6	64.6	60.1	64.2
3	LLaVA-Video-72B-Qwen2 (128)	Yes	64.9	72.4	77.4	63.9	59.3	63.9
4	Gemini-1.5-Pro (0514) (256)	No	64.4	70.2	75.3	65.0	59.1	64.0
5	LLaVA-OneVision-QWen2-72B-OV (32)	Yes	63.2	74.3	77.4	61.6	56.5	61.3
6	LLaVA-Video-7B-Qwen2 (128)	Yes	62.7	69.7	76.5	62.1	56.6	61.1
7	Gemini-1.5-Flash (0514) (256)	No	62.4	66.1	73.1	63.1	57.3	61.6
8	GPT-4-Turbo (0409) (256)	No	60.7	66.4	71.1	61.7	54.5	59.1
9	InternVL2-40B (16)	Yes	60.6	71.4	76.6	57.5	54.4	59.3
10	mPLUG-Owl3-7B (128)	Yes	60.1	69.3	73.7	58.8	53.9	59.8
11	TimeMarker (128)	Yes	59.6	67.3	73.6	58.2	53.8	56.3
12	GPT-4o-mini (250)	No	58.8	66.6	73.4	56.9	53.4	56.5
13	MiniCPM-V-2.6 (64)	Yes	57.7	62.5	69.1	54.9	49.8	54.9
14	Qwen2-VL-7B (256)	Yes	56.8	60.1	67.6	56.7	52.5	55.6
15	Kangaroo (64)	Yes	54.8	65.6	65.7	52.7	49.1	54.2
16	PLLaVA-34B (32)	Yes	53.5	60.1	66.8	50.8	49.1	53.2
17	InternVL-Chat-V1-5-26B (16)	Yes	51.7	61.3	62.7	49.5	46.6	51.2
18	LLaVA-Next-Video-34B (32)	Yes	50.5	57.6	61.6	48.7	45.9	50.5
19	Phi-3-Vision-Instruct (16)	Yes	49.9	58.3	59.6	48.4	45.1	49.6
20	Idefics2 (16)	Yes	49.4	57.4	60.4	47.3	44.7	49.7
21	Mantis-Idefics2 (16)	Yes	47.6	56.1	61.4	44.6	42.5	47.0
22	LLaVA-Next-Mistral-7B (8)	Yes	47.1	53.4	57.2	46.9	42.1	49.1
23	PLLaVA-13B (32)	Yes	45.1	52.9	54.3	42.9	41.2	45.6
24	InstructBLIP-T5-XXL (8)	Yes	43.8	48.1	50.1	44.5	40.0	43.3
25	Mantis-BakLLaVA (16)	Yes	43.7	51.3	52.7	41.1	40.1	43.7
26	BLIP-2-T5-XXL (8)	Yes	43.5	46.7	47.4	44.2	40.9	42.7
27	LLaVA-Next-Video-M7B (32)	Yes	43.5	50.9	53.1	42.6	38.9	43.5
28	LLaVA-1.5-13B (8)	Yes	43.1	49.0	51.1	41.8	39.6	43.4
29	ShareGPT4Video (16)	Yes	41.8	46.9	50.1	40.0	38.7	39.7
30	VideoChat2 (Mistral-7B) (16)	Yes	41.2	49.3	49.3	39.0	37.5	39.3
31	LLaVA-1.5-7B (8)	Yes	40.4	45.0	47.4	40.1	37.0	40.3
32	mPLUG-Owl2 (8)	Yes	39.4	49.4	47.3	38.7	34.3	39.1
33	PLLaVA-7B (32)	Yes	39.2	45.3	47.3	38.5	35.2	40.2
34	VideoLLaVA (8)	Yes	37.6	43.1	44.6	36.4	34.4	39.1
35	VideoChat2 (Vicuna 7B) (16)	Yes	35.1	38.1	40.5	33.5	33.6	36.0

The parenthesized number, e.g. (256), after the model name denotes the input number of frames while the model reaches best performance (or as recommended/limited by model providers).

BibTeX

@misc{wu2024longvideobench,
        title={LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding}, 
        author={Haoning Wu and Dongxu Li and Bei Chen and Junnan Li},
        year={2024},
        eprint={2407.15754},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2407.15754}, 
  }