LongVideoBench Logo LongVideoBench

A Benchmark for Long-context Interleaved Video-Language Understanding
NeurIPS 2024 Datasets & Benchmarks
Code Dataset Leaderboard arXiv

The LongVideoBench highlights referred reasoning questions, which are dependent on long frame inputs and cannot be well-addressed by a single frame or a few sparse frames. Here's an example:

Video:

Question:

At the beginning of the video (0:19 - 0:22), a woman with a headband tied to her head, wearing a red top, carrying a black backpack, when the woman comes down from a hill with tall rocks (3:34 - 3:40), what changes occur to her backpack?

Options:

  • A. There is a dark red jacket hanging on her black backpack
  • B. Nothing changed
  • C. There is a white jacket hanging on her black backpack
  • D. There is a dark blue jacket hanging on her black backpack

On LongVideoBench, best open-source LMMs can already significantly improve with up to 256 (Aria) or 128 (LLaVA-Video-72B) input frames, catching up with proprietary GPT-4o and Gemini-1.5-Pro. LongVideoBench Curve

Specifically, we include 17 categories of referred reasoning questions, exemplified as follows:
Please see our released dataset for all samples. LongVideoBench Cat Example

Leaderboard

Email teowu@rhymes.ai to submit to our benchmark!
Rank Model (max_frames) Open-Source? Test Total (5341) Test 8s-15s Test 15s-60s Test 180s-600s Test 900s-3600s Val Total (1337)
1 GPT-4o (0513) (256) No 66.7 71.6 76.8 66.7 61.6 66.7
2 Aria (256) Yes 65.0 69.4 76.6 64.6 60.1 64.2
3 LLaVA-Video-72B-Qwen2 (128) Yes 64.9 72.4 77.4 63.9 59.3 63.9
4 Gemini-1.5-Pro (0514) (256) No 64.4 70.2 75.3 65.0 59.1 64.0
5 LLaVA-OneVision-QWen2-72B-OV (32) Yes 63.2 74.3 77.4 61.6 56.5 61.3
6 LLaVA-Video-7B-Qwen2 (128) Yes 62.7 69.7 76.5 62.1 56.6 61.1
7 Gemini-1.5-Flash (0514) (256) No 62.4 66.1 73.1 63.1 57.3 61.6
8 GPT-4-Turbo (0409) (256) No 60.7 66.4 71.1 61.7 54.5 59.1
9 InternVL2-40B (16) Yes 60.6 71.4 76.6 57.5 54.4 59.3
10 mPLUG-Owl3-7B (128) Yes 60.1 69.3 73.7 58.8 53.9 59.8
11 TimeMarker (128) Yes 59.6 67.3 73.6 58.2 53.8 56.3
12 GPT-4o-mini (250) No 58.8 66.6 73.4 56.9 53.4 56.5
13 MiniCPM-V-2.6 (64) Yes 57.7 62.5 69.1 54.9 49.8 54.9
14 Qwen2-VL-7B (256) Yes 56.8 60.1 67.6 56.7 52.5 55.6
15 Kangaroo (64) Yes 54.8 65.6 65.7 52.7 49.1 54.2
16 PLLaVA-34B (32) Yes 53.5 60.1 66.8 50.8 49.1 53.2
17 InternVL-Chat-V1-5-26B (16) Yes 51.7 61.3 62.7 49.5 46.6 51.2
18 LLaVA-Next-Video-34B (32) Yes 50.5 57.6 61.6 48.7 45.9 50.5
19 Phi-3-Vision-Instruct (16) Yes 49.9 58.3 59.6 48.4 45.1 49.6
20 Idefics2 (16) Yes 49.4 57.4 60.4 47.3 44.7 49.7
21 Mantis-Idefics2 (16) Yes 47.6 56.1 61.4 44.6 42.5 47.0
22 LLaVA-Next-Mistral-7B (8) Yes 47.1 53.4 57.2 46.9 42.1 49.1
23 PLLaVA-13B (32) Yes 45.1 52.9 54.3 42.9 41.2 45.6
24 InstructBLIP-T5-XXL (8) Yes 43.8 48.1 50.1 44.5 40.0 43.3
25 Mantis-BakLLaVA (16) Yes 43.7 51.3 52.7 41.1 40.1 43.7
26 BLIP-2-T5-XXL (8) Yes 43.5 46.7 47.4 44.2 40.9 42.7
27 LLaVA-Next-Video-M7B (32) Yes 43.5 50.9 53.1 42.6 38.9 43.5
28 LLaVA-1.5-13B (8) Yes 43.1 49.0 51.1 41.8 39.6 43.4
29 ShareGPT4Video (16) Yes 41.8 46.9 50.1 40.0 38.7 39.7
30 VideoChat2 (Mistral-7B) (16) Yes 41.2 49.3 49.3 39.0 37.5 39.3
31 LLaVA-1.5-7B (8) Yes 40.4 45.0 47.4 40.1 37.0 40.3
32 mPLUG-Owl2 (8) Yes 39.4 49.4 47.3 38.7 34.3 39.1
33 PLLaVA-7B (32) Yes 39.2 45.3 47.3 38.5 35.2 40.2
34 VideoLLaVA (8) Yes 37.6 43.1 44.6 36.4 34.4 39.1
35 VideoChat2 (Vicuna 7B) (16) Yes 35.1 38.1 40.5 33.5 33.6 36.0
The parenthesized number, e.g. (256), after the model name denotes the input number of frames while the model reaches best performance (or as recommended/limited by model providers).

BibTeX

@misc{wu2024longvideobench,
        title={LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding}, 
        author={Haoning Wu and Dongxu Li and Bei Chen and Junnan Li},
        year={2024},
        eprint={2407.15754},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2407.15754}, 
  }