LongVideoBench Logo LongVideoBench

A Benchmark for Long-context Interleaved Video-Language Understanding
Code Dataset Leaderboard arXiv

The LongVideoBench highlights referred reasoning questions, which are dependent on long frame inputs and cannot be well-addressed by a single frame or a few sparse frames. Here's an example:

Video:

Question:

At the beginning of the video (0:19 - 0:22), a woman with a headband tied to her head, wearing a red top, carrying a black backpack, when the woman comes down from a hill with tall rocks (3:34 - 3:40), what changes occur to her backpack?

Options:

  • A. There is a dark red jacket hanging on her black backpack
  • B. Nothing changed
  • C. There is a white jacket hanging on her black backpack
  • D. There is a dark blue jacket hanging on her black backpack

On LongVideoBench, proprietary models (solid lines) can significantly improve with more input frames, while open-source models (dashed lines) cannot properly scale: LongVideoBench Curve

Specifically, we include 17 categories of referred reasoning questions, exemplified as follows:
Please see our released dataset for all samples. LongVideoBench Cat Example

Leaderboard

Model Test Total (5341) Test 8s-15s Test 15s-60s Test 180s-600s Test 900s-3600s Val Total (1337)
GPT-4o (0513) (256) 66.7 71.6 76.8 66.7 61.6 66.7
Gemini-1.5-Pro (0514) (256) 64.4 70.2 75.3 65.0 59.1 64.0
LLaVA-NeXT-Video-72B-Qwen2 (64) 63.5 72.7 76.5 61.5 56.8 61.9
LLaVA-OneVision-QWen2-72B-OV (32) 63.2 74.3 77.4 61.6 56.5 61.3
Gemini-1.5-Flash (0514) (256) 62.4 66.1 73.1 63.1 57.3 61.6
GPT-4-Turbo (0409) (256) 60.7 66.4 71.1 61.7 54.5 59.1
InternVL2-40B (16) 60.6 71.4 76.6 57.5 54.4 59.3
GPT-4o-mini (250) 58.8 66.6 73.4 56.9 53.4 56.5
MiniCPM-V-2.6 (64) 57.7 62.5 69.1 54.9 49.8 54.9
Qwen2-VL-7B (256) 56.8 60.1 67.6 56.7 52.5 55.6
Kangaroo (64) 54.8 65.6 65.7 52.7 49.1 54.2
PLLaVA-34B (32) 53.5 60.1 66.8 50.8 49.1 53.2
InternVL-Chat-V1-5-26B (16) 51.7 61.3 62.7 49.5 46.6 51.2
LLaVA-Next-Video-34B (32) 50.5 57.6 61.6 48.7 45.9 50.5
Phi-3-Vision-Instruct (16) 49.9 58.3 59.6 48.4 45.1 49.6
Idefics2 (16) 49.4 57.4 60.4 47.3 44.7 49.7
Mantis-Idefics2 (16) 47.6 56.1 61.4 44.6 42.5 47.0
LLaVA-Next-Mistral-7B (8) 47.1 53.4 57.2 46.9 42.1 49.1
PLLaVA-13B (32) 45.1 52.9 54.3 42.9 41.2 45.6
InstructBLIP-T5-XXL (8) 43.8 48.1 50.1 44.5 40.0 43.3
Mantis-BakLLaVA (16) 43.7 51.3 52.7 41.1 40.1 43.7
BLIP-2-T5-XXL (8) 43.5 46.7 47.4 44.2 40.9 42.7
LLaVA-Next-Video-M7B (32) 43.5 50.9 53.1 42.6 38.9 43.5
LLaVA-1.5-13B (8) 43.1 49.0 51.1 41.8 39.6 43.4
ShareGPT4Video (16) 41.8 46.9 50.1 40.0 38.7 39.7
VideoChat2 (Mistral-7B) (16) 41.2 49.3 49.3 39.0 37.5 39.3
LLaVA-1.5-7B (8) 40.4 45.0 47.4 40.1 37.0 40.3
mPLUG-Owl2 (8) 39.4 49.4 47.3 38.7 34.3 39.1
PLLaVA-7B (32) 39.2 45.3 47.3 38.5 35.2 40.2
VideoLLaVA (8) 37.6 43.1 44.6 36.4 34.4 39.1
VideoChat2 (Vicuna 7B) (16) 35.1 38.1 40.5 33.5 33.6 36.0
The parenthesized number, e.g. (256), after the model name denotes the input number of frames while the model reaches best performance (or as recommended/limited by model providers).

BibTeX

@misc{wu2024longvideobench,
        title={LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding}, 
        author={Haoning Wu and Dongxu Li and Bei Chen and Junnan Li},
        year={2024},
        eprint={2407.15754},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2407.15754}, 
  }