LongVideoBench Logo LongVideoBench

A Benchmark for Long-context Interleaved Video-Language Understanding
Code Dataset Leaderboard arXiv

The LongVideoBench highlights referred reasoning questions, which are dependent on long frame inputs and cannot be well-addressed by a single frame or a few sparse frames. Here's an example:

Video:

Question:

At the beginning of the video (0:19 - 0:22), a woman with a headband tied to her head, wearing a red top, carrying a black backpack, when the woman comes down from a hill with tall rocks (3:34 - 3:40), what changes occur to her backpack?

Options:

  • A. There is a dark red jacket hanging on her black backpack
  • B. Nothing changed
  • C. There is a white jacket hanging on her black backpack
  • D. There is a dark blue jacket hanging on her black backpack

On LongVideoBench, proprietary models (solid lines) can significantly improve with more input frames, while open-source models (dashed lines) cannot properly scale: LongVideoBench Curve

Specifically, we include 17 categories of referred reasoning questions, exemplified as follows:
Please see our released dataset for all samples. LongVideoBench Cat Example

Leaderboard

Model Test Total (5341) Test 8s-15s Test 15s-60s Test 180s-600s Test 900s-3600s Val Total (1337)
GPT-4o (0513) 66.7 71.6 76.8 66.7 61.6 66.7
Gemini-1.5-Pro (0514) 64.4 70.2 75.3 65.0 59.1 64.0
LLaVA-OneVision-QWen2-72B-OV 63.2 74.3 77.4 61.6 56.5 61.3
Gemini-1.5-Flash (0514) 62.4 66.1 73.1 63.1 57.3 61.6
GPT-4-Turbo (0409) 60.7 66.4 71.1 61.7 54.5 59.1
Kangaroo 54.8 65.6 65.7 52.7 49.1 54.2
PLLaVA-34B 53.5 60.1 66.8 50.8 49.1 53.2
LLaVA-Next-Video-34B 50.5 57.6 61.6 48.7 45.9 50.5
Phi-3-Vision-Instruct 49.9 58.3 59.6 48.4 45.1 49.6
Idefics2 49.4 57.4 60.4 47.3 44.7 49.7
Mantis-Idefics2 47.6 56.1 61.4 44.6 42.5 47.0
LLaVA-Next-Mistral-7B 47.1 53.4 57.2 46.9 42.1 49.1
PLLaVA-13B 45.1 52.9 54.3 42.9 41.2 45.6
InstructBLIP-T5-XXL 43.8 48.1 50.1 44.5 40.0 43.3
Mantis-BakLLaVA 43.7 51.3 52.7 41.1 40.1 43.7
BLIP-2-T5-XXL 43.5 46.7 47.4 44.2 40.9 42.7
LLaVA-Next-Video-M7B 43.5 50.9 53.1 42.6 38.9 43.5
LLaVA-1.5-13B 43.1 49.0 51.1 41.8 39.6 43.4
ShareGPT4Video 41.8 46.9 50.1 40.0 38.7 39.7
VideoChat2 (Mistral-7B) 41.2 49.3 49.3 39.0 37.5 39.3
LLaVA-1.5-7B 40.4 45.0 47.4 40.1 37.0 40.3
mPLUG-Owl2 39.4 49.4 47.3 38.7 34.3 39.1
PLLaVA-7B 39.2 45.3 47.3 38.5 35.2 40.2
VideoLLaVA 37.6 43.1 44.6 36.4 34.4 39.1
VideoChat2 (Vicuna 7B) 35.1 38.1 40.5 33.5 33.6 36.0
Please refer to our Hugging Face leaderboard for more results!

BibTeX

@misc{wu2024longvideobench,
        title={LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding}, 
        author={Haoning Wu and Dongxu Li and Bei Chen and Junnan Li},
        year={2024},
        eprint={2407.15754},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2407.15754}, 
  }