◆ Benchmark for fleeting visual evidence

Moment-Video

Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

* Equal contribution

1,000Human-verified QA
7Video domains
25Subcategories
33Models tested

Abstract

Moment-Video is a benchmark for diagnosing the temporal fidelity of video multimodal large language models (MLLMs) on momentary visual events: localized actions or state transitions that may last only a few frames, yet determine the correct answer.

Unlike benchmarks centered on persistent objects, global scene context, or long-form semantic aggregation, Moment-Video asks whether models can notice, count, describe, and reason over brief answer-critical evidence. The benchmark contains 1,000 human-verified video-QA pairs across 7 domains and 25 fine-grained subcategories, covering both real-world and virtual scenarios.

Video Categories

Video Categories

Task Types

Temporal Occurrence (TO)

Whether a brief event or state transition happens in the video.

Temporal Counting (TC)

How many transient actions, object changes, or repeated events occur.

Action Description (AD)

How a momentary event unfolds, including direction, trajectory, target, interaction, or state change.

Temporal Reasoning (TR)

How the pre-event state, momentary event, and post-event state imply the final answer.

Task Categories

Leaderboard

We evaluate 33 proprietary and open-source video MLLMs. Seed-2.0-Pro/Lite/Mini and MIMO-v2.5 use their default frame-sampling settings. Other models are evaluated with a 64-frame cap (50-frame cap for GPT-5.4). Switch between 1 FPS and 8 FPS results below.

Gold: best Underline: runner-up Bars: score strength

By Task Type

By Video Domain