In today’s world of artificial intelligence, benchmarks are essential for measuring AI models’ performance, especially in visual tasks that test reasoning and perception. However, large multimodal models (LMMs) struggle with image interpretation and often have weaker spatial understanding. Despite these limitations, LMMs score highly on many existing visual benchmarks, which quickly become outdated as models improve. To tackle this issue, a group of researchers have introduced ZeroBench, a new benchmark designed to remain challenging over time for LMMs.
Claimed to be “An Impossible* Visual Benchmark for Contemporary Large Multimodal Models,” ZeroBench aims to push the limits of model performance in visual reasoning.
Table of Contents
What is ZeroBench?
ZeroBench benchmark comprises 100 carefully curated main questions, each designed to test the limits of contemporary LMMs. To further refine the evaluation process, it includes 334 subquestions corresponding to individual reasoning steps required to answer each main question. Each question tests the model’s ability to interpret and reason about images, demanding complex cognitive processes and multi-step reasoning. The benchmark has been designed to ensure that it remains relevant and challenging for LMMs to achieve over time.
Key Features of ZeroBench
1. Lightweight but Focused
With only 100 main questions, ZeroBench prioritizes quality over quantity. This focused approach allows for a deeper evaluation of model performance without unnecessary complexity.
2. Too Difficult for Current Models
ZeroBench is designed to be impossible for today’s most advanced models, with none expected to score above 0.0%. This sets a new standard, encouraging future improvements in model capabilities.
3. Diverse Question Types
ZeroBench covers a wide range of topics and visual contexts and includes questions based on both natural and synthetic images. This variety ensures a well-rounded test of how models process and understand visual information.
How ZeroBench Was Created
The creation of ZeroBench involved a rigorous, multi-step process. A pool of over 20 human question creators generated a diverse set of candidate questions, with instructions to focus on difficult visual components, multi-step reasoning, and maximum challenge. These questions were then subjected to a four-part curation pipeline, including feedback, initial evaluation, thorough review, and adversarial filtering to ensure the final 100 questions were truly impossible for current models.
Sample Questions
The questions in ZeroBench are varied and thought-provoking. For instance, one question involves interpreting a chess position and providing the Forsyth-Edwards Notation (FEN) for the configuration. Another question requires calculating various statistics based on contributions made over a specific period. Each question stretches the limits of current AI models, revealing their strengths and weaknesses in visual reasoning.
To check more example questions and subquestions, visit here.
Evaluating LMMs on ZeroBench
To assess ZeroBench’s difficulty, researchers tested 20 leading models. The results from the leaderboard indicate that ZeroBench is indeed challenging for contemporary LMMs. All models scored 0% on the pass@1 metric, indicating they could not answer a single main question correctly. However, some models showed non-zero performance on the pass@5 metric, suggesting that while some questions are easier, the overall consistency remains low.
Claude Sonnet 3.5 v2 achieved the highest score, with a 24.30% pass@1 rate and 81 out of 334 subquestions answered correctly. Despite being easier than the main questions, these subquestions still proved challenging for the models.
Beyond measuring scores, ZeroBench focuses on analyzing mistakes. By studying common errors, researchers can identify patterns and weaknesses in how models interpret visual information. This helps guide future improvements, making it a valuable tool for advancing model development.
How to Access ZeroBench
ZeroBench is available on Hugging Face and accessed via this dataset. Researchers can easily load both the main questions and subquestions for evaluation.
To access the dataset, use the following Python code:
from datasets import load_dataset
# Load main questions
zerobench_ds = load_dataset('jonathan-roberts1/zerobench', split='zerobench')
# Load subquestions
zerobench_subquestions_ds = load_dataset('jonathan-roberts1/zerobench', split='zerobench_subquestions')
The dataset includes key details such as question text, images, answers, and attribution. More evaluation details and implementation guidelines can be found on the ZeroBench GitHub repository.
Concluding Remarks
ZeroBench sets a new standard for evaluating model performance, pushing researchers to develop more advanced systems. Unlike traditional benchmarks that models quickly outgrow, ZeroBench remains challenging, ensuring long-term relevance. By demanding higher levels of reasoning and interpretation, it drives continuous progress in model development and visual understanding of LMMs.
| Latest From Us
- Forget Towers: Verizon and AST SpaceMobile Are Launching Cellular Service From Space

- This $1,600 Graphics Card Can Now Run $30,000 AI Models, Thanks to Huawei

- The Global AI Safety Train Leaves the Station: Is the U.S. Already Too Late?

- The AI Breakthrough That Solves Sparse Data: Meet the Interpolating Neural Network

- The AI Advantage: Why Defenders Must Adopt Claude to Secure Digital Infrastructure







