Historically, benchmarks have played a crucial role in the development and evaluation of AI models. They provided a standard for measuring performance and guiding research. However, the advancement of LLMs has led to the reevaluation of these benchmarks. The advanced AI capabilities of modern models have outpaced many benchmarks. The GitHub repository “Killed by LLM” by R0bk serves as a memorial to these benchmarks that have been overshadowed by AI. Let’s get into the details.
Table of Contents
What is Killed by LLM?
Killed by LLM is a project dedicated to cataloguing benchmarks that have been rendered obsolete by advancements in AI, specifically through the use of LLMs. These benchmarks can no longer effectively measure AI capabilities. While these benchmarks continue to hold value, they no longer contribute meaningfully to the inquiry of “Can AI do X?” The initiative was inspired by similar projects, notably Cody Ogden’s Killed by Google, which documented tools and services that were discontinued due to the rapid evolution of technology.
Understanding Saturation in AI Benchmarks
Saturation in the context of AI benchmarks refers to the phenomenon where specific tests cease to challenge the capabilities of modern AI models. This concept is crucial for understanding the evolution of AI performance metrics. As LLMs become increasingly proficient, they have surpassed the benchmarks that once served as rigorous tests. For instance, benchmarks like the Turing Test, originally intended to evaluate a machine’s ability to exhibit intelligent behaviour indistinguishable from that of a human, no longer serve as effective measures. Modern LLMs consistently pass this test.
Key Benchmarks Killed by LLM in 2024
1. ARC-AGI: The Abstract Reasoning Challenge
The purpose of the Abstract Reasoning Challenge (ARC) benchmark was to evaluate a model’s ability to complete visual patterns. It has become a benchmark that demonstrates how far AI has come, especially with models like OpenAI O3, which achieved a score of 87.5%, surpassing the human baseline of approximately 80%. This benchmark has existed for just over five years, and now advancements in LLMs have defeated it.
2. MATH: Challenging Mathematical Problems
The MATH dataset comprises 12,000 competition-level math problems. This benchmark has illustrated the journey of AI in mathematical reasoning. Originally, an average computer science PhD student scored around 40% on this test, but the model O1 by OpenAI achieved an impressive score of 94.8%. This benchmark, with its rich history, highlights the evolving capabilities of AI in solving complex mathematical challenges.
3. BIG-Bench-Hard: Multi-Task Performance
The purpose of the BIG-Bench-Hard benchmark was to assess the performance of LLMs across a suite of challenging tasks. This benchmark faced saturation as models like Claude Sonnet 3.5 scored 93.1%, significantly higher than the average human score of 67.7%. The evolution of this benchmark emphasizes the growing proficiency of AI in multi-tasking and complex problem-solving.
4. HumanEval: Coding Capabilities
HumanEval consists of 164 Python programming problems designed to test the coding capabilities of language models. Initially, these problems posed significant challenges for AI, but the model GPT-4o achieved a score of 90.2%, showcasing the remarkable advancements in AI’s coding abilities. This benchmark has become a testament to the power of LLMs in programming tasks.
5. IFEval: Instruction Following
The IFEval benchmark evaluates AI’s ability to follow complex, multi-step instructions across various domains. The LLama 3.3 70B model by Meta have surpassed this benchmark by achieving a score of 92.1%. The ability of AI to understand and execute intricate instructions illustrates the remarkable strides made in AI’s instruction-following capabilities.
Conclusion: Reflecting on AI Progress
The Killed by LLM project serves as a poignant reminder of the remarkable progress made in AI. As benchmarks continue to be surpassed, it becomes increasingly important to document and understand the implications of these advancements. The memorialization of benchmarks through Killed by LLM highlights the capabilities of modern AI and encourages continuous exploration and innovation in the field. Recognizing and understanding these milestones will be key to shaping the future of artificial intelligence.
| Latest From Us
- Forget Towers: Verizon and AST SpaceMobile Are Launching Cellular Service From Space
- This $1,600 Graphics Card Can Now Run $30,000 AI Models, Thanks to Huawei
- The Global AI Safety Train Leaves the Station: Is the U.S. Already Too Late?
- The AI Breakthrough That Solves Sparse Data: Meet the Interpolating Neural Network
- The AI Advantage: Why Defenders Must Adopt Claude to Secure Digital Infrastructure

