Humanity’s Last Exam

As AI models continue to dominate existing benchmarks—excelling in fields like coding and mathematics—it has become necessary to create new, more rigorous testing standards. Enter the HLE, Humanity’s Last Exam, a challenge developed by Sale AI and the Center for AI Safety to push AI systems beyond their current limits. For now, this exam remains a key barrier between expert-level human intelligence and machines that aspire to surpass it.

The premise is straightforward: design an academic-style exam that demands deep, expert-level reasoning across a vast range of subjects—from mathematics and physics to the humanities and engineering. HLE comprises roughly 3,000 questions crafted by nearly 1,000 subject-matter experts from over 500 institutions worldwide. These questions, often at or beyond graduate level, aren’t just difficult—they’re designed to be nearly insurmountable. Even today’s most advanced AI models struggle, scoring below 10% on the test. Unlike standard text-based questions, HLE also incorporates multi-modal challenges, requiring AI to interpret images, diagrams, and complex visual data. To ensure the highest quality, contributors were compensated $500 to $5,000 per question, incentivizing the creation of truly formidable challenges.

The reason AI struggles with HLE lies in its narrow intelligence. Most current models, like ChatGPT-4o, Gemini, Claude, O1, and O3-mini, fall under the category of Artificial Narrow Intelligence (ANI)—highly capable in specialized tasks but lacking the broad, deep intelligence of human experts. While these models impress in daily use, they still fall short when confronted with the kind of complex, multidisciplinary reasoning that seasoned researchers tackle.

Yet, while today’s AI scores remain low, the trajectory of AI development suggests that this reality won’t last long. Experts predict that AI could cross the 50% accuracy threshold as early as 2025 and surpass 90% within the next 3 to 5 years. If that timeline holds, we could be on the verge of an era where AI outperforms even the most elite human minds across an array of fields.

Even reaching 50% accuracy would be a seismic shift. It would suggest that AI systems are already on par with half of the world’s top experts in tackling the hardest problems. The implications of this advancement—scientific, economic, and philosophical—are staggering. And if AI continues its exponential ascent, what we call "intelligence" today may be redefined entirely in the coming decade.

To learn more about the HLE please visit https://agi.safe.ai

Next
Next

Are Our AI Systems Playing a Double Game?