ARC-AGI-2 — Übersicht

Der ARC-AGI-2 Benchmark misst die "fluide Intelligenz" von KI-Modellen. Der Benchmark besteht aus neuartigen visuellen Raster-Puzzles besteht, die zum Lösen abstraktes Reasoning erfordern. Wie man es vielleicht aus Rätselheften kennt, werden wenige Beispiele mit Input/Output-Gitterpaare präsentiert, aus denen die zugrunde liegende Transformationsregel erkannt und auf einen neuen Test-Input angewendet werden muss. Im Vergleich zum inzwischen gesättigten Vorgänger ARC-AGI-1 wurden die Aufgaben in ARC-AGI-2 gezielt schwieriger gestaltet. Der Fokus liegt auf mehrstufigen Regeln und kontextabhängiger Symboldeutung. LLMs ohne Reasoning-Kapazitäten erreichen 0% und selbst spezialisierte Reasoning-Systeme lagen zum Veröffentlichungszeitpunkt eher im einstelligen Prozentbereich. Anders, als bei vielen anderen Benchmarks, erreichen menschliche Teilnehmer im Durchschnitt einen hohen Score von 66%. Die ARC-AGI Benchmarks gehören zu den anspruchsvollsten Tests für generelle Problemlösefähigkeiten und zielen auf die Schwächen des LLM-Reasonings ab.

Website Leaderboard GitHub Paper

ARC-AGI-2 Leaderboard

Ranking aller getesteten Modelle im ARC-AGI-2 Benchmark, sortiert nach Score.

‌

#	Modell⇅	Score
1	GPT-5.5	85 %
2	Gemini 3.1 Pro	77,1 %
3	Gemini 3.5 Flash	72,1 %
4	Claude Opus 4.6	68,8 %
5	GPT-5.6 Sol	67,1 %
6	Claude Sonnet 4.6	58,3 %
7	GPT-5.2 Pro	54,2 %
8	GPT-5.2	52,9 %
9	Gemini 3 Deep Think	45,1 %
10	Muse Spark	42,5 %
11	Claude Opus 4.5	37,6 %
12	GPT-5.6 Terra	37,5 %
13	Gemini 3 Pro	31,1 %
14	Grok-4	15,9 %
15	GPT-5.6 Luna	7,4 %

‌

Beispielaufgaben aus dem ARC-AGI-2 Benchmark

Die folgenden Beispielaufgaben zeigen typische Fragestellungen, die im ARC-AGI-2 Benchmark vorkommen.

Given training pairs where a 2x2 input grid is transformed into a 6x6 output grid (e.g., Input: [[7, 9], [4, 3]] produces a 6x6 tiling pattern), determine the transformation rule and apply it to the test input: [[3, 2], [7, 8]]

[[3, 2, 3, 2, 3, 2], [7, 8, 7, 8, 7, 8], [2, 3, 2, 3, 2, 3], [8, 7, 8, 7, 8, 7], [3, 2, 3, 2, 3, 2], [7, 8, 7, 8, 7, 8]]

Given training pairs where a 3x7 input grid (two 3x3 halves separated by a column of 5s) produces a 3x3 output grid (e.g., Input: [[1,1,0,5,0,1,0],[0,0,1,5,1,1,1],[1,1,0,5,0,1,0]] → Output: [[0,2,0],[0,0,2],[0,2,0]]), apply the rule to test input: [[1,0,1,5,1,0,1],[0,1,0,5,1,0,1],[1,0,1,5,0,1,0]]

[[2, 0, 2], [0, 0, 0], [0, 0, 0]]

Given training pairs where parallelogram-shaped outlines on a grid are transformed (the top edge shifts right while the bottom edge stays fixed, making the shape more rectangular), apply the same transformation to a test input containing a parallelogram outline made of color 4 on a 10x10 grid.

A 10x10 grid where the parallelogram outline of color 4 has been regularized: the top edge is shifted one position to the right while the bottom edge remains fixed, producing a more symmetrical closed shape.

Given training pairs where rectangular regions enclosed by borders of color 3 have their interior cells (value 0) filled with color 4 while all other cells remain unchanged, apply this rule to a 20x20 test grid containing multiple rectangular regions bordered by 3s.

The 20x20 output grid where all interior cells enclosed by complete rectangular borders of 3s are replaced with 4, while border cells (3s) and exterior cells (0s) remain unchanged.

Given training pairs where a 3x3 input grid is tiled into a 9x9 output grid by placing the input pattern in each cell position where the original input has a non-zero value (e.g., Input: [[0,7,7],[7,7,7],[0,7,7]] produces a 9x9 grid with the input pattern in positions corresponding to non-zero cells), apply to test input: [[7,0,7],[7,0,7],[7,7,0]]

A 9x9 grid where the 3x3 input pattern [[7,0,7],[7,0,7],[7,7,0]] is placed in the block positions corresponding to non-zero cells of the input, and zero-filled 3x3 blocks are placed in positions corresponding to zero cells.

ARC-AGI-2 — Übersicht

ARC-AGI-2 Leaderboard

Beispielaufgaben aus dem ARC-AGI-2 Benchmark

Ähnliche Benchmarks

AIME 2025

AIME 2026

ARC-AGI-3

Arena.ai

Artificial Analysis Intelligence Index

FrontierMath