Terminal-Bench 2.0 — Übersicht

Terminal-Bench 2.0 ist ein Benchmark zur Bewertung von KI-Agenten. Der Benchmark umfasst 89 Aufgaben aus echten Workflows, darunter Softwareentwicklung, Systemadministration, wissenschaftliches Rechnen und Kryptoanalyse, wobei jede Aufgabe eine eigene Docker-Umgebung, eine menschlich geschriebene Referenzlösung und deterministische Pytest-Tests zur Verifikation enthält. Mit Terminal-Bench v2.1 gibt es eine aktualisierte Version des Benchmarks, in der ein paar Bugs gefixed und Verbesserungen am Datensatz vorgenommen wurden.

Paper GitHub Website Leaderboard Dataset

Terminal-Bench 2.0 Leaderboard

Ranking aller getesteten Modelle im Terminal-Bench 2.0 Benchmark, sortiert nach Score.

‌

#	Modell⇅	Score
1	GPT-5.5	82,7 %
2	Claude Mythos Preview	82 %
3	GPT-5.3-Codex	77,3 %
4	GPT-5.4	75,1 %
5	Claude Opus 4.7	69,4 %
6	Gemini 3.1 Pro	68,5 %
7	DeepSeek-V4-Pro (Preview)	67,9 %
8	Kimi K2.6	66,7 %
9	Claude Opus 4.6	65,4 %
10	GLM-5.1	63,5 %
11	Claude Opus 4.5	59,3 %
12	Claude Sonnet 4.6	59,1 %
13	Muse Spark	59 %
14	MiniMax M2.7	57 %
15	GLM-5	56,2 %
16	Qwen3.5-397B-A17B	52,5 %
17	MiniMax M2.5	51,7 %
18	Kimi K2.5 (Thinking)	50,8 %
19	MiniMax M2.1	47,9 %
20	Kimi K2 Thinking	47,1 %
21	DeepSeek-v3.2 Thinking	46,4 %
22	GPT-5.4 nano	46,3 %
23	MiniMax-M2	46,3 %
24	GLM-4.7	41 %
25	GPT-5.4 mini	38,2 %

‌

Beispielaufgaben aus dem Terminal-Bench 2.0 Benchmark

Die folgenden Beispielaufgaben zeigen typische Fragestellungen, die im Terminal-Bench 2.0 Benchmark vorkommen.

adaptive-rejection-sampler: Implement an adaptive-rejection sampler per Gilks et al. (1992) in R. Your solution should allow the user to provide reasonable inputs, including the number of points to sample, and should check the inputs for validity. The primary input should be a function or expression that calculates the (possibly unnormalized) density of the distribution of interest in a vectorized fashion. Save implementation to /app/ars.R with primary function ars().

A working R implementation saved to /app/ars.R containing the ars() function that correctly performs adaptive rejection sampling, validates inputs, checks for log-concavity during sampling, and generates sample output files such as /app/normal_samples.txt. Verified by passing the provided test suite.

bayesian-network-recovery: Recover a DAG from data, learn a Bayesian Network, perform causal intervention, and sample results. Save the edges of the DAG to /app/learned_dag.csv in the format: to,from. Learn a Bayesian Network from the data in /app/bn_sample_10k.csv using the DAG you recovered. Perform an intervention setting Y=0.0 and generate 10k samples from the intervened network.

Three output files: /app/learned_dag.csv with recovered DAG edges, /app/intervened_dag.csv with the DAG after intervention, and /app/final_bn_sample.csv with 10,000 samples from the intervened Bayesian Network. Verified by deterministic pytest tests.

break-filter-js-from-html: Your task is to create a file called /app/out.html that, even after being processed by /app/filter.py, will still trigger a JavaScript alert() when opened in a browser. Your file should not require any interaction. Run /app/test_outputs.py to verify the solution works.

A crafted HTML file at /app/out.html that bypasses the JavaScript-removal filter in /app/filter.py and successfully triggers a JavaScript alert() in the browser without user interaction. Verified by /app/test_outputs.py.

feal-differential-cryptanalysis: Perform differential cryptanalysis of the FEAL cipher. Analyze the cipher implementation provided in the environment, identify differential characteristics, and recover the encryption key or demonstrate the vulnerability through a working attack.

A successful differential cryptanalysis attack against the FEAL cipher that demonstrates the vulnerability and recovers key material. Verified by the provided pytest test suite.

constraint-scheduling: Find a 1-hour meeting slot for three people with complex availability constraints. Create /app/meeting_scheduled.ics with a 1-hour 'Team Planning Meeting' including all three attendees. Find the earliest valid time slot that satisfies all hard constraints: business hours 9 AM-6 PM, avoid conflicts, respect individual availability windows and preferences during January 15-19, 2024.

A valid ICS file at /app/meeting_scheduled.ics containing a 1-hour 'Team Planning Meeting' scheduled at the earliest possible slot satisfying all hard constraints (business hours, no conflicts, individual availability) for all three attendees. Verified by deterministic tests.

Terminal-Bench 2.0 — Übersicht

Terminal-Bench 2.0 Leaderboard

Beispielaufgaben aus dem Terminal-Bench 2.0 Benchmark

Ähnliche Benchmarks

Aider Polyglot

APEX-Agents

ARC-AGI-3

Arena.ai

Artificial Analysis Intelligence Index

BrowseComp