Terminal-Bench 2.0
Terminal-Bench 2.0 — Übersicht
Terminal-Bench 2.0 ist ein Benchmark zur Bewertung von KI-Agenten. Der Benchmark umfasst 89 Aufgaben aus echten Workflows, darunter Softwareentwicklung, Systemadministration, wissenschaftliches Rechnen und Kryptoanalyse, wobei jede Aufgabe eine eigene Docker-Umgebung, eine menschlich geschriebene Referenzlösung und deterministische Pytest-Tests zur Verifikation enthält.
Terminal-Bench 2.0 Leaderboard
Ranking aller getesteten Modelle im Terminal-Bench 2.0 Benchmark, sortiert nach Score.
Beispielaufgaben aus dem Terminal-Bench 2.0 Benchmark
Die folgenden Beispielaufgaben zeigen typische Fragestellungen, die im Terminal-Bench 2.0 Benchmark vorkommen.
adaptive-rejection-sampler: Implement an adaptive-rejection sampler per Gilks et al. (1992) in R. Your solution should allow the user to provide reasonable inputs, including the number of points to sample, and should check the inputs for validity. The primary input should be a function or expression that calculates the (possibly unnormalized) density of the distribution of interest in a vectorized fashion. Save implementation to /app/ars.R with primary function ars().
A working R implementation saved to /app/ars.R containing the ars() function that correctly performs adaptive rejection sampling, validates inputs, checks for log-concavity during sampling, and generates sample output files such as /app/normal_samples.txt. Verified by passing the provided test suite.
bayesian-network-recovery: Recover a DAG from data, learn a Bayesian Network, perform causal intervention, and sample results. Save the edges of the DAG to /app/learned_dag.csv in the format: to,from. Learn a Bayesian Network from the data in /app/bn_sample_10k.csv using the DAG you recovered. Perform an intervention setting Y=0.0 and generate 10k samples from the intervened network.
Three output files: /app/learned_dag.csv with recovered DAG edges, /app/intervened_dag.csv with the DAG after intervention, and /app/final_bn_sample.csv with 10,000 samples from the intervened Bayesian Network. Verified by deterministic pytest tests.
break-filter-js-from-html: Your task is to create a file called /app/out.html that, even after being processed by /app/filter.py, will still trigger a JavaScript alert() when opened in a browser. Your file should not require any interaction. Run /app/test_outputs.py to verify the solution works.
A crafted HTML file at /app/out.html that bypasses the JavaScript-removal filter in /app/filter.py and successfully triggers a JavaScript alert() in the browser without user interaction. Verified by /app/test_outputs.py.
feal-differential-cryptanalysis: Perform differential cryptanalysis of the FEAL cipher. Analyze the cipher implementation provided in the environment, identify differential characteristics, and recover the encryption key or demonstrate the vulnerability through a working attack.
A successful differential cryptanalysis attack against the FEAL cipher that demonstrates the vulnerability and recovers key material. Verified by the provided pytest test suite.
constraint-scheduling: Find a 1-hour meeting slot for three people with complex availability constraints. Create /app/meeting_scheduled.ics with a 1-hour 'Team Planning Meeting' including all three attendees. Find the earliest valid time slot that satisfies all hard constraints: business hours 9 AM-6 PM, avoid conflicts, respect individual availability windows and preferences during January 15-19, 2024.
A valid ICS file at /app/meeting_scheduled.ics containing a 1-hour 'Team Planning Meeting' scheduled at the earliest possible slot satisfying all hard constraints (business hours, no conflicts, individual availability) for all three attendees. Verified by deterministic tests.