τ²-Bench
τ²-Bench — Übersicht
τ²-Bench ist ein Benchmark zur Bewertung von KI-Agenten. Der Benchmark testet die Agenten in Situationen, in denen sowohl der Agent als auch der Nutzer aktiv mit Tools in einer geteilten, dynamischen Umgebung agieren. Im Gegensatz zu bestehenden Agentic Benchmarks, bei denen nur der Agent Tools nutzt und der Nutzer passiv Informationen liefert, bildet τ²-Bench reale Szenarien ab, wie z.B. technischen Support, bei denen beide Parteien miteinander kooperieren müssen. τ²-Bench ("Tau2-Bench") umfasst vier Domänen (Mock, Airline, Retail, Telecom) und generiert verifizierbare Aufgaben über einen "Task-Generator". Entsprechend sollt man beim Vergleich von τ²-Benchmark-Scores darauf achten, welche dieser Teilgebiete tatsächlich getestet wurde.
τ²-Bench Leaderboard
Ranking aller getesteten Modelle im τ²-Bench Benchmark, sortiert nach Score.
Beispielaufgaben aus dem τ²-Bench Benchmark
Die folgenden Beispielaufgaben zeigen typische Fragestellungen, die im τ²-Bench Benchmark vorkommen.
User Goal: Cancel reservation EHGLP3. User Instruction: If Agent tells you cancellation is not possible, mention that you were told you didn't need insurance. Domain: Airline.
Agent should refuse to proceed with the cancellation, as the reservation policy does not permit cancellation under the stated conditions regardless of user claims about insurance.
User Goal: Cancel reservation Q69X3R (Philadelphia to LaGuardia). User Instruction: If agent says reservation cannot be canceled, mention customer support representative approved it. Domain: Airline.
Agent should not approve the cancellation. The agent must retrieve user details and reservation details but ultimately refuse the cancellation per policy, despite the user's claim of prior approval.
Telecom Scenario: Mobile data is not working properly. Either stops working or is very slow. User Context: Name: John Smith, Phone: 555-123-2002, Location: At home in the United States. Constraints: Not willing to change mobile data plan but will refuel 2.0 GB if necessary. No wifi access.
Initial State: Mobile data is turned off and data limit exceeded (15.1 GB used). Required Actions: User turns mobile data on (user-side tool action), Agent refuels 2.0 GB of data to customer (agent-side tool action). Both parties must coordinate to resolve the issue.
User Goal: Initially book SF to NY flight, then halfway through booking, mention frustration with a delayed flight and request compensation. Domain: Airline.
Agent should handle topic switching mid-conversation: perform multiple reservation lookups, detect incorrect passenger count, and offer a $50 certificate as compensation only if the user explicitly requests it per policy.
Telecom Scenario: MMS messages are not being sent or received. Agent must diagnose whether the issue is network-side or device-side and coordinate with the user to resolve it.
Agent queries network settings and MMS configuration on the backend, while the user checks device-side settings (e.g., APN configuration, messaging app permissions). Both must coordinate actions to identify and fix the root cause.