OSWorld-Verified
OSWorld-Verified — Übersicht
OSWorld-Verified ist ein Benchmark zur Bewertung multimodaler KI-Agenten. Der Benchmark umfasst 369 Aufgaben mit echten Web- und Desktop-Anwendungen wie Chrome, LibreOffice, GIMP, VS Code und Thunderbird, die durch Evaluationsskripte automatisch bewertet werden. Menschliche Probanden erreichen im OSWorld-Verified Benchmark eine Erfolgsrate von 72,36 %, während die besten KI-Agenten-Systeme inzwischen auch über 70 % erzielen können.
OSWorld-Verified Leaderboard
Ranking aller getesteten Modelle im OSWorld-Verified Benchmark, sortiert nach Score.
Beispielaufgaben aus dem OSWorld-Verified Benchmark
Die folgenden Beispielaufgaben zeigen typische Fragestellungen, die im OSWorld-Verified Benchmark vorkommen.
Please help me change all the places in this document that say "text" to "test".
Use VS Code's find-and-replace to change all instances of 'text' to 'test' in the file and save. Evaluated by comparing the modified file against a gold standard version using compare_text_file.
Could you fill the background layer with green color, leaving the object layer as is?
In GIMP, select the background layer and fill it with green color while keeping the object layer unchanged. Evaluated by perceptual image hash comparison against reference output.
Fill all the blank cells in B1:E30 with the value in the cell above it.
In LibreOffice Calc, select range B1:E30 and fill each blank cell with the value from the cell directly above it. Evaluated by comparing the resulting spreadsheet against a gold standard file.
Please help me set the current user's line length for code wrapping to 50 characters in VS Code.
Set editor.wordWrapColumn to 50 in VS Code user settings at /home/user/.config/Code/User/settings.json. Evaluated by JSON settings validation checking the exact parameter value.
On the current website, show me the cars available for pickup at Boston Logan Intl Airport from the 10th to the 11th of next month, sorted by the number of seats to find the largest capacity.
Navigate the car rental website in Chrome to search for vehicles at Boston Logan Intl Airport with the specified dates, then sort results by seat capacity. Evaluated by checking the final browser state and displayed results.