MCP-Atlas
MCP-Atlas — Übersicht
MCP-Atlas ist ein grossangelegter Benchmark von Scale AI zur Bewertung der Tool-Nutzungskompetenz von Sprachmodellen unter Verwendung realer MCP-Server (Model Context Protocol). Der Benchmark umfasst 1.000 menschlich erstellte Aufgaben, die 36 reale MCP-Server und 220 Tools abdecken und realistische, mehrstufige Workflows erfordern. Die Aufgaben verwenden natuerlichsprachliche Prompts, die keine spezifischen Tools oder Server benennen, sodass Agenten eigenstaendig 3-6 Tool-Aufrufe ueber mehrere Server hinweg identifizieren und orchestrieren muessen. Die Bewertung erfolgt ueber eine anspruchsbasierte Rubrik, die Teilpunkte vergibt basierend auf erfuellten faktischen Behauptungen in der Modellantwort.
MCP-Atlas Leaderboard
Ranking aller getesteten Modelle im MCP-Atlas Benchmark, sortiert nach Score.
Beispielaufgaben aus dem MCP-Atlas Benchmark
Die folgenden Beispielaufgaben zeigen typische Fragestellungen, die im MCP-Atlas Benchmark vorkommen.
I'm researching papers on advertisement effectiveness and comparing it to our own online database advertising data. There's a 2024 paper by jane castleman that deals with ad control effectiveness, can you get me the abstract? I believe it mentions ad locality, for which I will also need to ask you for the date of our campaign with the biggest engagement rate, started during the 2015-2023 period, and its locality.
Paper title references 'Ad Controls and Explanations'; abstract begins with 'Recently, Meta has shifted...'; campaigns tied at 15% engagement rate; start dates: 2022-06-24, 2019-09-20, 2017-09-09; localities: National, International, International
I've been looking into contributions in open source videogames and I'm curious about the longevity of this one open source project. Can you get me the year difference between the creation of the repository of the shooter game assaultcube and the domain registration date of the game's official site
The AssaultCube GitHub repository was created in 2013, the domain registration year is 2006, year difference: 7 years
What is the first word of the file at /data/Barber Shop.csv?
Customer
For a report I have to prepare on oncology with machine learning, help me identify the last related study published on november 2024 containing the word cancer in its title and retrieve the abstract of the article.
Study: 'xCG: Explainable Cell Graphs for Survival Prediction in Non-Small Cell Lung Cancer'; type of cancer: Non-Small Cell Lung Cancer
In the same year and month that SpaceX launched the first car into space, which was the leading product type in commercial performance data?
SpaceX launched the first car into space in February 2018; leading product type: Electric vehicles; best-selling: Tesla Model 3