SWE-bench Multilingual
SWE-bench Multilingual — Übersicht
SWE-bench Multilingual ist ein Benchmark mit 300 kuratierten Software-Engineering-Aufgaben aus realen GitHub-Pull-Requests, verteilt auf 42 Repositories und 9 Programmiersprachen (C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, Rust). Der Benchmark erweitert die ursprünglich Python-fokussierte SWE-bench-Familie um eine mehrsprachige Evaluierung der Fähigkeit von KI-Agenten, echte Software-Issues automatisch zu lösen. Jede Aufgabe besteht aus einer Issue-Beschreibung und einem Repository-Snapshot, wobei das System einen Patch generieren muss, der sowohl Fail-to-Pass-Tests (zur Issue-Behebung) als auch Pass-to-Pass-Tests (zur Regressionsverhinderung) besteht.
SWE-bench Multilingual Leaderboard
Ranking aller getesteten Modelle im SWE-bench Multilingual Benchmark, sortiert nach Score.
Beispielaufgaben aus dem SWE-bench Multilingual Benchmark
Die folgenden Beispielaufgaben zeigen typische Fragestellungen, die im SWE-bench Multilingual Benchmark vorkommen.
apache__druid-13704 (Java): Support Post aggregation function pow(f1,f2) to cater for square, cube, square root. As of now the only supported arithmetic functions are +, -, *, /, and quotient. The request is to add a pow function which enables square (pow(f1,2)), cube (pow(f1,3)), and square root (pow(f1,0.5)) operations.
A patch adding pow() support to ArithmeticPostAggregator, verified by org.apache.druid.query.aggregation.post.ArithmeticPostAggregatorTest#testPow
tokio-rs__tokio-6603 (Rust): Every 32 messages is_empty() on Receiver returns false despite len == 0. Issue occurs at BLOCK_CAP boundary. Appears to affect both bounded and unbounded channels.
A patch fixing the is_empty() check at BLOCK_CAP boundaries in Tokio's channel implementation, verified by test_is_empty_32_msgs
laravel__framework-52684 (PHP): Str::trim changed behavior in Laravel 11. Previously working code now produces different results when trimming strings with null bytes.
A patch restoring correct trim behavior for strings with null bytes, verified by Support Str > Trim, Support Str > Ltrim, Support Str > Rtrim
rubocop__rubocop-13503 (Ruby): Wrong autocorrect for Style/DigChain. Expected: hoge&.dig(:a, :b). Actual after autocorrect: hoge & dig(:a, :b). Issue occurs with safe navigation method chains using dig.
A patch correcting the autocorrect logic for Style/DigChain with safe navigation operators, verified by 'registers an offense and corrects with safe navigation method chain'
preactjs__preact-4152 (JavaScript): <div>{ new String('hi') }</div> renders blank. Rendering a constructed String object doesn't display text. This works in React. Normally people use string literals, but the constructed String case should also work.
A patch ensuring Preact correctly renders String objects (not just string primitives), verified by 'Components > should render a new String()'