SWE-bench Multilingual — Übersicht

SWE-bench Multilingual ist ein Benchmark mit 300 kuratierten Software-Engineering-Aufgaben aus realen GitHub-Pull-Requests, verteilt auf 42 Repositories und 9 Programmiersprachen (C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, Rust). Der Benchmark erweitert die ursprünglich Python-fokussierte SWE-bench-Familie um eine mehrsprachige Evaluierung der Fähigkeit von KI-Agenten, echte Software-Issues automatisch zu lösen. Jede Aufgabe besteht aus einer Issue-Beschreibung und einem Repository-Snapshot, wobei das System einen Patch generieren muss, der sowohl Fail-to-Pass-Tests (zur Issue-Behebung) als auch Pass-to-Pass-Tests (zur Regressionsverhinderung) besteht.

Website Dataset GitHub Leaderboard Blog Post

SWE-bench Multilingual Leaderboard

Ranking aller getesteten Modelle im SWE-bench Multilingual Benchmark, sortiert nach Score.

‌

#	Modell⇅	Score
1	Claude Mythos Preview	87,3 %
2	Claude Opus 4.8	84,4 %
3	Claude Sonnet 5	78,3 %
4	Grok 4.5	78 %
5	Kimi K2.6	76,7 %
6	MiniMax M2.7	76,5 %
7	DeepSeek-V4-Pro (Preview)	76,2 %

‌

Beispielaufgaben aus dem SWE-bench Multilingual Benchmark

Die folgenden Beispielaufgaben zeigen typische Fragestellungen, die im SWE-bench Multilingual Benchmark vorkommen.

apache__druid-13704 (Java): Support Post aggregation function pow(f1,f2) to cater for square, cube, square root. As of now the only supported arithmetic functions are +, -, *, /, and quotient. The request is to add a pow function which enables square (pow(f1,2)), cube (pow(f1,3)), and square root (pow(f1,0.5)) operations.

A patch adding pow() support to ArithmeticPostAggregator, verified by org.apache.druid.query.aggregation.post.ArithmeticPostAggregatorTest#testPow

tokio-rs__tokio-6603 (Rust): Every 32 messages is_empty() on Receiver returns false despite len == 0. Issue occurs at BLOCK_CAP boundary. Appears to affect both bounded and unbounded channels.

A patch fixing the is_empty() check at BLOCK_CAP boundaries in Tokio's channel implementation, verified by test_is_empty_32_msgs

laravel__framework-52684 (PHP): Str::trim changed behavior in Laravel 11. Previously working code now produces different results when trimming strings with null bytes.

A patch restoring correct trim behavior for strings with null bytes, verified by Support Str > Trim, Support Str > Ltrim, Support Str > Rtrim

rubocop__rubocop-13503 (Ruby): Wrong autocorrect for Style/DigChain. Expected: hoge&.dig(:a, :b). Actual after autocorrect: hoge & dig(:a, :b). Issue occurs with safe navigation method chains using dig.

A patch correcting the autocorrect logic for Style/DigChain with safe navigation operators, verified by 'registers an offense and corrects with safe navigation method chain'

preactjs__preact-4152 (JavaScript): <div>{ new String('hi') }</div> renders blank. Rendering a constructed String object doesn't display text. This works in React. Normally people use string literals, but the constructed String case should also work.

A patch ensuring Preact correctly renders String objects (not just string primitives), verified by 'Components > should render a new String()'

SWE-bench Multilingual — Übersicht

SWE-bench Multilingual Leaderboard

Beispielaufgaben aus dem SWE-bench Multilingual Benchmark

Ähnliche Benchmarks

Aider Polyglot

APEX-Agents

ARC-AGI-3

Arena.ai

Artificial Analysis Intelligence Index

BrowseComp