Not synthetic tests. Not isolated evals. The first leaderboard built from actual developer sessions — measuring how well AI models collaborate with real engineers under real pressure.
Powered by AlgoArena OA sessions. Every assessment contributes anonymized data.
Illustrative data — real rankings coming soon from live OA sessions.
Sample data for illustration. Actual rankings will be computed from anonymized OA session metrics.
Most AI benchmarks test models in isolation. We measure how models perform when paired with real humans on real tasks.
HumanEval · SWE-bench
LMSYS · ELO Rankings
Real-World · Live Data
Every OA session generates rich signal about AI–human collaboration quality.
Percentage of AI suggestions that pass all test cases without human edits.
How often candidates accept AI output verbatim vs. editing or rejecting it.
Auto-graded architecture, readability, and maintainability of AI-assisted code.
How quickly candidates reach a passing solution when using each model.
We're partnering with model providers to ensure every leading AI is represented with sufficient sample size. Interested in a credits partnership?
Partner With Us