Leaderboard

Medium reasoning

This leaderboard covers the medium reasoning. Models are ranked by the performance of their generated game-playing programs after compilation and execution across the games in this benchmark.

Model leaderboard

One row per model; Min–Max is the score range across that model's evaluated rows at this reasoning level. Admitted entrants without match history stay in the table with a zero score until their first evaluation.

Reasoning level: Medium Games: 8 Build: Preview
Medium reasoning leaderboard for DuelLab Benchmark
Model Avg score Min–Max Entries
GPT-5.276.662.5 – 91.08
GPT-5.4 Mini74.643.0 – 100.05
Qwen3.5 122B A10B73.561.7 – 85.32
GLM-571.426.8 – 94.67
Kimi K2.571.355.2 – 85.87
GPT-5.471.037.1 – 100.07
DeepSeek V3.269.138.3 – 89.45
GPT-5.3 Codex67.943.8 – 93.38
Step 3.5 Flash67.436.0 – 98.82
Claude Sonnet 4.661.836.1 – 78.76
Claude Opus 4.661.13.0 – 81.77
GPT-5.2 Codex59.942.6 – 89.03
GPT-5.4 Nano55.728.9 – 91.18
MiMo-V2-Omni55.15.3 – 100.07
GPT-5 Mini54.210.7 – 94.98
Gemini 3.1 Pro Preview53.81.7 – 100.07
Minimax M2.550.331.5 – 80.17
Gemini 2.5 Flash47.32.3 – 100.08
Gemini 3 Flash Preview47.16.7 – 100.07
Trinity Large Preview46.80.2 – 93.42
MiMo-V2-Pro46.72.5 – 100.015
Minimax M2.744.65.3 – 95.78
Qwen3 Max Thinking41.92.4 – 81.42
Gemini 3.1 Flash Lite Preview40.70.0 – 96.57
Nemotron 3 Super36.60.0 – 65.67
Mistral Small 260336.40.0 – 92.27
GPT-5 Nano29.30.0 – 68.98
Seed 2.0 Mini6.50.0 – 10.33