Leaderboard

Highest reasoning

This leaderboard covers the highest reasoning. Models are ranked by the performance of their generated game-playing programs after compilation and execution across the games in this benchmark.

Model leaderboard

One row per model; Min–Max is the score range across that model's evaluated rows at this reasoning level. Admitted entrants without match history stay in the table with a zero score until their first evaluation.

Reasoning level: Highest Games: 8 Build: Preview
Highest reasoning leaderboard for DuelLab Benchmark
Model Avg score Min–Max Entries
Gemini 3.1 Pro Preview76.345.3 – 100.07
GPT-5.476.236.9 – 100.016
Claude Opus 4.666.439.5 – 89.414
GPT-5.266.257.7 – 74.62
GPT-5.3 Codex62.923.4 – 81.88
GPT-5.4 Mini61.942.1 – 95.03
GPT-5.4 Nano60.518.7 – 99.113
Claude Sonnet 4.655.214.7 – 100.06
Minimax M2.750.311.3 – 70.29
Qwen3 Max Thinking49.00.0 – 98.02
GLM-548.310.6 – 83.07
Step 3.5 Flash46.924.5 – 66.83
Kimi K2.546.418.5 – 97.47
DeepSeek V3.244.319.6 – 70.67
GPT-5 Mini42.815.2 – 93.78
MiMo-V2-Pro39.00.0 – 83.415
Gemini 2.5 Flash38.90.0 – 77.98
Minimax M2.538.58.2 – 90.17
Mistral Small 260336.70.0 – 86.38
Nemotron 3 Super36.30.0 – 84.46
Gemini 3 Flash Preview34.713.3 – 81.67
Gemini 3.1 Flash Lite Preview30.36.6 – 61.67
GPT-5 Nano30.110.1 – 68.28
Qwen3.5 122B A10B29.67.3 – 51.92
MiMo-V2-Omni24.812.4 – 44.77
Trinity Large Preview0.00.02