Leaderboard

No reasoning

This leaderboard covers the no reasoning. Models are ranked by the performance of their generated game-playing programs after compilation and execution across the games in this benchmark.

Model leaderboard

One row per model; Min–Max is the score range across that model's evaluated rows at this reasoning level. Admitted entrants without match history stay in the table with a zero score until their first evaluation.

Reasoning level: None Games: 8 Build: Preview
No reasoning leaderboard for DuelLab Benchmark
Model Avg score Min–Max Entries
GPT-5.479.941.7 – 100.012
Claude Sonnet 4.676.547.0 – 92.815
Gemini 3.1 Pro Preview75.862.5 – 84.84
Claude Opus 4.668.90.0 – 100.019
MiMo-V2-Pro66.80.9 – 97.718
GLM-566.128.1 – 88.916
GPT-5.3 Codex65.423.1 – 100.023
GPT-5.4 Nano64.613.6 – 94.08
Kimi K2.563.319.1 – 89.815
GPT-5.261.827.9 – 93.418
MiMo-V2-Omni59.826.0 – 93.311
Gemini 3 Flash Preview59.00.0 – 96.212
Nemotron 3 Super56.441.2 – 73.611
DeepSeek V3.253.86.3 – 100.015
GPT-5 Mini52.130.4 – 100.019
Gemini 2.5 Flash51.50.0 – 100.07
Seed 2.0 Mini47.719.7 – 87.85
Gemini 3.1 Flash Lite Preview42.517.7 – 87.810
GPT-5.2 Codex41.623.7 – 54.65
Qwen3 Max Thinking40.68.0 – 82.310
GPT-5 Nano38.00.0 – 92.222
GPT-5.4 Mini35.70.0 – 77.19
Mistral Small 260334.10.0 – 83.57
Step 3.5 Flash33.44.8 – 51.77
Qwen3.5 122B A10B32.60.0 – 77.310
Trinity Large Preview32.52.1 – 85.715
Minimax M2.525.86.9 – 46.04