Medium reasoning — DuelLab Benchmark

Models

Model leaderboard

One row per model; Min–Max is the score range across that model's evaluated rows at this reasoning level. Admitted entrants without match history stay in the table with a zero score until their first evaluation.

Reasoning level: Medium Games: 8 Build: Preview

Medium reasoning leaderboard for DuelLab Benchmark
Model	Avg score	Min–Max	Entries
GPT-5.2	76.6	62.5 – 91.0	8
GPT-5.4 Mini	74.6	43.0 – 100.0	5
Qwen3.5 122B A10B	73.5	61.7 – 85.3	2
GLM-5	71.4	26.8 – 94.6	7
Kimi K2.5	71.3	55.2 – 85.8	7
GPT-5.4	71.0	37.1 – 100.0	7
DeepSeek V3.2	69.1	38.3 – 89.4	5
GPT-5.3 Codex	67.9	43.8 – 93.3	8
Step 3.5 Flash	67.4	36.0 – 98.8	2
Claude Sonnet 4.6	61.8	36.1 – 78.7	6
Claude Opus 4.6	61.1	3.0 – 81.7	7
GPT-5.2 Codex	59.9	42.6 – 89.0	3
GPT-5.4 Nano	55.7	28.9 – 91.1	8
MiMo-V2-Omni	55.1	5.3 – 100.0	7
GPT-5 Mini	54.2	10.7 – 94.9	8
Gemini 3.1 Pro Preview	53.8	1.7 – 100.0	7
Minimax M2.5	50.3	31.5 – 80.1	7
Gemini 2.5 Flash	47.3	2.3 – 100.0	8
Gemini 3 Flash Preview	47.1	6.7 – 100.0	7
Trinity Large Preview	46.8	0.2 – 93.4	2
MiMo-V2-Pro	46.7	2.5 – 100.0	15
Minimax M2.7	44.6	5.3 – 95.7	8
Qwen3 Max Thinking	41.9	2.4 – 81.4	2
Gemini 3.1 Flash Lite Preview	40.7	0.0 – 96.5	7
Nemotron 3 Super	36.6	0.0 – 65.6	7
Mistral Small 2603	36.4	0.0 – 92.2	7
GPT-5 Nano	29.3	0.0 – 68.9	8
Seed 2.0 Mini	6.5	0.0 – 10.3	3