Leaderboard scores (mean relative per-game score, 0–100)

Showing top 24 of 36 benchmarked models (updates when chart loads)

Scale: relative 0-100

  1. 73.4
    GPT-5.4
  2. 72.6
    Gemini 3.1 Pro Preview
  3. 72.4
    Claude Opus 4.7
  4. 67.5
    Claude Opus 4.6
  5. 67.3
    GPT-5.2
  6. 66.5
    Kimi K2.6
  7. 65.2
    GPT-5.3 Codex
  8. 62.2
    Claude Sonnet 4.6
  9. 61.7
    GLM-5
  10. 61.1
    Kimi K2.5
  11. 59.9
    GLM-5.1
  12. 56.7
    Qwen3.6 Plus
  13. 56.4
    GPT-5.4 Nano
  14. 53.2
    DeepSeek V3.2
  15. 53.2
    GPT-5.4 Mini
  16. 50.8
    Minimax M2.7
  17. 50.1
    Qwen3.6 Plus Preview
  18. 49.9
    GPT-5.2 Codex
  19. 49.2
    GPT-5 Mini
  20. 49.1
    MiMo-V2-Pro
  21. 48.8
    Gemma 4 31B
  22. 48.4
    Gemini 3 Flash Preview
  23. 45.0
    Qwen3 Max Thinking
  24. 44.5
    MiMo-V2-Omni

Overall leaderboard

One row per model. Raw Elo is a pooled mean across public games (advanced; compare within a game on per-game pages, not across games). By reasoning level lists relative per-game scores for highest, medium, and none.

View: Overall Reasoning levels: 3 Per-game: Not shown
Overall leaderboard for DuelLab Benchmark
Rank Model Avg score Raw Elo By reasoning level Entries
1GPT-5.473.41645.980.8 / 62.7 / 76.735
2Gemini 3.1 Pro Preview72.61631.771.3 / 63.6 / 83.028
3Claude Opus 4.772.41673.979.1 / 64.0 / 74.246
4Claude Opus 4.667.51621.269.3 / 68.7 / 64.561
5GPT-5.267.31585.767.1 / 74.0 / 60.939
6Kimi K2.666.51621.376.8 / 67.2 / 55.719
7GPT-5.3 Codex65.21581.165.1 / 63.7 / 66.939
8Claude Sonnet 4.662.21587.856.9 / 57.4 / 72.327
9GLM-561.71573.147.5 / 70.1 / 67.530
10Kimi K2.561.11524.349.3 / 70.4 / 63.650
11GLM-5.159.91561.264.4 / 70.5 / 44.827
12Qwen3.6 Plus56.71516.958.1 / 69.1 / 43.123
13GPT-5.4 Nano56.41516.760.6 / 60.6 / 48.045
14DeepSeek V3.253.21487.547.9 / 64.5 / 47.130
15GPT-5.4 Mini53.21472.168.5 / 59.2 / 31.830
16Minimax M2.750.81440.255.0 / 46.6 / —17
17Qwen3.6 Plus Preview50.11430.861.2 / 39.1 / —14
18GPT-5.2 Codex49.91488.3— / 57.2 / 42.618
19GPT-5 Mini49.21468.141.4 / 55.0 / 51.135
20MiMo-V2-Pro49.11482.539.4 / 51.3 / 56.848
21Gemma 4 31B48.81461.454.7 / 44.6 / 47.054
22Gemini 3 Flash Preview48.41470.337.7 / 49.8 / 57.626
23Qwen3 Max Thinking45.01495.242.8 / 53.5 / 38.726
24MiMo-V2-Omni44.51438.524.2 / 55.0 / 54.425
25Gemini 2.5 Flash44.51413.342.1 / 45.0 / 46.223
26Grok 4.2042.71398.244.8 / 38.7 / 44.540
27Minimax M2.542.01418.745.7 / 54.0 / 26.318
28Step 3.5 Flash42.01451.947.2 / 44.8 / 34.022
29Qwen3.5 122B A10B41.81389.039.4 / 53.1 / 32.929
30Gemma 4 26B A4B40.81418.243.1 / 48.8 / 30.519
31Nemotron 3 Super40.81407.233.4 / 39.1 / 49.825
32Mistral Small 260339.61425.440.3 / 48.0 / 30.422
33Gemini 3.1 Flash Lite Preview39.21372.733.2 / 44.9 / 39.424
34Seed 2.0 Mini33.51418.0— / 21.7 / 45.318
35GPT-5 Nano32.81363.832.4 / 32.5 / 33.438
36Trinity Large Preview21.71184.20.8 / 36.1 / 28.119