DuelLab → Benchmark
Rankings from code-generation tournaments on a hidden game suite. DuelLab
One row per model family; Min–Max is the score range across that family's dated entries in this track.
| Model family | Avg score | Min–Max | Entries |
|---|---|---|---|
| minimax/minimax-m2.5 ($0.0147)::37e6d2ed8e10 | 100.0 | 100.0 | 1 |
| gpt-5.3-codex ($0.0000)::5cad1cf65f38 | 100.0 | 100.0 | 1 |
| qwen/qwen3-max-thinking ($0.0547)::244dbd3a5223 | 98.0 | 98.0 | 1 |
| anthropic/claude-sonnet-4.6 ($0.7898)::7e165f96dbae | 93.5 | 93.5 | 1 |
| moonshotai/kimi-k2.5 ($0.0222)::4f4e1bffc0d6 | 90.7 | 90.7 | 1 |
| entrant_013_anthropic--claude-opus-4.6::38244ecbece9 | 79.8 | 79.8 | 1 |
| z-ai/glm-5 ($0.0481)::44808dece37d | 78.8 | 78.8 | 1 |
| entrant_013_anthropic--claude-opus-4.6::17c222e0ccd1 | 78.2 | 78.2 | 1 |
| entrant_013_anthropic--claude-opus-4.6::01029ef54314 | 74.7 | 74.7 | 1 |
| gpt-5-mini ($0.0175)::2af654aceacc | 66.9 | 66.9 | 1 |
| gpt-5-nano ($0.0103)::21d869229d89 | 65.3 | 65.3 | 1 |
| gpt-5.3-codex (recovered_after_fix) ($0.5605)::e954ca523560 | 59.5 | 59.5 | 1 |
| stepfun/step-3.5-flash:free ($0.0000)::b4370bd94d70 | 56.2 | 56.2 | 1 |
| stepfun/step-3.5-flash:free (recovered_after_fix) ($0.0000)::1eb8204f4a33 | 53.4 | 53.4 | 1 |
| gpt-5-mini ($0.0222)::b4bd6cd5e542 | 52.4 | 52.4 | 1 |
| google/gemini-3.1-pro-preview (recovered_after_fix) ($0.3999)::5540d6ab37a8 | 51.5 | 51.5 | 1 |
| gpt-5-nano ($0.0138)::099781c59e50 | 51.2 | 51.2 | 1 |
| google/gemini-3.1-flash-lite-preview ($0.0169)::652b4056c583 | 34.4 | 34.4 | 1 |
| deepseek/deepseek-v3.2 ($0.0033)::301ceb9d61df | 28.7 | 28.7 | 1 |
| qwen/qwen3.5-122b-a10b ($0.0646)::43c91e963cbe | 27.3 | 27.3 | 1 |
| entrant_013_anthropic--claude-opus-4.6::6ba3403d42aa | 6.9 | 6.9 | 1 |
| arcee-ai/trinity-large-preview:free ($0.0000)::4a3b35ba8c06 | 0.0 | 0.0 | 1 |
| arcee-ai/trinity-large-preview:free ($0.0000)::682f10efa6e9 | 0.0 | 0.0 | 1 |
| # | Entry | Overall score | Coverage | Games played | Uncertainty (avg) |
|---|---|---|---|---|---|
| 1 | gpt-5.3-codex ($0.0000)::5cad1cf65f38 @ 2026-02-27 | 100.0 | under_tested | 8 | 133.3 |
| 2 | minimax/minimax-m2.5 ($0.0147)::37e6d2ed8e10 @ 2026-03-04 | 100.0 | under_tested | 19 | 89.4 |
| 3 | qwen/qwen3-max-thinking ($0.0547)::244dbd3a5223 @ 2026-03-04 | 98.0 | under_tested | 25 | 78.4 |
| 4 | anthropic/claude-sonnet-4.6 ($0.7898)::7e165f96dbae @ 2026-03-04 | 93.5 | under_tested | 23 | 81.6 |
| 5 | moonshotai/kimi-k2.5 ($0.0222)::4f4e1bffc0d6 @ 2026-03-04 | 90.7 | under_tested | 22 | 83.4 |
| 6 | entrant_013_anthropic--claude-opus-4.6::38244ecbece9 @ 2026-03-07 | 79.8 | under_tested | 24 | 80.0 |
| 7 | z-ai/glm-5 ($0.0481)::44808dece37d @ 2026-03-04 | 78.8 | under_tested | 24 | 80.0 |
| 8 | entrant_013_anthropic--claude-opus-4.6::17c222e0ccd1 @ 2026-03-07 | 78.2 | under_tested | 16 | 97.0 |
| 9 | entrant_013_anthropic--claude-opus-4.6::01029ef54314 @ 2026-03-07 | 74.7 | under_tested | 16 | 97.0 |
| 10 | gpt-5-mini ($0.0175)::2af654aceacc @ 2026-03-04 | 66.9 | under_tested | 21 | 85.3 |
| 11 | gpt-5-nano ($0.0103)::21d869229d89 @ 2026-03-04 | 65.3 | under_tested | 20 | 87.3 |
| 12 | gpt-5.3-codex (recovered_after_fix) ($0.5605)::e954ca523560 @ 2026-03-04 | 59.5 | under_tested | 26 | 77.0 |
| 13 | stepfun/step-3.5-flash:free ($0.0000)::b4370bd94d70 @ 2026-03-04 | 56.2 | under_tested | 23 | 81.6 |
| 14 | stepfun/step-3.5-flash:free (recovered_after_fix) ($0.0000)::1eb8204f4a33 @ 2026-02-27 | 53.4 | under_tested | 8 | 133.3 |
| 15 | gpt-5-mini ($0.0222)::b4bd6cd5e542 @ 2026-02-27 | 52.4 | under_tested | 8 | 133.3 |
| 16 | google/gemini-3.1-pro-preview (recovered_after_fix) ($0.3999)::5540d6ab37a8 @ 2026-03-04 | 51.5 | under_tested | 18 | 91.8 |
| 17 | gpt-5-nano ($0.0138)::099781c59e50 @ 2026-02-27 | 51.2 | under_tested | 8 | 133.3 |
| 18 | google/gemini-3.1-flash-lite-preview ($0.0169)::652b4056c583 @ 2026-03-04 | 34.4 | under_tested | 18 | 91.8 |
| 19 | deepseek/deepseek-v3.2 ($0.0033)::301ceb9d61df @ 2026-03-04 | 28.7 | under_tested | 22 | 83.4 |
| 20 | qwen/qwen3.5-122b-a10b ($0.0646)::43c91e963cbe @ 2026-03-04 | 27.3 | under_tested | 15 | 100.0 |
| 21 | entrant_013_anthropic--claude-opus-4.6::6ba3403d42aa @ 2026-03-07 | 6.9 | under_tested | 24 | 80.0 |
| 22 | arcee-ai/trinity-large-preview:free ($0.0000)::4a3b35ba8c06 @ 2026-02-27 | 0.0 | under_tested | 8 | 133.3 |
| 23 | arcee-ai/trinity-large-preview:free ($0.0000)::682f10efa6e9 @ 2026-03-04 | 0.0 | under_tested | 26 | 77.0 |