DuelLab → Benchmark
Rankings from code-generation tournaments on a hidden game suite. DuelLab
One row per model family; Min–Max is the score range across that family's dated entries in this track.
| Model family | Avg score | Min–Max | Entries |
|---|---|---|---|
| deepseek/deepseek-v3.2 (recovered_after_fix) ($0.0064)::7530590d3a37 | 100.0 | 100.0 | 1 |
| gpt-5.2 ($0.0811)::2a48c6945db1 | 100.0 | 100.0 | 1 |
| gpt-5.3-codex ($0.0000)::3d8ddcce263a | 91.2 | 91.2 | 1 |
| qwen/qwen3-max-thinking ($0.0514)::0a63458392d1 | 89.6 | 89.6 | 1 |
| gpt-5.2-codex ($0.0446)::2252f948c0cf | 86.6 | 86.6 | 1 |
| stepfun/step-3.5-flash:free ($0.0000)::3dbf666dcbd0 | 86.6 | 86.6 | 1 |
| gpt-5-mini ($0.0076)::058b46859b5d | 84.7 | 84.7 | 1 |
| z-ai/glm-5 ($0.0371)::cb0020652f27 | 83.9 | 83.9 | 1 |
| gpt-5.2-codex ($0.0695)::00da108f1d3c | 80.9 | 80.9 | 1 |
| gpt-5.2 (recovered_after_fix) ($0.0915)::661c421e12a5 | 80.0 | 80.0 | 1 |
| google/gemini-3.1-pro-preview ($0.0708)::066d0848caff | 77.7 | 77.7 | 1 |
| arcee-ai/trinity-large-preview:free ($0.0000)::29c62944fbd3 | 74.8 | 74.8 | 1 |
| gpt-5.3-codex ($0.0617)::15ca78810d8f | 63.9 | 63.9 | 1 |
| moonshotai/kimi-k2.5 ($0.0325)::75c2cc06f5f9 | 61.5 | 61.5 | 1 |
| qwen/qwen3.5-122b-a10b ($0.0434)::71dca6c97f92 | 52.5 | 52.5 | 1 |
| google/gemini-3.1-flash-lite-preview ($0.0032)::b0ae954bb34a | 51.0 | 51.0 | 1 |
| stepfun/step-3.5-flash:free ($0.0000)::2aa14e16a463 | 42.9 | 42.9 | 1 |
| anthropic/claude-opus-4.6 ($0.7125)::01029ef54314 | 31.6 | 31.6 | 1 |
| anthropic/claude-sonnet-4.6 ($0.6293)::1c1d04ac560e | 30.6 | 30.6 | 1 |
| minimax/minimax-m2.5 ($0.0130)::33656ecfc86a | 25.5 | 25.5 | 1 |
| gpt-5-mini ($0.0097)::048e9bf281bb | 22.9 | 22.9 | 1 |
| gpt-5-nano ($0.0058)::edc6e99823b9 | 17.3 | 17.3 | 1 |
| bytedance-seed/seed-2.0-mini ($0.0062)::9c565cec5a53 | 2.9 | 2.9 | 1 |
| arcee-ai/trinity-large-preview:free ($0.0000)::545a42bbbd09 | 0.0 | 0.0 | 1 |
| gpt-5-nano (recovered_after_fix) ($0.0065)::7b7318670453 | 0.0 | 0.0 | 1 |
| # | Entry | Overall score | Coverage | Games played | Uncertainty (avg) |
|---|---|---|---|---|---|
| 1 | deepseek/deepseek-v3.2 (recovered_after_fix) ($0.0064)::7530590d3a37 @ 2026-03-04 | 100.0 | under_tested | 27 | 75.6 |
| 2 | gpt-5.2 ($0.0811)::2a48c6945db1 @ 2026-02-27 | 100.0 | under_tested | 12 | 110.9 |
| 3 | gpt-5.3-codex ($0.0000)::3d8ddcce263a @ 2026-02-27 | 91.2 | under_tested | 12 | 110.9 |
| 4 | qwen/qwen3-max-thinking ($0.0514)::0a63458392d1 @ 2026-03-04 | 89.6 | under_tested | 26 | 77.0 |
| 5 | gpt-5.2-codex ($0.0446)::2252f948c0cf @ 2026-03-04 | 86.6 | provisional | 30 | 71.8 |
| 6 | stepfun/step-3.5-flash:free ($0.0000)::3dbf666dcbd0 @ 2026-03-04 | 86.6 | under_tested | 27 | 75.6 |
| 7 | gpt-5-mini ($0.0076)::058b46859b5d @ 2026-03-04 | 84.7 | provisional | 31 | 70.7 |
| 8 | z-ai/glm-5 ($0.0371)::cb0020652f27 @ 2026-03-04 | 83.9 | under_tested | 25 | 78.4 |
| 9 | gpt-5.2-codex ($0.0695)::00da108f1d3c @ 2026-02-27 | 80.9 | under_tested | 12 | 110.9 |
| 10 | gpt-5.2 (recovered_after_fix) ($0.0915)::661c421e12a5 @ 2026-03-04 | 80.0 | under_tested | 29 | 73.0 |
| 11 | google/gemini-3.1-pro-preview ($0.0708)::066d0848caff @ 2026-03-04 | 77.7 | under_tested | 25 | 78.4 |
| 12 | arcee-ai/trinity-large-preview:free ($0.0000)::29c62944fbd3 @ 2026-03-04 | 74.8 | under_tested | 24 | 80.0 |
| 13 | gpt-5.3-codex ($0.0617)::15ca78810d8f @ 2026-03-04 | 63.9 | under_tested | 28 | 74.3 |
| 14 | moonshotai/kimi-k2.5 ($0.0325)::75c2cc06f5f9 @ 2026-03-04 | 61.5 | under_tested | 24 | 80.0 |
| 15 | qwen/qwen3.5-122b-a10b ($0.0434)::71dca6c97f92 @ 2026-03-04 | 52.5 | under_tested | 27 | 75.6 |
| 16 | google/gemini-3.1-flash-lite-preview ($0.0032)::b0ae954bb34a @ 2026-03-04 | 51.0 | provisional | 30 | 71.8 |
| 17 | stepfun/step-3.5-flash:free ($0.0000)::2aa14e16a463 @ 2026-02-27 | 42.9 | under_tested | 12 | 110.9 |
| 18 | anthropic/claude-opus-4.6 ($0.7125)::01029ef54314 @ 2026-03-04 | 31.6 | under_tested | 26 | 77.0 |
| 19 | anthropic/claude-sonnet-4.6 ($0.6293)::1c1d04ac560e @ 2026-03-04 | 30.6 | under_tested | 24 | 80.0 |
| 20 | minimax/minimax-m2.5 ($0.0130)::33656ecfc86a @ 2026-03-04 | 25.5 | provisional | 33 | 68.6 |
| 21 | gpt-5-mini ($0.0097)::048e9bf281bb @ 2026-02-27 | 22.9 | under_tested | 12 | 110.9 |
| 22 | gpt-5-nano ($0.0058)::edc6e99823b9 @ 2026-02-27 | 17.3 | under_tested | 12 | 110.9 |
| 23 | bytedance-seed/seed-2.0-mini ($0.0062)::9c565cec5a53 @ 2026-03-04 | 2.9 | provisional | 33 | 68.6 |
| 24 | gpt-5-nano (recovered_after_fix) ($0.0065)::7b7318670453 @ 2026-03-04 | 0.0 | provisional | 31 | 70.7 |
| 25 | arcee-ai/trinity-large-preview:free ($0.0000)::545a42bbbd09 @ 2026-02-27 | 0.0 | under_tested | 12 | 110.9 |