DuelLab → Benchmark
Rankings from code-generation tournaments on a hidden game suite. DuelLab
One row per model family; Min–Max is the score range across that family's dated entries in this track.
| Model family | Avg score | Min–Max | Entries |
|---|---|---|---|
| google/gemini-3.1-pro-preview ($0.0872)::8ae02d489cb7 | 100.0 | 100.0 | 1 |
| stepfun/step-3.5-flash:free (recovered_after_fix) ($0.0000)::416b6dabf6e1 | 100.0 | 100.0 | 1 |
| qwen/qwen3-max-thinking ($0.0590)::4bd70e782dab | 99.2 | 99.2 | 1 |
| moonshotai/kimi-k2.5 ($0.0297)::611c884a8097 | 99.2 | 99.2 | 1 |
| google/gemini-3.1-pro-preview ($0.0600)::a7b8bff01755 | 97.2 | 97.2 | 1 |
| anthropic/claude-sonnet-4.6 ($0.5111)::91715cc50e5e | 97.1 | 97.1 | 1 |
| gpt-5.2 ($0.0652)::3623da66d13f | 95.0 | 95.0 | 1 |
| gpt-5-mini ($0.0092)::1f8bd7336368 | 92.7 | 92.7 | 1 |
| z-ai/glm-5 ($0.0541)::17c57ee1cfa6 | 90.1 | 90.1 | 1 |
| gpt-5.4 ($0.0000)::0b2642b7b3b5 | 87.9 | 87.9 | 1 |
| moonshotai/kimi-k2.5 (recovered_after_fix) ($0.0558)::5476e97ed2c8 | 80.0 | 80.0 | 1 |
| gpt-5.3-codex ($0.0753)::880993f40176 | 74.2 | 74.2 | 1 |
| gpt-5.3-codex ($0.0000)::82d721235cc3 | 72.3 | 72.3 | 1 |
| qwen/qwen3.5-122b-a10b ($0.0207)::1023d7d1ecf9 | 71.1 | 71.1 | 1 |
| anthropic/claude-opus-4.6 ($0.8473)::17c222e0ccd1 | 69.3 | 69.3 | 1 |
| gpt-5.2-codex ($0.0507)::aef8969aacc7 | 59.9 | 59.9 | 1 |
| gpt-5.2 (recovered_after_fix) ($0.1364)::2efbb468d8e4 | 59.8 | 59.8 | 1 |
| gpt-5.2 ($0.0667)::47eb5fc99f6f | 59.2 | 59.2 | 1 |
| gpt-5-mini ($0.0103)::67c9498f1701 | 58.5 | 58.5 | 1 |
| qwen/qwen3-max-thinking ($0.0644)::00e3223323da | 54.6 | 54.6 | 1 |
| qwen/qwen3.5-122b-a10b ($0.0466)::58a5ba6c9338 | 53.8 | 53.8 | 1 |
| gpt-5-mini ($0.0077)::3928741d8858 | 50.6 | 50.6 | 1 |
| anthropic/claude-sonnet-4.6 ($0.3961)::74e8f80b29ee | 46.1 | 46.1 | 1 |
| stepfun/step-3.5-flash:free (recovered_after_fix) ($0.0000)::71211240e6e7 | 44.6 | 44.6 | 1 |
| bytedance-seed/seed-2.0-mini ($0.0063)::284413223bc7 | 44.4 | 44.4 | 1 |
| gpt-5.3-codex ($0.0544)::7b791c451590 | 42.6 | 42.6 | 1 |
| gpt-5.2-codex ($0.0487)::0b500f1f8734 | 41.5 | 41.5 | 1 |
| stepfun/step-3.5-flash:free ($0.0000)::be86064bd9b6 | 40.5 | 40.5 | 1 |
| minimax/minimax-m2.5 ($0.0051)::06bd7cb68806 | 39.0 | 39.0 | 1 |
| minimax/minimax-m2.5 (recovered_after_fix) ($0.0179)::7c939d8643c1 | 38.7 | 38.7 | 1 |
| deepseek/deepseek-v3.2 (recovered_after_fix) ($0.0083)::cd80f58124a8 | 31.8 | 31.8 | 1 |
| deepseek/deepseek-v3.2 ($0.0114)::af7298d9a915 | 28.7 | 28.7 | 1 |
| z-ai/glm-5 ($0.0443)::2490d4ff540f | 26.3 | 26.3 | 1 |
| google/gemini-3.1-flash-lite-preview ($0.0044)::2372e9571823 | 23.3 | 23.3 | 1 |
| arcee-ai/trinity-large-preview:free ($0.0000)::ce841544258f | 23.0 | 23.0 | 1 |
| arcee-ai/trinity-large-preview:free ($0.0000)::1b493558fdb1 | 22.7 | 22.7 | 1 |
| bytedance-seed/seed-2.0-mini ($0.0047)::1d511fe15598 | 18.9 | 18.9 | 1 |
| gpt-5-nano ($0.0041)::b5ef3d9318f0 | 12.9 | 12.9 | 1 |
| gpt-5-nano ($0.0049)::1a34fca062d0 | 4.9 | 4.9 | 1 |
| gpt-5.2-codex ($0.0275)::124e05529c56 | 4.8 | 4.8 | 1 |
| arcee-ai/trinity-large-preview:free (recovered_after_fix) ($0.0000)::0b87b7222640 | 0.3 | 0.3 | 1 |
| google/gemini-3.1-flash-lite-preview ($0.0040)::4d6f4419c790 | 0.0 | 0.0 | 1 |
| gpt-5-nano ($0.0031)::a37024d8b02c | 0.0 | 0.0 | 1 |
| # | Entry | Overall score | Coverage | Games played | Uncertainty (avg) |
|---|---|---|---|---|---|
| 1 | google/gemini-3.1-pro-preview ($0.0872)::8ae02d489cb7 @ 2026-03-07 | 100.0 | provisional | 34 | 67.6 |
| 2 | stepfun/step-3.5-flash:free (recovered_after_fix) ($0.0000)::416b6dabf6e1 @ 2026-03-04 | 100.0 | under_tested | 29 | 73.0 |
| 3 | qwen/qwen3-max-thinking ($0.0590)::4bd70e782dab @ 2026-03-04 | 99.2 | provisional | 30 | 71.8 |
| 4 | moonshotai/kimi-k2.5 ($0.0297)::611c884a8097 @ 2026-03-04 | 99.2 | under_tested | 28 | 74.3 |
| 5 | google/gemini-3.1-pro-preview ($0.0600)::a7b8bff01755 @ 2026-03-04 | 97.2 | under_tested | 29 | 73.0 |
| 6 | anthropic/claude-sonnet-4.6 ($0.5111)::91715cc50e5e @ 2026-03-07 | 97.1 | provisional | 34 | 67.6 |
| 7 | gpt-5.2 ($0.0652)::3623da66d13f @ 2026-03-04 | 95.0 | under_tested | 25 | 78.4 |
| 8 | gpt-5-mini ($0.0092)::1f8bd7336368 @ 2026-03-04 | 92.7 | under_tested | 29 | 73.0 |
| 9 | z-ai/glm-5 ($0.0541)::17c57ee1cfa6 @ 2026-03-07 | 90.1 | provisional | 34 | 67.6 |
| 10 | gpt-5.4 ($0.0000)::0b2642b7b3b5 @ 2026-03-07 | 87.9 | provisional | 34 | 67.6 |
| 11 | moonshotai/kimi-k2.5 (recovered_after_fix) ($0.0558)::5476e97ed2c8 @ 2026-03-07 | 80.0 | provisional | 34 | 67.6 |
| 12 | gpt-5.3-codex ($0.0753)::880993f40176 @ 2026-03-07 | 74.2 | provisional | 34 | 67.6 |
| 13 | gpt-5.3-codex ($0.0000)::82d721235cc3 @ 2026-02-27 | 72.3 | under_tested | 12 | 110.9 |
| 14 | qwen/qwen3.5-122b-a10b ($0.0207)::1023d7d1ecf9 @ 2026-03-04 | 71.1 | under_tested | 29 | 73.0 |
| 15 | anthropic/claude-opus-4.6 ($0.8473)::17c222e0ccd1 @ 2026-03-04 | 69.3 | under_tested | 22 | 83.4 |
| 16 | gpt-5.2-codex ($0.0507)::aef8969aacc7 @ 2026-03-07 | 59.9 | provisional | 34 | 67.6 |
| 17 | gpt-5.2 (recovered_after_fix) ($0.1364)::2efbb468d8e4 @ 2026-03-07 | 59.8 | provisional | 34 | 67.6 |
| 18 | gpt-5.2 ($0.0667)::47eb5fc99f6f @ 2026-02-27 | 59.2 | under_tested | 12 | 110.9 |
| 19 | gpt-5-mini ($0.0103)::67c9498f1701 @ 2026-02-27 | 58.5 | under_tested | 12 | 110.9 |
| 20 | qwen/qwen3-max-thinking ($0.0644)::00e3223323da @ 2026-03-07 | 54.6 | provisional | 34 | 67.6 |
| 21 | qwen/qwen3.5-122b-a10b ($0.0466)::58a5ba6c9338 @ 2026-03-07 | 53.8 | provisional | 34 | 67.6 |
| 22 | gpt-5-mini ($0.0077)::3928741d8858 @ 2026-03-07 | 50.6 | provisional | 34 | 67.6 |
| 23 | anthropic/claude-sonnet-4.6 ($0.3961)::74e8f80b29ee @ 2026-03-04 | 46.1 | provisional | 31 | 70.7 |
| 24 | stepfun/step-3.5-flash:free (recovered_after_fix) ($0.0000)::71211240e6e7 @ 2026-03-07 | 44.6 | provisional | 34 | 67.6 |
| 25 | bytedance-seed/seed-2.0-mini ($0.0063)::284413223bc7 @ 2026-03-04 | 44.4 | under_tested | 27 | 75.6 |
| 26 | gpt-5.3-codex ($0.0544)::7b791c451590 @ 2026-03-04 | 42.6 | provisional | 31 | 70.7 |
| 27 | gpt-5.2-codex ($0.0487)::0b500f1f8734 @ 2026-02-27 | 41.5 | under_tested | 12 | 110.9 |
| 28 | stepfun/step-3.5-flash:free ($0.0000)::be86064bd9b6 @ 2026-02-27 | 40.5 | under_tested | 12 | 110.9 |
| 29 | minimax/minimax-m2.5 ($0.0051)::06bd7cb68806 @ 2026-03-04 | 39.0 | under_tested | 27 | 75.6 |
| 30 | minimax/minimax-m2.5 (recovered_after_fix) ($0.0179)::7c939d8643c1 @ 2026-03-07 | 38.7 | provisional | 34 | 67.6 |
| 31 | deepseek/deepseek-v3.2 (recovered_after_fix) ($0.0083)::cd80f58124a8 @ 2026-03-07 | 31.8 | provisional | 34 | 67.6 |
| 32 | deepseek/deepseek-v3.2 ($0.0114)::af7298d9a915 @ 2026-03-04 | 28.7 | under_tested | 26 | 77.0 |
| 33 | z-ai/glm-5 ($0.0443)::2490d4ff540f @ 2026-03-04 | 26.3 | under_tested | 24 | 80.0 |
| 34 | google/gemini-3.1-flash-lite-preview ($0.0044)::2372e9571823 @ 2026-03-04 | 23.3 | under_tested | 14 | 103.3 |
| 35 | arcee-ai/trinity-large-preview:free ($0.0000)::ce841544258f @ 2026-02-27 | 23.0 | under_tested | 12 | 110.9 |
| 36 | arcee-ai/trinity-large-preview:free ($0.0000)::1b493558fdb1 @ 2026-03-04 | 22.7 | under_tested | 19 | 89.4 |
| 37 | bytedance-seed/seed-2.0-mini ($0.0047)::1d511fe15598 @ 2026-03-07 | 18.9 | provisional | 34 | 67.6 |
| 38 | gpt-5-nano ($0.0041)::b5ef3d9318f0 @ 2026-02-27 | 12.9 | under_tested | 12 | 110.9 |
| 39 | gpt-5-nano ($0.0049)::1a34fca062d0 @ 2026-03-07 | 4.9 | provisional | 34 | 67.6 |
| 40 | gpt-5.2-codex ($0.0275)::124e05529c56 @ 2026-03-04 | 4.8 | under_tested | 21 | 85.3 |
| 41 | arcee-ai/trinity-large-preview:free (recovered_after_fix) ($0.0000)::0b87b7222640 @ 2026-03-07 | 0.3 | provisional | 34 | 67.6 |
| 42 | google/gemini-3.1-flash-lite-preview ($0.0040)::4d6f4419c790 @ 2026-03-07 | 0.0 | provisional | 34 | 67.6 |
| 43 | gpt-5-nano ($0.0031)::a37024d8b02c @ 2026-03-04 | 0.0 | under_tested | 21 | 85.3 |