DuelLab → Benchmark

DuelLab Benchmark – minimal_v1 / medium

Rankings from code-generation tournaments on a hidden game suite. DuelLab

Model families

One row per model family; Min–Max is the score range across that family's dated entries in this track.

Model familyAvg scoreMin–MaxEntries
google/gemini-3.1-pro-preview ($0.0872)::8ae02d489cb7100.0100.01
stepfun/step-3.5-flash:free (recovered_after_fix) ($0.0000)::416b6dabf6e1100.0100.01
qwen/qwen3-max-thinking ($0.0590)::4bd70e782dab99.299.21
moonshotai/kimi-k2.5 ($0.0297)::611c884a809799.299.21
google/gemini-3.1-pro-preview ($0.0600)::a7b8bff0175597.297.21
anthropic/claude-sonnet-4.6 ($0.5111)::91715cc50e5e97.197.11
gpt-5.2 ($0.0652)::3623da66d13f95.095.01
gpt-5-mini ($0.0092)::1f8bd733636892.792.71
z-ai/glm-5 ($0.0541)::17c57ee1cfa690.190.11
gpt-5.4 ($0.0000)::0b2642b7b3b587.987.91
moonshotai/kimi-k2.5 (recovered_after_fix) ($0.0558)::5476e97ed2c880.080.01
gpt-5.3-codex ($0.0753)::880993f4017674.274.21
gpt-5.3-codex ($0.0000)::82d721235cc372.372.31
qwen/qwen3.5-122b-a10b ($0.0207)::1023d7d1ecf971.171.11
anthropic/claude-opus-4.6 ($0.8473)::17c222e0ccd169.369.31
gpt-5.2-codex ($0.0507)::aef8969aacc759.959.91
gpt-5.2 (recovered_after_fix) ($0.1364)::2efbb468d8e459.859.81
gpt-5.2 ($0.0667)::47eb5fc99f6f59.259.21
gpt-5-mini ($0.0103)::67c9498f170158.558.51
qwen/qwen3-max-thinking ($0.0644)::00e3223323da54.654.61
qwen/qwen3.5-122b-a10b ($0.0466)::58a5ba6c933853.853.81
gpt-5-mini ($0.0077)::3928741d885850.650.61
anthropic/claude-sonnet-4.6 ($0.3961)::74e8f80b29ee46.146.11
stepfun/step-3.5-flash:free (recovered_after_fix) ($0.0000)::71211240e6e744.644.61
bytedance-seed/seed-2.0-mini ($0.0063)::284413223bc744.444.41
gpt-5.3-codex ($0.0544)::7b791c45159042.642.61
gpt-5.2-codex ($0.0487)::0b500f1f873441.541.51
stepfun/step-3.5-flash:free ($0.0000)::be86064bd9b640.540.51
minimax/minimax-m2.5 ($0.0051)::06bd7cb6880639.039.01
minimax/minimax-m2.5 (recovered_after_fix) ($0.0179)::7c939d8643c138.738.71
deepseek/deepseek-v3.2 (recovered_after_fix) ($0.0083)::cd80f58124a831.831.81
deepseek/deepseek-v3.2 ($0.0114)::af7298d9a91528.728.71
z-ai/glm-5 ($0.0443)::2490d4ff540f26.326.31
google/gemini-3.1-flash-lite-preview ($0.0044)::2372e957182323.323.31
arcee-ai/trinity-large-preview:free ($0.0000)::ce841544258f23.023.01
arcee-ai/trinity-large-preview:free ($0.0000)::1b493558fdb122.722.71
bytedance-seed/seed-2.0-mini ($0.0047)::1d511fe1559818.918.91
gpt-5-nano ($0.0041)::b5ef3d9318f012.912.91
gpt-5-nano ($0.0049)::1a34fca062d04.94.91
gpt-5.2-codex ($0.0275)::124e05529c564.84.81
arcee-ai/trinity-large-preview:free (recovered_after_fix) ($0.0000)::0b87b72226400.30.31
google/gemini-3.1-flash-lite-preview ($0.0040)::4d6f4419c7900.00.01
gpt-5-nano ($0.0031)::a37024d8b02c0.00.01

Dated entries

#EntryOverall scoreCoverageGames playedUncertainty (avg)
1google/gemini-3.1-pro-preview ($0.0872)::8ae02d489cb7 @ 2026-03-07100.0 provisional3467.6
2stepfun/step-3.5-flash:free (recovered_after_fix) ($0.0000)::416b6dabf6e1 @ 2026-03-04100.0 under_tested2973.0
3qwen/qwen3-max-thinking ($0.0590)::4bd70e782dab @ 2026-03-0499.2 provisional3071.8
4moonshotai/kimi-k2.5 ($0.0297)::611c884a8097 @ 2026-03-0499.2 under_tested2874.3
5google/gemini-3.1-pro-preview ($0.0600)::a7b8bff01755 @ 2026-03-0497.2 under_tested2973.0
6anthropic/claude-sonnet-4.6 ($0.5111)::91715cc50e5e @ 2026-03-0797.1 provisional3467.6
7gpt-5.2 ($0.0652)::3623da66d13f @ 2026-03-0495.0 under_tested2578.4
8gpt-5-mini ($0.0092)::1f8bd7336368 @ 2026-03-0492.7 under_tested2973.0
9z-ai/glm-5 ($0.0541)::17c57ee1cfa6 @ 2026-03-0790.1 provisional3467.6
10gpt-5.4 ($0.0000)::0b2642b7b3b5 @ 2026-03-0787.9 provisional3467.6
11moonshotai/kimi-k2.5 (recovered_after_fix) ($0.0558)::5476e97ed2c8 @ 2026-03-0780.0 provisional3467.6
12gpt-5.3-codex ($0.0753)::880993f40176 @ 2026-03-0774.2 provisional3467.6
13gpt-5.3-codex ($0.0000)::82d721235cc3 @ 2026-02-2772.3 under_tested12110.9
14qwen/qwen3.5-122b-a10b ($0.0207)::1023d7d1ecf9 @ 2026-03-0471.1 under_tested2973.0
15anthropic/claude-opus-4.6 ($0.8473)::17c222e0ccd1 @ 2026-03-0469.3 under_tested2283.4
16gpt-5.2-codex ($0.0507)::aef8969aacc7 @ 2026-03-0759.9 provisional3467.6
17gpt-5.2 (recovered_after_fix) ($0.1364)::2efbb468d8e4 @ 2026-03-0759.8 provisional3467.6
18gpt-5.2 ($0.0667)::47eb5fc99f6f @ 2026-02-2759.2 under_tested12110.9
19gpt-5-mini ($0.0103)::67c9498f1701 @ 2026-02-2758.5 under_tested12110.9
20qwen/qwen3-max-thinking ($0.0644)::00e3223323da @ 2026-03-0754.6 provisional3467.6
21qwen/qwen3.5-122b-a10b ($0.0466)::58a5ba6c9338 @ 2026-03-0753.8 provisional3467.6
22gpt-5-mini ($0.0077)::3928741d8858 @ 2026-03-0750.6 provisional3467.6
23anthropic/claude-sonnet-4.6 ($0.3961)::74e8f80b29ee @ 2026-03-0446.1 provisional3170.7
24stepfun/step-3.5-flash:free (recovered_after_fix) ($0.0000)::71211240e6e7 @ 2026-03-0744.6 provisional3467.6
25bytedance-seed/seed-2.0-mini ($0.0063)::284413223bc7 @ 2026-03-0444.4 under_tested2775.6
26gpt-5.3-codex ($0.0544)::7b791c451590 @ 2026-03-0442.6 provisional3170.7
27gpt-5.2-codex ($0.0487)::0b500f1f8734 @ 2026-02-2741.5 under_tested12110.9
28stepfun/step-3.5-flash:free ($0.0000)::be86064bd9b6 @ 2026-02-2740.5 under_tested12110.9
29minimax/minimax-m2.5 ($0.0051)::06bd7cb68806 @ 2026-03-0439.0 under_tested2775.6
30minimax/minimax-m2.5 (recovered_after_fix) ($0.0179)::7c939d8643c1 @ 2026-03-0738.7 provisional3467.6
31deepseek/deepseek-v3.2 (recovered_after_fix) ($0.0083)::cd80f58124a8 @ 2026-03-0731.8 provisional3467.6
32deepseek/deepseek-v3.2 ($0.0114)::af7298d9a915 @ 2026-03-0428.7 under_tested2677.0
33z-ai/glm-5 ($0.0443)::2490d4ff540f @ 2026-03-0426.3 under_tested2480.0
34google/gemini-3.1-flash-lite-preview ($0.0044)::2372e9571823 @ 2026-03-0423.3 under_tested14103.3
35arcee-ai/trinity-large-preview:free ($0.0000)::ce841544258f @ 2026-02-2723.0 under_tested12110.9
36arcee-ai/trinity-large-preview:free ($0.0000)::1b493558fdb1 @ 2026-03-0422.7 under_tested1989.4
37bytedance-seed/seed-2.0-mini ($0.0047)::1d511fe15598 @ 2026-03-0718.9 provisional3467.6
38gpt-5-nano ($0.0041)::b5ef3d9318f0 @ 2026-02-2712.9 under_tested12110.9
39gpt-5-nano ($0.0049)::1a34fca062d0 @ 2026-03-074.9 provisional3467.6
40gpt-5.2-codex ($0.0275)::124e05529c56 @ 2026-03-044.8 under_tested2185.3
41arcee-ai/trinity-large-preview:free (recovered_after_fix) ($0.0000)::0b87b7222640 @ 2026-03-070.3 provisional3467.6
42google/gemini-3.1-flash-lite-preview ($0.0040)::4d6f4419c790 @ 2026-03-070.0 provisional3467.6
43gpt-5-nano ($0.0031)::a37024d8b02c @ 2026-03-040.0 under_tested2185.3