DuelLab → Benchmark

DuelLab Benchmark – minimal_v1 / highest

Rankings from code-generation tournaments on a hidden game suite. DuelLab

Model families

One row per model family; Min–Max is the score range across that family's dated entries in this track.

Model familyAvg scoreMin–MaxEntries
gpt-5.3-codex ($0.0000)::861682ece0ae100.0100.01
qwen/qwen3-max-thinking ($0.0620)::17ebf1a0f415100.0100.01
minimax/minimax-m2.5 ($0.0150)::17d86923861b99.299.21
moonshotai/kimi-k2.5 ($0.0336)::0d7f59c95b3a95.995.91
gpt-5-mini ($0.0200)::844f8cf45a4e95.795.71
stepfun/step-3.5-flash:free ($0.0000)::84575e98212394.694.61
entrant_013_anthropic--claude-opus-4.6::38244ecbece988.188.11
z-ai/glm-5 ($0.0437)::00201bb03a0186.286.21
entrant_013_anthropic--claude-opus-4.6::17c222e0ccd186.286.21
entrant_013_anthropic--claude-opus-4.6::01029ef5431481.981.91
gpt-5-mini ($0.0232)::7ed20c1065d680.380.31
gpt-5.2-codex ($0.4983)::ab71abbabbae74.474.41
gpt-5.3-codex ($0.4748)::1399bc429a5062.162.11
google/gemini-3.1-pro-preview ($0.3446)::37db7ffea12758.958.91
qwen/qwen3.5-122b-a10b ($0.0250)::3a876f4663d458.258.21
gpt-5-nano ($0.0104)::d41b2f44dda757.257.21
google/gemini-3.1-flash-lite-preview ($0.0125)::c096dda2961841.741.71
anthropic/claude-sonnet-4.6 ($0.3750)::263e91e37c9639.539.51
gpt-5-nano ($0.0055)::62315ee296bc35.435.41
stepfun/step-3.5-flash:free (recovered_after_fix) ($0.0000)::4ab1bcc3e4b731.431.41
deepseek/deepseek-v3.2 ($0.0032)::7b6db8a35def31.131.11
arcee-ai/trinity-large-preview:free (recovered_after_fix) ($0.0000)::1b9e3f0b2b3014.214.21
entrant_013_anthropic--claude-opus-4.6::6ba3403d42aa0.00.01
arcee-ai/trinity-large-preview:free (recovered_after_fix) ($0.0000)::c0e35d0722f20.00.01

Dated entries

#EntryOverall scoreCoverageGames playedUncertainty (avg)
1qwen/qwen3-max-thinking ($0.0620)::17ebf1a0f415 @ 2026-03-04100.0 under_tested2480.0
2gpt-5.3-codex ($0.0000)::861682ece0ae @ 2026-02-27100.0 under_tested8133.3
3minimax/minimax-m2.5 ($0.0150)::17d86923861b @ 2026-03-0499.2 under_tested2480.0
4moonshotai/kimi-k2.5 ($0.0336)::0d7f59c95b3a @ 2026-03-0495.9 under_tested2480.0
5gpt-5-mini ($0.0200)::844f8cf45a4e @ 2026-03-0495.7 under_tested1891.8
6stepfun/step-3.5-flash:free ($0.0000)::84575e982123 @ 2026-03-0494.6 under_tested2480.0
7entrant_013_anthropic--claude-opus-4.6::38244ecbece9 @ 2026-03-0788.1 under_tested2480.0
8z-ai/glm-5 ($0.0437)::00201bb03a01 @ 2026-03-0486.2 under_tested2480.0
9entrant_013_anthropic--claude-opus-4.6::17c222e0ccd1 @ 2026-03-0786.2 under_tested1697.0
10entrant_013_anthropic--claude-opus-4.6::01029ef54314 @ 2026-03-0781.9 under_tested1697.0
11gpt-5-mini ($0.0232)::7ed20c1065d6 @ 2026-02-2780.3 under_tested8133.3
12gpt-5.2-codex ($0.4983)::ab71abbabbae @ 2026-03-0474.4 under_tested1891.8
13gpt-5.3-codex ($0.4748)::1399bc429a50 @ 2026-03-0462.1 under_tested1794.3
14google/gemini-3.1-pro-preview ($0.3446)::37db7ffea127 @ 2026-03-0458.9 under_tested1989.4
15qwen/qwen3.5-122b-a10b ($0.0250)::3a876f4663d4 @ 2026-03-0458.2 under_tested2677.0
16gpt-5-nano ($0.0104)::d41b2f44dda7 @ 2026-02-2757.2 under_tested8133.3
17google/gemini-3.1-flash-lite-preview ($0.0125)::c096dda29618 @ 2026-03-0441.7 under_tested2775.6
18anthropic/claude-sonnet-4.6 ($0.3750)::263e91e37c96 @ 2026-03-0439.5 under_tested2578.4
19gpt-5-nano ($0.0055)::62315ee296bc @ 2026-03-0435.4 under_tested2381.6
20stepfun/step-3.5-flash:free (recovered_after_fix) ($0.0000)::4ab1bcc3e4b7 @ 2026-02-2731.4 under_tested8133.3
21deepseek/deepseek-v3.2 ($0.0032)::7b6db8a35def @ 2026-03-0431.1 under_tested2381.6
22arcee-ai/trinity-large-preview:free (recovered_after_fix) ($0.0000)::1b9e3f0b2b30 @ 2026-03-0414.2 under_tested2677.0
23arcee-ai/trinity-large-preview:free (recovered_after_fix) ($0.0000)::c0e35d0722f2 @ 2026-02-270.0 under_tested8133.3
24entrant_013_anthropic--claude-opus-4.6::6ba3403d42aa @ 2026-03-070.0 under_tested2480.0