DuelLab → Benchmark

DuelLab Benchmark – full_freedom / highest

Rankings from code-generation tournaments on a hidden game suite. DuelLab

Model families

One row per model family; Min–Max is the score range across that family's dated entries in this track.

Model familyAvg scoreMin–MaxEntries
minimax/minimax-m2.5 ($0.0147)::37e6d2ed8e10100.0100.01
gpt-5.3-codex ($0.0000)::5cad1cf65f38100.0100.01
qwen/qwen3-max-thinking ($0.0547)::244dbd3a522398.098.01
anthropic/claude-sonnet-4.6 ($0.7898)::7e165f96dbae93.593.51
moonshotai/kimi-k2.5 ($0.0222)::4f4e1bffc0d690.790.71
entrant_013_anthropic--claude-opus-4.6::38244ecbece979.879.81
z-ai/glm-5 ($0.0481)::44808dece37d78.878.81
entrant_013_anthropic--claude-opus-4.6::17c222e0ccd178.278.21
entrant_013_anthropic--claude-opus-4.6::01029ef5431474.774.71
gpt-5-mini ($0.0175)::2af654aceacc66.966.91
gpt-5-nano ($0.0103)::21d869229d8965.365.31
gpt-5.3-codex (recovered_after_fix) ($0.5605)::e954ca52356059.559.51
stepfun/step-3.5-flash:free ($0.0000)::b4370bd94d7056.256.21
stepfun/step-3.5-flash:free (recovered_after_fix) ($0.0000)::1eb8204f4a3353.453.41
gpt-5-mini ($0.0222)::b4bd6cd5e54252.452.41
google/gemini-3.1-pro-preview (recovered_after_fix) ($0.3999)::5540d6ab37a851.551.51
gpt-5-nano ($0.0138)::099781c59e5051.251.21
google/gemini-3.1-flash-lite-preview ($0.0169)::652b4056c58334.434.41
deepseek/deepseek-v3.2 ($0.0033)::301ceb9d61df28.728.71
qwen/qwen3.5-122b-a10b ($0.0646)::43c91e963cbe27.327.31
entrant_013_anthropic--claude-opus-4.6::6ba3403d42aa6.96.91
arcee-ai/trinity-large-preview:free ($0.0000)::4a3b35ba8c060.00.01
arcee-ai/trinity-large-preview:free ($0.0000)::682f10efa6e90.00.01

Dated entries

#EntryOverall scoreCoverageGames playedUncertainty (avg)
1gpt-5.3-codex ($0.0000)::5cad1cf65f38 @ 2026-02-27100.0 under_tested8133.3
2minimax/minimax-m2.5 ($0.0147)::37e6d2ed8e10 @ 2026-03-04100.0 under_tested1989.4
3qwen/qwen3-max-thinking ($0.0547)::244dbd3a5223 @ 2026-03-0498.0 under_tested2578.4
4anthropic/claude-sonnet-4.6 ($0.7898)::7e165f96dbae @ 2026-03-0493.5 under_tested2381.6
5moonshotai/kimi-k2.5 ($0.0222)::4f4e1bffc0d6 @ 2026-03-0490.7 under_tested2283.4
6entrant_013_anthropic--claude-opus-4.6::38244ecbece9 @ 2026-03-0779.8 under_tested2480.0
7z-ai/glm-5 ($0.0481)::44808dece37d @ 2026-03-0478.8 under_tested2480.0
8entrant_013_anthropic--claude-opus-4.6::17c222e0ccd1 @ 2026-03-0778.2 under_tested1697.0
9entrant_013_anthropic--claude-opus-4.6::01029ef54314 @ 2026-03-0774.7 under_tested1697.0
10gpt-5-mini ($0.0175)::2af654aceacc @ 2026-03-0466.9 under_tested2185.3
11gpt-5-nano ($0.0103)::21d869229d89 @ 2026-03-0465.3 under_tested2087.3
12gpt-5.3-codex (recovered_after_fix) ($0.5605)::e954ca523560 @ 2026-03-0459.5 under_tested2677.0
13stepfun/step-3.5-flash:free ($0.0000)::b4370bd94d70 @ 2026-03-0456.2 under_tested2381.6
14stepfun/step-3.5-flash:free (recovered_after_fix) ($0.0000)::1eb8204f4a33 @ 2026-02-2753.4 under_tested8133.3
15gpt-5-mini ($0.0222)::b4bd6cd5e542 @ 2026-02-2752.4 under_tested8133.3
16google/gemini-3.1-pro-preview (recovered_after_fix) ($0.3999)::5540d6ab37a8 @ 2026-03-0451.5 under_tested1891.8
17gpt-5-nano ($0.0138)::099781c59e50 @ 2026-02-2751.2 under_tested8133.3
18google/gemini-3.1-flash-lite-preview ($0.0169)::652b4056c583 @ 2026-03-0434.4 under_tested1891.8
19deepseek/deepseek-v3.2 ($0.0033)::301ceb9d61df @ 2026-03-0428.7 under_tested2283.4
20qwen/qwen3.5-122b-a10b ($0.0646)::43c91e963cbe @ 2026-03-0427.3 under_tested15100.0
21entrant_013_anthropic--claude-opus-4.6::6ba3403d42aa @ 2026-03-076.9 under_tested2480.0
22arcee-ai/trinity-large-preview:free ($0.0000)::4a3b35ba8c06 @ 2026-02-270.0 under_tested8133.3
23arcee-ai/trinity-large-preview:free ($0.0000)::682f10efa6e9 @ 2026-03-040.0 under_tested2677.0