DuelLab → Benchmark

DuelLab Benchmark – full_freedom / medium

Rankings from code-generation tournaments on a hidden game suite. DuelLab

Model families

One row per model family; Min–Max is the score range across that family's dated entries in this track.

Model familyAvg scoreMin–MaxEntries
deepseek/deepseek-v3.2 (recovered_after_fix) ($0.0064)::7530590d3a37100.0100.01
gpt-5.2 ($0.0811)::2a48c6945db1100.0100.01
gpt-5.3-codex ($0.0000)::3d8ddcce263a91.291.21
qwen/qwen3-max-thinking ($0.0514)::0a63458392d189.689.61
gpt-5.2-codex ($0.0446)::2252f948c0cf86.686.61
stepfun/step-3.5-flash:free ($0.0000)::3dbf666dcbd086.686.61
gpt-5-mini ($0.0076)::058b46859b5d84.784.71
z-ai/glm-5 ($0.0371)::cb0020652f2783.983.91
gpt-5.2-codex ($0.0695)::00da108f1d3c80.980.91
gpt-5.2 (recovered_after_fix) ($0.0915)::661c421e12a580.080.01
google/gemini-3.1-pro-preview ($0.0708)::066d0848caff77.777.71
arcee-ai/trinity-large-preview:free ($0.0000)::29c62944fbd374.874.81
gpt-5.3-codex ($0.0617)::15ca78810d8f63.963.91
moonshotai/kimi-k2.5 ($0.0325)::75c2cc06f5f961.561.51
qwen/qwen3.5-122b-a10b ($0.0434)::71dca6c97f9252.552.51
google/gemini-3.1-flash-lite-preview ($0.0032)::b0ae954bb34a51.051.01
stepfun/step-3.5-flash:free ($0.0000)::2aa14e16a46342.942.91
anthropic/claude-opus-4.6 ($0.7125)::01029ef5431431.631.61
anthropic/claude-sonnet-4.6 ($0.6293)::1c1d04ac560e30.630.61
minimax/minimax-m2.5 ($0.0130)::33656ecfc86a25.525.51
gpt-5-mini ($0.0097)::048e9bf281bb22.922.91
gpt-5-nano ($0.0058)::edc6e99823b917.317.31
bytedance-seed/seed-2.0-mini ($0.0062)::9c565cec5a532.92.91
arcee-ai/trinity-large-preview:free ($0.0000)::545a42bbbd090.00.01
gpt-5-nano (recovered_after_fix) ($0.0065)::7b73186704530.00.01

Dated entries

#EntryOverall scoreCoverageGames playedUncertainty (avg)
1deepseek/deepseek-v3.2 (recovered_after_fix) ($0.0064)::7530590d3a37 @ 2026-03-04100.0 under_tested2775.6
2gpt-5.2 ($0.0811)::2a48c6945db1 @ 2026-02-27100.0 under_tested12110.9
3gpt-5.3-codex ($0.0000)::3d8ddcce263a @ 2026-02-2791.2 under_tested12110.9
4qwen/qwen3-max-thinking ($0.0514)::0a63458392d1 @ 2026-03-0489.6 under_tested2677.0
5gpt-5.2-codex ($0.0446)::2252f948c0cf @ 2026-03-0486.6 provisional3071.8
6stepfun/step-3.5-flash:free ($0.0000)::3dbf666dcbd0 @ 2026-03-0486.6 under_tested2775.6
7gpt-5-mini ($0.0076)::058b46859b5d @ 2026-03-0484.7 provisional3170.7
8z-ai/glm-5 ($0.0371)::cb0020652f27 @ 2026-03-0483.9 under_tested2578.4
9gpt-5.2-codex ($0.0695)::00da108f1d3c @ 2026-02-2780.9 under_tested12110.9
10gpt-5.2 (recovered_after_fix) ($0.0915)::661c421e12a5 @ 2026-03-0480.0 under_tested2973.0
11google/gemini-3.1-pro-preview ($0.0708)::066d0848caff @ 2026-03-0477.7 under_tested2578.4
12arcee-ai/trinity-large-preview:free ($0.0000)::29c62944fbd3 @ 2026-03-0474.8 under_tested2480.0
13gpt-5.3-codex ($0.0617)::15ca78810d8f @ 2026-03-0463.9 under_tested2874.3
14moonshotai/kimi-k2.5 ($0.0325)::75c2cc06f5f9 @ 2026-03-0461.5 under_tested2480.0
15qwen/qwen3.5-122b-a10b ($0.0434)::71dca6c97f92 @ 2026-03-0452.5 under_tested2775.6
16google/gemini-3.1-flash-lite-preview ($0.0032)::b0ae954bb34a @ 2026-03-0451.0 provisional3071.8
17stepfun/step-3.5-flash:free ($0.0000)::2aa14e16a463 @ 2026-02-2742.9 under_tested12110.9
18anthropic/claude-opus-4.6 ($0.7125)::01029ef54314 @ 2026-03-0431.6 under_tested2677.0
19anthropic/claude-sonnet-4.6 ($0.6293)::1c1d04ac560e @ 2026-03-0430.6 under_tested2480.0
20minimax/minimax-m2.5 ($0.0130)::33656ecfc86a @ 2026-03-0425.5 provisional3368.6
21gpt-5-mini ($0.0097)::048e9bf281bb @ 2026-02-2722.9 under_tested12110.9
22gpt-5-nano ($0.0058)::edc6e99823b9 @ 2026-02-2717.3 under_tested12110.9
23bytedance-seed/seed-2.0-mini ($0.0062)::9c565cec5a53 @ 2026-03-042.9 provisional3368.6
24gpt-5-nano (recovered_after_fix) ($0.0065)::7b7318670453 @ 2026-03-040.0 provisional3170.7
25arcee-ai/trinity-large-preview:free ($0.0000)::545a42bbbd09 @ 2026-02-270.0 under_tested12110.9