DuelLab → Benchmark
Rankings from code-generation tournaments on a hidden game suite. DuelLab
One row per model family; Min–Max is the score range across that family's dated entries in this track.
| Model family | Avg score | Min–Max | Entries |
|---|---|---|---|
| gpt-5-mini ($0.0088)::628ebfd2c9b8 | 100.0 | 100.0 | 1 |
| entrant_009_google--gemini-3-flash-preview::8c114917b8e3 | 100.0 | 100.0 | 1 |
| entrant_009_google--gemini-3-flash-preview::6d1c5f498355 | 98.9 | 98.9 | 1 |
| qwen/qwen3-max-thinking (recovered_after_fix) ($0.0201)::4aeacca85750 | 96.9 | 96.9 | 1 |
| entrant_000_gpt-5.4::14ff6b6748de | 95.2 | 95.2 | 1 |
| entrant_000_gpt-5.2::2a48c6945db1 | 94.9 | 94.9 | 1 |
| entrant_004_gpt-5.3-codex::861682ece0ae | 94.4 | 94.4 | 1 |
| entrant_007_z-ai--glm-5::b0dd13061084 | 93.2 | 93.2 | 1 |
| entrant_005_google--gemini-3.1-pro-preview::2e4d06f52910 | 91.7 | 91.7 | 1 |
| entrant_009_google--gemini-3-flash-preview::625ea5d044cd | 91.3 | 91.3 | 1 |
| entrant_005_google--gemini-3.1-pro-preview::8ebe96e65980 | 90.9 | 90.9 | 1 |
| entrant_007_z-ai--glm-5::17c57ee1cfa6 | 90.9 | 90.9 | 1 |
| entrant_011_moonshotai--kimi-k2.5::25225f273b28 | 90.6 | 90.6 | 1 |
| entrant_004_gpt-5.3-codex::c811aa176b62 | 90.4 | 90.4 | 1 |
| entrant_004_gpt-5.3-codex::9154767998bf | 90.1 | 90.1 | 1 |
| entrant_006_z-ai--glm-5::a2b617f85cfd | 89.6 | 89.6 | 1 |
| entrant_004_gpt-5.3-codex::5980a9c19f87 | 89.2 | 89.2 | 1 |
| entrant_012_moonshotai--kimi-k2.5::04fe201a22c6 | 89.1 | 89.1 | 1 |
| entrant_014_anthropic--claude-sonnet-4.6::0f0c803f0b38 | 89.0 | 89.0 | 1 |
| gpt-5.3-codex ($0.0200)::2e94e75ca479 | 88.3 | 88.3 | 1 |
| entrant_000_gpt-5.4::0b2642b7b3b5 | 87.1 | 87.1 | 1 |
| entrant_004_gpt-5.3-codex::5cad1cf65f38 | 86.8 | 86.8 | 1 |
| entrant_006_google--gemini-3.1-pro-preview | 86.1 | 86.1 | 1 |
| entrant_006_z-ai--glm-5::4bae2d47b34c | 85.9 | 85.9 | 1 |
| entrant_014_anthropic--claude-sonnet-4.6::01fce6ceed12 | 85.4 | 85.4 | 1 |
| entrant_015_anthropic--claude-sonnet-4.6::91715cc50e5e | 85.3 | 85.3 | 1 |
| entrant_013_anthropic--claude-opus-4.6::5c868b25a52f | 84.8 | 84.8 | 1 |
| entrant_014_anthropic--claude-opus-4.6::b0f54ac64a0f | 84.5 | 84.5 | 1 |
| entrant_000_gpt-5.2::381b51bd0a04 | 84.3 | 84.3 | 1 |
| entrant_011_moonshotai--kimi-k2.5::97459f7b08ce | 83.0 | 83.0 | 1 |
| entrant_006_z-ai--glm-5::16b2342fb880 | 80.9 | 80.9 | 1 |
| entrant_013_anthropic--claude-opus-4.6::6035e544b9be | 79.3 | 79.3 | 1 |
| entrant_012_moonshotai--kimi-k2.5::5476e97ed2c8 | 78.8 | 78.8 | 1 |
| entrant_014_anthropic--claude-sonnet-4.6::71d6f6447cbc | 77.6 | 77.6 | 1 |
| entrant_014_anthropic--claude-opus-4.6::9a62dcbb8a3b | 77.1 | 77.1 | 1 |
| entrant_005_gpt-5.3-codex::880993f40176 | 76.6 | 76.6 | 1 |
| entrant_014_anthropic--claude-sonnet-4.6::b8daeba4a7cf | 74.9 | 74.9 | 1 |
| entrant_007_z-ai--glm-5::60475863ae30 | 73.5 | 73.5 | 1 |
| anthropic/claude-opus-4.6 ($0.0833)::6ba3403d42aa | 73.4 | 73.4 | 1 |
| entrant_004_gpt-5.3-codex::3d8ddcce263a | 73.3 | 73.3 | 1 |
| entrant_015_anthropic--claude-sonnet-4.6::09e756834eda | 73.2 | 73.2 | 1 |
| entrant_012_moonshotai--kimi-k2.5::0e4c4b3371ea | 72.4 | 72.4 | 1 |
| entrant_005_google--gemini-3.1-pro-preview::53822cf06dda | 71.8 | 71.8 | 1 |
| entrant_004_gpt-5.3-codex::26c58495c5e9 | 71.2 | 71.2 | 1 |
| entrant_000_gpt-5.2::34ff3d41d915 | 70.1 | 70.1 | 1 |
| entrant_006_z-ai--glm-5::24b8f171bca6 | 68.2 | 68.2 | 1 |
| entrant_007_qwen--qwen3-max-thinking::f438fd18faee | 67.4 | 67.4 | 1 |
| entrant_009_google--gemini-3-flash-preview::2db816dc429f | 63.8 | 63.8 | 1 |
| entrant_000_gpt-5.4::18fd032f43d1 | 63.0 | 63.0 | 1 |
| entrant_003_gpt-5.2-codex::00da108f1d3c | 62.8 | 62.8 | 1 |
| anthropic/claude-opus-4.6 ($0.0875)::b0f54ac64a0f | 60.9 | 60.9 | 1 |
| entrant_004_gpt-5.3-codex::60fc1c213ad8 | 60.5 | 60.5 | 1 |
| entrant_004_gpt-5.3-codex::82d721235cc3 | 60.3 | 60.3 | 1 |
| entrant_001_gpt-5.2::a02047939ea9 | 59.9 | 59.9 | 1 |
| entrant_003_gpt-5.2-codex::5f0a5b2529a5 | 59.7 | 59.7 | 1 |
| z-ai/glm-5 ($0.0116)::b0dd13061084 | 59.7 | 59.7 | 1 |
| entrant_019_bytedance-seed--seed-2.0-mini::513ee872c075 | 59.6 | 59.6 | 1 |
| entrant_000_gpt-5.2::7d23fd327111 | 59.6 | 59.6 | 1 |
| anthropic/claude-sonnet-4.6 ($0.0579)::09e756834eda | 59.5 | 59.5 | 1 |
| entrant_000_gpt-5.2::5e3db3fbd34c | 59.4 | 59.4 | 1 |
| entrant_016_stepfun--step-3.5-flash_free::1eb8204f4a33 | 59.1 | 59.1 | 1 |
| entrant_005_gpt-5.3-codex::057f7457c82d | 58.9 | 58.9 | 1 |
| z-ai/glm-5 ($0.0093)::cb5aa20bd106 | 58.9 | 58.9 | 1 |
| entrant_001_gpt-5.2::9e6a8333c618 | 58.8 | 58.8 | 1 |
| entrant_009_google--gemini-3-flash-preview::6ac5fff628cd | 58.2 | 58.2 | 1 |
| entrant_001_gpt-5.2::2efbb468d8e4 | 58.2 | 58.2 | 1 |
| entrant_004_gpt-5.3-codex::5ca5a945609f | 58.2 | 58.2 | 1 |
| entrant_017_stepfun--step-3.5-flash_free | 57.8 | 57.8 | 1 |
| entrant_004_gpt-5.2-codex | 57.6 | 57.6 | 1 |
| entrant_001_gpt-5-mini::7ed20c1065d6 | 57.5 | 57.5 | 1 |
| entrant_001_gpt-5-mini::b4bd6cd5e542 | 57.2 | 57.2 | 1 |
| entrant_009_qwen--qwen3.5-122b-a10b::58a5ba6c9338 | 57.0 | 57.0 | 1 |
| entrant_000_gpt-5.2::761693f14061 | 56.3 | 56.3 | 1 |
| entrant_008_qwen--qwen3-max-thinking::00e3223323da | 56.2 | 56.2 | 1 |
| anthropic/claude-sonnet-4.6 ($0.0420)::a0d3ca1ae9ad | 56.1 | 56.1 | 1 |
| entrant_004_gpt-5.3-codex::1cbbac7a4039 | 56.0 | 56.0 | 1 |
| entrant_005_gpt-5.3-codex::665c73d58b45 | 55.4 | 55.4 | 1 |
| moonshotai/kimi-k2.5 ($0.0046)::0e4c4b3371ea | 55.1 | 55.1 | 1 |
| entrant_008_qwen--qwen3.5-122b-a10b::11c15e22c8e8 | 55.0 | 55.0 | 1 |
| entrant_001_gpt-5-mini::87821c3c85b1 | 55.0 | 55.0 | 1 |
| qwen/qwen3.5-122b-a10b (recovered_after_fix) ($0.0099)::9237962e52ca | 54.9 | 54.9 | 1 |
| entrant_002_gpt-5-nano::4e47419a7589 | 54.5 | 54.5 | 1 |
| entrant_001_gpt-5-mini::ad1d783a4c70 | 54.4 | 54.4 | 1 |
| entrant_001_gpt-5-mini::2822279cbf1a | 53.7 | 53.7 | 1 |
| entrant_008_qwen--qwen3.5-122b-a10b::0241c2460b90 | 53.7 | 53.7 | 1 |
| entrant_002_gpt-5-mini | 53.7 | 53.7 | 1 |
| entrant_001_gpt-5-mini::8f38c4d9855c | 53.6 | 53.6 | 1 |
| entrant_001_gpt-5-mini::67c9498f1701 | 52.7 | 52.7 | 1 |
| entrant_001_gpt-5-mini::0c8fab332113 | 52.4 | 52.4 | 1 |
| entrant_011_moonshotai--kimi-k2.5::6b9555b535cf | 51.3 | 51.3 | 1 |
| gpt-5.4 ($0.0000)::18fd032f43d1 | 51.0 | 51.0 | 1 |
| entrant_002_gpt-5-nano::099781c59e50 | 50.9 | 50.9 | 1 |
| entrant_001_gpt-5-mini::5bf4759e7c02 | 50.4 | 50.4 | 1 |
| entrant_000_gpt-5.2::47eb5fc99f6f | 49.4 | 49.4 | 1 |
| entrant_016_stepfun--step-3.5-flash_free::2aa14e16a463 | 48.7 | 48.7 | 1 |
| entrant_006_z-ai--glm-5::432a47fdb873 | 48.4 | 48.4 | 1 |
| entrant_011_moonshotai--kimi-k2.5::86974ddeead2 | 48.0 | 48.0 | 1 |
| arcee-ai/trinity-large-preview:free ($0.0000)::e5c9c34f4cf9 | 47.8 | 47.8 | 1 |
| entrant_011_minimax--minimax-m2.5 | 47.4 | 47.4 | 1 |
| gpt-5.2 ($0.0430)::39826a082fb0 | 47.3 | 47.3 | 1 |
| gpt-5.3-codex ($0.0325)::057f7457c82d | 47.2 | 47.2 | 1 |
| gpt-5.2 ($0.0319)::9e6a8333c618 | 47.1 | 47.1 | 1 |
| gpt-5.3-codex ($0.0000)::1cbbac7a4039 | 45.6 | 45.6 | 1 |
| google/gemini-3.1-flash-lite-preview ($0.0027)::745448837948 | 45.4 | 45.4 | 1 |
| gpt-5.2 ($0.0300)::791483e95653 | 44.4 | 44.4 | 1 |
| entrant_010_google--gemini-3.1-flash-lite-preview::745448837948 | 43.8 | 43.8 | 1 |
| qwen/qwen3.5-122b-a10b (recovered_after_fix) ($0.0132)::b1f8ca87ed0a | 43.4 | 43.4 | 1 |
| gpt-5-mini ($0.0090)::5bf4759e7c02 | 43.0 | 43.0 | 1 |
| entrant_012_deepseek--deepseek-v3.2::babb7f633345 | 42.2 | 42.2 | 1 |
| entrant_016_arcee-ai--trinity-large-preview_free::b15c0f016557 | 42.1 | 42.1 | 1 |
| entrant_007_qwen--qwen3-max-thinking::45cf191da6b2 | 42.0 | 42.0 | 1 |
| entrant_000_gpt-5.2::39826a082fb0 | 42.0 | 42.0 | 1 |
| entrant_000_gpt-5.4::7e297e7b9118 | 41.8 | 41.8 | 1 |
| entrant_013_deepseek--deepseek-v3.2::708fac99e5dc | 41.8 | 41.8 | 1 |
| entrant_015_anthropic--claude-sonnet-4.6::8eeefac1ec17 | 41.6 | 41.6 | 1 |
| entrant_013_deepseek--deepseek-v3.2::cd80f58124a8 | 41.2 | 41.2 | 1 |
| deepseek/deepseek-v3.2 (recovered_after_fix) ($0.0038)::71a3315ecc07 | 41.0 | 41.0 | 1 |
| entrant_001_gpt-5-mini::9643aa170276 | 40.6 | 40.6 | 1 |
| deepseek/deepseek-v3.2 ($0.0018)::708fac99e5dc | 38.7 | 38.7 | 1 |
| gpt-5-nano ($0.0076)::03769244b16e | 37.7 | 37.7 | 1 |
| entrant_008_qwen--qwen3.5-122b-a10b::af4bb1a03d77 | 37.6 | 37.6 | 1 |
| entrant_012_deepseek--deepseek-v3.2::b09f4a5411ae | 37.1 | 37.1 | 1 |
| entrant_001_gpt-5-mini::048e9bf281bb | 36.9 | 36.9 | 1 |
| entrant_019_bytedance-seed--seed-2.0-mini::1d511fe15598 | 36.7 | 36.7 | 1 |
| arcee-ai/trinity-large-preview:free (recovered_after_fix) ($0.0000)::42448db4449b | 36.7 | 36.7 | 1 |
| entrant_002_gpt-5-nano::9d755956a0f6 | 36.3 | 36.3 | 1 |
| entrant_010_minimax--minimax-m2.5::e7794d25f07b | 34.7 | 34.7 | 1 |
| entrant_015_arcee-ai--trinity-large-preview_free::09894b1bd9ea | 33.5 | 33.5 | 1 |
| entrant_002_gpt-5-nano::03769244b16e | 33.1 | 33.1 | 1 |
| qwen/qwen3-max-thinking ($0.0086)::99446e67ec0f | 31.9 | 31.9 | 1 |
| entrant_007_qwen--qwen3-max-thinking::fe1be3eb2268 | 31.6 | 31.6 | 1 |
| bytedance-seed/seed-2.0-mini ($0.0009)::10023bce516e | 30.7 | 30.7 | 1 |
| entrant_016_stepfun--step-3.5-flash_free::be86064bd9b6 | 30.6 | 30.6 | 1 |
| entrant_015_arcee-ai--trinity-large-preview_free::16bb68f624ee | 30.6 | 30.6 | 1 |
| entrant_007_qwen--qwen3-max-thinking::352e53cd1449 | 30.3 | 30.3 | 1 |
| entrant_002_gpt-5-nano::edc6e99823b9 | 30.0 | 30.0 | 1 |
| entrant_013_deepseek--deepseek-v3.2::0638cde804dc | 29.7 | 29.7 | 1 |
| entrant_007_qwen--qwen3-max-thinking::44e3d89d6410 | 29.4 | 29.4 | 1 |
| entrant_002_gpt-5-nano::d41b2f44dda7 | 28.3 | 28.3 | 1 |
| entrant_002_gpt-5-nano::168b4641c9d2 | 27.8 | 27.8 | 1 |
| moonshotai/kimi-k2.5 ($0.0088)::3417d570adb7 | 27.6 | 27.6 | 1 |
| entrant_016_stepfun--step-3.5-flash_free::c36f05dc9ad2 | 27.6 | 27.6 | 1 |
| entrant_015_arcee-ai--trinity-large-preview_free::4a3b35ba8c06 | 26.6 | 26.6 | 1 |
| entrant_003_gpt-5.2-codex::557237351b91 | 26.3 | 26.3 | 1 |
| entrant_016_arcee-ai--trinity-large-preview_free::42448db4449b | 26.3 | 26.3 | 1 |
| entrant_008_qwen--qwen3.5-122b-a10b::4dfac77a88dd | 25.5 | 25.5 | 1 |
| entrant_003_gpt-5.2-codex::0b500f1f8734 | 25.5 | 25.5 | 1 |
| gpt-5-nano ($0.0032)::b71e9163bf77 | 25.0 | 25.0 | 1 |
| entrant_016_stepfun--step-3.5-flash_free::4ab1bcc3e4b7 | 24.3 | 24.3 | 1 |
| entrant_015_arcee-ai--trinity-large-preview_free::0ace044aeb44 | 23.4 | 23.4 | 1 |
| entrant_009_qwen--qwen3.5-122b-a10b::b1f8ca87ed0a | 21.6 | 21.6 | 1 |
| entrant_016_arcee-ai--trinity-large-preview_free::0b87b7222640 | 21.2 | 21.2 | 1 |
| entrant_010_minimax--minimax-m2.5::856d0f4c9892 | 18.6 | 18.6 | 1 |
| entrant_015_arcee-ai--trinity-large-preview_free::ce841544258f | 17.2 | 17.2 | 1 |
| entrant_002_gpt-5-nano::b5ef3d9318f0 | 17.0 | 17.0 | 1 |
| entrant_015_arcee-ai--trinity-large-preview_free::545a42bbbd09 | 16.3 | 16.3 | 1 |
| entrant_010_minimax--minimax-m2.5::80374b7181ce | 12.8 | 12.8 | 1 |
| entrant_010_google--gemini-3.1-flash-lite-preview::5ed71f0ce79a | 12.7 | 12.7 | 1 |
| entrant_008_qwen--qwen3-max-thinking::99446e67ec0f | 12.4 | 12.4 | 1 |
| entrant_010_google--gemini-3.1-flash-lite-preview::4d6f4419c790 | 12.1 | 12.1 | 1 |
| entrant_003_gpt-5-nano | 11.3 | 11.3 | 1 |
| entrant_008_qwen--qwen3-max-thinking::83206da24217 | 11.3 | 11.3 | 1 |
| entrant_015_arcee-ai--trinity-large-preview_free::c0e35d0722f2 | 11.1 | 11.1 | 1 |
| entrant_015_arcee-ai--trinity-large-preview_free::0f8a48b690b6 | 9.8 | 9.8 | 1 |
| entrant_002_gpt-5-nano::3b80bc411288 | 9.8 | 9.8 | 1 |
| entrant_008_qwen--qwen3.5-122b-a10b::547f7c89c067 | 9.1 | 9.1 | 1 |
| entrant_002_gpt-5-nano::04639b45a655 | 8.9 | 8.9 | 1 |
| entrant_002_gpt-5-nano::681d5465556b | 6.3 | 6.3 | 1 |
| entrant_012_deepseek--deepseek-v3.2::1516bc091028 | 5.8 | 5.8 | 1 |
| entrant_009_qwen--qwen3.5-122b-a10b::2b25ee71d64d | 5.7 | 5.7 | 1 |
| entrant_016_stepfun--step-3.5-flash_free::57027fa97bfc | 3.4 | 3.4 | 1 |
| google/gemini-3.1-flash-lite-preview ($0.0023)::1be8da66db78 | 0.0 | 0.0 | 1 |
| entrant_015_arcee-ai--trinity-large-preview_free::9a5c7e5c7b07 | 0.0 | 0.0 | 1 |
| # | Entry | Overall score | Coverage | Games played | Uncertainty (avg) |
|---|---|---|---|---|---|
| 1 | gpt-5-mini ($0.0088)::628ebfd2c9b8 @ 2026-03-04 | 100.0 | under_tested | 21 | 85.3 |
| 2 | entrant_009_google--gemini-3-flash-preview::8c114917b8e3 @ 2026-03-07 | 100.0 | stable | 282 | 23.8 |
| 3 | entrant_009_google--gemini-3-flash-preview::6d1c5f498355 @ 2026-03-07 | 98.9 | stable | 282 | 23.8 |
| 4 | qwen/qwen3-max-thinking (recovered_after_fix) ($0.0201)::4aeacca85750 @ 2026-03-04 | 96.9 | under_tested | 21 | 85.3 |
| 5 | entrant_000_gpt-5.4::14ff6b6748de @ 2026-03-07 | 95.2 | stable | 282 | 23.8 |
| 6 | entrant_000_gpt-5.2::2a48c6945db1 @ 2026-03-07 | 94.9 | stable | 282 | 23.8 |
| 7 | entrant_004_gpt-5.3-codex::861682ece0ae @ 2026-03-07 | 94.4 | stable | 282 | 23.8 |
| 8 | entrant_007_z-ai--glm-5::b0dd13061084 @ 2026-03-07 | 93.2 | stable | 282 | 23.8 |
| 9 | entrant_005_google--gemini-3.1-pro-preview::2e4d06f52910 @ 2026-03-07 | 91.7 | stable | 281 | 23.8 |
| 10 | entrant_009_google--gemini-3-flash-preview::625ea5d044cd @ 2026-03-07 | 91.3 | stable | 282 | 23.8 |
| 11 | entrant_005_google--gemini-3.1-pro-preview::8ebe96e65980 @ 2026-03-07 | 90.9 | stable | 282 | 23.8 |
| 12 | entrant_007_z-ai--glm-5::17c57ee1cfa6 @ 2026-03-07 | 90.9 | stable | 282 | 23.8 |
| 13 | entrant_011_moonshotai--kimi-k2.5::25225f273b28 @ 2026-03-07 | 90.6 | stable | 282 | 23.8 |
| 14 | entrant_004_gpt-5.3-codex::c811aa176b62 @ 2026-03-07 | 90.4 | stable | 281 | 23.8 |
| 15 | entrant_004_gpt-5.3-codex::9154767998bf @ 2026-03-07 | 90.1 | stable | 282 | 23.8 |
| 16 | entrant_006_z-ai--glm-5::a2b617f85cfd @ 2026-03-07 | 89.6 | stable | 282 | 23.8 |
| 17 | entrant_004_gpt-5.3-codex::5980a9c19f87 @ 2026-03-07 | 89.2 | stable | 282 | 23.8 |
| 18 | entrant_012_moonshotai--kimi-k2.5::04fe201a22c6 @ 2026-03-07 | 89.1 | stable | 282 | 23.8 |
| 19 | entrant_014_anthropic--claude-sonnet-4.6::0f0c803f0b38 @ 2026-03-07 | 89.0 | stable | 281 | 23.8 |
| 20 | gpt-5.3-codex ($0.0200)::2e94e75ca479 @ 2026-03-04 | 88.3 | under_tested | 22 | 83.4 |
| 21 | entrant_000_gpt-5.4::0b2642b7b3b5 @ 2026-03-07 | 87.1 | stable | 282 | 23.8 |
| 22 | entrant_004_gpt-5.3-codex::5cad1cf65f38 @ 2026-03-07 | 86.8 | stable | 282 | 23.8 |
| 23 | entrant_006_google--gemini-3.1-pro-preview @ 2026-03-07 | 86.1 | stable | 281 | 23.8 |
| 24 | entrant_006_z-ai--glm-5::4bae2d47b34c @ 2026-03-07 | 85.9 | stable | 282 | 23.8 |
| 25 | entrant_014_anthropic--claude-sonnet-4.6::01fce6ceed12 @ 2026-03-07 | 85.4 | stable | 282 | 23.8 |
| 26 | entrant_015_anthropic--claude-sonnet-4.6::91715cc50e5e @ 2026-03-07 | 85.3 | stable | 282 | 23.8 |
| 27 | entrant_013_anthropic--claude-opus-4.6::5c868b25a52f @ 2026-03-07 | 84.8 | stable | 283 | 23.7 |
| 28 | entrant_014_anthropic--claude-opus-4.6::b0f54ac64a0f @ 2026-03-07 | 84.5 | stable | 282 | 23.8 |
| 29 | entrant_000_gpt-5.2::381b51bd0a04 @ 2026-03-07 | 84.3 | stable | 282 | 23.8 |
| 30 | entrant_011_moonshotai--kimi-k2.5::97459f7b08ce @ 2026-03-07 | 83.0 | stable | 283 | 23.7 |
| 31 | entrant_006_z-ai--glm-5::16b2342fb880 @ 2026-03-07 | 80.9 | stable | 282 | 23.8 |
| 32 | entrant_013_anthropic--claude-opus-4.6::6035e544b9be @ 2026-03-07 | 79.3 | stable | 282 | 23.8 |
| 33 | entrant_012_moonshotai--kimi-k2.5::5476e97ed2c8 @ 2026-03-07 | 78.8 | stable | 282 | 23.8 |
| 34 | entrant_014_anthropic--claude-sonnet-4.6::71d6f6447cbc @ 2026-03-07 | 77.6 | stable | 282 | 23.8 |
| 35 | entrant_014_anthropic--claude-opus-4.6::9a62dcbb8a3b @ 2026-03-07 | 77.1 | stable | 282 | 23.8 |
| 36 | entrant_005_gpt-5.3-codex::880993f40176 @ 2026-03-07 | 76.6 | stable | 282 | 23.8 |
| 37 | entrant_014_anthropic--claude-sonnet-4.6::b8daeba4a7cf @ 2026-03-07 | 74.9 | stable | 282 | 23.8 |
| 38 | entrant_007_z-ai--glm-5::60475863ae30 @ 2026-03-07 | 73.5 | stable | 282 | 23.8 |
| 39 | anthropic/claude-opus-4.6 ($0.0833)::6ba3403d42aa @ 2026-03-04 | 73.4 | under_tested | 21 | 85.3 |
| 40 | entrant_004_gpt-5.3-codex::3d8ddcce263a @ 2026-03-07 | 73.3 | stable | 281 | 23.8 |
| 41 | entrant_015_anthropic--claude-sonnet-4.6::09e756834eda @ 2026-03-07 | 73.2 | stable | 281 | 23.8 |
| 42 | entrant_012_moonshotai--kimi-k2.5::0e4c4b3371ea @ 2026-03-07 | 72.4 | stable | 281 | 23.8 |
| 43 | entrant_005_google--gemini-3.1-pro-preview::53822cf06dda @ 2026-03-07 | 71.8 | stable | 281 | 23.8 |
| 44 | entrant_004_gpt-5.3-codex::26c58495c5e9 @ 2026-03-07 | 71.2 | stable | 282 | 23.8 |
| 45 | entrant_000_gpt-5.2::34ff3d41d915 @ 2026-03-07 | 70.1 | provisional | 32 | 69.6 |
| 46 | entrant_006_z-ai--glm-5::24b8f171bca6 @ 2026-03-07 | 68.2 | stable | 282 | 23.8 |
| 47 | entrant_007_qwen--qwen3-max-thinking::f438fd18faee @ 2026-03-07 | 67.4 | stable | 281 | 23.8 |
| 48 | entrant_009_google--gemini-3-flash-preview::2db816dc429f @ 2026-03-07 | 63.8 | stable | 281 | 23.8 |
| 49 | entrant_000_gpt-5.4::18fd032f43d1 @ 2026-03-07 | 63.0 | stable | 281 | 23.8 |
| 50 | entrant_003_gpt-5.2-codex::00da108f1d3c @ 2026-03-07 | 62.8 | stable | 282 | 23.8 |
| 51 | anthropic/claude-opus-4.6 ($0.0875)::b0f54ac64a0f @ 2026-03-07 | 60.9 | under_tested | 22 | 83.4 |
| 52 | entrant_004_gpt-5.3-codex::60fc1c213ad8 @ 2026-03-07 | 60.5 | stable | 280 | 23.9 |
| 53 | entrant_004_gpt-5.3-codex::82d721235cc3 @ 2026-03-07 | 60.3 | stable | 282 | 23.8 |
| 54 | entrant_001_gpt-5.2::a02047939ea9 @ 2026-03-07 | 59.9 | stable | 281 | 23.8 |
| 55 | entrant_003_gpt-5.2-codex::5f0a5b2529a5 @ 2026-03-07 | 59.7 | stable | 282 | 23.8 |
| 56 | z-ai/glm-5 ($0.0116)::b0dd13061084 @ 2026-03-07 | 59.7 | under_tested | 22 | 83.4 |
| 57 | entrant_019_bytedance-seed--seed-2.0-mini::513ee872c075 @ 2026-03-07 | 59.6 | stable | 281 | 23.8 |
| 58 | entrant_000_gpt-5.2::7d23fd327111 @ 2026-03-07 | 59.6 | stable | 281 | 23.8 |
| 59 | anthropic/claude-sonnet-4.6 ($0.0579)::09e756834eda @ 2026-03-07 | 59.5 | under_tested | 22 | 83.4 |
| 60 | entrant_000_gpt-5.2::5e3db3fbd34c @ 2026-03-07 | 59.4 | stable | 282 | 23.8 |
| 61 | entrant_016_stepfun--step-3.5-flash_free::1eb8204f4a33 @ 2026-03-07 | 59.1 | stable | 281 | 23.8 |
| 62 | entrant_005_gpt-5.3-codex::057f7457c82d @ 2026-03-07 | 58.9 | stable | 281 | 23.8 |
| 63 | z-ai/glm-5 ($0.0093)::cb5aa20bd106 @ 2026-03-04 | 58.9 | under_tested | 14 | 103.3 |
| 64 | entrant_001_gpt-5.2::9e6a8333c618 @ 2026-03-07 | 58.8 | stable | 281 | 23.8 |
| 65 | entrant_009_google--gemini-3-flash-preview::6ac5fff628cd @ 2026-03-07 | 58.2 | stable | 281 | 23.8 |
| 66 | entrant_001_gpt-5.2::2efbb468d8e4 @ 2026-03-07 | 58.2 | stable | 282 | 23.8 |
| 67 | entrant_004_gpt-5.3-codex::5ca5a945609f @ 2026-03-07 | 58.2 | stable | 282 | 23.8 |
| 68 | entrant_017_stepfun--step-3.5-flash_free @ 2026-03-07 | 57.8 | stable | 282 | 23.8 |
| 69 | entrant_004_gpt-5.2-codex @ 2026-03-07 | 57.6 | stable | 281 | 23.8 |
| 70 | entrant_001_gpt-5-mini::7ed20c1065d6 @ 2026-03-07 | 57.5 | stable | 282 | 23.8 |
| 71 | entrant_001_gpt-5-mini::b4bd6cd5e542 @ 2026-03-07 | 57.2 | stable | 282 | 23.8 |
| 72 | entrant_009_qwen--qwen3.5-122b-a10b::58a5ba6c9338 @ 2026-03-07 | 57.0 | stable | 281 | 23.8 |
| 73 | entrant_000_gpt-5.2::761693f14061 @ 2026-03-07 | 56.3 | stable | 282 | 23.8 |
| 74 | entrant_008_qwen--qwen3-max-thinking::00e3223323da @ 2026-03-07 | 56.2 | stable | 281 | 23.8 |
| 75 | anthropic/claude-sonnet-4.6 ($0.0420)::a0d3ca1ae9ad @ 2026-03-04 | 56.1 | under_tested | 13 | 106.9 |
| 76 | entrant_004_gpt-5.3-codex::1cbbac7a4039 @ 2026-03-07 | 56.0 | stable | 282 | 23.8 |
| 77 | entrant_005_gpt-5.3-codex::665c73d58b45 @ 2026-03-07 | 55.4 | stable | 281 | 23.8 |
| 78 | moonshotai/kimi-k2.5 ($0.0046)::0e4c4b3371ea @ 2026-03-07 | 55.1 | under_tested | 22 | 83.4 |
| 79 | entrant_008_qwen--qwen3.5-122b-a10b::11c15e22c8e8 @ 2026-03-07 | 55.0 | stable | 282 | 23.8 |
| 80 | entrant_001_gpt-5-mini::87821c3c85b1 @ 2026-03-07 | 55.0 | stable | 281 | 23.8 |
| 81 | qwen/qwen3.5-122b-a10b (recovered_after_fix) ($0.0099)::9237962e52ca @ 2026-03-04 | 54.9 | under_tested | 23 | 81.6 |
| 82 | entrant_002_gpt-5-nano::4e47419a7589 @ 2026-03-07 | 54.5 | stable | 281 | 23.8 |
| 83 | entrant_001_gpt-5-mini::ad1d783a4c70 @ 2026-03-07 | 54.4 | stable | 281 | 23.8 |
| 84 | entrant_001_gpt-5-mini::2822279cbf1a @ 2026-03-07 | 53.7 | stable | 281 | 23.8 |
| 85 | entrant_008_qwen--qwen3.5-122b-a10b::0241c2460b90 @ 2026-03-07 | 53.7 | stable | 281 | 23.8 |
| 86 | entrant_002_gpt-5-mini @ 2026-03-07 | 53.7 | stable | 282 | 23.8 |
| 87 | entrant_001_gpt-5-mini::8f38c4d9855c @ 2026-03-07 | 53.6 | stable | 281 | 23.8 |
| 88 | entrant_001_gpt-5-mini::67c9498f1701 @ 2026-03-07 | 52.7 | stable | 281 | 23.8 |
| 89 | entrant_001_gpt-5-mini::0c8fab332113 @ 2026-03-07 | 52.4 | stable | 281 | 23.8 |
| 90 | entrant_011_moonshotai--kimi-k2.5::6b9555b535cf @ 2026-03-07 | 51.3 | stable | 282 | 23.8 |
| 91 | gpt-5.4 ($0.0000)::18fd032f43d1 @ 2026-03-07 | 51.0 | under_tested | 22 | 83.4 |
| 92 | entrant_002_gpt-5-nano::099781c59e50 @ 2026-03-07 | 50.9 | stable | 281 | 23.8 |
| 93 | entrant_001_gpt-5-mini::5bf4759e7c02 @ 2026-03-07 | 50.4 | stable | 282 | 23.8 |
| 94 | entrant_000_gpt-5.2::47eb5fc99f6f @ 2026-03-07 | 49.4 | stable | 282 | 23.8 |
| 95 | entrant_016_stepfun--step-3.5-flash_free::2aa14e16a463 @ 2026-03-07 | 48.7 | stable | 281 | 23.8 |
| 96 | entrant_006_z-ai--glm-5::432a47fdb873 @ 2026-03-07 | 48.4 | stable | 281 | 23.8 |
| 97 | entrant_011_moonshotai--kimi-k2.5::86974ddeead2 @ 2026-03-07 | 48.0 | stable | 282 | 23.8 |
| 98 | arcee-ai/trinity-large-preview:free ($0.0000)::e5c9c34f4cf9 @ 2026-03-04 | 47.8 | under_tested | 26 | 77.0 |
| 99 | entrant_011_minimax--minimax-m2.5 @ 2026-03-07 | 47.4 | stable | 281 | 23.8 |
| 100 | gpt-5.2 ($0.0430)::39826a082fb0 @ 2026-02-27 | 47.3 | under_tested | 6 | 151.2 |
| 101 | gpt-5.3-codex ($0.0325)::057f7457c82d @ 2026-03-07 | 47.2 | under_tested | 22 | 83.4 |
| 102 | gpt-5.2 ($0.0319)::9e6a8333c618 @ 2026-03-07 | 47.1 | under_tested | 22 | 83.4 |
| 103 | gpt-5.3-codex ($0.0000)::1cbbac7a4039 @ 2026-02-27 | 45.6 | under_tested | 6 | 151.2 |
| 104 | google/gemini-3.1-flash-lite-preview ($0.0027)::745448837948 @ 2026-03-07 | 45.4 | under_tested | 22 | 83.4 |
| 105 | gpt-5.2 ($0.0300)::791483e95653 @ 2026-03-04 | 44.4 | under_tested | 20 | 87.3 |
| 106 | entrant_010_google--gemini-3.1-flash-lite-preview::745448837948 @ 2026-03-07 | 43.8 | stable | 281 | 23.8 |
| 107 | qwen/qwen3.5-122b-a10b (recovered_after_fix) ($0.0132)::b1f8ca87ed0a @ 2026-03-07 | 43.4 | under_tested | 22 | 83.4 |
| 108 | gpt-5-mini ($0.0090)::5bf4759e7c02 @ 2026-02-27 | 43.0 | under_tested | 6 | 151.2 |
| 109 | entrant_012_deepseek--deepseek-v3.2::babb7f633345 @ 2026-03-07 | 42.2 | stable | 282 | 23.8 |
| 110 | entrant_016_arcee-ai--trinity-large-preview_free::b15c0f016557 @ 2026-03-07 | 42.1 | stable | 280 | 23.9 |
| 111 | entrant_007_qwen--qwen3-max-thinking::45cf191da6b2 @ 2026-03-07 | 42.0 | stable | 278 | 23.9 |
| 112 | entrant_000_gpt-5.2::39826a082fb0 @ 2026-03-07 | 42.0 | stable | 282 | 23.8 |
| 113 | entrant_000_gpt-5.4::7e297e7b9118 @ 2026-03-07 | 41.8 | stable | 282 | 23.8 |
| 114 | entrant_013_deepseek--deepseek-v3.2::708fac99e5dc @ 2026-03-07 | 41.8 | stable | 281 | 23.8 |
| 115 | entrant_015_anthropic--claude-sonnet-4.6::8eeefac1ec17 @ 2026-03-07 | 41.6 | stable | 282 | 23.8 |
| 116 | entrant_013_deepseek--deepseek-v3.2::cd80f58124a8 @ 2026-03-07 | 41.2 | stable | 280 | 23.9 |
| 117 | deepseek/deepseek-v3.2 (recovered_after_fix) ($0.0038)::71a3315ecc07 @ 2026-03-04 | 41.0 | under_tested | 20 | 87.3 |
| 118 | entrant_001_gpt-5-mini::9643aa170276 @ 2026-03-07 | 40.6 | stable | 278 | 23.9 |
| 119 | deepseek/deepseek-v3.2 ($0.0018)::708fac99e5dc @ 2026-03-07 | 38.7 | under_tested | 22 | 83.4 |
| 120 | gpt-5-nano ($0.0076)::03769244b16e @ 2026-02-27 | 37.7 | under_tested | 6 | 151.2 |
| 121 | entrant_008_qwen--qwen3.5-122b-a10b::af4bb1a03d77 @ 2026-03-07 | 37.6 | stable | 280 | 23.9 |
| 122 | entrant_012_deepseek--deepseek-v3.2::b09f4a5411ae @ 2026-03-07 | 37.1 | stable | 282 | 23.8 |
| 123 | entrant_001_gpt-5-mini::048e9bf281bb @ 2026-03-07 | 36.9 | stable | 280 | 23.9 |
| 124 | entrant_019_bytedance-seed--seed-2.0-mini::1d511fe15598 @ 2026-03-07 | 36.7 | stable | 279 | 23.9 |
| 125 | arcee-ai/trinity-large-preview:free (recovered_after_fix) ($0.0000)::42448db4449b @ 2026-03-07 | 36.7 | under_tested | 22 | 83.4 |
| 126 | entrant_002_gpt-5-nano::9d755956a0f6 @ 2026-03-07 | 36.3 | stable | 281 | 23.8 |
| 127 | entrant_010_minimax--minimax-m2.5::e7794d25f07b @ 2026-03-07 | 34.7 | stable | 282 | 23.8 |
| 128 | entrant_015_arcee-ai--trinity-large-preview_free::09894b1bd9ea @ 2026-03-07 | 33.5 | stable | 280 | 23.9 |
| 129 | entrant_002_gpt-5-nano::03769244b16e @ 2026-03-07 | 33.1 | stable | 281 | 23.8 |
| 130 | qwen/qwen3-max-thinking ($0.0086)::99446e67ec0f @ 2026-03-07 | 31.9 | under_tested | 22 | 83.4 |
| 131 | entrant_007_qwen--qwen3-max-thinking::fe1be3eb2268 @ 2026-03-07 | 31.6 | stable | 280 | 23.9 |
| 132 | bytedance-seed/seed-2.0-mini ($0.0009)::10023bce516e @ 2026-03-04 | 30.7 | under_tested | 19 | 89.4 |
| 133 | entrant_016_stepfun--step-3.5-flash_free::be86064bd9b6 @ 2026-03-07 | 30.6 | stable | 278 | 23.9 |
| 134 | entrant_015_arcee-ai--trinity-large-preview_free::16bb68f624ee @ 2026-03-07 | 30.6 | stable | 279 | 23.9 |
| 135 | entrant_007_qwen--qwen3-max-thinking::352e53cd1449 @ 2026-03-07 | 30.3 | stable | 280 | 23.9 |
| 136 | entrant_002_gpt-5-nano::edc6e99823b9 @ 2026-03-07 | 30.0 | stable | 279 | 23.9 |
| 137 | entrant_013_deepseek--deepseek-v3.2::0638cde804dc @ 2026-03-07 | 29.7 | stable | 282 | 23.8 |
| 138 | entrant_007_qwen--qwen3-max-thinking::44e3d89d6410 @ 2026-03-07 | 29.4 | stable | 280 | 23.9 |
| 139 | entrant_002_gpt-5-nano::d41b2f44dda7 @ 2026-03-07 | 28.3 | stable | 280 | 23.9 |
| 140 | entrant_002_gpt-5-nano::168b4641c9d2 @ 2026-03-07 | 27.8 | stable | 280 | 23.9 |
| 141 | moonshotai/kimi-k2.5 ($0.0088)::3417d570adb7 @ 2026-03-04 | 27.6 | under_tested | 14 | 103.3 |
| 142 | entrant_016_stepfun--step-3.5-flash_free::c36f05dc9ad2 @ 2026-03-07 | 27.6 | stable | 280 | 23.9 |
| 143 | entrant_015_arcee-ai--trinity-large-preview_free::4a3b35ba8c06 @ 2026-03-07 | 26.6 | stable | 278 | 23.9 |
| 144 | entrant_003_gpt-5.2-codex::557237351b91 @ 2026-03-07 | 26.3 | stable | 278 | 23.9 |
| 145 | entrant_016_arcee-ai--trinity-large-preview_free::42448db4449b @ 2026-03-07 | 26.3 | stable | 279 | 23.9 |
| 146 | entrant_008_qwen--qwen3.5-122b-a10b::4dfac77a88dd @ 2026-03-07 | 25.5 | stable | 279 | 23.9 |
| 147 | entrant_003_gpt-5.2-codex::0b500f1f8734 @ 2026-03-07 | 25.5 | stable | 279 | 23.9 |
| 148 | gpt-5-nano ($0.0032)::b71e9163bf77 @ 2026-03-04 | 25.0 | under_tested | 15 | 100.0 |
| 149 | entrant_016_stepfun--step-3.5-flash_free::4ab1bcc3e4b7 @ 2026-03-07 | 24.3 | stable | 278 | 23.9 |
| 150 | entrant_015_arcee-ai--trinity-large-preview_free::0ace044aeb44 @ 2026-03-07 | 23.4 | stable | 209 | 27.6 |
| 151 | entrant_009_qwen--qwen3.5-122b-a10b::b1f8ca87ed0a @ 2026-03-07 | 21.6 | stable | 279 | 23.9 |
| 152 | entrant_016_arcee-ai--trinity-large-preview_free::0b87b7222640 @ 2026-03-07 | 21.2 | stable | 278 | 23.9 |
| 153 | entrant_010_minimax--minimax-m2.5::856d0f4c9892 @ 2026-03-07 | 18.6 | stable | 279 | 23.9 |
| 154 | entrant_015_arcee-ai--trinity-large-preview_free::ce841544258f @ 2026-03-07 | 17.2 | stable | 279 | 23.9 |
| 155 | entrant_002_gpt-5-nano::b5ef3d9318f0 @ 2026-03-07 | 17.0 | stable | 280 | 23.9 |
| 156 | entrant_015_arcee-ai--trinity-large-preview_free::545a42bbbd09 @ 2026-03-07 | 16.3 | stable | 279 | 23.9 |
| 157 | entrant_010_minimax--minimax-m2.5::80374b7181ce @ 2026-03-07 | 12.8 | stable | 279 | 23.9 |
| 158 | entrant_010_google--gemini-3.1-flash-lite-preview::5ed71f0ce79a @ 2026-03-07 | 12.7 | stable | 281 | 23.8 |
| 159 | entrant_008_qwen--qwen3-max-thinking::99446e67ec0f @ 2026-03-07 | 12.4 | stable | 278 | 23.9 |
| 160 | entrant_010_google--gemini-3.1-flash-lite-preview::4d6f4419c790 @ 2026-03-07 | 12.1 | stable | 278 | 23.9 |
| 161 | entrant_003_gpt-5-nano @ 2026-03-07 | 11.3 | stable | 279 | 23.9 |
| 162 | entrant_008_qwen--qwen3-max-thinking::83206da24217 @ 2026-03-07 | 11.3 | stable | 279 | 23.9 |
| 163 | entrant_015_arcee-ai--trinity-large-preview_free::c0e35d0722f2 @ 2026-03-07 | 11.1 | stable | 279 | 23.9 |
| 164 | entrant_015_arcee-ai--trinity-large-preview_free::0f8a48b690b6 @ 2026-03-07 | 9.8 | stable | 162 | 31.3 |
| 165 | entrant_002_gpt-5-nano::3b80bc411288 @ 2026-03-07 | 9.8 | stable | 280 | 23.9 |
| 166 | entrant_008_qwen--qwen3.5-122b-a10b::547f7c89c067 @ 2026-03-07 | 9.1 | stable | 281 | 23.8 |
| 167 | entrant_002_gpt-5-nano::04639b45a655 @ 2026-03-07 | 8.9 | stable | 279 | 23.9 |
| 168 | entrant_002_gpt-5-nano::681d5465556b @ 2026-03-07 | 6.3 | stable | 281 | 23.8 |
| 169 | entrant_012_deepseek--deepseek-v3.2::1516bc091028 @ 2026-03-07 | 5.8 | stable | 280 | 23.9 |
| 170 | entrant_009_qwen--qwen3.5-122b-a10b::2b25ee71d64d @ 2026-03-07 | 5.7 | stable | 278 | 23.9 |
| 171 | entrant_016_stepfun--step-3.5-flash_free::57027fa97bfc @ 2026-03-07 | 3.4 | stable | 279 | 23.9 |
| 172 | google/gemini-3.1-flash-lite-preview ($0.0023)::1be8da66db78 @ 2026-03-04 | 0.0 | under_tested | 19 | 89.4 |
| 173 | entrant_015_arcee-ai--trinity-large-preview_free::9a5c7e5c7b07 @ 2026-03-07 | 0.0 | stable | 280 | 23.9 |