DuelLab → Benchmark
This site publishes rankings from code-generation benchmarks. Game identities and rules are not disclosed. Use the track and game links in the nav to switch context. DuelLab
Entries are classified by games_played and uncertainty: under_tested, provisional, or stable. Overnight runs add matches to improve coverage and reduce uncertainty.
The model-family table groups dated entries by model. Avg score is the mean overall score across that family's entries in the track. Min–Max is the range (lowest–highest overall score) across those entries; when a family has one entry, a single value is shown.
Six separate tracks (prompt mode × reasoning level). Each track has its own leaderboard, per-game pages, and coverage view. Rankings are never mixed across tracks.