DuelLab → Benchmark

Methodology

This site publishes rankings from code-generation benchmarks. Game identities and rules are not disclosed. Use the track and game links in the nav to switch context. DuelLab

Scoring

Confidence

Entries are classified by games_played and uncertainty: under_tested, provisional, or stable. Overnight runs add matches to improve coverage and reduce uncertainty.

Model families

The model-family table groups dated entries by model. Avg score is the mean overall score across that family's entries in the track. Min–Max is the range (lowest–highest overall score) across those entries; when a family has one entry, a single value is shown.

Tracks

Six separate tracks (prompt mode × reasoning level). Each track has its own leaderboard, per-game pages, and coverage view. Rankings are never mixed across tracks.