Methodology — DuelLab Benchmark

About this version

Leaderboards not frozen yet

This deployment is not a published cut: leaderboards and scores are not frozen and may change. For a stable reference or citation, use an official published deployment where the site header shows Published.

Benchmark games

What kinds of games

Games in active use today are mostly abstract strategy titles—structured, typically complete-information games that fit well with code-generation and match-play evaluation.

We plan to broaden the benchmark over time, adding a wider variety of game types as the pipeline matures, so future releases exercise models across more diverse rules and mechanics.

Scoring

How scores are computed

Per-game: For each game, DuelLab computes Elo from match results, then derives a conservative score using rating − uncertainty. Scores are published as a relative per-game score (0–100) within that game's pool. The uncertainty column is a separate 0–100 index from each entrant's raw Elo uncertainty using a fixed scale (higher means less statistically certain), so it does not depend on who else played that game. For code matches, the default policy is fault-aware: clearly symmetric move-limit stalls count as draws, and all other fault outcomes are excluded from Elo.
Overall: Each leaderboard row's overall score is the mean of its relative per-game scores across games where it has at least one match. The Overall model table does not include an uncertainty column (it stays focused on rank, average score, raw Elo, per-reasoning breakdown, and entries); for the per-game Uncertainty index, use each game's leaderboard on a reasoning track.

Why this format is useful

Why this benchmark can move quickly

A main advantage of this benchmark format is speed. Once a model has produced code, evaluation only needs to simulate some games: there is no judge model and no human review loop deciding who performed better.

It can also produce a strong relative signal from a small number of prompts or generated entrants, because writing code that plays abstract games well is itself a demanding test of intelligence.

Aggregation

How model rows are grouped

The model table groups evaluated rows by model. Avg score is the mean overall score across that model's rows at a given reasoning level. Min–Max is the range from the lowest to the highest overall score across those rows. When a model has only one such row, a single value is shown instead of a range.

Official per-game pages list individual entrants directly. The model table's Entries column is still hierarchical: on a reasoning track it is the sum of contributing entrant rows for that model across the suite (for Mixed, each displayed row counts as one per game it appears in). Mixed per-game pages also keep one entrant per row and add a Reasoning column for that entrant's preset. On Overall, Entries is the sum of those three official track totals (highest, medium, none)—while the overall score remains the mean of up to three per-track overall scores, unchanged.

A Mixed (cross-reasoning) view, when present, is a separate leaderboard: each row is one model at one reasoning-effort preset, matched against every other model-and-preset variant (all-vs-all cross-reasoning). Those matches use the same scoring rules, but they do not feed the single-effort boards or the Overall aggregate.

Reasoning

When levels look similar

For some models, scores or spreads across Highest, Medium, and None can be close together because the underlying reasoning-related controls do not differ much in practice for that provider or SKU, or because the task is already saturated at a given setting.

We are still adjusting per-model mappings and parameters tied to those levels. In a future release we plan to make the exact reasoning-related settings used for each evaluated model easier to find on the site.

Prompting and harness

Why draws and uncertainty can still be high

The benchmark's games are curated to favor titles that are typically less draw-prone under strong play, but the benchmark still measures model-generated programs run through a fixed prompt and execution harness. That stack is not perfect: implementation bugs, misread rules, overly safe heuristics, or repeated invalid moves can all produce long, symmetric play that ends in stalls counted as draws (see scoring policy above), or simply many inconclusive outcomes.

When many matches tie or look similar from the rating system's perspective, uncertainty stays elevated even if the underlying rules are not especially drawish. Prompting and the execution harness are improved on an ongoing basis, with successive releases delivering clearer instructions, tighter validation, and fewer spurious outcomes—so residual draws and noisy uncertainty should be read alongside that steady progress, not as a permanent ceiling.