Benchmark mode runs the whole fixture suite across a roster and scores each reviewer against ground-truth labels. It answers the question that matters when choosing a roster: which reviewers actually find the planted bugs without crying wolf?
Precision, recall, F1
A finding matches ground truth when the file matches and the line is within tolerance
(|line diff| ≤ 3 by default). From the true positives (tp), false positives (fp), and false
negatives (fn):
- Precision = tp / (tp + fp) — of what it flagged, how much was real (low = noisy).
- Recall = tp / (tp + fn) — of the real bugs, how many it caught (low = misses bugs).
- F1 = 2·P·R / (P + R) — the harmonic mean; the headline ranking axis.
Fixtures & scoring
The current suite has four diff-based fixtures. The clean-baseline is a false-positive control: it contains no real bug, so the correct output is zero findings.
| Fixture | Expected |
|---|---|
sql-injection | Planted SQL injection. |
race-refund | Concurrency / race on a refund path. |
secrets-leak | Leaked secret / credential. |
clean-baseline | 0 findings — false-positive control. |
A clean baseline with no findings scores P = R = F1 = 1.0 — but only for successful, parseable runs (a reviewer that crashes or returns garbage does not get a free perfect score).
Reference leaderboard
F1 score by reviewer, from the reference benchmark run. The chart re-themes with light / dark mode.
| Rank | Reviewer | F1 | Precision | Recall | Avg call (s) |
|---|---|---|---|---|---|
| 1 | opencode | 0.811 | 0.896 | 0.754 | 48 |
| 2 | qwen-3.6-plus | 0.761 | 1.000 | 0.650 | 76 |
| 3 | glm-5.x | 0.697 | 0.772 | 0.725 | 27 |
| 4 | gemini-or (Flash) | 0.681 | 0.736 | 0.639 | 2 |
| 5 | minimax | 0.674 | 0.875 | 0.588 | 29 |
| 6 | mimo-v2-pro | 0.652 | 0.736 | 0.600 | 49 |
| 7 | codex | 0.581 | 0.688 | 0.754 | 60 |
| 8 | deepseek | 0.572 | 0.778 | 0.494 | 6 |
| 9 | grok-4.20 | 0.557 | 0.592 | 0.533 | 2 |
| 10 | hermes-4.3 | 0.551 | 0.646 | 0.653 | 13 |
| 11 | kimi-k2.6 | 0.505 | 0.729 | 0.575 | 83 |
Footnote: This is a historical run (4 fixtures × 3 runs = 12 calls per
reviewer, ~$0.42 total). Reviewer names have since been version-bumped (e.g. glm-5.x →
glm-5.2, minimax → minimax-m3, deepseek →
deepseek-v4-pro); the numbers are shown as-recorded. With only four fixtures the leaderboard is
noisy — treat it as directional.
Running a benchmark
# Smoke test one fixture first (catches config bugs in ~30s):
python scripts/benchmark.py --runs 1 --fixtures sql-injection --profile panel
# Full run across the suite:
python scripts/benchmark.py --runs 3 --profile panel
python scripts/benchmark.py --runs 3 --roster "glm-5.2,minimax-m3,opencode"
Average call time is a cost / quality trade-off column, not a ranking axis: a fast, cheap frontier reviewer wins over a slow, expensive one when F1 is comparable.