Benchmarks · Argus

Benchmark mode runs the whole fixture suite across a roster and scores each reviewer against ground-truth labels. It answers the question that matters when choosing a roster: which reviewers actually find the planted bugs without crying wolf?

Precision, recall, F1

A finding matches ground truth when the file matches and the line is within tolerance (|line diff| ≤ 3 by default). From the true positives (tp), false positives (fp), and false negatives (fn):

Precision = tp / (tp + fp) — of what it flagged, how much was real (low = noisy).
Recall = tp / (tp + fn) — of the real bugs, how many it caught (low = misses bugs).
F1 = 2·P·R / (P + R) — the harmonic mean; the headline ranking axis.

Fixtures & scoring

The current suite has four diff-based fixtures. The clean-baseline is a false-positive control: it contains no real bug, so the correct output is zero findings.

Fixture	Expected
`sql-injection`	Planted SQL injection.
`race-refund`	Concurrency / race on a refund path.
`secrets-leak`	Leaked secret / credential.
`clean-baseline`	0 findings — false-positive control.

A clean baseline with no findings scores P = R = F1 = 1.0 — but only for successful, parseable runs (a reviewer that crashes or returns garbage does not get a free perfect score).

Reference leaderboard

F1 score by reviewer, from the reference benchmark run. The chart re-themes with light / dark mode.

Rank	Reviewer	F1	Precision	Recall	Avg call (s)
1	opencode	0.811	0.896	0.754	48
2	qwen-3.6-plus	0.761	1.000	0.650	76
3	glm-5.x	0.697	0.772	0.725	27
4	gemini-or (Flash)	0.681	0.736	0.639	2
5	minimax	0.674	0.875	0.588	29
6	mimo-v2-pro	0.652	0.736	0.600	49
7	codex	0.581	0.688	0.754	60
8	deepseek	0.572	0.778	0.494	6
9	grok-4.20	0.557	0.592	0.533	2
10	hermes-4.3	0.551	0.646	0.653	13
11	kimi-k2.6	0.505	0.729	0.575	83

Footnote: This is a historical run (4 fixtures × 3 runs = 12 calls per reviewer, ~$0.42 total). Reviewer names have since been version-bumped (e.g. glm-5.x → glm-5.2, minimax → minimax-m3, deepseek → deepseek-v4-pro); the numbers are shown as-recorded. With only four fixtures the leaderboard is noisy — treat it as directional.

Running a benchmark

# Smoke test one fixture first (catches config bugs in ~30s):
python scripts/benchmark.py --runs 1 --fixtures sql-injection --profile panel

# Full run across the suite:
python scripts/benchmark.py --runs 3 --profile panel
python scripts/benchmark.py --runs 3 --roster "glm-5.2,minimax-m3,opencode"

Average call time is a cost / quality trade-off column, not a ranking axis: a fast, cheap frontier reviewer wins over a slow, expensive one when F1 is comparable.