Argus multi-model code review

Benchmark mode runs the whole fixture suite across a roster and scores each reviewer against ground-truth labels. It answers the question that matters when choosing a roster: which reviewers actually find the planted bugs without crying wolf?

Precision, recall, F1

A finding matches ground truth when the file matches and the line is within tolerance (|line diff| ≤ 3 by default). From the true positives (tp), false positives (fp), and false negatives (fn):

  • Precision = tp / (tp + fp) — of what it flagged, how much was real (low = noisy).
  • Recall = tp / (tp + fn) — of the real bugs, how many it caught (low = misses bugs).
  • F1 = 2·P·R / (P + R) — the harmonic mean; the headline ranking axis.

Fixtures & scoring

The current suite has four diff-based fixtures. The clean-baseline is a false-positive control: it contains no real bug, so the correct output is zero findings.

FixtureExpected
sql-injectionPlanted SQL injection.
race-refundConcurrency / race on a refund path.
secrets-leakLeaked secret / credential.
clean-baseline0 findings — false-positive control.

A clean baseline with no findings scores P = R = F1 = 1.0 — but only for successful, parseable runs (a reviewer that crashes or returns garbage does not get a free perfect score).

Reference leaderboard

F1 score by reviewer, from the reference benchmark run. The chart re-themes with light / dark mode.

RankReviewerF1PrecisionRecallAvg call (s)
1opencode0.8110.8960.75448
2qwen-3.6-plus0.7611.0000.65076
3glm-5.x0.6970.7720.72527
4gemini-or (Flash)0.6810.7360.6392
5minimax0.6740.8750.58829
6mimo-v2-pro0.6520.7360.60049
7codex0.5810.6880.75460
8deepseek0.5720.7780.4946
9grok-4.200.5570.5920.5332
10hermes-4.30.5510.6460.65313
11kimi-k2.60.5050.7290.57583

Footnote: This is a historical run (4 fixtures × 3 runs = 12 calls per reviewer, ~$0.42 total). Reviewer names have since been version-bumped (e.g. glm-5.xglm-5.2, minimaxminimax-m3, deepseekdeepseek-v4-pro); the numbers are shown as-recorded. With only four fixtures the leaderboard is noisy — treat it as directional.

Running a benchmark

# Smoke test one fixture first (catches config bugs in ~30s):
python scripts/benchmark.py --runs 1 --fixtures sql-injection --profile panel

# Full run across the suite:
python scripts/benchmark.py --runs 3 --profile panel
python scripts/benchmark.py --runs 3 --roster "glm-5.2,minimax-m3,opencode"

Average call time is a cost / quality trade-off column, not a ranking axis: a fast, cheap frontier reviewer wins over a slow, expensive one when F1 is comparable.