Argus multi-model code review

Argus is a pipeline: resolve a roster, estimate cost, dispatch reviewers in parallel, then merge their findings into one review. The quality of the output comes from two ideas working together — a confidence threshold and a corroboration boost.

Pipeline overview

flowchart LR D[Diff / PR / files] --> R[Resolve roster
profile + host rules] R --> C[Estimate cost
gate check] C --> P[Dispatch reviewers
in parallel] P --> J[reviews/*.json
strict schema] J --> M[Merge:
filter + corroborate + cluster] M --> O[merged.md + metrics.json] P --> H[(history.db)] M --> H

Dispatch pattern

The default dispatch is subagent-per-reviewer, capped at 4 concurrent; the rest queue and launch as slots free up. Each subagent runs dispatch.py synchronously for exactly one reviewer and writes that reviewer’s JSON. The legacy / quick path is a single-process dispatch.py --roster a,b,c,d for rosters of ≤4 reviewers. Either way, merge.py runs exactly once at the end.

sequenceDiagram participant O as Orchestrator participant Q as Queue (cap 4) participant S as Subagent (per reviewer) participant F as reviews/*.json O->>Q: enqueue all reviewers loop up to 4 at a time Q->>S: launch dispatch.py (one reviewer, sync) S-->>F: write <reviewer>.json S-->>Q: slot freed end Q-->>O: all reviewers done O->>O: merge.py (runs once)
Subagents must run synchronously

A subagent must run dispatch.py to completion before its session ends — never run_in_background inside the subagent. Otherwise the session ends first and the review JSON is never written.

Merge & corroboration

The merge step is where a noisy panel becomes a clean review:

  • Confidence threshold — drop findings with effective confidence below 80.
  • Corroboration boost — add +15 confidence (capped at 100) when ≥2 reviewers flag the same file within ±3 lines.
  • Anchor-based clustering — ±3-line proximity with dual tolerance (anchor and cluster-max) to avoid chain drift; the reported line is the cluster median and the worst severity wins.
flowchart TD A[All findings from all reviewers] --> B[Cluster by file + anchor +/-3 lines] B --> C{>=2 reviewers in cluster?} C -- yes --> D[+15 confidence boost, cap 100] C -- no --> E[Keep base confidence] D --> F{Effective confidence >= 80?} E --> F F -- no --> G[Drop finding] F -- yes --> H[Emit: median line, worst severity] H --> I[merged.md + metrics.json]

Finding schema & extraction

Every reviewer returns strict-schema JSON findings:

{ "file": "...", "line": 0, "severity": "...", "category": "...",
  "description": "...", "confidence": 0 }

A tolerant extractor handles fenced code blocks, <think> reasoning prefixes (Qwen / DeepSeek-R1 style), and braces inside string values. Other safeguards: a context-window pre-check skips a reviewer if the prompt exceeds 70% of its context; an OpenRouter reasoning-exclude patch avoids providers returning {content:null, reasoning:...}; and failure isolation means one broken reviewer never kills the run (timeouts kill the whole process tree).

Host-CLI awareness

Argus inspects the environment and parent process tree to detect its host CLI, then applies per-host rules. The rule is simple: never ask a host CLI to review its own invocation, so the matching reviewer is always skipped. The add column applies to profile rosters only, not explicit --custom lists.

HostSkipAdd
claudeclaude
codexcodexclaude
geminigeminiclaude
opencodeopencodeclaude
unknown

history.db schema

Every run is logged to a local SQLite database (history.db) for stats and benchmarks.

erDiagram RUNS ||--o{ REVIEWER_RUNS : has REVIEWER_RUNS ||--o{ FINDINGS : produces BENCHMARKS ||--o{ REVIEWER_RUNS : scores RUNS { int id PK text timestamp text mode text roster real est_cost } REVIEWER_RUNS { int id PK int run_id FK text reviewer text route real wall_sec text status } FINDINGS { int id PK int reviewer_run_id FK text file int line text severity int confidence } BENCHMARKS { int id PK text timestamp text fixture real f1 real precision real recall }