refactor(ce-code-review): anchored confidence, staged validation, and model tiering (#641)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -191,6 +191,21 @@ This path works with any ref — a SHA, `origin/main`, a branch name. Automated
|
||||
|
||||
If `mode:report-only` or `mode:headless` is active, do **not** run `gh pr checkout <number-or-url>` on the shared checkout. For `mode:report-only`, tell the caller: "mode:report-only cannot switch the shared checkout to review a PR target. Run it from an isolated worktree/checkout for that PR, or run report-only with no target argument on the already checked out branch." For `mode:headless`, emit `Review failed (headless mode). Reason: cannot switch shared checkout. Re-invoke with base:<ref> to review the current checkout, or run from an isolated worktree.` Stop here unless the review is already running in an isolated checkout.
|
||||
|
||||
**Skip-condition pre-check.** Before checkout or scope detection, run a PR-state probe to decide whether the review should proceed:
|
||||
|
||||
```
|
||||
gh pr view <number-or-url> --json state,title,body,files
|
||||
```
|
||||
|
||||
Apply skip rules in order:
|
||||
|
||||
- `state` is `CLOSED` or `MERGED` -> stop with message `PR is closed/merged; not reviewing.`
|
||||
- **Trivial-PR judgment**: spawn a lightweight sub-agent (use `model: haiku` in Claude Code; gpt-5.4-nano or equivalent in Codex) with the PR title, body, and changed file paths. The agent's task: "Is this an automated or trivial PR that does not warrant a code review? Consider: dependency lock-file or manifest-only bumps, automated release commits, chore version increments with no substantive code changes. When in doubt, answer no — false negatives (skipped reviews that should have run) are more costly than false positives (unnecessary reviews)." If the judgment returns yes: stop with message `PR appears to be a trivial automated PR; not reviewing. Run without a PR argument to review the current branch, or pass base:<ref> if review is intended.`
|
||||
|
||||
When any skip rule fires, emit the message and stop without dispatching reviewers, switching the checkout, or running scope detection. **Standalone branch mode and `base:` mode are unaffected** -- they always run the full review. **Draft PRs are reviewed normally** -- draft status is not a skip condition; early feedback on in-progress work is valuable.
|
||||
|
||||
If no skip rule fires, proceed to the checkout logic below.
|
||||
|
||||
First, verify the worktree is clean before switching branches:
|
||||
|
||||
```
|
||||
@@ -385,13 +400,11 @@ Pass the resulting path list to the `project-standards` persona inside a `<stand
|
||||
|
||||
#### Model tiering
|
||||
|
||||
Persona sub-agents do focused, scoped work and should use a fast mid-tier model to reduce cost and latency without sacrificing review quality. The orchestrator itself stays on the default (most capable) model.
|
||||
Three reviewers inherit the session model with no override: `ce-correctness-reviewer`, `ce-security-reviewer`, and `ce-adversarial-reviewer`. These perform the highest-stakes analysis — logic bugs, security vulnerabilities, adversarial failure scenarios — and should run at whatever capability level the user has configured. If the user is on Opus, these get Opus.
|
||||
|
||||
Use the platform's mid-tier model for all persona and CE sub-agents. In Claude Code, pass `model: "sonnet"` in the Agent tool call. On other platforms, use the equivalent mid-tier (e.g., `gpt-4o` in Codex). If the platform has no model override mechanism or the available model names are unknown, omit the model parameter and let agents inherit the default -- a working review on the parent model is better than a broken dispatch from an unrecognized model name.
|
||||
All other persona sub-agents and CE agents use the platform's mid-tier model to reduce cost and latency. In Claude Code, pass `model: "sonnet"` in the Agent tool call. On other platforms, use the equivalent mid-tier (e.g., `gpt-5.4-mini` in Codex as of April 2026). If the platform has no model override mechanism or the available model names are unknown, omit the model parameter and let agents inherit the default -- a working review on the parent model is better than a broken dispatch from an unrecognized model name.
|
||||
|
||||
CE always-on agents (ce-agent-native-reviewer, ce-learnings-researcher) and CE conditional agents (ce-schema-drift-detector, ce-deployment-verification-agent) also use the mid-tier model since they perform scoped, focused work.
|
||||
|
||||
The orchestrator (this skill) stays on the default model because it handles intent discovery, reviewer selection, finding merge/dedup, and synthesis -- tasks that benefit from stronger reasoning.
|
||||
The orchestrator (this skill) also inherits the session model; it handles intent discovery, reviewer selection, finding merge/dedup, and synthesis -- tasks that benefit from the same reasoning capability the user configured.
|
||||
|
||||
#### Run ID
|
||||
|
||||
@@ -435,7 +448,7 @@ Each persona sub-agent writes full JSON (all schema fields) to `.context/compoun
|
||||
"severity": "P0",
|
||||
"file": "orders_controller.rb",
|
||||
"line": 42,
|
||||
"confidence": 0.92,
|
||||
"confidence": 100,
|
||||
"autofix_class": "gated_auto",
|
||||
"owner": "downstream-resolver",
|
||||
"requires_verification": true,
|
||||
@@ -458,6 +471,8 @@ Detail-tier fields (`why_it_matters`, `evidence`) are in the artifact file only.
|
||||
|
||||
Convert multiple reviewer compact JSON returns into one deduplicated, confidence-gated finding set. The compact returns contain merge-tier fields (title, severity, file, line, confidence, autofix_class, owner, requires_verification, pre_existing) plus the optional suggested_fix. Detail-tier fields (why_it_matters, evidence) are on disk in the per-agent artifact files and are not loaded at this stage.
|
||||
|
||||
`confidence` is one of 5 discrete anchors (`0`, `25`, `50`, `75`, `100`) with behavioral definitions in the findings schema. Synthesis treats anchors as integers; do not coerce to floats.
|
||||
|
||||
1. **Validate.** Check each compact return for required top-level and per-finding fields, plus value constraints. Drop malformed returns or findings. Record the drop count.
|
||||
- **Top-level required:** reviewer (string), findings (array), residual_risks (array), testing_gaps (array). Drop the entire return if any are missing or wrong type.
|
||||
- **Per-finding required:** title, severity, file, line, confidence, autofix_class, owner, requires_verification, pre_existing
|
||||
@@ -465,25 +480,78 @@ Convert multiple reviewer compact JSON returns into one deduplicated, confidence
|
||||
- severity: P0 | P1 | P2 | P3
|
||||
- autofix_class: safe_auto | gated_auto | manual | advisory
|
||||
- owner: review-fixer | downstream-resolver | human | release
|
||||
- confidence: numeric, 0.0-1.0
|
||||
- confidence: integer in {0, 25, 50, 75, 100}
|
||||
- line: positive integer
|
||||
- pre_existing, requires_verification: boolean
|
||||
- Do not validate against the full schema here -- the full schema (including why_it_matters and evidence) applies to the artifact files on disk, not the compact returns.
|
||||
2. **Confidence gate.** Suppress findings below 0.60 confidence. Exception: P0 findings at 0.50+ confidence survive the gate -- critical-but-uncertain issues must not be silently dropped. Record the suppressed count. This matches the persona instructions and the schema's confidence thresholds.
|
||||
3. **Deduplicate.** Compute fingerprint: `normalize(file) + line_bucket(line, +/-3) + normalize(title)`. When fingerprints match, merge: keep highest severity, keep highest confidence, note which reviewers flagged it.
|
||||
4. **Cross-reviewer agreement.** When 2+ independent reviewers flag the same issue (same fingerprint), boost the merged confidence by 0.10 (capped at 1.0). Cross-reviewer agreement is strong signal -- independent reviewers converging on the same issue is more reliable than any single reviewer's confidence. Note the agreement in the Reviewer column of the output (e.g., "security, correctness").
|
||||
5. **Separate pre-existing.** Pull out findings with `pre_existing: true` into a separate list.
|
||||
6. **Resolve disagreements.** When reviewers flag the same code region but disagree on severity, autofix_class, or owner, annotate the Reviewer column with the disagreement (e.g., "security (P0), correctness (P1) -- kept P0"). This transparency helps the user understand why a finding was routed the way it was.
|
||||
7. **Normalize routing.** For each merged finding, set the final `autofix_class`, `owner`, and `requires_verification`. If reviewers disagree, keep the most conservative route. Synthesis may narrow a finding from `safe_auto` to `gated_auto` or `manual`, but must not widen it without new evidence.
|
||||
7b. **Tie-break the recommended action.** Interactive mode's walk-through and LFG paths present a per-finding recommended action (Apply / Defer / Skip / Acknowledge) derived from the normalized `autofix_class` and `suggested_fix`. When contributing reviewers implied different actions for the same merged finding, synthesis picks the most conservative using the order `Skip > Defer > Apply > Acknowledge`. This guarantees that identical review artifacts produce the same recommendation deterministically, so LFG results are auditable after the fact and the walk-through's recommendation is stable across re-runs. The user may still override per finding via the walk-through's options; this rule only determines what gets labeled "recommended."
|
||||
2. **Deduplicate.** Compute fingerprint: `normalize(file) + line_bucket(line, +/-3) + normalize(title)`. When fingerprints match, merge: keep highest severity, keep highest anchor, note which reviewers flagged it. Dedup runs over the full validated set (including anchor 50) so cross-reviewer promotion in step 3 can lift matching anchor-50 findings into the actionable tier.
|
||||
3. **Cross-reviewer agreement.** When 2+ independent reviewers flag the same issue (same fingerprint), promote the merged finding by one anchor step: `50 -> 75`, `75 -> 100`, `100 -> 100`. Cross-reviewer corroboration is a stronger signal than any single reviewer's anchor; the promotion routes a previously-soft finding into the actionable tier or strengthens its already-actionable position. Note the agreement in the Reviewer column of the output (e.g., "security, correctness").
|
||||
4. **Separate pre-existing.** Pull out findings with `pre_existing: true` into a separate list.
|
||||
5. **Resolve disagreements.** When reviewers flag the same code region but disagree on severity, autofix_class, or owner, annotate the Reviewer column with the disagreement (e.g., "security (P0), correctness (P1) -- kept P0"). This transparency helps the user understand why a finding was routed the way it was.
|
||||
6. **Normalize routing.** For each merged finding, set the final `autofix_class`, `owner`, and `requires_verification`. If reviewers disagree, keep the most conservative route. Synthesis may narrow a finding from `safe_auto` to `gated_auto` or `manual`, but must not widen it without new evidence.
|
||||
6b. **Tie-break the recommended action.** Interactive mode's walk-through and LFG paths present a per-finding recommended action (Apply / Defer / Skip / Acknowledge) derived from the normalized `autofix_class` and `suggested_fix`. When contributing reviewers implied different actions for the same merged finding, synthesis picks the most conservative using the order `Skip > Defer > Apply > Acknowledge`. This guarantees that identical review artifacts produce the same recommendation deterministically, so LFG results are auditable after the fact and the walk-through's recommendation is stable across re-runs. The user may still override per finding via the walk-through's options; this rule only determines what gets labeled "recommended."
|
||||
6c. **Mode-aware demotion of weak general-quality findings.** Some persona output is real signal but does not warrant primary-findings attention. Reroute it to the existing soft buckets so the primary findings table stays focused on actionable issues.
|
||||
|
||||
A finding qualifies for demotion when **all** of these hold:
|
||||
- Severity is P2 or P3 (P0 and P1 always stay in primary findings)
|
||||
- `autofix_class` is `advisory` (concrete-fix findings stay in primary)
|
||||
- **All** contributing reviewers are `testing` or `maintainability` — if any other persona also flagged this finding, cross-reviewer corroboration is present and the finding stays in primary findings regardless of its severity or advisory status (expand the weak-signal list later only with evidence)
|
||||
|
||||
When a finding qualifies, route by mode:
|
||||
- **Interactive and report-only modes:** Move the finding out of the primary findings set. If the contributing reviewer is `testing`, append `<file:line> -- <title>` to `testing_gaps`. If `maintainability`, append the same to `residual_risks`. Record the demotion count for Coverage. The finding does not appear in the Stage 6 findings table. (Use title only -- the compact return omits `why_it_matters`, and report-only mode skips artifact files entirely. Soft-bucket entries are FYI items; readers who want depth can open the per-agent artifact when one exists.)
|
||||
- **Headless and autofix modes:** Suppress the finding entirely. Record the suppressed count in Coverage as "mode-aware demotion suppressions" so the user can see what was filtered.
|
||||
|
||||
Demotion is intentionally narrow. The conservative scope (testing/maintainability + P2/P3 + advisory) is the starting point; do not expand the rule by guessing which other personas overproduce noise. If real review runs show another persona consistently emitting weak signal, expand with evidence.
|
||||
|
||||
7. **Confidence gate.** After dedup, promotion, and demotion have shaped the primary set, suppress remaining findings below anchor 75. Exception: P0 findings at anchor 50+ survive the gate -- critical-but-uncertain issues must not be silently dropped. Record the suppressed count by anchor (so Coverage can report "N findings suppressed at anchor 50, M at anchor 25"). The gate runs late deliberately: anchor-50 findings need a chance to be promoted by step 3 (cross-reviewer corroboration) or rerouted by step 6c (mode-aware demotion to soft buckets) before any drop decision.
|
||||
8. **Partition the work.** Build three sets:
|
||||
- in-skill fixer queue: only `safe_auto -> review-fixer`
|
||||
- residual actionable queue: unresolved `gated_auto` or `manual` findings whose owner is `downstream-resolver`
|
||||
- report-only queue: `advisory` findings plus anything owned by `human` or `release`
|
||||
9. **Sort.** Order by severity (P0 first) -> confidence (descending) -> file path -> line number.
|
||||
9. **Sort.** Order by severity (P0 first) -> anchor (descending) -> file path -> line number.
|
||||
10. **Collect coverage data.** Union residual_risks and testing_gaps across reviewers.
|
||||
11. **Preserve CE agent artifacts.** Keep the learnings, agent-native, schema-drift, and deployment-verification outputs alongside the merged finding set. Do not drop unstructured agent output just because it does not match the persona JSON schema.
|
||||
|
||||
### Stage 5b: Validation pass (externalizing modes only)
|
||||
|
||||
Independent verification gate. Spawn one validator sub-agent per surviving finding using `references/validator-template.md`. The validator's job is to re-check the finding against the diff and surrounding code with no commitment to the original persona's analysis. Findings the validator rejects are dropped; findings the validator confirms flow through unchanged.
|
||||
|
||||
**When this stage runs:**
|
||||
|
||||
| Mode | Runs Stage 5b? | Where |
|
||||
|------|---------------|-------|
|
||||
| `headless` | Yes, eagerly | Between Stage 5 and Stage 6 |
|
||||
| `autofix` | Yes, eagerly | Between Stage 5 and Stage 6 |
|
||||
| `interactive`, walk-through routing (option A) — per-finding phase | No -- the user is the per-finding validator | n/a |
|
||||
| `interactive`, walk-through routing (option A) — LFG-the-rest handoff | Yes, on the remaining action set | Before bulk-preview dispatch (same gate as option B) |
|
||||
| `interactive`, LFG routing (option B) | Yes, on the action set | Before bulk-preview dispatch |
|
||||
| `interactive`, File-tickets routing (option C) | Yes, on all pending findings | Before tracker dispatch |
|
||||
| `interactive`, Report-only routing (option D) | No -- nothing is being externalized | n/a |
|
||||
| `report-only` | No -- read-only mode externalizes nothing | n/a |
|
||||
|
||||
When Stage 5b does not run, the merged finding set from Stage 5 flows through to Stage 6 unchanged. When it runs, the steps below execute on the relevant set.
|
||||
|
||||
**Steps:**
|
||||
|
||||
1. **Select findings to validate.**
|
||||
- **headless/autofix:** All survivors of Stage 5.
|
||||
- **interactive LFG (option B) and walk-through LFG-the-rest handoff:** The action set — findings with a recommended action of Apply or Defer. Skip and Acknowledge findings are not being externalized on this path.
|
||||
- **interactive File-tickets (option C):** All pending findings regardless of recommended action. Option C externalizes every finding as a ticket, so every finding needs validation.
|
||||
2. **Apply dispatch budget cap.** If the selected set exceeds 15 findings, validate the highest-severity 15 (P0 first, then P1, then P2, then P3, breaking ties by anchor descending). Drop the remainder and record the over-budget count for the Coverage section. The blunt drop is intentional; a review producing 15+ surviving findings is already in territory where a second wave would not change the user's triage approach.
|
||||
3. **Spawn validators in parallel.** One sub-agent per finding, dispatched concurrently using the validator template. Each validator receives:
|
||||
- The finding's title, severity, file, line, suggested_fix, original reviewer name, and confidence anchor
|
||||
- `why_it_matters` when available — loaded from the per-agent artifact file at `.context/compound-engineering/ce-code-review/{run_id}/{reviewer_name}.json`; omit when the file is absent or the artifact write failed. The validator proceeds without it, using the diff and cited code directly.
|
||||
- The full diff
|
||||
- Read-tool access to inspect the cited code, callers, guards, framework defaults, and git blame
|
||||
4. **Collect verdicts.** Each validator returns `{ "validated": true | false, "reason": "<one sentence>" }`.
|
||||
- `validated: true` -> finding survives unchanged into the next phase (Stage 6 for headless/autofix, dispatch for interactive)
|
||||
- `validated: false` -> finding is dropped; record the validator's reason in Coverage
|
||||
- Validator failure (timeout, dispatch error, malformed JSON) -> drop the finding with reason "validator failed"; conservative bias is correct
|
||||
5. **Use mid-tier model for validators.** Same model class (sonnet) the persona reviewers use. Validators are read-only — same constraints as persona reviewers. They may use non-mutating inspection commands (Read, Grep, Glob, git blame, gh).
|
||||
6. **Record metrics for Coverage.** Total dispatched, validated true count, validated false count (with reasons), failures, and over-budget drops.
|
||||
|
||||
**Why per-finding parallel dispatch (not batched):** Independence is the point. A single batched validator looking at all findings together pattern-matches across them and recreates the persona-bias problem. Per-finding parallel dispatch preserves fresh context per call. Per-file batching is a plausible future optimization for reviews with many findings clustered in few files; not implemented today.
|
||||
|
||||
### Stage 6: Synthesize and present
|
||||
|
||||
Assemble the final report using **pipe-delimited markdown tables for findings** from the review output template included below. The table format is mandatory for finding rows in interactive mode — do not render findings as freeform text blocks or horizontal-rule-separated prose. Other report sections (Applied Fixes, Learnings, Coverage, etc.) use bullet lists and the `---` separator before the verdict, as shown in the template.
|
||||
@@ -501,7 +569,7 @@ Assemble the final report using **pipe-delimited markdown tables for findings**
|
||||
8. **Agent-Native Gaps.** Surface ce-agent-native-reviewer results. Omit section if no gaps found.
|
||||
9. **Schema Drift Check.** If ce-schema-drift-detector ran, summarize whether drift was found. If drift exists, list the unrelated schema objects and the required cleanup command. If clean, say so briefly.
|
||||
10. **Deployment Notes.** If ce-deployment-verification-agent ran, surface the key Go/No-Go items: blocking pre-deploy checks, the most important verification queries, rollback caveats, and monitoring focus areas. Keep the checklist actionable rather than dropping it into Coverage.
|
||||
11. **Coverage.** Suppressed count, residual risks, testing gaps, failed/timed-out reviewers, and any intent uncertainty carried by non-interactive modes.
|
||||
11. **Coverage.** Suppressed count by anchor (e.g., "N findings suppressed at anchor 50, M at anchor 25"), mode-aware demotion count (interactive/report-only) or suppression count (headless/autofix), validator drop count and reasons (when Stage 5b ran), validator over-budget drops (when the 15-cap fired), residual risks, testing gaps, failed/timed-out reviewers, and any intent uncertainty carried by non-interactive modes.
|
||||
12. **Verdict.** Ready to merge / Ready with fixes / Not ready. Fix order if applicable. When an `explicit` plan has unaddressed requirements, the verdict must reflect it — a PR that's code-clean but missing planned requirements is "Not ready" unless the omission is intentional. When an `inferred` plan has unaddressed requirements, note it in the verdict reasoning but do not block on it alone.
|
||||
|
||||
Do not include time estimates.
|
||||
@@ -565,7 +633,11 @@ Testing gaps:
|
||||
- <gap>
|
||||
|
||||
Coverage:
|
||||
- Suppressed: <N> findings below 0.60 confidence (P0 at 0.50+ retained)
|
||||
- Suppressed: <N> findings below anchor 75 (P0 at anchor 50+ retained)
|
||||
- Mode-aware demotion suppressions: <N> findings suppressed (testing/maintainability advisory P2-P3)
|
||||
- Validator drops: <N> findings rejected by Stage 5b validator
|
||||
- <file:line> -- <reason>
|
||||
- Validator over-budget drops: <N> findings exceeded the 15-cap and were not validated
|
||||
- Untracked files excluded: <file1>, <file2>
|
||||
- Failed reviewers: <reviewer>
|
||||
|
||||
@@ -642,8 +714,8 @@ After presenting findings and verdict (Stage 6), route the next steps by mode. R
|
||||
|
||||
- **Dispatch on selection.** Route by the option letter (A / B / C / D), not by the rendered label string. The option-C label varies by tracker-detection confidence (`File a [TRACKER] ticket per finding without applying fixes` for a named tracker, `File an issue per finding without applying fixes` as the generic fallback, or omitted entirely when no sink is available — see `references/tracker-defer.md`), and options A / B / D have a single canonical label each. The letter is the stable dispatch signal; the canonical labels below are shown for documentation only. A low-confidence run that rendered option C as the generic label routes to the same branch as a high-confidence run that rendered it with the named tracker.
|
||||
- (A) `Review each finding one by one` — load `references/walkthrough.md` and enter the per-finding walk-through loop. The walk-through accumulates Apply decisions in memory; Defer decisions execute inline via `references/tracker-defer.md`; Skip / Acknowledge decisions are recorded as no-action; `LFG the rest` routes through `references/bulk-preview.md`. At end of the loop, dispatch one fixer subagent for the accumulated Apply set (Step 3). Emit the unified completion report.
|
||||
- (B) `LFG. Apply the agent's best-judgment action per finding` — load `references/bulk-preview.md` scoped to every pending `gated_auto` / `manual` finding. On `Proceed`, execute the plan: Apply set → Step 3 fixer dispatch; Defer set → `references/tracker-defer.md`; Skip / Acknowledge → no-op. On `Cancel`, return to this routing question. Emit the unified completion report after execution.
|
||||
- (C) `File a [TRACKER] ticket per finding without applying fixes` (or the generic `File an issue per finding without applying fixes` when the named-tracker label is not used) — load `references/bulk-preview.md` with every pending finding in the file-tickets bucket (regardless of the agent's natural recommendation). On `Proceed`, route every finding through `references/tracker-defer.md`; no fixes are applied. On `Cancel`, return to this routing question. Emit the unified completion report.
|
||||
- (B) `LFG. Apply the agent's best-judgment action per finding` — first run Stage 5b validation on the action set (Apply / Defer findings). Drop validator-rejected findings with their reasons recorded in Coverage. Then load `references/bulk-preview.md` scoped to every surviving pending `gated_auto` / `manual` finding. On `Proceed`, execute the plan: Apply set → Step 3 fixer dispatch; Defer set → `references/tracker-defer.md`; Skip / Acknowledge → no-op. On `Cancel`, return to this routing question. Emit the unified completion report after execution.
|
||||
- (C) `File a [TRACKER] ticket per finding without applying fixes` (or the generic `File an issue per finding without applying fixes` when the named-tracker label is not used) — first run Stage 5b validation on every pending finding. Drop validator-rejected findings with their reasons recorded in Coverage. Then load `references/bulk-preview.md` with every surviving finding in the file-tickets bucket. On `Proceed`, route every finding through `references/tracker-defer.md`; no fixes are applied. On `Cancel`, return to this routing question. Emit the unified completion report.
|
||||
- (D) `Report only — take no further action` — do not enter any dispatch phase. Emit the completion report, then proceed to Step 5 per its gating rule (`fixes_applied_count > 0` from earlier `safe_auto` passes). If no fixes were applied this run, stop after the report.
|
||||
|
||||
- The walk-through's completion report, the LFG / File-tickets completion report, and the zero-remaining completion summary all follow the unified completion-report structure documented in `references/walkthrough.md`. Use the same structure across every terminal path.
|
||||
|
||||
@@ -70,10 +70,9 @@
|
||||
"description": "Concrete minimal fix. Omit or null if no good fix is obvious -- a bad suggestion is worse than none."
|
||||
},
|
||||
"confidence": {
|
||||
"type": "number",
|
||||
"description": "Reviewer confidence in this finding, calibrated per persona",
|
||||
"minimum": 0.0,
|
||||
"maximum": 1.0
|
||||
"type": "integer",
|
||||
"enum": [0, 25, 50, 75, 100],
|
||||
"description": "Anchored confidence score. Use exactly one of 0, 25, 50, 75, 100. Each anchor has a behavioral criterion the reviewer must honestly self-apply. 0: Not confident. This is a false positive that does not stand up to light scrutiny, or a pre-existing issue this PR did not introduce. 25: Somewhat confident. Might be a real issue but could also be a false positive; the reviewer could not verify from the diff and surrounding code alone. 50: Moderately confident. The reviewer verified this is a real issue but it may be a nitpick, narrow edge case, or have minimal practical impact. Relative to the diff's other concerns, it is not very important. Style preferences and subjective improvements land here. 75: Highly confident. The reviewer double-checked the diff and confirmed the issue will affect users, downstream callers, or runtime behavior in normal usage. The bug, vulnerability, or contract violation is clearly present and actionable. 100: Absolutely certain. The issue is verifiable from the code itself -- compile error, type mismatch, definitive logic bug, or an explicit project-standards violation with a quotable rule. No interpretation required."
|
||||
},
|
||||
"evidence": {
|
||||
"type": "array",
|
||||
@@ -98,14 +97,20 @@
|
||||
"description": "Missing test coverage the reviewer identified",
|
||||
"items": { "type": "string" }
|
||||
}
|
||||
},
|
||||
},
|
||||
|
||||
"_meta": {
|
||||
"confidence_anchors": {
|
||||
"description": "Confidence is one of 5 discrete anchors (0, 25, 50, 75, 100), each tied to a behavioral criterion the reviewer can honestly self-apply. Float values (e.g., 0.73) are not valid -- the model cannot meaningfully calibrate at finer granularity, and discrete anchors prevent false-precision gaming.",
|
||||
"0": "False positive or pre-existing -- do not report",
|
||||
"25": "Speculative; could not verify -- do not report",
|
||||
"50": "Verified real but minor or stylistic -- report only when P0 or when synthesis routes to advisory/soft buckets",
|
||||
"75": "Highly confident, will affect users or runtime in normal usage -- report",
|
||||
"100": "Verifiable from code alone (compile error, type mismatch, definitive logic bug, quoted standards violation) -- report"
|
||||
},
|
||||
"confidence_thresholds": {
|
||||
"suppress": "Below 0.60 -- do not report. Finding is speculative noise. Exception: P0 findings at 0.50+ may be reported.",
|
||||
"flag": "0.60-0.69 -- include only when the issue is clearly actionable with concrete evidence.",
|
||||
"confident": "0.70-0.84 -- real and important. Report with full evidence.",
|
||||
"certain": "0.85-1.00 -- verifiable from the code alone. Report."
|
||||
"suppress": "Below anchor 75 -- do not report. Exception: P0 findings at anchor 50+ may be reported (critical-but-uncertain issues must not be silently dropped).",
|
||||
"report": "Anchor 75 or 100 -- include with full evidence."
|
||||
},
|
||||
"severity_definitions": {
|
||||
"P0": "Critical breakage, exploitable vulnerability, data loss/corruption. Must fix before merge.",
|
||||
|
||||
@@ -21,26 +21,26 @@ Use this **exact format** when presenting synthesized review findings. Findings
|
||||
|
||||
| # | File | Issue | Reviewer | Confidence | Route |
|
||||
|---|------|-------|----------|------------|-------|
|
||||
| 1 | `orders_controller.rb:42` | User-supplied ID in account lookup without ownership check | security | 0.92 | `gated_auto -> downstream-resolver` |
|
||||
| 1 | `orders_controller.rb:42` | User-supplied ID in account lookup without ownership check | security | 100 | `gated_auto -> downstream-resolver` |
|
||||
|
||||
### P1 -- High
|
||||
|
||||
| # | File | Issue | Reviewer | Confidence | Route |
|
||||
|---|------|-------|----------|------------|-------|
|
||||
| 2 | `export_service.rb:87` | Loads all orders into memory -- unbounded for large accounts | performance | 0.85 | `safe_auto -> review-fixer` |
|
||||
| 3 | `export_service.rb:91` | No pagination -- response size grows linearly with order count | api-contract, performance | 0.80 | `manual -> downstream-resolver` |
|
||||
| 2 | `export_service.rb:87` | Loads all orders into memory -- unbounded for large accounts | performance | 100 | `safe_auto -> review-fixer` |
|
||||
| 3 | `export_service.rb:91` | No pagination -- response size grows linearly with order count | api-contract, performance | 75 | `manual -> downstream-resolver` |
|
||||
|
||||
### P2 -- Moderate
|
||||
|
||||
| # | File | Issue | Reviewer | Confidence | Route |
|
||||
|---|------|-------|----------|------------|-------|
|
||||
| 4 | `export_service.rb:45` | Missing error handling for CSV serialization failure | correctness | 0.75 | `safe_auto -> review-fixer` |
|
||||
| 4 | `export_service.rb:45` | Missing error handling for CSV serialization failure | correctness | 75 | `safe_auto -> review-fixer` |
|
||||
|
||||
### P3 -- Low
|
||||
|
||||
| # | File | Issue | Reviewer | Confidence | Route |
|
||||
|---|------|-------|----------|------------|-------|
|
||||
| 5 | `export_helper.rb:12` | Format detection could use early return instead of nested conditional | maintainability | 0.70 | `advisory -> human` |
|
||||
| 5 | `export_helper.rb:12` | Format detection could use early return instead of nested conditional | maintainability | 75 | `advisory -> human` |
|
||||
|
||||
### Applied Fixes
|
||||
|
||||
@@ -79,7 +79,7 @@ Use this **exact format** when presenting synthesized review findings. Findings
|
||||
|
||||
### Coverage
|
||||
|
||||
- Suppressed: 2 findings below 0.60 confidence
|
||||
- Suppressed: 2 findings below anchor 75 (1 at anchor 50, 1 at anchor 25)
|
||||
- Residual risks: No rate limiting on export endpoint
|
||||
- Testing gaps: No test for concurrent export requests
|
||||
|
||||
@@ -103,7 +103,7 @@ Sev: P1
|
||||
File: foo.go:42
|
||||
Issue: Some problem description
|
||||
Reviewer(s): adversarial
|
||||
Confidence: 0.85
|
||||
Confidence: 75
|
||||
Route: advisory -> human
|
||||
────────────────────────────────────────
|
||||
Sev: P2
|
||||
@@ -119,7 +119,7 @@ This fails because: no pipe-delimited tables, no severity-grouped `###` headers,
|
||||
- **Severity-grouped sections** -- `### P0 -- Critical`, `### P1 -- High`, `### P2 -- Moderate`, `### P3 -- Low`. Omit empty severity levels.
|
||||
- **Always include file:line location** for code review issues
|
||||
- **Reviewer column** shows which persona(s) flagged the issue. Multiple reviewers = cross-reviewer agreement.
|
||||
- **Confidence column** shows the finding's confidence score
|
||||
- **Confidence column** shows the finding's anchor as an integer (`50`, `75`, or `100`). Never render as a float.
|
||||
- **Route column** shows the synthesized handling decision as ``<autofix_class> -> <owner>``.
|
||||
- **Header includes** scope, intent, and reviewer team with per-conditional justifications
|
||||
- **Mode line** -- include `interactive`, `autofix`, `report-only`, or `headless`
|
||||
|
||||
@@ -38,15 +38,55 @@ The schema below describes the **full artifact file format** (all fields require
|
||||
|
||||
{schema}
|
||||
|
||||
Confidence rubric (0.0-1.0 scale):
|
||||
- 0.00-0.29: Not confident / likely false positive. Do not report.
|
||||
- 0.30-0.49: Somewhat confident. Do not report -- too speculative for actionable review.
|
||||
- 0.50-0.59: Moderately confident. Real but uncertain. Do not report unless P0 severity.
|
||||
- 0.60-0.69: Confident enough to flag. Include only when the issue is clearly actionable.
|
||||
- 0.70-0.84: Highly confident. Real and important. Report with full evidence.
|
||||
- 0.85-1.00: Certain. Verifiable from the code alone. Report.
|
||||
**Schema conformance — hard constraints (use these exact values; validation rejects anything else):**
|
||||
|
||||
Suppress threshold: 0.60. Do not emit findings below 0.60 confidence (except P0 at 0.50+).
|
||||
- `severity`: one of `"P0"`, `"P1"`, `"P2"`, `"P3"` — use these exact strings. Do NOT use `"high"`, `"medium"`, `"low"`, `"critical"`, or any other vocabulary, even if your persona's prose discusses priorities in those terms conceptually.
|
||||
- `autofix_class`: one of `"safe_auto"`, `"gated_auto"`, `"manual"`, `"advisory"`.
|
||||
- `owner`: one of `"review-fixer"`, `"downstream-resolver"`, `"human"`, `"release"`.
|
||||
- `evidence`: an ARRAY of strings with at least one element. A single string value is a validation failure — wrap every quote in `["..."]` even when there is only one.
|
||||
- `pre_existing`: boolean, never null.
|
||||
- `requires_verification`: boolean, never null.
|
||||
- `confidence`: one of exactly `0`, `25`, `50`, `75`, or `100` — a discrete anchor, NOT a continuous number. Any other value (e.g., `72`, `0.85`, `"high"`) is a validation failure. Pick the anchor whose behavioral criterion you can honestly self-apply to this finding (see "Confidence rubric" below).
|
||||
|
||||
If your persona description uses severity vocabulary like "high-priority" or "critical" in its rubric text, translate to the P0-P3 scale at emit time. "Critical / must-fix" → P0, "important / should-fix" → P1, "worth-noting / could-fix" → P2, "low-signal" → P3. Same for priorities described qualitatively in your analysis — map to P0-P3 on the way out.
|
||||
|
||||
**Confidence rubric — use these exact behavioral anchors.** Pick the single anchor whose criterion you can honestly self-apply. Do not pick a value between anchors; only `0`, `25`, `50`, `75`, and `100` are valid. The rubric is anchored on behavior you performed, not on a vague sense of certainty — if you cannot truthfully attach the behavioral claim to the finding, step down to the next anchor.
|
||||
|
||||
- **`0` — Not confident at all.** A false positive that does not stand up to light scrutiny, or a pre-existing issue this PR did not introduce. **Do not emit — suppress silently.** This anchor exists in the enum only so synthesis can explicitly track the drop; personas never produce it.
|
||||
- **`25` — Somewhat confident.** Might be a real issue but could also be a false positive; you could not verify from the diff and surrounding code alone. **Do not emit — suppress silently.** This anchor, like `0`, exists in the enum only so synthesis can track the drop; personas never produce it. If your domain is genuinely uncertain, either gather more evidence (read related files, check call sites, inspect git blame) until you can honestly anchor at `50` or higher, or suppress entirely.
|
||||
- **`50` — Moderately confident.** You verified this is a real issue but it is a nitpick, narrow edge case, or has minimal practical impact. Style preferences and subjective improvements land here. Surfaces only when synthesis routes weak findings to advisory / residual_risks / testing_gaps soft buckets, or when the finding is P0 (critical-but-uncertain issues are not silently dropped).
|
||||
- **`75` — Highly confident.** You double-checked the diff and surrounding code and confirmed the issue will affect users, downstream callers, or runtime behavior in normal usage. The bug, vulnerability, or contract violation is clearly present and actionable.
|
||||
|
||||
**Anchor `75` requires naming a concrete observable consequence** — a wrong result, an unhandled error path, a contract mismatch, a security exposure, missing coverage that a real test scenario would surface. "This could be cleaner" or "I would have written this differently" do not meet this bar — they are advisory observations and land at anchor `50`. When in doubt between `50` and `75`, ask: "will a user, caller, or operator concretely encounter this in normal usage, or is this my opinion about the code's quality?" The former is `75`; the latter is `50`.
|
||||
- **`100` — Absolutely certain.** The issue is verifiable from the code itself — compile error, type mismatch, definitive logic bug (off-by-one in a tested algorithm, wrong return type, swapped arguments), or an explicit project-standards violation with a quotable rule. No interpretation required.
|
||||
|
||||
Anchor and severity are independent axes. A P2 finding can be anchor `100` if the evidence is airtight; a P0 finding can be anchor `50` if it is an important concern you could not fully verify. Anchor gates where the finding surfaces (drop / soft bucket / actionable); severity orders it within the actionable surface.
|
||||
|
||||
Synthesis suppresses anchors `0` and `25` silently. Anchor `50` is dropped from primary findings unless the severity is P0 (P0+50 survives) or synthesis routes it to a soft bucket (testing_gaps, residual_risks, advisory) per mode-aware demotion. Anchors `75` and `100` enter the actionable tier.
|
||||
|
||||
Example of a schema-valid finding (all required fields, correct enum values, correct array shape):
|
||||
|
||||
```json
|
||||
{
|
||||
"title": "User-supplied ID in account lookup without ownership check",
|
||||
"severity": "P0",
|
||||
"file": "app/controllers/orders_controller.rb",
|
||||
"line": 42,
|
||||
"why_it_matters": "Any signed-in user can read another user's orders by pasting the target account ID into the URL. The controller looks up the account and returns its orders without verifying the current user owns it. The shipments controller already uses a current_user.owns?(account) guard for the same attack class; matching that pattern fixes this finding.",
|
||||
"autofix_class": "gated_auto",
|
||||
"owner": "downstream-resolver",
|
||||
"requires_verification": true,
|
||||
"suggested_fix": "Add current_user.owns?(account) guard before lookup, matching the pattern in shipments_controller.rb",
|
||||
"confidence": 100,
|
||||
"evidence": [
|
||||
"orders_controller.rb:42 -- account = Account.find(params[:account_id])",
|
||||
"shipments_controller.rb:38 -- raise NotAuthorized unless current_user.owns?(account)"
|
||||
],
|
||||
"pre_existing": false
|
||||
}
|
||||
```
|
||||
|
||||
The `confidence: 100` is justified because the issue is verifiable from the code alone — the controller fetches by user-supplied ID and returns data without any guard, and the parallel pattern in shipments_controller.rb confirms the project's own convention is being violated.
|
||||
|
||||
Writing `why_it_matters` (required field, every finding):
|
||||
|
||||
@@ -55,7 +95,7 @@ The `why_it_matters` field is how the reader — a developer triaging findings,
|
||||
- **Lead with observable behavior.** Describe what the bug does from the outside — what a user, attacker, operator, or downstream caller experiences. Do not lead with code structure ("The function X does Y..."). Start with the effect ("Any signed-in user can read another user's orders..."). Function and variable names appear later, only when the reader needs them to locate the issue.
|
||||
- **Explain why the fix resolves the problem.** If you include a `suggested_fix`, the `why_it_matters` should make clear why that specific fix addresses the root cause. When a similar pattern exists elsewhere in the codebase (an existing guard, an established convention, a parallel handler), reference it so the recommendation is grounded in the project's own conventions rather than theoretical best practice.
|
||||
- **Keep it tight.** Approximately 2-4 sentences plus the minimum code quoted inline to ground the point. Longer framings are a regression — downstream surfaces have narrow display budgets, and verbose `why_it_matters` content gets truncated or skimmed.
|
||||
- **Always produce substantive content.** `why_it_matters` is required by the schema. Empty strings, nulls, and single-phrase entries are validation failures. If you found something worth flagging (confidence >= 0.60), you can explain it — the field exists because every finding needs a reason.
|
||||
- **Always produce substantive content.** `why_it_matters` is required by the schema. Empty strings, nulls, and single-phrase entries are validation failures. If you found something worth flagging at anchor `50` or higher, you can explain it — the field exists because every finding needs a reason.
|
||||
|
||||
Illustrative pair — same finding, weak vs. strong framing:
|
||||
|
||||
@@ -72,18 +112,27 @@ STRONG (observable behavior first, grounded fix reasoning):
|
||||
pattern already used in the shipments controller for the same attack.
|
||||
```
|
||||
|
||||
False-positive categories to actively suppress:
|
||||
- Pre-existing issues unrelated to this diff (mark pre_existing: true for unchanged code the diff does not interact with; if the diff makes it newly relevant, it is secondary, not pre-existing)
|
||||
- Pedantic style nitpicks that a linter/formatter would catch
|
||||
- Code that looks wrong but is intentional (check comments, commit messages, PR description for intent)
|
||||
- Issues already handled elsewhere in the codebase (check callers, guards, middleware)
|
||||
- Suggestions that restate what the code already does in different words
|
||||
- Generic "consider adding" advice without a concrete failure mode
|
||||
False-positive categories to actively suppress. Do NOT emit a finding when any of these apply — not even at anchor `25` or `50`. These are not edge cases you should route to soft buckets; they are non-findings.
|
||||
|
||||
- **Pre-existing issues unrelated to this diff.** Mark `pre_existing: true` only for unchanged code the diff does not interact with. If the diff makes a previously-dormant issue newly relevant (e.g., changes a caller in a way that exposes a bug downstream), it is a secondary finding, not pre-existing. PR-comment and headless externalization filter pre-existing entirely; interactive review surfaces them in a separate section.
|
||||
- **Pedantic style nitpicks that a linter or formatter would catch.** Missing semicolons, indentation, import ordering, unused-variable warnings the project's tooling already catches. Style belongs to the toolchain.
|
||||
- **Code that looks wrong but is intentional.** Check comments, commit messages, PR description, or surrounding code for evidence of intent before flagging. A persona-flagged "missing null check" guarded by an upstream `.present?` call is a false positive.
|
||||
- **Issues already handled elsewhere.** Check callers, guards, middleware, framework defaults, and parallel handlers before flagging. If a controller's input is already validated by a parent middleware, the controller-level check the persona wants to add is redundant.
|
||||
- **Suggestions that restate what the code already does in different words.** "Consider extracting this into a helper" when the code is already a small helper, "consider adding a guard" when a guard one line up already enforces it.
|
||||
- **Generic "consider adding" advice without a concrete failure mode.** If you cannot name what breaks, the finding is not actionable. Either find the failure mode or suppress.
|
||||
- **Issues with a relevant lint-ignore comment.** Code that carries an explicit lint disable comment for the rule you are about to flag (`eslint-disable-next-line no-unused-vars`, `# rubocop:disable Style/StringLiterals`, `# noqa: E501`, etc.) — suppress unless the suppression itself violates a project-standards rule that explicitly forbids disabling that lint for this code shape. The author already chose to suppress; re-flagging it via a different reviewer creates noise and ignores their decision.
|
||||
- **General code-quality concerns not codified in CLAUDE.md / AGENTS.md.** "This file is getting long," "this method has too many parameters," "this is hard to read" — without a project-standards rule to anchor the concern, these are subjective and waste reviewer time. If the project explicitly bans long files or sets a parameter-count limit in its standards, that is a project-standards finding; otherwise suppress.
|
||||
- **Speculative future-work concerns with no current signal.** "This might break under load," "what if the requirements change," "this could be hard to test later" — not findings unless the diff introduces concrete evidence the concern is reachable now.
|
||||
|
||||
**Advisory observations — route to advisory autofix_class, do not force a decision.** If the honest answer to "what actually breaks if we do not fix this?" is "nothing breaks, but…", the finding is advisory. Set `autofix_class: advisory` and `confidence: 50` so synthesis routes the finding to a soft bucket rather than surfacing it as a primary action item. Do not suppress — the observation may have value; it just does not warrant user judgment. Typical advisory shapes: design asymmetry the PR improves but does not fully resolve, opportunity to consolidate two similar helpers when neither is broken, residual risk worth noting in the report.
|
||||
|
||||
**Precedence over the false-positive catalog.** The false-positive catalog above is stricter than the advisory rule — if a shape matches the FP catalog, it is a non-finding and must be suppressed entirely. Do NOT route it to anchor `50` / advisory. The advisory rule applies only to shapes that are NOT in the FP catalog.
|
||||
|
||||
Rules:
|
||||
- You are a leaf reviewer inside an already-running compound-engineering review workflow. Do not invoke compound-engineering skills or agents unless this template explicitly instructs you to. Perform your analysis directly and return findings in the required output format only.
|
||||
- Suppress any finding you cannot honestly anchor at `50` or higher (the actionable floor is `50`; anchors `0` and `25` are suppressed by synthesis anyway, so emitting them only adds noise). If your persona's domain description sets a stricter floor (e.g., anchor `75` minimum), honor it.
|
||||
- Every finding in the full artifact file MUST include at least one evidence item grounded in the actual code. The compact return omits evidence -- the evidence requirement applies to the disk artifact only.
|
||||
- Set pre_existing to true ONLY for issues in unchanged code that are unrelated to this diff. If the diff makes the issue newly relevant, it is NOT pre-existing.
|
||||
- Set `pre_existing` to true ONLY for issues in unchanged code that are unrelated to this diff. If the diff makes the issue newly relevant, it is NOT pre-existing.
|
||||
- You are operationally read-only. The one permitted exception is writing your full analysis to the `.context/` artifact path when a run ID is provided. You may also use non-mutating inspection commands, including read-oriented `git` / `gh` commands, to gather evidence. Do not edit project files, change branches, commit, push, create PRs, or otherwise mutate the checkout or repository state.
|
||||
- Set `autofix_class` accurately -- not every finding is `advisory`. Use this decision guide:
|
||||
- `safe_auto`: The fix is local and deterministic — the fixer can apply it mechanically without design judgment. Examples: extracting a duplicated helper, adding a missing nil/null check, fixing an off-by-one, adding a missing test for an untested code path, removing dead code.
|
||||
|
||||
@@ -0,0 +1,85 @@
|
||||
# Validator Sub-agent Prompt Template
|
||||
|
||||
This template is used by Stage 5b to spawn one validator sub-agent per surviving finding before externalization. The validator's job is **independent re-verification**, not re-reasoning. It is a fresh second opinion, not a critic of the original persona's analysis.
|
||||
|
||||
---
|
||||
|
||||
## Template
|
||||
|
||||
```
|
||||
You are an independent validator for a code review finding. Another reviewer flagged the issue described below. Your job is to verify whether the finding holds up under fresh inspection.
|
||||
|
||||
You have no commitment to the original finding. If it is wrong, say so. False positives are common; do not feel pressure to confirm.
|
||||
|
||||
<finding-to-validate>
|
||||
Title: {finding_title}
|
||||
Severity: {finding_severity}
|
||||
File: {finding_file}
|
||||
Line: {finding_line}
|
||||
|
||||
Why it matters (the original reviewer's framing):
|
||||
{finding_why_it_matters}
|
||||
|
||||
Suggested fix (if any):
|
||||
{finding_suggested_fix}
|
||||
|
||||
Original reviewer: {finding_reviewer}
|
||||
Confidence anchor: {finding_confidence}
|
||||
</finding-to-validate>
|
||||
|
||||
<diff>
|
||||
{diff}
|
||||
</diff>
|
||||
|
||||
<scope-context>
|
||||
The diff above is the full change being reviewed. The finding is about file {finding_file} around line {finding_line}. Use read tools (Read, Grep, Glob, git blame) to inspect the cited code and its callers, guards, middleware, or framework defaults that might handle the concern elsewhere.
|
||||
</scope-context>
|
||||
|
||||
Your task is to answer three questions:
|
||||
|
||||
1. **Is the issue real in the code as written?** Read the cited file and surrounding code. If the code does not actually have the problem the finding describes, the finding is invalid. Common false-positive shapes:
|
||||
- The persona missed an existing guard / null check / validation that handles the case
|
||||
- The persona misread types or signatures
|
||||
- The persona flagged a pattern that is intentional in this codebase (check comments, parallel handlers, project conventions)
|
||||
|
||||
2. **Is the issue introduced by THIS diff?** Use git blame or diff inspection. If the cited line predates this PR's commits and the diff does not interact with it (does not call into it, does not change its callers in a way that newly exposes the issue), the finding is pre-existing — not validated for externalization regardless of whether it is a real issue.
|
||||
|
||||
3. **Is the issue not handled elsewhere?** Look for guards in callers, middleware in the request chain, framework defaults, type system constraints, or parallel handlers that already address the concern. If the issue is functionally prevented by surrounding infrastructure, the finding is invalid.
|
||||
|
||||
Return ONLY this JSON, no prose:
|
||||
|
||||
```json
|
||||
{
|
||||
"validated": true | false,
|
||||
"reason": "<one sentence explaining the verdict>"
|
||||
}
|
||||
```
|
||||
|
||||
Examples:
|
||||
|
||||
- `{ "validated": true, "reason": "Cited line is new in this diff and lacks the ownership guard used by parallel controllers." }`
|
||||
- `{ "validated": false, "reason": "Line 87 already guards user.email with .present? check; the null deref the finding describes cannot occur." }`
|
||||
- `{ "validated": false, "reason": "Cited line dates to 2024-08 (pre-existing); diff does not modify or interact with it." }`
|
||||
- `{ "validated": false, "reason": "Framework handles the timeout case via Faraday default; no application-level retry needed." }`
|
||||
|
||||
Rules:
|
||||
- Be honest. If the original reviewer was right, validate. If they were wrong, reject. Conservative bias is preferred — when in doubt, reject.
|
||||
- Do not invent new findings. Your scope is this one finding; surface anything else as a no-vote with reason.
|
||||
- Do not edit, commit, push, or modify any files. You are operationally read-only.
|
||||
- If you cannot read the cited file, return `{ "validated": false, "reason": "Could not access file path to verify." }` rather than guessing.
|
||||
- Return JSON only. No prose, no markdown, no explanation outside the JSON object.
|
||||
```
|
||||
|
||||
## Variable Reference
|
||||
|
||||
| Variable | Source | Description |
|
||||
|----------|--------|-------------|
|
||||
| `{finding_title}` | Stage 5 merged finding | The persona's title for the issue |
|
||||
| `{finding_severity}` | Stage 5 merged finding | P0 / P1 / P2 / P3 |
|
||||
| `{finding_file}` | Stage 5 merged finding | Repo-relative file path |
|
||||
| `{finding_line}` | Stage 5 merged finding | Primary line number |
|
||||
| `{finding_why_it_matters}` | Per-agent artifact file (detail tier) | Loaded from disk for this validation; required for the validator to understand the finding |
|
||||
| `{finding_suggested_fix}` | Stage 5 merged finding (optional) | Pass empty string if not present |
|
||||
| `{finding_reviewer}` | Stage 5 merged finding | Original persona name (informational; helps validator interpret the framing) |
|
||||
| `{finding_confidence}` | Stage 5 merged finding | The persona's anchor (informational) |
|
||||
| `{diff}` | Stage 1 output | Full diff for context |
|
||||
@@ -10,7 +10,7 @@ Interactive mode only.
|
||||
|
||||
The walk-through receives, from the orchestrator:
|
||||
|
||||
- The merged findings list in severity order (P0 → P1 → P2 → P3), filtered to `gated_auto` and `manual` findings that survived the Stage 5 confidence gate. Advisory findings are included when they were surfaced to this phase (advisory findings normally live in the report-only queue, but when the review flow routes them here for acknowledgment they take the advisory variant below).
|
||||
- The merged findings list in severity order (P0 → P1 → P2 → P3), filtered to `gated_auto` and `manual` findings that survived the Stage 5 anchor gate (anchor 75+, with P0 escape at anchor 50). Advisory findings are included when they were surfaced to this phase (advisory findings normally live in the report-only queue, but when the review flow routes them here for acknowledgment they take the advisory variant below).
|
||||
- The cached tracker-detection tuple from `tracker-defer.md` (`{ tracker_name, confidence, named_sink_available, any_sink_available }`). `any_sink_available` determines whether the Defer option is offered; `named_sink_available` + `confidence` determine whether the label names the tracker inline.
|
||||
- The run id for artifact lookups.
|
||||
|
||||
@@ -156,7 +156,7 @@ For each finding's answer:
|
||||
- **Acknowledge — mark as reviewed** (advisory variant) — record Acknowledge in the in-memory decision list. Advance to the next finding. No side effects.
|
||||
- **Defer — file a [TRACKER] ticket** — invoke the tracker-defer flow from `tracker-defer.md`. The walk-through's position indicator stays on the current finding during any failure-path sub-question (Retry / Fall back / Convert to Skip). On success, record the tracker URL / reference in the in-memory decision list and advance. On conversion-to-Skip from the failure path, advance with the failure noted in the completion report.
|
||||
- **Skip — don't apply, don't track** — record Skip in the in-memory decision list. Advance. No side effects.
|
||||
- **LFG the rest — apply the agent's best judgment to this and remaining findings** — exit the walk-through loop. Dispatch the bulk preview from `bulk-preview.md`, scoped to the current finding and everything not yet decided. The preview header reports the count of already-decided findings ("K already decided"). If the user picks `Cancel` from the preview, return to the current finding's per-finding question (not to the routing question). If the user picks `Proceed`, execute the plan per `bulk-preview.md` — Apply findings join the in-memory Apply set with the ones the user already picked, Defer findings route through `tracker-defer.md`, Skip / Acknowledge no-op — then proceed to end-of-walk-through dispatch.
|
||||
- **LFG the rest — apply the agent's best judgment to this and remaining findings** — exit the walk-through loop. Before dispatching the bulk preview, run Stage 5b on the remaining action set (the current finding plus any not yet decided) using the same gate as LFG routing (option B): validator template at `references/validator-template.md`, 15-finding cap, parallel per-finding dispatch. Findings Stage 5b rejects are dropped from the remaining set. Then dispatch the bulk preview from `bulk-preview.md` scoped to the Stage 5b-validated survivors. The preview header reports the count of already-decided findings ("K already decided"). If the user picks `Cancel` from the preview, return to the current finding's per-finding question (not to the routing question). If the user picks `Proceed`, execute the plan per `bulk-preview.md` — Apply findings join the in-memory Apply set with the ones the user already picked, Defer findings route through `tracker-defer.md`, Skip / Acknowledge no-op — then proceed to end-of-walk-through dispatch.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user