refactor(ce-code-review): anchored confidence, staged validation, and model tiering (#641)
Some checks failed
CI / pr-title (push) Has been cancelled
CI / test (push) Has been cancelled
Release PR / release-pr (push) Has been cancelled
Release PR / publish-cli (push) Has been cancelled

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trevin Chow
2026-04-21 21:04:29 -07:00
committed by GitHub
parent b104ce46be
commit 5a26a8fbd3
28 changed files with 1201 additions and 119 deletions

View File

@@ -0,0 +1,512 @@
---
title: "refactor: Adopt anchored confidence, validation gate, and mode-aware precision in ce-code-review"
type: refactor
status: active
date: 2026-04-21
---
# refactor: Adopt anchored confidence, validation gate, and mode-aware precision in ce-code-review
## Overview
Port the ce-doc-review anchored-confidence pattern into ce-code-review and add three code-review-specific precision controls inspired by Anthropic's official `code-review` plugin: a per-finding validation stage before externalization, mode-aware false-positive policy, and an explicit lint-ignore suppression rule. Also add a PR-mode-only skip-condition pre-check (closed/draft/trivial/already-reviewed) to avoid wasted review cycles.
The goal is to make ce-code-review's externalizing modes (autofix, headless, future PR-comment) materially higher-precision while preserving interactive mode's broader review surface.
## Problem Frame
ce-code-review currently uses a continuous `confidence: 0.0-1.0` field with a 0.60 suppress gate, a 0.50+ P0 exception, and a `+0.10` cross-reviewer agreement boost. The same false-precision problem ce-doc-review just fixed applies here: personas anchor on round numbers (0.65, 0.72, 0.85), the gate boundary creates a coin-flip band, and the additive boost hides what the score actually measures.
Reviewing Anthropic's official `code-review` plugin (`anthropics/claude-plugins-official/plugins/code-review/commands/code-review.md`) surfaced four additional precision techniques worth adopting:
1. **Anchored 0/25/50/75/100 rubric** — discrete buckets tied to behavioral criteria reduce model-fabricated precision. ce-doc-review already proved this works (commit `6caf3303`); ce-code-review was deferred at the time.
2. **Per-finding validation subagent** — Anthropic's actual command relies on a binary validated/not-validated gate more than on the numeric score. Independent validation catches false positives that confident-sounding personas produce. We rely on cross-reviewer agreement, which only fires when 2+ reviewers happen to converge — many real findings only fire once.
3. **Skip-condition pre-check** — Anthropic skips closed, draft, trivial, or already-reviewed PRs before doing any work. We have no equivalent; PR-mode invocations spend full review effort on PRs that should not be reviewed.
4. **Lint-ignore suppression** — code carrying an explicit `eslint-disable`, `rubocop:disable`, etc. for the rule a reviewer is about to flag should suppress the finding. Not currently in our false-positive catalog.
The right framing for ce-code-review's broader surface is not "narrow to Anthropic's 4-agent shape" but "tier the precision bar by mode": externalizing modes (PR-comment, autofix, headless) need narrow Anthropic-style precision; interactive mode is allowed broader findings as long as weak general-quality concerns route to soft buckets (`advisory` / `residual_risks` / `testing_gaps`) rather than primary findings.
Independent validation as a Stage 5b *gate* (drop rejected findings, keep approved ones) is the right framing. An earlier draft of this plan added a `validated: boolean` field to every finding — that field was YAGNI and is removed. The validator's effect is on the population of surviving findings, not on per-finding metadata.
## Requirements Trace
- R1. Replace continuous `confidence` field with 5 discrete anchor points (0, 25, 50, 75, 100) and a behavioral rubric per anchor. Mirror ce-doc-review's pattern.
- R2. Update Stage 5 synthesis to consume anchor values: `>= 75` filter threshold (P0 exception at 50+), one-anchor cross-reviewer promotion (replaces `+0.10`), anchor-descending sort.
- R3. Add a new Stage 5b validation pass that spawns one validator subagent per surviving finding before externalization. Scope: required for autofix/headless externalization and downstream-resolver handoff; skipped for interactive terminal display where the human is the validator. Validation is process logic — findings the validator rejects are dropped; no metadata field is added to surviving findings.
- R4. Make the false-positive policy mode-aware in synthesis. Headless and autofix apply the narrow Anthropic-style filter (concrete bugs, compile/parse failures, traceable security, explicit standards violations only). Interactive demotes weak general-quality concerns to `advisory` / `residual_risks` / `testing_gaps` rather than suppressing them.
- R5. Add an explicit lint-ignore suppression rule to the subagent template's false-positive catalog: if the code carries a lint disable comment for the rule the reviewer is about to flag, suppress unless the suppression itself violates project standards.
- R6. Add a PR-mode-only skip-condition pre-check (closed, draft, trivial automated, or already-reviewed by Claude). Skip cleanly without dispatching reviewers. Standalone branch and `base:` modes are unaffected.
- R7. Update all persona files for hardcoded float confidence references and mode-aware suppression hints where applicable.
- R8. Update test fixtures and contract tests in `tests/review-skill-contract.test.ts` and any related fixtures.
- R9. Document the migration in `docs/solutions/skill-design/` extending the existing ce-doc-review note, including the rationale for ce-code-review's specific threshold and the validation-stage scoping decision.
## Scope Boundaries
- No change to persona-specific domain logic (what each persona looks for). Only confidence rubric, validation flow, mode-aware policy, and skip-conditions change.
- No change to severity taxonomy (`P0 | P1 | P2 | P3`).
- No change to `autofix_class` or `owner` enums.
- No collapse of the 17-persona architecture to Anthropic's 4-agent shape. ce-code-review's broader surface is intentional.
- No change to the standalone / branch / PR / `base:` scope-resolution paths in Stage 1.
### Deferred to Separate Tasks
- **PR inline comment posting mode**: Anthropic's `--comment` flag posts findings as inline GitHub PR comments via `mcp__github_inline_comment__create_inline_comment` with full-SHA link discipline and committable suggestion blocks for small fixes. We have no PR-comment mode at all today. This is a substantial new mode (link format, suggestion-block handling, deduplication semantics, tracker integration overlap). Worth its own plan; this refactor sets the precision foundation it would build on.
- **Haiku-tier orchestrator-side checks**: Anthropic uses haiku for the skip-condition probe and CLAUDE.md path discovery. We currently use sonnet for everything; pushing cheap checks to haiku is a separate cost-optimization task.
- **Re-evaluating which always-on personas earn their noise**: Anthropic's HIGH-SIGNAL philosophy raises the question of whether `testing` and `maintainability` should remain always-on. Out of scope here — handled by the mode-aware soft-bucket routing in this plan, but a deeper re-think is its own conversation.
## Context & Research
### Relevant Code and Patterns
**Direct port targets (ce-doc-review prior art):**
- `plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json` — anchored integer enum precedent
- `plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md` — verbatim rubric + consolidated false-positive catalog
- `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md` — anchor gate, one-anchor promotion, anchor-descending sort
- Commit `6caf3303` — the migration diff is the canonical reference for what to change in this skill
**Files this plan modifies:**
- `plugins/compound-engineering/skills/ce-code-review/SKILL.md` — Stage 1 (skip-condition gate), Stage 5 (anchor gate, promotion), new Stage 5b (validation), Stage 6 (mode-aware false-positive policy)
- `plugins/compound-engineering/skills/ce-code-review/references/findings-schema.json` — confidence enum, threshold table in `_meta`
- `plugins/compound-engineering/skills/ce-code-review/references/subagent-template.md` — anchored rubric, expanded false-positive catalog with lint-ignore rule, mode-aware suppression hints
- `plugins/compound-engineering/skills/ce-code-review/references/persona-catalog.md` — verify no float references remain (no behavioral changes needed)
- `plugins/compound-engineering/skills/ce-code-review/references/review-output-template.md` — anchor-as-integer rendering in confidence column
- `plugins/compound-engineering/skills/ce-code-review/references/walkthrough.md` — anchor display in per-finding block
- `plugins/compound-engineering/skills/ce-code-review/references/bulk-preview.md` — anchor rendering if confidence appears
- `plugins/compound-engineering/agents/ce-*-reviewer.agent.md` — sweep for hardcoded float references
- `tests/review-skill-contract.test.ts` — anchor enum assertions, validation-stage assertions, skip-condition assertions
- `tests/fixtures/` — any seeded review fixtures with embedded confidence values
- `docs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md` — extend with ce-code-review section
### Institutional Learnings
- `docs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md` — the canonical writeup of the anchored-rubric pattern. Establishes the ce-doc-review threshold of `>= 50` and explicitly anticipates ce-code-review's threshold of `>= 75` due to opposite economics (linter backstop, PR-comment cost, ground-truth verifiability of code claims).
- `docs/plans/2026-04-21-001-refactor-ce-doc-review-anchored-confidence-scoring-plan.md` — the ce-doc-review plan, particularly its "Deferred to Separate Tasks" entry naming this exact follow-up. Sequencing rationale ("do ce-doc-review first, observe, then plan ce-code-review") was honored.
### External References
- `anthropics/claude-plugins-official/plugins/code-review/commands/code-review.md` — canonical source for the four code-review-specific patterns (anchored rubric, validation step, skip-conditions, lint-ignore). Note: the README describes a 0/25/50/75/100 scale with threshold 80, but the actual command prompt relies more heavily on the binary validated/not-validated gate (their Step 5) than on the numeric score. We model this faithfully by adopting both the anchored rubric *and* the validation gate, recognizing the validation gate is the load-bearing precision mechanism.
- Two-model comparative analysis (this conversation, 2026-04-21) — original reflection plus second-model critique that surfaced (a) validation gate is more important than the numeric score in the upstream design, (b) false-positive policy should be mode-aware, (c) confidence and validation should be decoupled fields. All three insights are R-traced above.
### Slack Context
Slack tools detected. Ask me to search Slack for organizational context at any point, or include it in your next prompt.
## Key Technical Decisions
- **Threshold `>= 75`, not `>= 80`**: Matches ce-doc-review's stylistic choice of using the anchor itself as the threshold (no awkward `>= 80` middle-bucket gap that effectively means "100 only" under the discrete scale). At `>= 75`, anchor 75 ("real, will hit in practice") and anchor 100 ("evidence directly confirms") survive; anchors 0 / 25 / 50 are dropped. P0 exception at 50+ preserves the current escape hatch for critical-but-uncertain issues.
- **Validation is process logic, not a metadata field**: An earlier draft of this plan added a `validated: boolean` field to every finding. Removed: rejected findings are dropped, so surviving findings post-validation are validated by definition; in modes where validation does not run, no consumer needs a per-finding flag because the run's mode already tells them whether validation ran. A field that is constant within any mode does no work and the name implies a truth claim it does not carry. Validation stays as a Stage 5b gate; no schema change.
- **Validation is scoped to externalization, not universal**: Validating every finding roughly doubles agent calls. The cost is justified when findings will be posted to GitHub, applied automatically, or handed off to downstream automation — places where false positives have real cost. For interactive terminal display, the user provides the validation by reviewing.
- **One validator subagent per finding, not batched**: Independence is the product. A single batched validator looking at all findings together pattern-matches across them and effectively becomes an opinionated re-reviewer, recreating the persona-bias problem we are escaping. Per-finding parallel dispatch keeps fresh context per call. Per-file batching is a plausible future optimization for reviews with many findings clustered in few files, but not needed today (typical reviews surface 3-8 findings post-gate).
- **Validator dispatch budget cap**: To bound worst-case cost when a review surfaces an unusually large finding set, cap parallel validator dispatch at 15. If more findings survive Stage 5, validate the highest-severity 15 in parallel and queue the rest for a second wave. This is a safety bound; typical reviews never hit it.
- **Mode-aware false-positive policy uses existing soft buckets, not a new schema field**: Weak general-quality findings already have well-defined homes (`residual_risks` for "noticed but couldn't confirm," `testing_gaps` for missing coverage, `advisory` autofix_class for "report-only"). Mode-aware demotion routes weak findings into these buckets in interactive mode and suppresses them in headless/autofix. No new schema needed.
- **One-anchor cross-reviewer promotion replaces `+0.10` boost**: Mirrors ce-doc-review. Cleaner than additive math and semantically meaningful (independent corroboration moves a "real but minor" finding to "real, will hit in practice").
- **Skip-condition gate is PR-mode only**: Standalone, branch, and `base:` modes always run. The closed/draft/trivial/already-reviewed checks only make sense when there's a PR. Already-reviewed detection uses `gh pr view <PR> --comments` filtering for prior Claude-authored comments; the same pattern Anthropic uses.
- **Lint-ignore suppression has a project-standards exception**: If a finding is about a CLAUDE.md/AGENTS.md rule violation and the code uses a lint disable to suppress that specific rule, the suppression itself may violate project standards (e.g., "do not use `eslint-disable-next-line` for security rules"). The rule is "suppress the finding *unless* the suppression itself is the violation."
- **No haiku-tier downgrade in this plan**: The skip-condition pre-check is a natural haiku candidate, but model-tier choices are out of scope here. Use the same mid-tier (sonnet) the rest of the skill uses; haiku is its own optimization plan.
## Open Questions
### Resolved During Planning
- **Threshold value (`>= 75` vs `>= 80`)?** Resolved: `>= 75`. Matches ce-doc-review's use of the anchor as the threshold and avoids the "`>= 80` collapses to anchor 100 only" gotcha under a discrete scale.
- **Add a `validated` field on findings or keep validation as process-only?** Resolved: process-only. Surviving findings post-validation are validated by definition; mode metadata in `metadata.json` already tells consumers whether validation ran. A per-finding flag is YAGNI and the name implies a truth claim it does not carry.
- **Validate every finding or only externalizing ones?** Resolved: only externalizing (autofix, headless, downstream-resolver handoff). Interactive uses the human as the validator.
- **One validator per finding, batched, or per file?** Resolved: per finding, parallel. Independence is the design point. Per-file batching is documented as a future optimization if real-world data shows reviews routinely cluster many findings in few files.
- **Adopt PR-comment posting mode in this plan?** Resolved: deferred. It's a substantial new mode and would dilute the precision-foundation focus of this refactor.
- **Should we collapse to Anthropic's 4-agent architecture?** Resolved: no. Our 17-persona surface serves a broader workflow (pre-PR review, learnings, deployment notes). Adopt their precision techniques without their narrowness.
### Deferred to Implementation
- **Exact rubric wording per anchor for code-review economics**. ce-doc-review's wording works as a starting point, but code review has unambiguous ground truth (compile errors, runtime bugs) that doc review lacks. Anchor 100 should reference "directly verifiable from the code without execution" or similar; implementation pass writes the final text.
- **Validator subagent prompt design**. The validator's job is independent re-verification, not re-reasoning. Prompt should give it the finding's title, file, line range, and `why_it_matters`, plus the diff and surrounding code, and ask "is this real, introduced by this diff, and not handled elsewhere?" Final wording during implementation; Anthropic's Step 5 prompt is reference material.
- **Whether to validate findings about to be presented in interactive mode's walk-through**. The walk-through is technically interactive (human in the loop) but the user may LFG-bulk-apply, which crosses into externalization. Decision-deferral candidate: validate before LFG bulk-apply; skip otherwise.
- **Whether persona files need any additional updates beyond a float-reference sweep**. A few personas may carry domain-specific calibration text (e.g., security: "always flag SQL injection at high confidence") that needs anchor-rewriting. Per-file judgment during implementation.
## Implementation Units
- [ ] **Unit 1: Update findings schema with anchored confidence**
**Goal:** Replace continuous `confidence` with integer enum. Update `_meta.confidence_thresholds` to describe the anchor-based gates.
**Requirements:** R1
**Dependencies:** None — this unit establishes the contract every other unit consumes.
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-code-review/references/findings-schema.json`
- Test: `tests/review-skill-contract.test.ts` (schema-shape assertions)
**Approach:**
- Replace `confidence: { type: "number", minimum: 0.0, maximum: 1.0 }` with `confidence: { type: "integer", enum: [0, 25, 50, 75, 100] }`.
- Rewrite `_meta.confidence_thresholds` table to describe anchors and the `>= 75` gate (with P0 exception at 50+).
- No `validated` field — validation is process logic in Stage 5b. Surviving findings post-validation are validated by definition; rejected findings are dropped. See Key Technical Decisions for rationale.
**Patterns to follow:**
- `plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json` — anchor enum precedent
**Test scenarios:**
- Happy path — Schema validates a finding with `confidence: 75`.
- Edge case — Schema rejects a finding with `confidence: 0.85` (float not in enum).
- Edge case — Schema rejects a finding with `confidence: 80` (not in enum).
- Edge case — `_meta` documents the threshold semantics in human-readable form (smoke test: assert key strings present).
**Verification:**
- All schema assertions in `tests/review-skill-contract.test.ts` pass with the new shape.
- `bun run release:validate` reports no parity drift.
---
- [ ] **Unit 2: Rewrite subagent template with anchored rubric, expanded false-positive catalog, and mode-aware hints**
**Goal:** Replace the float rubric with the verbatim 5-anchor behavioral rubric. Expand the false-positive catalog with lint-ignore suppression. Add a mode-aware suppression hint so personas know their findings will be filtered differently in headless/autofix.
**Requirements:** R1, R4, R5
**Dependencies:** Unit 1 (schema must accept anchor values).
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-code-review/references/subagent-template.md`
**Approach:**
- Replace the "Confidence rubric (0.0-1.0 scale)" section (lines 41-49) with the 5-anchor rubric, each anchor named and tied to a behavioral criterion the persona can self-apply (e.g., "100: Verifiable from the code alone without running it").
- Update the suppress-threshold sentence to "Suppress threshold: anchor 75. Do not emit findings below anchor 75 (except P0 at anchor 50)."
- Expand the false-positive catalog (lines 75-81) to include the lint-ignore rule explicitly: "Code with an explicit lint disable comment for the rule you are about to flag — suppress unless the suppression itself violates a project-standards rule."
**Patterns to follow:**
- `plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md` — rubric and false-positive catalog structure
**Test scenarios:**
- Happy path — Template renders with all 5 anchors and behavioral definitions.
- Integration — A test that spawns a persona against a fixture diff returns findings with anchor values.
**Verification:**
- The rubric appears in the template verbatim and matches the schema enum.
- The false-positive catalog includes lint-ignore handling.
- No persona sub-agent prompt references continuous floats after this unit lands.
---
- [ ] **Unit 3: Update synthesis Stage 5 with anchor gate, one-anchor promotion, and anchor-descending sort**
**Goal:** Update the merge stage to consume integer anchors. Replace the `0.60` threshold with `>= 75` (P0 exception at 50+). Replace the `+0.10` cross-reviewer boost with one-anchor promotion. Update the sort to use anchor descending.
**Requirements:** R2
**Dependencies:** Units 1, 2.
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-code-review/SKILL.md` (Stage 5)
**Approach:**
- In Stage 5 step 1 ("Validate"), update the `confidence` value constraint from `numeric, 0.0-1.0` to `integer in {0, 25, 50, 75, 100}`.
- In Stage 5 step 2 ("Confidence gate"), change "Suppress findings below 0.60 confidence. Exception: P0 findings at 0.50+" to "Suppress findings below anchor 75. Exception: P0 findings at anchor 50+ survive."
- In Stage 5 step 4 ("Cross-reviewer agreement"), replace "boost the merged confidence by 0.10 (capped at 1.0)" with "promote the merged finding by one anchor step (50 -> 75, 75 -> 100, 100 -> 100). Cross-reviewer corroboration is a stronger signal than any single reviewer's anchor; the promotion routes the finding from the soft tier into the actionable tier or strengthens its already-actionable position."
- In Stage 5 step 9 ("Sort"), change "confidence (descending)" to "anchor (descending)".
- Update the Stage 5 preamble to describe the new contract (integer anchors instead of floats).
**Test scenarios:**
- Happy path — Two reviewers flag the same fingerprint at anchor 50; merged result is anchor 75 (one-anchor promotion).
- Happy path — Two reviewers flag the same fingerprint at anchor 75; merged result is anchor 100.
- Happy path — One reviewer flags at anchor 100; merged result remains anchor 100 (no over-promotion).
- Edge case — A single reviewer flags at anchor 50, no other reviewer agrees; merged result is filtered out (below threshold).
- Edge case — A P0 finding at anchor 50 survives the gate; a P1 finding at anchor 50 does not.
- Edge case — Sort order: two findings at the same severity, one at anchor 100 and one at anchor 75; the anchor-100 finding sorts first.
**Verification:**
- `tests/review-skill-contract.test.ts` synthesis assertions pass with the new gate, promotion, and sort.
- A manual review run against a fixture diff produces expected anchor distributions and routing.
---
- [ ] **Unit 4: Add Stage 5b validation pass for externalizing findings**
**Goal:** Insert a new synthesis stage that spawns a validator subagent per surviving finding when the run will externalize. Validator says yes -> finding survives; validator says no, times out, or returns malformed output -> finding is dropped. Pure process logic; no metadata change to surviving findings.
**Requirements:** R3
**Dependencies:** Units 1, 3.
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-code-review/SKILL.md` (new Stage 5b between Stage 5 and Stage 6)
- Create: `plugins/compound-engineering/skills/ce-code-review/references/validator-template.md` — the validator subagent's prompt template
**Approach:**
- After Stage 5 merge produces the deduplicated finding set, decide whether validation runs. Validation runs when:
- Mode is `headless` or `autofix`
- Mode is `interactive` and the routing path is LFG (option B) or File-tickets (option C)
- A future PR-comment mode (when added)
- Validation does *not* run when:
- Mode is `report-only`
- Mode is `interactive` and the routing is walk-through (option A) per-finding (the user is the validator) or Report-only (option D)
- For each surviving finding, spawn one validator subagent in parallel. Validator prompt (in `references/validator-template.md`) gives it: finding title, file, line, `why_it_matters`, the diff, and surrounding code via the platform's read tool. Validator returns `{ "validated": true | false, "reason": "..." }`.
- Findings where validator returns `false` are dropped. Findings where validator returns `true` flow through unchanged into Stage 6 — no field is set on the finding (validation is process logic, not metadata).
- Validator runs at mid-tier (sonnet) like the personas. Validator is read-only — same constraints as persona reviewers.
- **Dispatch budget cap: max 15 parallel validators.** When more than 15 findings survive Stage 5, validate the highest-severity 15 (P0 first, then P1, then P2, then P3, breaking ties by anchor descending) and drop the remainder with a Coverage note. This is a safety bound; typical reviews surface < 10 findings post-gate and never hit the cap. The blunt "drop the rest" behavior is intentional — a review producing 15+ surviving findings is already in territory where a second wave wouldn't change the user's triage approach.
- Record validation drop count and any over-budget drops in Coverage for Stage 6.
- If the validator subagent fails, times out, or returns malformed JSON, treat as a no-vote and drop the finding. Unverified findings should not externalize. Conservative bias is correct.
- **Future optimization (not implemented here):** per-file batching. Group surviving findings by file and dispatch one validator per file (validator reads the file once, evaluates all findings in that file). Real win when reviews cluster many findings in few files (large refactors). Skip until we see real-world data showing it matters; per-finding parallel dispatch is the correct default for typical reviews.
**Execution note:** Add a contract test for the validation stage before wiring it into the orchestrator, so we have a known-good harness for fixture-based verification.
**Patterns to follow:**
- `plugins/compound-engineering/skills/ce-code-review/references/subagent-template.md` — output contract structure for the validator template
- Anthropic's Step 5 in `commands/code-review.md` — the validator's job is independent re-verification, not re-reasoning
**Technical design:** *(directional guidance, not implementation specification)*
```
Stage 5 -> merged findings
|
v
Stage 5b: Validation gate
|
+-- mode in {headless, autofix} OR (interactive AND routing in {LFG, File-tickets})?
| YES -> sort findings by severity desc, take top 15
| | spawn one validator subagent per finding, in parallel
| | each validator: { validated: true | false, reason: ... }
| | drop findings the validator rejects; survivors flow through unchanged
| | drop findings beyond the 15-cap with a Coverage note
| NO -> pass through unchanged
|
v
Stage 6 -> synthesize and present
```
**Test scenarios:**
- Happy path — Headless mode, validator confirms a finding; finding survives into Stage 6 unchanged.
- Happy path — Headless mode, validator rejects a finding; finding is dropped and counted in Coverage with the validator's reason.
- Happy path — Interactive mode, walk-through routing; validation stage is skipped entirely, all surviving findings pass through.
- Edge case — Validator subagent times out; finding is dropped.
- Edge case — Validator returns malformed JSON; finding is dropped, drop reason recorded.
- Edge case — 20 findings survive Stage 5 in headless mode; first 15 (sorted by severity desc) validate in parallel, remaining 5 are dropped with Coverage note "5 findings exceeded validator budget cap and were not externalized."
- Integration — Autofix mode applies only validator-approved `safe_auto` findings; a validator-rejected `safe_auto` finding does not enter the fixer queue.
**Verification:**
- `tests/review-skill-contract.test.ts` validation-stage assertions pass.
- Coverage section reports validator drop count and any second-wave deferrals.
- Autofix mode does not apply validator-rejected findings.
---
- [ ] **Unit 5: Add PR-mode-only skip-condition pre-check in Stage 1**
**Goal:** Before the standard Stage 1 scope-detection runs in PR mode (PR number or URL provided), perform a cheap skip-condition check. Skip cleanly without dispatching reviewers if the PR is closed, draft, marked trivial/automated, or already reviewed by a prior Claude run.
**Requirements:** R6
**Dependencies:** None — this is a pre-stage gate, independent of the schema and synthesis changes.
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-code-review/SKILL.md` (Stage 1 PR/URL path, before the existing checkout step)
**Approach:**
- Add a sub-step at the top of the "PR number or GitHub URL is provided" branch in Stage 1.
- Run a single `gh pr view <number-or-url> --json state,isDraft,title,body,comments` call to fetch all skip-relevant data in one round trip.
- Apply skip rules:
- `state` is `CLOSED` or `MERGED` -> skip with message "PR is closed/merged; not reviewing."
- `isDraft` is `true` -> skip with message "PR is a draft; not reviewing. Re-invoke once it's marked ready."
- `title` matches a trivial-PR pattern (e.g., `^(chore\\(deps\\)|build\\(deps\\)|chore: bump|chore: release)`) AND body is empty/template-only -> skip with message "PR appears to be a trivial automated PR; not reviewing. Pass `mode:headless` or another explicit invocation if review is intended."
- `comments` includes any comment whose body starts with the ce-code-review report header (e.g., `## Code Review` or the headless completion line) -> skip with message "PR already has a ce-code-review report. To re-review, run from the branch (no PR target) or pass `base:<ref>` against the current checkout."
- Skip detection deliberately ignores commits-since-comment. Yes, this over-suppresses when new commits land after a prior review — the user's escape hatch is branch mode or `base:` mode, both of which bypass the PR-mode skip-check entirely. Simpler to detect and explain than commit-vs-comment timestamp logic, and the over-suppression cost is one extra command from the user.
- Skip cleanly: emit the message and stop without dispatching any reviewers or running scope detection.
- Standalone branch and `base:` modes are unaffected — they always run.
**Patterns to follow:**
- Anthropic's Step 1 in `commands/code-review.md` — same set of skip conditions
- Existing Stage 1 "uncommitted changes" check — same shape: probe state, emit message, stop early if conditions don't allow proceeding
**Test scenarios:**
- Happy path — PR is open, draft is false, title is normal, no prior Claude comment; skip-check passes, scope detection runs.
- Edge case — PR is closed; skip-check stops early with the closed message; no reviewers dispatched.
- Edge case — PR is draft; skip-check stops early with the draft message.
- Edge case — PR title is `chore(deps): bump foo from 1.0 to 1.1`; skip-check stops early with the trivial message.
- Edge case — PR has a prior ce-code-review report comment; skip-check stops early with the already-reviewed message regardless of subsequent commits.
- Negative — Standalone mode (no PR argument) does not run skip-check.
- Negative — `base:` mode does not run skip-check.
**Verification:**
- `tests/review-skill-contract.test.ts` skip-check assertions pass.
- A manual run against a closed PR exits cleanly without dispatching reviewers.
---
- [ ] **Unit 6: Add mode-aware false-positive demotion in Stage 5/6**
**Goal:** In Stage 5 (after merge, before validation), demote weak general-quality findings to soft buckets in interactive mode and suppress them in headless/autofix mode. The point is to surface the same content the personas produce, but route weak signal to `residual_risks` / `testing_gaps` / `advisory` rather than primary findings in interactive, and suppress entirely in externalizing modes.
**Requirements:** R4
**Dependencies:** Units 1, 3.
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-code-review/SKILL.md` (Stage 5 step ordering and Stage 6 rendering)
**Approach:**
- Define "weak general-quality finding" precisely: a finding where `severity` is P2 or P3, `autofix_class` is `advisory`, and the persona is `testing` or `maintainability` (the always-on personas most prone to general-quality flagging). This is the conservative definition; it can expand if practice shows other patterns.
- In Stage 5 (after merge, before partition), apply mode-aware demotion:
- **Interactive mode:** Move weak general-quality findings out of the primary findings list. If the finding is from `testing`, append the `title` + `why_it_matters` to `testing_gaps`. If from `maintainability`, append to `residual_risks`. The finding does not appear in the Stage 6 findings table.
- **Headless and autofix modes:** Suppress weak general-quality findings entirely. Record the suppressed count in Coverage.
- **Report-only mode:** Same as interactive — demote to soft buckets, do not suppress.
- Stage 6 rendering already shows `residual_risks` and `testing_gaps`; no template change needed for the demoted destinations. Update the Coverage section to report mode-aware suppressions/demotions distinctly from the existing confidence-gate suppressions.
**Test scenarios:**
- Happy path — Interactive mode, a `testing` persona produces a P3 advisory finding; after demotion it appears in `testing_gaps`, not the findings table.
- Happy path — Headless mode, the same finding is suppressed and counted in Coverage.
- Happy path — A `correctness` persona produces a P3 advisory finding; demotion does *not* apply (only `testing` and `maintainability` qualify under the conservative definition), and the finding appears in the findings table.
- Edge case — A `testing` persona produces a P0 finding; demotion does not apply (severity exceeds threshold).
- Edge case — A `maintainability` persona produces a P2 `safe_auto` finding; demotion does not apply (autofix_class is not `advisory`).
**Verification:**
- `tests/review-skill-contract.test.ts` mode-demotion assertions pass.
- Stage 6 output in interactive mode shows demoted findings in `testing_gaps`/`residual_risks`, not in the findings table.
---
- [ ] **Unit 7: Sweep persona files and update walkthrough/template/bulk-preview rendering**
**Goal:** Update all reviewer persona files for hardcoded float references (e.g., a persona that says "always file SQL injection at 0.85+"). Update rendering surfaces to display anchors as integers consistently.
**Requirements:** R7
**Dependencies:** Units 1, 2.
**Files:**
- Modify: `plugins/compound-engineering/agents/ce-correctness-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-testing-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-maintainability-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-project-standards-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-security-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-performance-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-api-contract-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-data-migrations-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-reliability-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-adversarial-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-cli-readiness-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-previous-comments-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-dhh-rails-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-kieran-rails-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-kieran-python-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-kieran-typescript-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-julik-frontend-races-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/ce-swift-ios-reviewer.agent.md` — explicit float bands at lines 75/77/79 (`0.80+` -> anchor 75/100; `0.60-0.79` -> anchor 50; `below 0.60` -> anchor 0/25)
- Modify: `plugins/compound-engineering/skills/ce-code-review/references/review-output-template.md`
- Modify: `plugins/compound-engineering/skills/ce-code-review/references/walkthrough.md`
- Modify: `plugins/compound-engineering/skills/ce-code-review/references/bulk-preview.md`
- Modify: `plugins/compound-engineering/skills/ce-code-review/references/persona-catalog.md` (verify no float references remain; no behavioral changes needed)
**Approach:**
- For each persona file: grep for confidence references (`0.\\d`, "0.6", "0.7", etc.) and rewrite to use anchors. Most personas rely on the template and won't need changes; the sweep catches outliers.
- For each rendering surface: update confidence-column rendering from float (`0.85`) to integer-anchor (`75` or `100`). Update walkthrough per-finding block to show anchor.
- For `persona-catalog.md`: no behavioral changes needed; selection rules are unchanged. Verify no float references remain.
- For `review-output-template.md`: update the Confidence column header/format if needed.
**Test scenarios:**
- Edge case — Grep for float-confidence references across `agents/` returns nothing after the sweep.
- Happy path — Walkthrough rendering for a finding shows `Confidence: 75` (integer), not `Confidence: 0.85`.
- Happy path — Bulk-preview rendering uses anchor format consistently with walkthrough.
- Happy path — Findings table in `review-output-template.md` shows anchor as integer.
**Verification:**
- No hardcoded float confidence values remain in `plugins/compound-engineering/agents/` or `plugins/compound-engineering/skills/ce-code-review/references/`.
- All rendering surfaces use anchor integers consistently.
---
- [ ] **Unit 8: Update test fixtures and contract tests**
**Goal:** Update `tests/review-skill-contract.test.ts` to assert the new schema, synthesis behavior, validation stage, skip-conditions, and mode-aware demotion. Update or add fixtures with anchor values.
**Requirements:** R8
**Dependencies:** Units 1-6 (the behavior all units 1-6 produce must already be in code so the tests pass).
**Files:**
- Modify: `tests/review-skill-contract.test.ts`
- Modify: `tests/fixtures/` (any seeded ce-code-review fixtures with embedded confidence values; check `tests/fixtures/ce-code-review/` if present, or `tests/fixtures/sample-plugin/` if shared)
**Execution note:** Mirror the test additions ce-doc-review's commit `6caf3303` made to `tests/pipeline-review-contract.test.ts` (73 lines added). The pattern is established; copy the structure.
**Approach:**
- Add schema-shape assertions: `confidence` is integer enum, `_meta.confidence_thresholds` describes the new gates.
- Add synthesis assertions: `>= 75` gate, P0 exception at 50, one-anchor promotion, anchor-descending sort.
- Add validation-stage assertions: mode-conditional dispatch, finding survives on validator approval, finding drops on validator rejection or timeout, budget cap drops overflow with Coverage note.
- Add skip-condition assertions: closed/draft/trivial/already-reviewed cases stop early; standalone and `base:` modes do not skip-check.
- Add mode-aware demotion assertions: `testing` P3 advisory in interactive lands in `testing_gaps`; same finding in headless is suppressed.
- Update fixtures with embedded confidence values from float to anchor integers. Convert by behavior: `0.85` -> `75` if "real, will hit in practice"; `0.92+` -> `100` if "verifiable from code."
**Test scenarios:**
- (Implicit — this unit *is* the test scenarios for prior units.)
**Verification:**
- `bun test` passes with all new assertions.
- `bun run release:validate` passes.
- A targeted test run against a known-bad fixture (a finding the old gate would have surfaced and the new gate should suppress) demonstrates the behavior change.
---
- [ ] **Unit 9: Document the migration in `docs/solutions/`**
**Goal:** Extend the existing ce-doc-review writeup with a ce-code-review section. Capture the threshold-divergence rationale (why `>= 75` for code review vs `>= 50` for doc review), the validation-stage rationale, and the mode-aware policy framing.
**Requirements:** R9
**Dependencies:** Units 1-8 (document what was actually built).
**Files:**
- Modify: `docs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md` (add ce-code-review section)
- Optionally split if the file becomes too long: create `docs/solutions/skill-design/code-review-precision-and-validation-2026-04-2X.md` (use today's date)
**Approach:**
- Add a "ce-code-review migration" section after the existing ce-doc-review content.
- Document:
- Threshold choice (`>= 75`) and why it differs from ce-doc-review's `>= 50`. Both pick the anchor as the threshold; doc review surfaces broadly because dismissal is cheap, code review surfaces narrowly because false positives erode trust.
- The validation stage and its scope (externalization only). Reference Anthropic's Step 5 as the design pattern; explain why the upstream's binary validated/not-validated gate is more important than the numeric score.
- Mode-aware false-positive policy and the "demote-not-suppress" rule for interactive mode.
- The lint-ignore suppression rule.
- Link to the ce-code-review SKILL.md and findings-schema.json.
- Add a "When to apply this pattern to a new skill" section so future skill authors know when an anchored rubric + validation gate makes sense vs when continuous confidence is fine.
**Test scenarios:**
- Test expectation: none -- documentation update.
**Verification:**
- The doc reads coherently for someone who hasn't seen either codebase. A new contributor can use it to understand both ce-doc-review and ce-code-review's confidence handling.
- The "when to apply" guidance is concrete enough to be actionable.
## System-Wide Impact
- **Interaction graph:** Stage 5b (new) sits between Stage 5 (merge) and Stage 6 (synthesis). Stage 1 PR-mode path gains a pre-stage skip-check that may exit early. Both interaction-graph changes are localized to ce-code-review; they do not affect callers (`ce-work`, `lfg`, `slfg`, `ce-polish-beta`).
- **Error propagation:** Validator subagent failures (timeout, malformed output, dispatch error) drop the finding rather than abort the review. A failed validator does not block the review; it just means one finding doesn't externalize. Conservative bias is correct.
- **State lifecycle risks:** None. The plan is in-memory orchestration changes; no persistent state migrations. Run-artifact JSON files on disk are unchanged in shape — no new fields. Validator drop count is reported in Coverage but does not appear in the artifact schema.
- **API surface parity:** Headless output envelope is unchanged in shape. The validator's effect is that fewer findings appear in the envelope when validation runs (rejected ones drop out). No new markers; no schema change for downstream consumers.
- **Integration coverage:** Cross-skill: `ce-polish-beta` reads ce-code-review run artifacts; the artifact format is unchanged so no compat work is needed. `ce-work` invokes ce-code-review in headless mode; verify the new validation stage doesn't break the headless contract (it shouldn't — the contract is the envelope shape, which is unchanged).
- **Unchanged invariants:** Severity taxonomy (P0-P3), `autofix_class` enum (`safe_auto`/`gated_auto`/`manual`/`advisory`), `owner` enum (`review-fixer`/`downstream-resolver`/`human`/`release`), persona selection logic, scope-resolution paths, run-artifact directory layout and shape, mode definitions (interactive/autofix/report-only/headless), Stage 6 section ordering. The `pre_existing` field semantics are unchanged.
## Risks & Dependencies
| Risk | Mitigation |
|------|------------|
| Validation stage adds significant latency to externalizing modes | Validation runs in parallel across findings (one subagent per finding). Most reviews surface < 10 findings; parallel mid-tier dispatch is bounded. Headless/autofix users have already accepted the multi-agent latency cost; a few-second add for validation is proportionate. |
| Validator subagent itself produces false negatives (rejects real findings) | Validator failure mode is "drop the finding" — same as our existing confidence-gate suppression. Conservative bias is correct for externalizing modes (better to miss a real finding than post a false one publicly). For interactive walk-through mode, validation is skipped, so per-finding human judgment can still surface borderline findings. |
| Mode-aware demotion in Unit 6 is too narrow (only `testing` and `maintainability`) and lets weak findings from other personas pollute primary results | The conservative definition is intentional. Practice from real review runs will reveal which other personas overproduce weak findings; expand the definition incrementally with evidence rather than guessing. |
| Skip-condition's "trivial PR pattern" misclassifies a non-trivial PR with a chore-style title | The pattern is conservative (`^(chore\\(deps\\)|build\\(deps\\)|chore: bump|chore: release)` requires a colon-prefixed convention). Hand-typed informal commits won't match. If a real PR is misclassified, the user can re-invoke from the branch (no PR target) or with `base:` to bypass the skip-check. Document this in the skip message. |
| Threshold change from 0.60 to 75 is conceptually a stricter gate; some currently-surfaced findings will disappear | This is the desired behavior — stricter gates are the point. A safety net: P0 exception at anchor 50 ensures critical-but-uncertain issues still surface. Monitor real review runs for regressions in the first week and tune the gate or expand the P0 exception if needed. |
| Validator dispatch budget cap (15) drops findings beyond the limit | Drop is loud, not silent: Coverage section reports the over-budget count so the user knows to follow up if a 15+ finding review surfaces. Cap is a worst-case safety bound; typical reviews never hit it. If real-world data shows reviews routinely exceed 15, raise the cap or re-evaluate as second-wave logic. |
| Sequencing this plan against other in-flight ce-code-review work | Branch is `tmchow/review-skill-compare`. No other in-flight ce-code-review PRs noted. Coordinate with anyone working on `ce-polish-beta` (downstream consumer) before the artifact-format change lands. |
## Documentation / Operational Notes
- Update `plugins/compound-engineering/README.md` if the ce-code-review entry mentions confidence scoring specifics (likely not — most plugin READMEs don't cover internal scoring mechanics).
- The `docs/solutions/skill-design/` writeup (Unit 9) is the primary documentation deliverable.
- Run `bun run release:validate` after Unit 8 to confirm marketplace parity and counts.
- No version bump in plugin manifests — release-please owns this. The work is a `refactor(ce-code-review):` commit (per repo convention).
- After merge, watch the next few real ce-code-review runs in interactive and headless mode to confirm: (a) anchor distribution is sensible, (b) validation stage isn't dropping too many real findings, (c) skip-conditions don't misclassify legitimate PRs, (d) mode-aware demotion produces useful `testing_gaps`/`residual_risks` content.
## Sources & References
- **Origin conversation:** Two-model comparative analysis of ce-code-review vs Anthropic's official `code-review` plugin (this conversation, 2026-04-21). No formal `docs/brainstorms/` document — the conversation itself is the requirements input.
- **Prior plan (sister skill, established pattern):** `docs/plans/2026-04-21-001-refactor-ce-doc-review-anchored-confidence-scoring-plan.md` — explicit "Deferred to Separate Tasks" entry naming this work.
- **Institutional learning:** `docs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md` — canonical writeup of the anchored-rubric pattern.
- **Reference commit:** `6caf3303 refactor(ce-doc-review): anchor-based confidence scoring (#622)` — the migration diff for the sister skill.
- **External canonical reference:** `https://github.com/anthropics/claude-code/blob/main/plugins/code-review/commands/code-review.md` — Anthropic's command prompt is the authoritative source for skip-conditions, validation-stage design, and lint-ignore semantics. The README at `https://github.com/anthropics/claude-code/blob/main/plugins/code-review/README.md` is product description only — the command prompt is the real behavior.
- **Files modified by this plan:** `plugins/compound-engineering/skills/ce-code-review/SKILL.md`, `plugins/compound-engineering/skills/ce-code-review/references/findings-schema.json`, `plugins/compound-engineering/skills/ce-code-review/references/subagent-template.md`, `plugins/compound-engineering/skills/ce-code-review/references/persona-catalog.md`, `plugins/compound-engineering/skills/ce-code-review/references/review-output-template.md`, `plugins/compound-engineering/skills/ce-code-review/references/walkthrough.md`, `plugins/compound-engineering/skills/ce-code-review/references/bulk-preview.md`, `plugins/compound-engineering/skills/ce-code-review/references/validator-template.md` (new), all `plugins/compound-engineering/agents/ce-*-reviewer.agent.md` files (including the recently-added `ce-swift-ios-reviewer.agent.md`), `tests/review-skill-contract.test.ts`, `tests/fixtures/` (as needed), `docs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md`.

View File

@@ -85,7 +85,7 @@ Filter at `>= 50` for doc review; let the routing menu handle volume. Dismissing
## When NOT to port directly
- Code review workflows (e.g., `ce-code-review`) have linter backstops and public-comment costs. Port the rubric structure, but tune the threshold higher (`>= 75` or `>= 80` per Anthropic). This is out of scope for the ce-doc-review migration; evaluate separately.
- Code review workflows have linter backstops and public-comment costs. Port the rubric structure, but tune the threshold higher (`>= 75`). See the "ce-code-review migration" section below for the completed port.
- High-throughput pipelines where the `25` anchor ("couldn't verify") represents most findings. Dropping everything below `50` may be too aggressive; consider surfacing `25` as "needs human triage" instead.
## Migration history
@@ -121,7 +121,88 @@ Re-evaluation with the tightened criterion shifted cli-printing-press from 21 De
- **Harness drift between iterations.** Iteration-1 orchestrators dispatched parallel persona subagents; iteration-2 orchestrators executed personas inline (nested Agent tool unavailable in that session). This affected side metrics (proposed-fix count on cli-printing-press iteration-2 dropped 15 → 4, likely harness-driven rather than tweak-driven) but did not obscure the tweak's core effect, which was large-magnitude.
- **No absolute-calibration ground truth.** The evaluation measured the migration's stated failure modes disappearing. Whether an anchor-75 finding literally hits 75% of the time remains unmeasured; no labeled doc-review corpus exists.
## ce-code-review migration (2026-04-21)
Ported the same anchored-rubric structure into `ce-code-review` and bundled it with three additional code-review-specific precision controls. The two skills now share calibration discipline but diverge on threshold and on how independent verification is implemented.
### Threshold: `>= 75` (not `>= 50` like ce-doc-review, not `>= 80` like Anthropic)
ce-code-review uses anchor 75 as the gate. P0 findings escape at anchor 50.
`>= 75` matches the ce-doc-review choice of using the anchor itself as the threshold (no awkward middle-bucket gap). At `>= 75`, anchors 75 ("real, will hit in practice") and 100 ("verifiable from code alone") survive; anchors 0/25/50 are dropped. Anthropic's `>= 80` under a discrete `{0,25,50,75,100}` scale would collapse to "anchor 100 only," which is too narrow — it would silence findings where personas can construct the trace but cannot literally read the bug off the code.
The threshold divergence from ce-doc-review (`>= 50`) is correct for the same reasons documented in the "Why the threshold diverges from Anthropic" section above, applied in reverse: code review HAS a linter backstop, IS publicly visible, and code claims ARE ground-truth verifiable. Code review wants narrow precision; doc review wants broad surfacing.
### Validation pass (Stage 5b): the deferred follow-up, now landed
The ce-doc-review plan deferred a "neutral-scorer second pass" to a follow-up plan. ce-code-review implements it as **Stage 5b**: an independent validator sub-agent per surviving finding, mode-conditional dispatch, and a 15-finding budget cap.
- **Why now for code review, not doc review:** code review has externalizing modes (autofix applies fixes, headless returns findings to programmatic callers) where false positives have real cost — wrong fixes get committed, downstream automation acts on bad signal. Doc review's worst case is a noisy report a user dismisses with a keystroke; code review's worst case is a wrong-fix PR getting merged.
- **Mode-conditional dispatch:** validation runs in `headless`, `autofix`, and the interactive LFG/File-tickets routing paths. It is skipped in interactive walk-through (the human is the per-finding validator) and report-only (nothing is being externalized). This scopes cost to the cases where false positives have real cost.
- **Per-finding parallel dispatch, not batched:** independence is the design point. A single batched validator looking at all findings together pattern-matches across them and recreates the persona-bias problem we are escaping. Per-file batching is left as a future optimization for reviews with many findings clustered in few files.
- **No `validated` field on findings:** an early plan added a `validated: boolean` field; it was removed during planning. Surviving findings post-validation are validated by definition (rejected ones are dropped); in modes where validation does not run, the run's mode tells consumers everything they need. A field constant within any mode does no work.
- **Conservative failure mode:** validator timeout, malformed output, or dispatch error → drop the finding. Unverified findings should not externalize.
The validator's protocol is `{ "validated": true | false, "reason": "<one sentence>" }` answering three questions: is the issue real, is it introduced by THIS diff, and is it not handled elsewhere. Template: `references/validator-template.md`.
### Mode-aware false-positive demotion
ce-code-review's broader persona surface (17 reviewers vs ce-doc-review's 7) means more weak general-quality signal. Stricter precision in externalizing modes was already accomplished by the higher threshold; for interactive mode, a different policy: route weak findings to existing soft buckets (`testing_gaps`, `residual_risks`, `advisory`) rather than suppress.
The demotion rule is intentionally narrow: severity P2 or P3, `autofix_class` advisory, contributing reviewer is `testing` or `maintainability`. Headless and autofix suppress these entirely; interactive and report-only demote them to soft buckets where they remain visible without competing for primary-findings attention.
This is the "tier the precision bar by mode" framing. Synthesis owns it; personas don't change what they flag based on mode.
### Lint-ignore suppression
Code carrying an explicit lint disable comment for the rule a reviewer is about to flag (`eslint-disable-next-line no-unused-vars`, `# rubocop:disable Style/...`, `# noqa: E501`, etc.) — suppress unless the suppression itself violates a project-standards rule. The author already chose to suppress; re-flagging via a different reviewer creates noise and ignores their decision.
This is the only entirely new false-positive category in ce-code-review's catalog; the rest were ported from the existing pre-anchor catalog.
### PR-mode skip-condition pre-check
Before running the full review on a PR, a single `gh pr view` call probes for skip conditions:
- Closed or merged PR
- Draft PR
- Trivial automated PR (conservative `chore(deps)` / `build(deps)` / release-bump pattern with empty body)
- Already has a ce-code-review report comment
Skip cleanly without dispatching reviewers. Standalone branch and `base:` modes always run — the skip-check is PR-mode only. Already-reviewed detection deliberately ignores commits-since-comment; the escape hatch for "I want to re-review after pushing more commits" is branch mode or `base:` mode, both of which bypass the skip-check entirely.
This avoids the wasted multi-agent review cost on PRs that should not be reviewed (closed, draft, dependabot-style, or already-reviewed). It is the cheapest mechanism in this migration and disproportionately valuable for any team that runs the skill against arbitrary PR queues.
### Files
- `plugins/compound-engineering/skills/ce-code-review/references/findings-schema.json``confidence` is integer enum `[0, 25, 50, 75, 100]` with code-review-specific behavioral definitions in the description; `_meta.confidence_anchors` and `_meta.confidence_thresholds` document the anchors and `>= 75` gate
- `plugins/compound-engineering/skills/ce-code-review/references/subagent-template.md` — verbatim 5-anchor rubric with code-review framing, expanded false-positive catalog including lint-ignore rule, hard schema-conformance constraints rejecting floats
- `plugins/compound-engineering/skills/ce-code-review/references/validator-template.md` — Stage 5b validator subagent prompt
- `plugins/compound-engineering/skills/ce-code-review/SKILL.md` — Stage 5 anchor gate and one-anchor promotion (replaces `+0.10`), Stage 5 step 7c mode-aware demotion, Stage 5b validation pass with budget cap, Stage 1 PR-mode skip-condition pre-check, After-Review options B and C invoke validation before externalizing
- `plugins/compound-engineering/agents/ce-*-reviewer.agent.md` — 18 personas updated from float bands to anchored language, preserving each persona's specific calibration signal
- `plugins/compound-engineering/skills/ce-code-review/references/review-output-template.md` — Confidence column renders as integer (`75`, `100`), not float
- `tests/review-skill-contract.test.ts` — schema, synthesis, validation pass, skip-conditions, mode-aware demotion, and per-persona anchored-language assertions
### When to apply this combined pattern to a new skill
Apply the full bundle (anchored rubric + validation pass + mode-aware demotion + skip-conditions + lint-ignore) when **all** of:
1. The skill is a multi-persona review workflow producing structured findings.
2. The skill has externalizing modes — outputs that get acted on without further human review (PR comments, autofix, downstream automation, headless callers).
3. The skill is invoked frequently enough that wasted runs are visible (skip-conditions are pure win in this case; modest cost in low-volume cases).
Apply only the **anchored rubric** (the ce-doc-review subset) when:
- The skill is single-shot or dismissal is cheap via UI/menu — validation pass adds cost without protecting anything that wasn't already going to be triaged by a human.
- The skill operates on premise/strategy claims that lack ground-truth verification — anchor 100 is unreachable; threshold should be `>= 50`.
Skip the entire pattern when:
- The skill produces a single value, not a population of findings.
- The skill operates on user input where the user IS the source of truth (e.g., interactive Q&A skills).
### Migration history (ce-code-review)
Landed in a single PR with anchored rubric, validation pass, skip-conditions, mode-aware demotion, lint-ignore suppression, and persona sweep all together. The schema change is the load-bearing commit; subagent template, synthesis, and persona updates consume it. Branch: `refactor/ce-code-review-precision-and-validation`. The plan with full decision rationale lives at `docs/plans/2026-04-21-002-refactor-ce-code-review-precision-and-validation-plan.md`.
## Deferred follow-ups
- Port the pattern to `ce-code-review` with a code-review-appropriate threshold
- Evaluate a neutral-scorer second pass (a cheap agent that re-scores findings independent of the producing persona) once the anchor rubric has been observed in practice
- **PR inline comment posting mode for ce-code-review.** Anthropic's plugin posts findings as inline GitHub PR comments via `mcp__github_inline_comment__create_inline_comment` with full-SHA link discipline and committable suggestion blocks. ce-code-review currently has no PR-comment mode at all (terminal output, fixer auto-apply, or headless return only). Real workflow gap; deferred because it is a substantial new mode (link format, suggestion-block handling, deduplication semantics, tracker integration overlap).
- **Per-file validator batching.** When real-world reviews routinely surface many findings clustered in few files (large refactors), a per-file validator that reads the file once and evaluates all findings against it could meaningfully reduce cost while preserving cross-file independence. Implement when data shows the saving matters.
- **Haiku-tier orchestrator-side checks.** ce-code-review currently uses sonnet for all subagent dispatch including the cheap PR skip-condition probe. Push obvious cheap checks (skip-conditions, standards path discovery) to haiku.
- **Re-evaluate which always-on personas earn their noise.** ce-code-review keeps `testing` and `maintainability` always-on with mode-aware demotion as the safety valve. If real review runs show the demotion is firing constantly, consider making them conditional rather than always-on.

View File

@@ -68,11 +68,15 @@ Find legitimate-seeming usage patterns that cause bad outcomes. These are not se
## Confidence calibration
Your confidence should be **high (0.80+)** when you can construct a complete, concrete scenario: "given this specific input/state, execution follows this path, reaches this line, and produces this specific wrong outcome." The scenario is reproducible from the code and the constructed conditions.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when you can construct the scenario but one step depends on conditions you can see but can't fully confirm -- e.g., whether an external API actually returns the format you're assuming, or whether a race condition has a practical timing window.
**Anchor 100** — the failure scenario is mechanically constructible: every step in the chain is verifiable from the diff and surrounding code, no assumed runtime conditions.
Your confidence should be **low (below 0.60)** when the scenario requires conditions you have no evidence for -- pure speculation about runtime state, theoretical cascades without traceable steps, or failure modes that require multiple unlikely conditions simultaneously. Suppress these.
**Anchor 75** — you can construct a complete, concrete scenario: "given this specific input/state, execution follows this path, reaches this line, and produces this specific wrong outcome." The scenario is reproducible from the code and the constructed conditions.
**Anchor 50** — you can construct the scenario but one step depends on conditions you can see but can't fully confirm — e.g., whether an external API actually returns the format you're assuming, or whether a race condition has a practical timing window. Surfaces only as P0 escape or soft buckets.
**Anchor 25 or below — suppress** — the scenario requires conditions you have no evidence for: pure speculation about runtime state, theoretical cascades without traceable steps, or failure modes that require multiple unlikely conditions simultaneously.
## What you don't flag

View File

@@ -138,11 +138,15 @@ If an action looks like it belongs on this list but you are not sure, flag it as
## Confidence Calibration
**High (0.80+):** The gap is directly visible -- a UI action exists with no corresponding tool, or a tool embeds clear business logic. Traceable from the code alone.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
**Moderate (0.60-0.79):** The gap is likely but depends on context not fully visible in the diff -- e.g., whether a system prompt is assembled dynamically elsewhere.
**Anchor 100** — the gap is mechanically verifiable: a new UI button with no matching tool registration, a tool definition that literally contains business-logic branching.
**Low (below 0.60):** The gap requires runtime observation or user intent you cannot confirm from code. Suppress these.
**Anchor 75** — the gap is directly visible — a UI action exists with no corresponding tool, or a tool embeds clear business logic. Traceable from the code alone.
**Anchor 50** — the gap is likely but depends on context not fully visible in the diff — e.g., whether a system prompt is assembled dynamically elsewhere. Surfaces only as P0 escape or soft buckets.
**Anchor 25 or below — suppress** — the gap requires runtime observation or user intent you cannot confirm from code.
## Output Format

View File

@@ -21,11 +21,15 @@ You are an API design and contract stability expert who evaluates changes throug
## Confidence calibration
Your confidence should be **high (0.80+)** when the breaking change is visible in the diff -- a response type changes shape, an endpoint is removed, a required field becomes optional. You can point to the exact line where the contract changes.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when the contract impact is likely but depends on how consumers use the API -- e.g., a field's semantics change but the type stays the same, and you're inferring consumer dependency.
**Anchor 100** the breaking change is mechanical: an endpoint route deleted, a required field's name changed in the response schema, a type signature with new required parameter.
Your confidence should be **low (below 0.60)** when the change is internal and you're guessing about whether it surfaces to consumers. Suppress these.
**Anchor 75** the breaking change is visible in the diff — a response type changes shape, an endpoint is removed, a required field becomes optional. You can point to the exact line where the contract changes.
**Anchor 50** — the contract impact is likely but depends on how consumers use the API — e.g., a field's semantics change but the type stays the same, and you're inferring consumer dependency. Surfaces only as P0 escape or soft buckets.
**Anchor 25 or below — suppress** — the change is internal and you're guessing about whether it surfaces to consumers.
## What you don't flag

View File

@@ -41,11 +41,15 @@ Cap findings at 5-7 per review. Focus on the highest-severity issues for the det
## Confidence calibration
Your confidence should be **high (0.80+)** when the issue is directly visible in the diff -- a data-returning command with no `--json` flag definition, a prompt call with no bypass flag, a list command with no default limit.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when the pattern is present but context beyond the diff might resolve it -- e.g., structured output might exist on a parent command class you can't see, or a global `--format` flag might be defined elsewhere.
**Anchor 100** the violation is verifiable from the diff: a command literally has no `--json` definition and prints free-form text, a prompt call with no bypass flag definition.
Your confidence should be **low (below 0.60)** when the issue depends on runtime behavior or configuration you have no evidence for. Suppress these.
**Anchor 75** the issue is directly visible in the diff — a data-returning command with no `--json` flag definition, a prompt call with no bypass flag, a list command with no default limit.
**Anchor 50** — the pattern is present but context beyond the diff might resolve it — e.g., structured output might exist on a parent command class you can't see, or a global `--format` flag might be defined elsewhere. Surfaces only as P0 escape or soft buckets.
**Anchor 25 or below — suppress** — the issue depends on runtime behavior or configuration you have no evidence for.
## What you don't flag

View File

@@ -21,11 +21,15 @@ You are a logic and behavioral correctness expert who reads code by mentally exe
## Confidence calibration
Your confidence should be **high (0.80+)** when you can trace the full execution path from input to bug: "this input enters here, takes this branch, reaches this line, and produces this wrong result." The bug is reproducible from the code alone.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when the bug depends on conditions you can see but can't fully confirm -- e.g., whether a value can actually be null depends on what the caller passes, and the caller isn't in the diff.
**Anchor 100** the bug is verifiable from the code alone with zero interpretation: a definitive logic error (off-by-one in a tested algorithm, wrong return type, swapped arguments) or a compile/type error. The execution trace is mechanical.
Your confidence should be **low (below 0.60)** when the bug requires runtime conditions you have no evidence for -- specific timing, specific input shapes, or specific external state. Suppress these.
**Anchor 75** — you can trace the full execution path from input to bug: "this input enters here, takes this branch, reaches this line, and produces this wrong result." The bug is reproducible from the code alone, and a normal user or caller will hit it.
**Anchor 50** — the bug depends on conditions you can see but can't fully confirm — e.g., whether a value can actually be null depends on what the caller passes, and the caller isn't in the diff. Surfaces only as P0 escape or via soft-bucket routing.
**Anchor 25 or below — suppress** — the bug requires runtime conditions you have no evidence for: specific timing, specific input shapes, specific external state.
## What you don't flag

View File

@@ -25,11 +25,15 @@ You are a data integrity and migration safety expert who evaluates schema change
## Confidence calibration
Your confidence should be **high (0.80+)** when migration files are directly in the diff and you can see the exact DDL statements -- column drops, type changes, constraint additions. The risk is concrete and visible.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when you're inferring data impact from application code changes -- e.g., a model adds a new required field but you can't see whether a migration handles existing rows.
**Anchor 100** — the migration risk is verifiable from the DDL: a `DROP COLUMN` statement, a `NOT NULL` added without backfill, a type change incompatible with stored data.
Your confidence should be **low (below 0.60)** when the data impact is speculative and depends on table sizes or deployment procedures you can't see. Suppress these.
**Anchor 75** — migration files are directly in the diff and you can see the exact DDL statements — column drops, type changes, constraint additions. The risk is concrete and visible.
**Anchor 50** — you're inferring data impact from application code changes — e.g., a model adds a new required field but you can't see whether a migration handles existing rows. Surfaces only as P0 escape or soft buckets.
**Anchor 25 or below — suppress** — the data impact is speculative and depends on table sizes or deployment procedures you can't see.
## What you don't flag

View File

@@ -19,11 +19,15 @@ You are David Heinemeier Hansson (DHH), the creator of Ruby on Rails, reviewing
## Confidence calibration
Your confidence should be **high (0.80+)** when the anti-pattern is explicit in the diff -- a repository wrapper over Active Record, JWT/session replacement, a service layer that merely forwards Rails behavior, or a frontend abstraction that duplicates what Turbo already provides.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when the code smells un-Rails-like but there may be repo-specific constraints you cannot see -- for example, a service object that might exist for cross-app reuse or an API boundary that may be externally required.
**Anchor 100** — the anti-pattern is verbatim from a known un-Rails playbook: a Repository class wrapping ActiveRecord with no added behavior, a JWT-session class with `def encode/decode` mirroring `session[:user_id]`.
Your confidence should be **low (below 0.60)** when the complaint would mostly be philosophical or when the alternative is debatable. Suppress these.
**Anchor 75** — the anti-pattern is explicit in the diff — a repository wrapper over Active Record, JWT/session replacement, a service layer that merely forwards Rails behavior, or a frontend abstraction that duplicates what Turbo already provides.
**Anchor 50** — the code smells un-Rails-like but there may be repo-specific constraints you cannot see — for example, a service object that might exist for cross-app reuse or an API boundary that may be externally required. Surfaces only as P0 escape or soft buckets.
**Anchor 25 or below — suppress** — the complaint would mostly be philosophical or the alternative is debatable.
## What you don't flag

View File

@@ -20,11 +20,15 @@ You are Julik, a seasoned full-stack developer reviewing frontend code through t
## Confidence calibration
Your confidence should be **high (0.80+)** when the race is traceable from the code -- for example, an interval is created with no teardown, a controller schedules async work after disconnect, or a second interaction can obviously start before the first one finishes.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when the race depends on runtime timing you cannot fully force from the diff, but the code clearly lacks the guardrails that would prevent it.
**Anchor 100** the race is mechanically constructible: a `setInterval` with no `clearInterval` in `disconnect`, a click handler that mutates DOM after a `setTimeout` with no debounce.
Your confidence should be **low (below 0.60)** when the concern is mostly speculative or would amount to frontend superstition. Suppress these.
**Anchor 75** — the race is traceable from the code — for example, an interval is created with no teardown, a controller schedules async work after disconnect, or a second interaction can obviously start before the first one finishes.
**Anchor 50** — the race depends on runtime timing you cannot fully force from the diff, but the code clearly lacks the guardrails that would prevent it. Surfaces only as P0 escape or soft buckets.
**Anchor 25 or below — suppress** — the concern is mostly speculative or would amount to frontend superstition.
## What you don't flag

View File

@@ -20,11 +20,15 @@ You are Kieran, a super senior Python developer with impeccable taste and an exc
## Confidence calibration
Your confidence should be **high (0.80+)** when the missing typing, structural problem, or regression risk is directly visible in the touched code -- for example, a new public function without annotations, catch-and-continue behavior, or an extraction that clearly worsens readability.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when the issue is real but partially contextual -- whether a richer data model is warranted, whether a module crossed the complexity line, or whether an exception path is truly harmful in this codebase.
**Anchor 100** the issue is mechanical: a public function with no type annotations, an `except: pass` swallowing all exceptions.
Your confidence should be **low (below 0.60)** when the finding would mostly be a style preference or depends on conventions you cannot confirm from the diff. Suppress these.
**Anchor 75** the missing typing, structural problem, or regression risk is directly visible in the touched code — for example, a new public function without annotations, catch-and-continue behavior, or an extraction that clearly worsens readability.
**Anchor 50** — the issue is real but partially contextual — whether a richer data model is warranted, whether a module crossed the complexity line, or whether an exception path is truly harmful in this codebase. Surfaces only as P0 escape or soft buckets.
**Anchor 25 or below — suppress** — the finding would mostly be a style preference or depends on conventions you cannot confirm from the diff.
## What you don't flag

View File

@@ -20,11 +20,15 @@ You are Kieran, a senior Rails reviewer with a very high bar. You are strict whe
## Confidence calibration
Your confidence should be **high (0.80+)** when you can point to a concrete regression, an objectively confusing extraction, or a Rails convention break that clearly makes the touched code harder to maintain or verify.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when the issue is real but partly judgment-based -- naming quality, whether extraction crossed the line into needless complexity, or whether a Turbo pattern is overbuilt for the use case.
**Anchor 100** the regression is mechanical: a removed callback that was the only thing enforcing an invariant, a renamed method called from existing tests in the diff.
Your confidence should be **low (below 0.60)** when the criticism is mostly stylistic or depends on project context outside the diff. Suppress these.
**Anchor 75** — you can point to a concrete regression, an objectively confusing extraction, or a Rails convention break that clearly makes the touched code harder to maintain or verify.
**Anchor 50** — the issue is real but partly judgment-based — naming quality, whether extraction crossed the line into needless complexity, or whether a Turbo pattern is overbuilt for the use case. Surfaces only as P0 escape or soft buckets.
**Anchor 25 or below — suppress** — the criticism is mostly stylistic or depends on project context outside the diff.
## What you don't flag

View File

@@ -20,11 +20,15 @@ You are Kieran reviewing TypeScript with a high bar for type safety and code cla
## Confidence calibration
Your confidence should be **high (0.80+)** when the type hole or structural regression is directly visible in the diff -- for example, a new `any`, an unsafe cast, a removed guard, or a refactor that clearly makes a touched module harder to verify.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when the issue is partly judgment-based -- naming quality, whether extraction should have happened, or whether a nullable flow is truly unsafe given surrounding code you cannot fully inspect.
**Anchor 100** — the type hole is mechanical: an explicit `any`, a `// @ts-ignore` over genuinely unsafe code, an `as` cast that bypasses a discriminated union exhaustiveness check.
Your confidence should be **low (below 0.60)** when the complaint is mostly taste or depends on broader project conventions. Suppress these.
**Anchor 75** — the type hole or structural regression is directly visible in the diff — for example, a new `any`, an unsafe cast, a removed guard, or a refactor that clearly makes a touched module harder to verify.
**Anchor 50** — the issue is partly judgment-based — naming quality, whether extraction should have happened, or whether a nullable flow is truly unsafe given surrounding code you cannot fully inspect. Surfaces only as P0 escape or soft buckets.
**Anchor 25 or below — suppress** — the complaint is mostly taste or depends on broader project conventions.
## What you don't flag

View File

@@ -21,11 +21,15 @@ You are a code clarity and long-term maintainability expert who reads code from
## Confidence calibration
Your confidence should be **high (0.80+)** when the structural problem is objectively provable -- the abstraction literally has one implementation and you can see it, the dead code is provably unreachable, the indirection adds a measurable layer with no added behavior.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when the finding involves judgment about naming quality, abstraction boundaries, or coupling severity. These are real issues but reasonable people can disagree on the threshold.
**Anchor 100** the structural problem is verifiable from the code with zero interpretation: dead code reached only by an unreachable branch, an interface with exactly one implementation that can be inlined.
Your confidence should be **low (below 0.60)** when the finding is primarily a style preference or the "better" approach is debatable. Suppress these.
**Anchor 75** — the structural problem is objectively provable: the abstraction literally has one implementation and you can see it, the dead code is provably unreachable, the indirection adds a measurable layer with no added behavior.
**Anchor 50** — the finding involves judgment about naming quality, abstraction boundaries, or coupling severity. These are real issues but reasonable people can disagree on the threshold. Surfaces only as P0 escape or via mode-aware demotion to `residual_risks`.
**Anchor 25 or below — suppress** — the finding is primarily a style preference or the "better" approach is debatable.
## What you don't flag

View File

@@ -21,13 +21,17 @@ You are a runtime performance and scalability expert who reads code through the
## Confidence calibration
Performance findings have a **higher confidence threshold** than other personas because the cost of a miss is low (performance issues are easy to measure and fix later) and false positives waste engineering time on premature optimization.
Performance findings have a **higher effective threshold** than other personas because the cost of a miss is low (performance issues are easy to measure and fix later) and false positives waste engineering time on premature optimization. Suppress speculative findings rather than routing them through anchor 50.
Your confidence should be **high (0.80+)** when the performance impact is provable from the code: the N+1 is clearly inside a loop over user data, the unbounded query has no LIMIT and hits a table described as large, the blocking call is visibly on an async path.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when the pattern is present but impact depends on data size or load you can't confirm -- e.g., a query without LIMIT on a table whose size is unknown.
**Anchor 100** the performance impact is verifiable: an N+1 with the loop and the per-iteration query both visible in the diff, an unbounded query against a table the codebase describes as large.
Your confidence should be **low (below 0.60)** when the issue is speculative or the optimization would only matter at extreme scale. Suppress findings below 0.60 -- performance at that confidence level is noise.
**Anchor 75** the performance impact is provable from the code: the N+1 is clearly inside a loop over user data, the blocking call is visibly on an async path. Real users will hit it under normal load.
**Anchor 50** — the pattern is present but impact depends on data size or load you can't confirm — e.g., a query without LIMIT on a table whose size is unknown. Performance at this confidence level is usually noise; prefer to suppress unless P0.
**Anchor 25 or below — suppress** — the issue is speculative or the optimization would only matter at extreme scale.
## What you don't flag

View File

@@ -44,11 +44,15 @@ If the PR has no prior review comments, return an empty findings array immediate
## Confidence calibration
Your confidence should be **high (0.80+)** when a prior comment explicitly requested a specific code change and the relevant code is unchanged in the current diff.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when a prior comment suggested a change and the code has changed in the area but doesn't clearly address the feedback.
**Anchor 100** a prior comment explicitly requested a specific named change ("rename `foo` to `bar`", "remove this `console.log`") and the diff shows the change was not made.
Your confidence should be **low (below 0.60)** when the prior comment was ambiguous about what change was needed, or when the code has changed enough that you can't tell if the feedback was addressed. Suppress these.
**Anchor 75** — a prior comment explicitly requested a specific code change and the relevant code is unchanged in the current diff.
**Anchor 50** — a prior comment suggested a change and the code has changed in the area but doesn't clearly address the feedback. Surfaces only as P0 escape or soft buckets.
**Anchor 25 or below — suppress** — the prior comment was ambiguous about what change was needed, or the code has changed enough that you can't tell if the feedback was addressed.
## Output format

View File

@@ -43,11 +43,15 @@ In either case, identify which sections apply to the file types in the diff. A s
## Confidence calibration
Your confidence should be **high (0.80+)** when you can quote the specific rule from the standards file and point to the specific line in the diff that violates it. Both the rule and the violation are unambiguous.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when the rule exists in the standards file but applying it to this specific case requires judgment -- e.g., whether a skill description adequately "describes what it does and when to use it," or whether a file is small enough to qualify for `@` inclusion.
**Anchor 100** — the violation is verifiable from the code: the standards file has a quotable rule, the diff has a line that mechanically violates it (e.g., "do not use absolute paths in skills" + a literal absolute path), and no interpretation is needed.
Your confidence should be **low (below 0.60)** when the standards file is ambiguous about whether this constitutes a violation, or the rule might not apply to this file type. Suppress these.
**Anchor 75** — you can quote the specific rule from the standards file and point to the specific line in the diff that violates it. Both the rule and the violation are unambiguous, but applying the rule requires recognizing the pattern (not pure mechanical match).
**Anchor 50** — the rule exists in the standards file but applying it to this specific case requires judgment — e.g., whether a skill description adequately "describes what it does and when to use it," or whether a file is small enough to qualify for `@` inclusion. Surfaces only as P0 escape or soft buckets.
**Anchor 25 or below — suppress** — the standards file is ambiguous about whether this constitutes a violation, or the rule might not apply to this file type.
## What you don't flag

View File

@@ -21,11 +21,15 @@ You are a production reliability and failure mode expert who reads code by askin
## Confidence calibration
Your confidence should be **high (0.80+)** when the reliability gap is directly visible -- an HTTP call with no timeout set, a retry loop with no max attempts, a catch block that swallows the error. You can point to the specific line missing the protection.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when the code lacks explicit protection but might be handled by framework defaults or middleware you can't see -- e.g., the HTTP client *might* have a default timeout configured elsewhere.
**Anchor 100** the gap is mechanical: a `requests.get(url)` with no `timeout=` keyword, an infinite loop with no break, a catch block with `pass` and no log.
Your confidence should be **low (below 0.60)** when the reliability concern is architectural and can't be confirmed from the diff alone. Suppress these.
**Anchor 75** the reliability gap is directly visible: an HTTP call with no timeout set, a retry loop with no max attempts, a catch block that swallows the error. You can point to the specific line missing the protection.
**Anchor 50** — the code lacks explicit protection but might be handled by framework defaults or middleware you can't see — e.g., the HTTP client *might* have a default timeout configured elsewhere. Surfaces only as P0 escape or soft buckets.
**Anchor 25 or below — suppress** — the reliability concern is architectural and can't be confirmed from the diff alone.
## What you don't flag

View File

@@ -21,13 +21,17 @@ You are an application security expert who thinks like an attacker looking for t
## Confidence calibration
Security findings have a **lower confidence threshold** than other personas because the cost of missing a real vulnerability is high. A security finding at **0.60 confidence is actionable** and should be reported.
Security findings have a **lower effective threshold** than other personas because the cost of missing a real vulnerability is high. Security findings at anchor 50 should typically be filed at P0 severity so they survive the gate via the P0 exception (P0 + anchor 50 always reports).
Your confidence should be **high (0.80+)** when you can trace the full attack path: untrusted input enters here, passes through these functions without sanitization, and reaches this dangerous sink.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when the dangerous pattern is present but you can't fully confirm exploitability -- e.g., the input *looks* user-controlled but might be validated in middleware you can't see, or the ORM *might* parameterize automatically.
**Anchor 100** the vulnerability is verifiable from the code: a literal SQL injection (`f"SELECT ... {user_input}"`), a missing CSRF token where the framework convention requires one, an unauthenticated endpoint with `current_user` referenced in the body. No interpretation needed.
Your confidence should be **low (below 0.60)** when the attack requires conditions you have no evidence for. Suppress these.
**Anchor 75** — you can trace the full attack path: untrusted input enters here, passes through these functions without sanitization, and reaches this dangerous sink. The exploit is constructible from the code alone.
**Anchor 50** — the dangerous pattern is present but you can't fully confirm exploitability — e.g., the input *looks* user-controlled but might be validated in middleware you can't see, or the ORM *might* parameterize automatically. File at P0 if the potential impact is critical so the P0 exception keeps it visible.
**Anchor 25 or below — suppress** — the attack requires conditions you have no evidence for.
## What you don't flag

View File

@@ -72,11 +72,15 @@ Generic magic-number, threshold, and hardcoded-rate concerns are not Swift-speci
## Confidence calibration
Your confidence should be **high (0.80+)** when the state management bug, retain cycle, or concurrency hazard is directly visible in the diff -- for example, `@ObservedObject` on a locally-created object, a closure capturing `self` strongly in a `sink`, UI mutation from a background context with no `@MainActor`, or a managed-object access outside a `perform` block.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when the issue is real but depends on context outside the diff -- whether a parent actually re-creates a child view (making `@ObservedObject` vs `@StateObject` matter), whether a closure is truly escaping, or whether strict concurrency mode is enabled.
**Anchor 100** the bug is mechanical: `@ObservedObject` on a locally-instantiated object literal, a closure capturing `self` strongly in a known-escaping context with no `[weak self]`, UI mutation in a `Task.detached` block.
Your confidence should be **low (below 0.60)** when the finding depends on runtime conditions, project-wide architecture decisions you cannot confirm, or is mostly a style preference. Suppress these.
**Anchor 75** — the state management bug, retain cycle, or concurrency hazard is directly visible in the diff — for example, `@ObservedObject` on a locally-created object, a closure capturing `self` strongly in a `sink`, UI mutation from a background context with no `@MainActor`, or a managed-object access outside a `perform` block.
**Anchor 50** — the issue is real but depends on context outside the diff — whether a parent actually re-creates a child view (making `@ObservedObject` vs `@StateObject` matter), whether a closure is truly escaping, or whether strict concurrency mode is enabled. Surfaces only as P0 escape or soft buckets.
**Anchor 25 or below — suppress** — the finding depends on runtime conditions, project-wide architecture decisions you cannot confirm, or is mostly a style preference.
## What you don't flag

View File

@@ -21,11 +21,15 @@ You are a test architecture and coverage expert who evaluates whether the tests
## Confidence calibration
Your confidence should be **high (0.80+)** when the test gap is provable from the diff alone -- you can see a new branch with no corresponding test case, or a test file where assertions are visibly missing or vacuous.
Use the anchored confidence rubric in the subagent template. Persona-specific guidance:
Your confidence should be **moderate (0.60-0.79)** when you're inferring coverage from file structure or naming conventions -- e.g., a new `utils/parser.ts` with no `utils/parser.test.ts`, but you can't be certain tests don't exist in an integration test file.
**Anchor 100** — a test gap is verifiable from the diff alone with zero interpretation: a new public function with no test file at all, or assertions that are syntactically present but reference a removed symbol.
Your confidence should be **low (below 0.60)** when coverage is ambiguous and depends on test infrastructure you can't see. Suppress these.
**Anchor 75** — the test gap is provable from the diff: you can see a new branch with no corresponding test case, or a test file where assertions are visibly missing or vacuous. A normal future code path will hit untested behavior.
**Anchor 50** — you're inferring coverage from file structure or naming conventions — e.g., a new `utils/parser.ts` with no `utils/parser.test.ts`, but you can't be certain tests don't exist in an integration test file. Surfaces only as P0 escape or via mode-aware demotion to `testing_gaps`.
**Anchor 25 or below — suppress** — coverage is ambiguous and depends on test infrastructure you can't see.
## What you don't flag

View File

@@ -191,6 +191,21 @@ This path works with any ref — a SHA, `origin/main`, a branch name. Automated
If `mode:report-only` or `mode:headless` is active, do **not** run `gh pr checkout <number-or-url>` on the shared checkout. For `mode:report-only`, tell the caller: "mode:report-only cannot switch the shared checkout to review a PR target. Run it from an isolated worktree/checkout for that PR, or run report-only with no target argument on the already checked out branch." For `mode:headless`, emit `Review failed (headless mode). Reason: cannot switch shared checkout. Re-invoke with base:<ref> to review the current checkout, or run from an isolated worktree.` Stop here unless the review is already running in an isolated checkout.
**Skip-condition pre-check.** Before checkout or scope detection, run a PR-state probe to decide whether the review should proceed:
```
gh pr view <number-or-url> --json state,title,body,files
```
Apply skip rules in order:
- `state` is `CLOSED` or `MERGED` -> stop with message `PR is closed/merged; not reviewing.`
- **Trivial-PR judgment**: spawn a lightweight sub-agent (use `model: haiku` in Claude Code; gpt-5.4-nano or equivalent in Codex) with the PR title, body, and changed file paths. The agent's task: "Is this an automated or trivial PR that does not warrant a code review? Consider: dependency lock-file or manifest-only bumps, automated release commits, chore version increments with no substantive code changes. When in doubt, answer no — false negatives (skipped reviews that should have run) are more costly than false positives (unnecessary reviews)." If the judgment returns yes: stop with message `PR appears to be a trivial automated PR; not reviewing. Run without a PR argument to review the current branch, or pass base:<ref> if review is intended.`
When any skip rule fires, emit the message and stop without dispatching reviewers, switching the checkout, or running scope detection. **Standalone branch mode and `base:` mode are unaffected** -- they always run the full review. **Draft PRs are reviewed normally** -- draft status is not a skip condition; early feedback on in-progress work is valuable.
If no skip rule fires, proceed to the checkout logic below.
First, verify the worktree is clean before switching branches:
```
@@ -385,13 +400,11 @@ Pass the resulting path list to the `project-standards` persona inside a `<stand
#### Model tiering
Persona sub-agents do focused, scoped work and should use a fast mid-tier model to reduce cost and latency without sacrificing review quality. The orchestrator itself stays on the default (most capable) model.
Three reviewers inherit the session model with no override: `ce-correctness-reviewer`, `ce-security-reviewer`, and `ce-adversarial-reviewer`. These perform the highest-stakes analysis — logic bugs, security vulnerabilities, adversarial failure scenarios — and should run at whatever capability level the user has configured. If the user is on Opus, these get Opus.
Use the platform's mid-tier model for all persona and CE sub-agents. In Claude Code, pass `model: "sonnet"` in the Agent tool call. On other platforms, use the equivalent mid-tier (e.g., `gpt-4o` in Codex). If the platform has no model override mechanism or the available model names are unknown, omit the model parameter and let agents inherit the default -- a working review on the parent model is better than a broken dispatch from an unrecognized model name.
All other persona sub-agents and CE agents use the platform's mid-tier model to reduce cost and latency. In Claude Code, pass `model: "sonnet"` in the Agent tool call. On other platforms, use the equivalent mid-tier (e.g., `gpt-5.4-mini` in Codex as of April 2026). If the platform has no model override mechanism or the available model names are unknown, omit the model parameter and let agents inherit the default -- a working review on the parent model is better than a broken dispatch from an unrecognized model name.
CE always-on agents (ce-agent-native-reviewer, ce-learnings-researcher) and CE conditional agents (ce-schema-drift-detector, ce-deployment-verification-agent) also use the mid-tier model since they perform scoped, focused work.
The orchestrator (this skill) stays on the default model because it handles intent discovery, reviewer selection, finding merge/dedup, and synthesis -- tasks that benefit from stronger reasoning.
The orchestrator (this skill) also inherits the session model; it handles intent discovery, reviewer selection, finding merge/dedup, and synthesis -- tasks that benefit from the same reasoning capability the user configured.
#### Run ID
@@ -435,7 +448,7 @@ Each persona sub-agent writes full JSON (all schema fields) to `.context/compoun
"severity": "P0",
"file": "orders_controller.rb",
"line": 42,
"confidence": 0.92,
"confidence": 100,
"autofix_class": "gated_auto",
"owner": "downstream-resolver",
"requires_verification": true,
@@ -458,6 +471,8 @@ Detail-tier fields (`why_it_matters`, `evidence`) are in the artifact file only.
Convert multiple reviewer compact JSON returns into one deduplicated, confidence-gated finding set. The compact returns contain merge-tier fields (title, severity, file, line, confidence, autofix_class, owner, requires_verification, pre_existing) plus the optional suggested_fix. Detail-tier fields (why_it_matters, evidence) are on disk in the per-agent artifact files and are not loaded at this stage.
`confidence` is one of 5 discrete anchors (`0`, `25`, `50`, `75`, `100`) with behavioral definitions in the findings schema. Synthesis treats anchors as integers; do not coerce to floats.
1. **Validate.** Check each compact return for required top-level and per-finding fields, plus value constraints. Drop malformed returns or findings. Record the drop count.
- **Top-level required:** reviewer (string), findings (array), residual_risks (array), testing_gaps (array). Drop the entire return if any are missing or wrong type.
- **Per-finding required:** title, severity, file, line, confidence, autofix_class, owner, requires_verification, pre_existing
@@ -465,25 +480,78 @@ Convert multiple reviewer compact JSON returns into one deduplicated, confidence
- severity: P0 | P1 | P2 | P3
- autofix_class: safe_auto | gated_auto | manual | advisory
- owner: review-fixer | downstream-resolver | human | release
- confidence: numeric, 0.0-1.0
- confidence: integer in {0, 25, 50, 75, 100}
- line: positive integer
- pre_existing, requires_verification: boolean
- Do not validate against the full schema here -- the full schema (including why_it_matters and evidence) applies to the artifact files on disk, not the compact returns.
2. **Confidence gate.** Suppress findings below 0.60 confidence. Exception: P0 findings at 0.50+ confidence survive the gate -- critical-but-uncertain issues must not be silently dropped. Record the suppressed count. This matches the persona instructions and the schema's confidence thresholds.
3. **Deduplicate.** Compute fingerprint: `normalize(file) + line_bucket(line, +/-3) + normalize(title)`. When fingerprints match, merge: keep highest severity, keep highest confidence, note which reviewers flagged it.
4. **Cross-reviewer agreement.** When 2+ independent reviewers flag the same issue (same fingerprint), boost the merged confidence by 0.10 (capped at 1.0). Cross-reviewer agreement is strong signal -- independent reviewers converging on the same issue is more reliable than any single reviewer's confidence. Note the agreement in the Reviewer column of the output (e.g., "security, correctness").
5. **Separate pre-existing.** Pull out findings with `pre_existing: true` into a separate list.
6. **Resolve disagreements.** When reviewers flag the same code region but disagree on severity, autofix_class, or owner, annotate the Reviewer column with the disagreement (e.g., "security (P0), correctness (P1) -- kept P0"). This transparency helps the user understand why a finding was routed the way it was.
7. **Normalize routing.** For each merged finding, set the final `autofix_class`, `owner`, and `requires_verification`. If reviewers disagree, keep the most conservative route. Synthesis may narrow a finding from `safe_auto` to `gated_auto` or `manual`, but must not widen it without new evidence.
7b. **Tie-break the recommended action.** Interactive mode's walk-through and LFG paths present a per-finding recommended action (Apply / Defer / Skip / Acknowledge) derived from the normalized `autofix_class` and `suggested_fix`. When contributing reviewers implied different actions for the same merged finding, synthesis picks the most conservative using the order `Skip > Defer > Apply > Acknowledge`. This guarantees that identical review artifacts produce the same recommendation deterministically, so LFG results are auditable after the fact and the walk-through's recommendation is stable across re-runs. The user may still override per finding via the walk-through's options; this rule only determines what gets labeled "recommended."
2. **Deduplicate.** Compute fingerprint: `normalize(file) + line_bucket(line, +/-3) + normalize(title)`. When fingerprints match, merge: keep highest severity, keep highest anchor, note which reviewers flagged it. Dedup runs over the full validated set (including anchor 50) so cross-reviewer promotion in step 3 can lift matching anchor-50 findings into the actionable tier.
3. **Cross-reviewer agreement.** When 2+ independent reviewers flag the same issue (same fingerprint), promote the merged finding by one anchor step: `50 -> 75`, `75 -> 100`, `100 -> 100`. Cross-reviewer corroboration is a stronger signal than any single reviewer's anchor; the promotion routes a previously-soft finding into the actionable tier or strengthens its already-actionable position. Note the agreement in the Reviewer column of the output (e.g., "security, correctness").
4. **Separate pre-existing.** Pull out findings with `pre_existing: true` into a separate list.
5. **Resolve disagreements.** When reviewers flag the same code region but disagree on severity, autofix_class, or owner, annotate the Reviewer column with the disagreement (e.g., "security (P0), correctness (P1) -- kept P0"). This transparency helps the user understand why a finding was routed the way it was.
6. **Normalize routing.** For each merged finding, set the final `autofix_class`, `owner`, and `requires_verification`. If reviewers disagree, keep the most conservative route. Synthesis may narrow a finding from `safe_auto` to `gated_auto` or `manual`, but must not widen it without new evidence.
6b. **Tie-break the recommended action.** Interactive mode's walk-through and LFG paths present a per-finding recommended action (Apply / Defer / Skip / Acknowledge) derived from the normalized `autofix_class` and `suggested_fix`. When contributing reviewers implied different actions for the same merged finding, synthesis picks the most conservative using the order `Skip > Defer > Apply > Acknowledge`. This guarantees that identical review artifacts produce the same recommendation deterministically, so LFG results are auditable after the fact and the walk-through's recommendation is stable across re-runs. The user may still override per finding via the walk-through's options; this rule only determines what gets labeled "recommended."
6c. **Mode-aware demotion of weak general-quality findings.** Some persona output is real signal but does not warrant primary-findings attention. Reroute it to the existing soft buckets so the primary findings table stays focused on actionable issues.
A finding qualifies for demotion when **all** of these hold:
- Severity is P2 or P3 (P0 and P1 always stay in primary findings)
- `autofix_class` is `advisory` (concrete-fix findings stay in primary)
- **All** contributing reviewers are `testing` or `maintainability` — if any other persona also flagged this finding, cross-reviewer corroboration is present and the finding stays in primary findings regardless of its severity or advisory status (expand the weak-signal list later only with evidence)
When a finding qualifies, route by mode:
- **Interactive and report-only modes:** Move the finding out of the primary findings set. If the contributing reviewer is `testing`, append `<file:line> -- <title>` to `testing_gaps`. If `maintainability`, append the same to `residual_risks`. Record the demotion count for Coverage. The finding does not appear in the Stage 6 findings table. (Use title only -- the compact return omits `why_it_matters`, and report-only mode skips artifact files entirely. Soft-bucket entries are FYI items; readers who want depth can open the per-agent artifact when one exists.)
- **Headless and autofix modes:** Suppress the finding entirely. Record the suppressed count in Coverage as "mode-aware demotion suppressions" so the user can see what was filtered.
Demotion is intentionally narrow. The conservative scope (testing/maintainability + P2/P3 + advisory) is the starting point; do not expand the rule by guessing which other personas overproduce noise. If real review runs show another persona consistently emitting weak signal, expand with evidence.
7. **Confidence gate.** After dedup, promotion, and demotion have shaped the primary set, suppress remaining findings below anchor 75. Exception: P0 findings at anchor 50+ survive the gate -- critical-but-uncertain issues must not be silently dropped. Record the suppressed count by anchor (so Coverage can report "N findings suppressed at anchor 50, M at anchor 25"). The gate runs late deliberately: anchor-50 findings need a chance to be promoted by step 3 (cross-reviewer corroboration) or rerouted by step 6c (mode-aware demotion to soft buckets) before any drop decision.
8. **Partition the work.** Build three sets:
- in-skill fixer queue: only `safe_auto -> review-fixer`
- residual actionable queue: unresolved `gated_auto` or `manual` findings whose owner is `downstream-resolver`
- report-only queue: `advisory` findings plus anything owned by `human` or `release`
9. **Sort.** Order by severity (P0 first) -> confidence (descending) -> file path -> line number.
9. **Sort.** Order by severity (P0 first) -> anchor (descending) -> file path -> line number.
10. **Collect coverage data.** Union residual_risks and testing_gaps across reviewers.
11. **Preserve CE agent artifacts.** Keep the learnings, agent-native, schema-drift, and deployment-verification outputs alongside the merged finding set. Do not drop unstructured agent output just because it does not match the persona JSON schema.
### Stage 5b: Validation pass (externalizing modes only)
Independent verification gate. Spawn one validator sub-agent per surviving finding using `references/validator-template.md`. The validator's job is to re-check the finding against the diff and surrounding code with no commitment to the original persona's analysis. Findings the validator rejects are dropped; findings the validator confirms flow through unchanged.
**When this stage runs:**
| Mode | Runs Stage 5b? | Where |
|------|---------------|-------|
| `headless` | Yes, eagerly | Between Stage 5 and Stage 6 |
| `autofix` | Yes, eagerly | Between Stage 5 and Stage 6 |
| `interactive`, walk-through routing (option A) — per-finding phase | No -- the user is the per-finding validator | n/a |
| `interactive`, walk-through routing (option A) — LFG-the-rest handoff | Yes, on the remaining action set | Before bulk-preview dispatch (same gate as option B) |
| `interactive`, LFG routing (option B) | Yes, on the action set | Before bulk-preview dispatch |
| `interactive`, File-tickets routing (option C) | Yes, on all pending findings | Before tracker dispatch |
| `interactive`, Report-only routing (option D) | No -- nothing is being externalized | n/a |
| `report-only` | No -- read-only mode externalizes nothing | n/a |
When Stage 5b does not run, the merged finding set from Stage 5 flows through to Stage 6 unchanged. When it runs, the steps below execute on the relevant set.
**Steps:**
1. **Select findings to validate.**
- **headless/autofix:** All survivors of Stage 5.
- **interactive LFG (option B) and walk-through LFG-the-rest handoff:** The action set — findings with a recommended action of Apply or Defer. Skip and Acknowledge findings are not being externalized on this path.
- **interactive File-tickets (option C):** All pending findings regardless of recommended action. Option C externalizes every finding as a ticket, so every finding needs validation.
2. **Apply dispatch budget cap.** If the selected set exceeds 15 findings, validate the highest-severity 15 (P0 first, then P1, then P2, then P3, breaking ties by anchor descending). Drop the remainder and record the over-budget count for the Coverage section. The blunt drop is intentional; a review producing 15+ surviving findings is already in territory where a second wave would not change the user's triage approach.
3. **Spawn validators in parallel.** One sub-agent per finding, dispatched concurrently using the validator template. Each validator receives:
- The finding's title, severity, file, line, suggested_fix, original reviewer name, and confidence anchor
- `why_it_matters` when available — loaded from the per-agent artifact file at `.context/compound-engineering/ce-code-review/{run_id}/{reviewer_name}.json`; omit when the file is absent or the artifact write failed. The validator proceeds without it, using the diff and cited code directly.
- The full diff
- Read-tool access to inspect the cited code, callers, guards, framework defaults, and git blame
4. **Collect verdicts.** Each validator returns `{ "validated": true | false, "reason": "<one sentence>" }`.
- `validated: true` -> finding survives unchanged into the next phase (Stage 6 for headless/autofix, dispatch for interactive)
- `validated: false` -> finding is dropped; record the validator's reason in Coverage
- Validator failure (timeout, dispatch error, malformed JSON) -> drop the finding with reason "validator failed"; conservative bias is correct
5. **Use mid-tier model for validators.** Same model class (sonnet) the persona reviewers use. Validators are read-only — same constraints as persona reviewers. They may use non-mutating inspection commands (Read, Grep, Glob, git blame, gh).
6. **Record metrics for Coverage.** Total dispatched, validated true count, validated false count (with reasons), failures, and over-budget drops.
**Why per-finding parallel dispatch (not batched):** Independence is the point. A single batched validator looking at all findings together pattern-matches across them and recreates the persona-bias problem. Per-finding parallel dispatch preserves fresh context per call. Per-file batching is a plausible future optimization for reviews with many findings clustered in few files; not implemented today.
### Stage 6: Synthesize and present
Assemble the final report using **pipe-delimited markdown tables for findings** from the review output template included below. The table format is mandatory for finding rows in interactive mode — do not render findings as freeform text blocks or horizontal-rule-separated prose. Other report sections (Applied Fixes, Learnings, Coverage, etc.) use bullet lists and the `---` separator before the verdict, as shown in the template.
@@ -501,7 +569,7 @@ Assemble the final report using **pipe-delimited markdown tables for findings**
8. **Agent-Native Gaps.** Surface ce-agent-native-reviewer results. Omit section if no gaps found.
9. **Schema Drift Check.** If ce-schema-drift-detector ran, summarize whether drift was found. If drift exists, list the unrelated schema objects and the required cleanup command. If clean, say so briefly.
10. **Deployment Notes.** If ce-deployment-verification-agent ran, surface the key Go/No-Go items: blocking pre-deploy checks, the most important verification queries, rollback caveats, and monitoring focus areas. Keep the checklist actionable rather than dropping it into Coverage.
11. **Coverage.** Suppressed count, residual risks, testing gaps, failed/timed-out reviewers, and any intent uncertainty carried by non-interactive modes.
11. **Coverage.** Suppressed count by anchor (e.g., "N findings suppressed at anchor 50, M at anchor 25"), mode-aware demotion count (interactive/report-only) or suppression count (headless/autofix), validator drop count and reasons (when Stage 5b ran), validator over-budget drops (when the 15-cap fired), residual risks, testing gaps, failed/timed-out reviewers, and any intent uncertainty carried by non-interactive modes.
12. **Verdict.** Ready to merge / Ready with fixes / Not ready. Fix order if applicable. When an `explicit` plan has unaddressed requirements, the verdict must reflect it — a PR that's code-clean but missing planned requirements is "Not ready" unless the omission is intentional. When an `inferred` plan has unaddressed requirements, note it in the verdict reasoning but do not block on it alone.
Do not include time estimates.
@@ -565,7 +633,11 @@ Testing gaps:
- <gap>
Coverage:
- Suppressed: <N> findings below 0.60 confidence (P0 at 0.50+ retained)
- Suppressed: <N> findings below anchor 75 (P0 at anchor 50+ retained)
- Mode-aware demotion suppressions: <N> findings suppressed (testing/maintainability advisory P2-P3)
- Validator drops: <N> findings rejected by Stage 5b validator
- <file:line> -- <reason>
- Validator over-budget drops: <N> findings exceeded the 15-cap and were not validated
- Untracked files excluded: <file1>, <file2>
- Failed reviewers: <reviewer>
@@ -642,8 +714,8 @@ After presenting findings and verdict (Stage 6), route the next steps by mode. R
- **Dispatch on selection.** Route by the option letter (A / B / C / D), not by the rendered label string. The option-C label varies by tracker-detection confidence (`File a [TRACKER] ticket per finding without applying fixes` for a named tracker, `File an issue per finding without applying fixes` as the generic fallback, or omitted entirely when no sink is available — see `references/tracker-defer.md`), and options A / B / D have a single canonical label each. The letter is the stable dispatch signal; the canonical labels below are shown for documentation only. A low-confidence run that rendered option C as the generic label routes to the same branch as a high-confidence run that rendered it with the named tracker.
- (A) `Review each finding one by one` — load `references/walkthrough.md` and enter the per-finding walk-through loop. The walk-through accumulates Apply decisions in memory; Defer decisions execute inline via `references/tracker-defer.md`; Skip / Acknowledge decisions are recorded as no-action; `LFG the rest` routes through `references/bulk-preview.md`. At end of the loop, dispatch one fixer subagent for the accumulated Apply set (Step 3). Emit the unified completion report.
- (B) `LFG. Apply the agent's best-judgment action per finding` — load `references/bulk-preview.md` scoped to every pending `gated_auto` / `manual` finding. On `Proceed`, execute the plan: Apply set → Step 3 fixer dispatch; Defer set → `references/tracker-defer.md`; Skip / Acknowledge → no-op. On `Cancel`, return to this routing question. Emit the unified completion report after execution.
- (C) `File a [TRACKER] ticket per finding without applying fixes` (or the generic `File an issue per finding without applying fixes` when the named-tracker label is not used) — load `references/bulk-preview.md` with every pending finding in the file-tickets bucket (regardless of the agent's natural recommendation). On `Proceed`, route every finding through `references/tracker-defer.md`; no fixes are applied. On `Cancel`, return to this routing question. Emit the unified completion report.
- (B) `LFG. Apply the agent's best-judgment action per finding` — first run Stage 5b validation on the action set (Apply / Defer findings). Drop validator-rejected findings with their reasons recorded in Coverage. Then load `references/bulk-preview.md` scoped to every surviving pending `gated_auto` / `manual` finding. On `Proceed`, execute the plan: Apply set → Step 3 fixer dispatch; Defer set → `references/tracker-defer.md`; Skip / Acknowledge → no-op. On `Cancel`, return to this routing question. Emit the unified completion report after execution.
- (C) `File a [TRACKER] ticket per finding without applying fixes` (or the generic `File an issue per finding without applying fixes` when the named-tracker label is not used) — first run Stage 5b validation on every pending finding. Drop validator-rejected findings with their reasons recorded in Coverage. Then load `references/bulk-preview.md` with every surviving finding in the file-tickets bucket. On `Proceed`, route every finding through `references/tracker-defer.md`; no fixes are applied. On `Cancel`, return to this routing question. Emit the unified completion report.
- (D) `Report only — take no further action` — do not enter any dispatch phase. Emit the completion report, then proceed to Step 5 per its gating rule (`fixes_applied_count > 0` from earlier `safe_auto` passes). If no fixes were applied this run, stop after the report.
- The walk-through's completion report, the LFG / File-tickets completion report, and the zero-remaining completion summary all follow the unified completion-report structure documented in `references/walkthrough.md`. Use the same structure across every terminal path.

View File

@@ -70,10 +70,9 @@
"description": "Concrete minimal fix. Omit or null if no good fix is obvious -- a bad suggestion is worse than none."
},
"confidence": {
"type": "number",
"description": "Reviewer confidence in this finding, calibrated per persona",
"minimum": 0.0,
"maximum": 1.0
"type": "integer",
"enum": [0, 25, 50, 75, 100],
"description": "Anchored confidence score. Use exactly one of 0, 25, 50, 75, 100. Each anchor has a behavioral criterion the reviewer must honestly self-apply. 0: Not confident. This is a false positive that does not stand up to light scrutiny, or a pre-existing issue this PR did not introduce. 25: Somewhat confident. Might be a real issue but could also be a false positive; the reviewer could not verify from the diff and surrounding code alone. 50: Moderately confident. The reviewer verified this is a real issue but it may be a nitpick, narrow edge case, or have minimal practical impact. Relative to the diff's other concerns, it is not very important. Style preferences and subjective improvements land here. 75: Highly confident. The reviewer double-checked the diff and confirmed the issue will affect users, downstream callers, or runtime behavior in normal usage. The bug, vulnerability, or contract violation is clearly present and actionable. 100: Absolutely certain. The issue is verifiable from the code itself -- compile error, type mismatch, definitive logic bug, or an explicit project-standards violation with a quotable rule. No interpretation required."
},
"evidence": {
"type": "array",
@@ -98,14 +97,20 @@
"description": "Missing test coverage the reviewer identified",
"items": { "type": "string" }
}
},
},
"_meta": {
"confidence_anchors": {
"description": "Confidence is one of 5 discrete anchors (0, 25, 50, 75, 100), each tied to a behavioral criterion the reviewer can honestly self-apply. Float values (e.g., 0.73) are not valid -- the model cannot meaningfully calibrate at finer granularity, and discrete anchors prevent false-precision gaming.",
"0": "False positive or pre-existing -- do not report",
"25": "Speculative; could not verify -- do not report",
"50": "Verified real but minor or stylistic -- report only when P0 or when synthesis routes to advisory/soft buckets",
"75": "Highly confident, will affect users or runtime in normal usage -- report",
"100": "Verifiable from code alone (compile error, type mismatch, definitive logic bug, quoted standards violation) -- report"
},
"confidence_thresholds": {
"suppress": "Below 0.60 -- do not report. Finding is speculative noise. Exception: P0 findings at 0.50+ may be reported.",
"flag": "0.60-0.69 -- include only when the issue is clearly actionable with concrete evidence.",
"confident": "0.70-0.84 -- real and important. Report with full evidence.",
"certain": "0.85-1.00 -- verifiable from the code alone. Report."
"suppress": "Below anchor 75 -- do not report. Exception: P0 findings at anchor 50+ may be reported (critical-but-uncertain issues must not be silently dropped).",
"report": "Anchor 75 or 100 -- include with full evidence."
},
"severity_definitions": {
"P0": "Critical breakage, exploitable vulnerability, data loss/corruption. Must fix before merge.",

View File

@@ -21,26 +21,26 @@ Use this **exact format** when presenting synthesized review findings. Findings
| # | File | Issue | Reviewer | Confidence | Route |
|---|------|-------|----------|------------|-------|
| 1 | `orders_controller.rb:42` | User-supplied ID in account lookup without ownership check | security | 0.92 | `gated_auto -> downstream-resolver` |
| 1 | `orders_controller.rb:42` | User-supplied ID in account lookup without ownership check | security | 100 | `gated_auto -> downstream-resolver` |
### P1 -- High
| # | File | Issue | Reviewer | Confidence | Route |
|---|------|-------|----------|------------|-------|
| 2 | `export_service.rb:87` | Loads all orders into memory -- unbounded for large accounts | performance | 0.85 | `safe_auto -> review-fixer` |
| 3 | `export_service.rb:91` | No pagination -- response size grows linearly with order count | api-contract, performance | 0.80 | `manual -> downstream-resolver` |
| 2 | `export_service.rb:87` | Loads all orders into memory -- unbounded for large accounts | performance | 100 | `safe_auto -> review-fixer` |
| 3 | `export_service.rb:91` | No pagination -- response size grows linearly with order count | api-contract, performance | 75 | `manual -> downstream-resolver` |
### P2 -- Moderate
| # | File | Issue | Reviewer | Confidence | Route |
|---|------|-------|----------|------------|-------|
| 4 | `export_service.rb:45` | Missing error handling for CSV serialization failure | correctness | 0.75 | `safe_auto -> review-fixer` |
| 4 | `export_service.rb:45` | Missing error handling for CSV serialization failure | correctness | 75 | `safe_auto -> review-fixer` |
### P3 -- Low
| # | File | Issue | Reviewer | Confidence | Route |
|---|------|-------|----------|------------|-------|
| 5 | `export_helper.rb:12` | Format detection could use early return instead of nested conditional | maintainability | 0.70 | `advisory -> human` |
| 5 | `export_helper.rb:12` | Format detection could use early return instead of nested conditional | maintainability | 75 | `advisory -> human` |
### Applied Fixes
@@ -79,7 +79,7 @@ Use this **exact format** when presenting synthesized review findings. Findings
### Coverage
- Suppressed: 2 findings below 0.60 confidence
- Suppressed: 2 findings below anchor 75 (1 at anchor 50, 1 at anchor 25)
- Residual risks: No rate limiting on export endpoint
- Testing gaps: No test for concurrent export requests
@@ -103,7 +103,7 @@ Sev: P1
File: foo.go:42
Issue: Some problem description
Reviewer(s): adversarial
Confidence: 0.85
Confidence: 75
Route: advisory -> human
────────────────────────────────────────
Sev: P2
@@ -119,7 +119,7 @@ This fails because: no pipe-delimited tables, no severity-grouped `###` headers,
- **Severity-grouped sections** -- `### P0 -- Critical`, `### P1 -- High`, `### P2 -- Moderate`, `### P3 -- Low`. Omit empty severity levels.
- **Always include file:line location** for code review issues
- **Reviewer column** shows which persona(s) flagged the issue. Multiple reviewers = cross-reviewer agreement.
- **Confidence column** shows the finding's confidence score
- **Confidence column** shows the finding's anchor as an integer (`50`, `75`, or `100`). Never render as a float.
- **Route column** shows the synthesized handling decision as ``<autofix_class> -> <owner>``.
- **Header includes** scope, intent, and reviewer team with per-conditional justifications
- **Mode line** -- include `interactive`, `autofix`, `report-only`, or `headless`

View File

@@ -38,15 +38,55 @@ The schema below describes the **full artifact file format** (all fields require
{schema}
Confidence rubric (0.0-1.0 scale):
- 0.00-0.29: Not confident / likely false positive. Do not report.
- 0.30-0.49: Somewhat confident. Do not report -- too speculative for actionable review.
- 0.50-0.59: Moderately confident. Real but uncertain. Do not report unless P0 severity.
- 0.60-0.69: Confident enough to flag. Include only when the issue is clearly actionable.
- 0.70-0.84: Highly confident. Real and important. Report with full evidence.
- 0.85-1.00: Certain. Verifiable from the code alone. Report.
**Schema conformance — hard constraints (use these exact values; validation rejects anything else):**
Suppress threshold: 0.60. Do not emit findings below 0.60 confidence (except P0 at 0.50+).
- `severity`: one of `"P0"`, `"P1"`, `"P2"`, `"P3"` — use these exact strings. Do NOT use `"high"`, `"medium"`, `"low"`, `"critical"`, or any other vocabulary, even if your persona's prose discusses priorities in those terms conceptually.
- `autofix_class`: one of `"safe_auto"`, `"gated_auto"`, `"manual"`, `"advisory"`.
- `owner`: one of `"review-fixer"`, `"downstream-resolver"`, `"human"`, `"release"`.
- `evidence`: an ARRAY of strings with at least one element. A single string value is a validation failure — wrap every quote in `["..."]` even when there is only one.
- `pre_existing`: boolean, never null.
- `requires_verification`: boolean, never null.
- `confidence`: one of exactly `0`, `25`, `50`, `75`, or `100` — a discrete anchor, NOT a continuous number. Any other value (e.g., `72`, `0.85`, `"high"`) is a validation failure. Pick the anchor whose behavioral criterion you can honestly self-apply to this finding (see "Confidence rubric" below).
If your persona description uses severity vocabulary like "high-priority" or "critical" in its rubric text, translate to the P0-P3 scale at emit time. "Critical / must-fix" → P0, "important / should-fix" → P1, "worth-noting / could-fix" → P2, "low-signal" → P3. Same for priorities described qualitatively in your analysis — map to P0-P3 on the way out.
**Confidence rubric — use these exact behavioral anchors.** Pick the single anchor whose criterion you can honestly self-apply. Do not pick a value between anchors; only `0`, `25`, `50`, `75`, and `100` are valid. The rubric is anchored on behavior you performed, not on a vague sense of certainty — if you cannot truthfully attach the behavioral claim to the finding, step down to the next anchor.
- **`0` — Not confident at all.** A false positive that does not stand up to light scrutiny, or a pre-existing issue this PR did not introduce. **Do not emit — suppress silently.** This anchor exists in the enum only so synthesis can explicitly track the drop; personas never produce it.
- **`25` — Somewhat confident.** Might be a real issue but could also be a false positive; you could not verify from the diff and surrounding code alone. **Do not emit — suppress silently.** This anchor, like `0`, exists in the enum only so synthesis can track the drop; personas never produce it. If your domain is genuinely uncertain, either gather more evidence (read related files, check call sites, inspect git blame) until you can honestly anchor at `50` or higher, or suppress entirely.
- **`50` — Moderately confident.** You verified this is a real issue but it is a nitpick, narrow edge case, or has minimal practical impact. Style preferences and subjective improvements land here. Surfaces only when synthesis routes weak findings to advisory / residual_risks / testing_gaps soft buckets, or when the finding is P0 (critical-but-uncertain issues are not silently dropped).
- **`75` — Highly confident.** You double-checked the diff and surrounding code and confirmed the issue will affect users, downstream callers, or runtime behavior in normal usage. The bug, vulnerability, or contract violation is clearly present and actionable.
**Anchor `75` requires naming a concrete observable consequence** — a wrong result, an unhandled error path, a contract mismatch, a security exposure, missing coverage that a real test scenario would surface. "This could be cleaner" or "I would have written this differently" do not meet this bar — they are advisory observations and land at anchor `50`. When in doubt between `50` and `75`, ask: "will a user, caller, or operator concretely encounter this in normal usage, or is this my opinion about the code's quality?" The former is `75`; the latter is `50`.
- **`100` — Absolutely certain.** The issue is verifiable from the code itself — compile error, type mismatch, definitive logic bug (off-by-one in a tested algorithm, wrong return type, swapped arguments), or an explicit project-standards violation with a quotable rule. No interpretation required.
Anchor and severity are independent axes. A P2 finding can be anchor `100` if the evidence is airtight; a P0 finding can be anchor `50` if it is an important concern you could not fully verify. Anchor gates where the finding surfaces (drop / soft bucket / actionable); severity orders it within the actionable surface.
Synthesis suppresses anchors `0` and `25` silently. Anchor `50` is dropped from primary findings unless the severity is P0 (P0+50 survives) or synthesis routes it to a soft bucket (testing_gaps, residual_risks, advisory) per mode-aware demotion. Anchors `75` and `100` enter the actionable tier.
Example of a schema-valid finding (all required fields, correct enum values, correct array shape):
```json
{
"title": "User-supplied ID in account lookup without ownership check",
"severity": "P0",
"file": "app/controllers/orders_controller.rb",
"line": 42,
"why_it_matters": "Any signed-in user can read another user's orders by pasting the target account ID into the URL. The controller looks up the account and returns its orders without verifying the current user owns it. The shipments controller already uses a current_user.owns?(account) guard for the same attack class; matching that pattern fixes this finding.",
"autofix_class": "gated_auto",
"owner": "downstream-resolver",
"requires_verification": true,
"suggested_fix": "Add current_user.owns?(account) guard before lookup, matching the pattern in shipments_controller.rb",
"confidence": 100,
"evidence": [
"orders_controller.rb:42 -- account = Account.find(params[:account_id])",
"shipments_controller.rb:38 -- raise NotAuthorized unless current_user.owns?(account)"
],
"pre_existing": false
}
```
The `confidence: 100` is justified because the issue is verifiable from the code alone — the controller fetches by user-supplied ID and returns data without any guard, and the parallel pattern in shipments_controller.rb confirms the project's own convention is being violated.
Writing `why_it_matters` (required field, every finding):
@@ -55,7 +95,7 @@ The `why_it_matters` field is how the reader — a developer triaging findings,
- **Lead with observable behavior.** Describe what the bug does from the outside — what a user, attacker, operator, or downstream caller experiences. Do not lead with code structure ("The function X does Y..."). Start with the effect ("Any signed-in user can read another user's orders..."). Function and variable names appear later, only when the reader needs them to locate the issue.
- **Explain why the fix resolves the problem.** If you include a `suggested_fix`, the `why_it_matters` should make clear why that specific fix addresses the root cause. When a similar pattern exists elsewhere in the codebase (an existing guard, an established convention, a parallel handler), reference it so the recommendation is grounded in the project's own conventions rather than theoretical best practice.
- **Keep it tight.** Approximately 2-4 sentences plus the minimum code quoted inline to ground the point. Longer framings are a regression — downstream surfaces have narrow display budgets, and verbose `why_it_matters` content gets truncated or skimmed.
- **Always produce substantive content.** `why_it_matters` is required by the schema. Empty strings, nulls, and single-phrase entries are validation failures. If you found something worth flagging (confidence >= 0.60), you can explain it — the field exists because every finding needs a reason.
- **Always produce substantive content.** `why_it_matters` is required by the schema. Empty strings, nulls, and single-phrase entries are validation failures. If you found something worth flagging at anchor `50` or higher, you can explain it — the field exists because every finding needs a reason.
Illustrative pair — same finding, weak vs. strong framing:
@@ -72,18 +112,27 @@ STRONG (observable behavior first, grounded fix reasoning):
pattern already used in the shipments controller for the same attack.
```
False-positive categories to actively suppress:
- Pre-existing issues unrelated to this diff (mark pre_existing: true for unchanged code the diff does not interact with; if the diff makes it newly relevant, it is secondary, not pre-existing)
- Pedantic style nitpicks that a linter/formatter would catch
- Code that looks wrong but is intentional (check comments, commit messages, PR description for intent)
- Issues already handled elsewhere in the codebase (check callers, guards, middleware)
- Suggestions that restate what the code already does in different words
- Generic "consider adding" advice without a concrete failure mode
False-positive categories to actively suppress. Do NOT emit a finding when any of these apply — not even at anchor `25` or `50`. These are not edge cases you should route to soft buckets; they are non-findings.
- **Pre-existing issues unrelated to this diff.** Mark `pre_existing: true` only for unchanged code the diff does not interact with. If the diff makes a previously-dormant issue newly relevant (e.g., changes a caller in a way that exposes a bug downstream), it is a secondary finding, not pre-existing. PR-comment and headless externalization filter pre-existing entirely; interactive review surfaces them in a separate section.
- **Pedantic style nitpicks that a linter or formatter would catch.** Missing semicolons, indentation, import ordering, unused-variable warnings the project's tooling already catches. Style belongs to the toolchain.
- **Code that looks wrong but is intentional.** Check comments, commit messages, PR description, or surrounding code for evidence of intent before flagging. A persona-flagged "missing null check" guarded by an upstream `.present?` call is a false positive.
- **Issues already handled elsewhere.** Check callers, guards, middleware, framework defaults, and parallel handlers before flagging. If a controller's input is already validated by a parent middleware, the controller-level check the persona wants to add is redundant.
- **Suggestions that restate what the code already does in different words.** "Consider extracting this into a helper" when the code is already a small helper, "consider adding a guard" when a guard one line up already enforces it.
- **Generic "consider adding" advice without a concrete failure mode.** If you cannot name what breaks, the finding is not actionable. Either find the failure mode or suppress.
- **Issues with a relevant lint-ignore comment.** Code that carries an explicit lint disable comment for the rule you are about to flag (`eslint-disable-next-line no-unused-vars`, `# rubocop:disable Style/StringLiterals`, `# noqa: E501`, etc.) — suppress unless the suppression itself violates a project-standards rule that explicitly forbids disabling that lint for this code shape. The author already chose to suppress; re-flagging it via a different reviewer creates noise and ignores their decision.
- **General code-quality concerns not codified in CLAUDE.md / AGENTS.md.** "This file is getting long," "this method has too many parameters," "this is hard to read" — without a project-standards rule to anchor the concern, these are subjective and waste reviewer time. If the project explicitly bans long files or sets a parameter-count limit in its standards, that is a project-standards finding; otherwise suppress.
- **Speculative future-work concerns with no current signal.** "This might break under load," "what if the requirements change," "this could be hard to test later" — not findings unless the diff introduces concrete evidence the concern is reachable now.
**Advisory observations — route to advisory autofix_class, do not force a decision.** If the honest answer to "what actually breaks if we do not fix this?" is "nothing breaks, but…", the finding is advisory. Set `autofix_class: advisory` and `confidence: 50` so synthesis routes the finding to a soft bucket rather than surfacing it as a primary action item. Do not suppress — the observation may have value; it just does not warrant user judgment. Typical advisory shapes: design asymmetry the PR improves but does not fully resolve, opportunity to consolidate two similar helpers when neither is broken, residual risk worth noting in the report.
**Precedence over the false-positive catalog.** The false-positive catalog above is stricter than the advisory rule — if a shape matches the FP catalog, it is a non-finding and must be suppressed entirely. Do NOT route it to anchor `50` / advisory. The advisory rule applies only to shapes that are NOT in the FP catalog.
Rules:
- You are a leaf reviewer inside an already-running compound-engineering review workflow. Do not invoke compound-engineering skills or agents unless this template explicitly instructs you to. Perform your analysis directly and return findings in the required output format only.
- Suppress any finding you cannot honestly anchor at `50` or higher (the actionable floor is `50`; anchors `0` and `25` are suppressed by synthesis anyway, so emitting them only adds noise). If your persona's domain description sets a stricter floor (e.g., anchor `75` minimum), honor it.
- Every finding in the full artifact file MUST include at least one evidence item grounded in the actual code. The compact return omits evidence -- the evidence requirement applies to the disk artifact only.
- Set pre_existing to true ONLY for issues in unchanged code that are unrelated to this diff. If the diff makes the issue newly relevant, it is NOT pre-existing.
- Set `pre_existing` to true ONLY for issues in unchanged code that are unrelated to this diff. If the diff makes the issue newly relevant, it is NOT pre-existing.
- You are operationally read-only. The one permitted exception is writing your full analysis to the `.context/` artifact path when a run ID is provided. You may also use non-mutating inspection commands, including read-oriented `git` / `gh` commands, to gather evidence. Do not edit project files, change branches, commit, push, create PRs, or otherwise mutate the checkout or repository state.
- Set `autofix_class` accurately -- not every finding is `advisory`. Use this decision guide:
- `safe_auto`: The fix is local and deterministic — the fixer can apply it mechanically without design judgment. Examples: extracting a duplicated helper, adding a missing nil/null check, fixing an off-by-one, adding a missing test for an untested code path, removing dead code.

View File

@@ -0,0 +1,85 @@
# Validator Sub-agent Prompt Template
This template is used by Stage 5b to spawn one validator sub-agent per surviving finding before externalization. The validator's job is **independent re-verification**, not re-reasoning. It is a fresh second opinion, not a critic of the original persona's analysis.
---
## Template
```
You are an independent validator for a code review finding. Another reviewer flagged the issue described below. Your job is to verify whether the finding holds up under fresh inspection.
You have no commitment to the original finding. If it is wrong, say so. False positives are common; do not feel pressure to confirm.
<finding-to-validate>
Title: {finding_title}
Severity: {finding_severity}
File: {finding_file}
Line: {finding_line}
Why it matters (the original reviewer's framing):
{finding_why_it_matters}
Suggested fix (if any):
{finding_suggested_fix}
Original reviewer: {finding_reviewer}
Confidence anchor: {finding_confidence}
</finding-to-validate>
<diff>
{diff}
</diff>
<scope-context>
The diff above is the full change being reviewed. The finding is about file {finding_file} around line {finding_line}. Use read tools (Read, Grep, Glob, git blame) to inspect the cited code and its callers, guards, middleware, or framework defaults that might handle the concern elsewhere.
</scope-context>
Your task is to answer three questions:
1. **Is the issue real in the code as written?** Read the cited file and surrounding code. If the code does not actually have the problem the finding describes, the finding is invalid. Common false-positive shapes:
- The persona missed an existing guard / null check / validation that handles the case
- The persona misread types or signatures
- The persona flagged a pattern that is intentional in this codebase (check comments, parallel handlers, project conventions)
2. **Is the issue introduced by THIS diff?** Use git blame or diff inspection. If the cited line predates this PR's commits and the diff does not interact with it (does not call into it, does not change its callers in a way that newly exposes the issue), the finding is pre-existing — not validated for externalization regardless of whether it is a real issue.
3. **Is the issue not handled elsewhere?** Look for guards in callers, middleware in the request chain, framework defaults, type system constraints, or parallel handlers that already address the concern. If the issue is functionally prevented by surrounding infrastructure, the finding is invalid.
Return ONLY this JSON, no prose:
```json
{
"validated": true | false,
"reason": "<one sentence explaining the verdict>"
}
```
Examples:
- `{ "validated": true, "reason": "Cited line is new in this diff and lacks the ownership guard used by parallel controllers." }`
- `{ "validated": false, "reason": "Line 87 already guards user.email with .present? check; the null deref the finding describes cannot occur." }`
- `{ "validated": false, "reason": "Cited line dates to 2024-08 (pre-existing); diff does not modify or interact with it." }`
- `{ "validated": false, "reason": "Framework handles the timeout case via Faraday default; no application-level retry needed." }`
Rules:
- Be honest. If the original reviewer was right, validate. If they were wrong, reject. Conservative bias is preferred — when in doubt, reject.
- Do not invent new findings. Your scope is this one finding; surface anything else as a no-vote with reason.
- Do not edit, commit, push, or modify any files. You are operationally read-only.
- If you cannot read the cited file, return `{ "validated": false, "reason": "Could not access file path to verify." }` rather than guessing.
- Return JSON only. No prose, no markdown, no explanation outside the JSON object.
```
## Variable Reference
| Variable | Source | Description |
|----------|--------|-------------|
| `{finding_title}` | Stage 5 merged finding | The persona's title for the issue |
| `{finding_severity}` | Stage 5 merged finding | P0 / P1 / P2 / P3 |
| `{finding_file}` | Stage 5 merged finding | Repo-relative file path |
| `{finding_line}` | Stage 5 merged finding | Primary line number |
| `{finding_why_it_matters}` | Per-agent artifact file (detail tier) | Loaded from disk for this validation; required for the validator to understand the finding |
| `{finding_suggested_fix}` | Stage 5 merged finding (optional) | Pass empty string if not present |
| `{finding_reviewer}` | Stage 5 merged finding | Original persona name (informational; helps validator interpret the framing) |
| `{finding_confidence}` | Stage 5 merged finding | The persona's anchor (informational) |
| `{diff}` | Stage 1 output | Full diff for context |

View File

@@ -10,7 +10,7 @@ Interactive mode only.
The walk-through receives, from the orchestrator:
- The merged findings list in severity order (P0 → P1 → P2 → P3), filtered to `gated_auto` and `manual` findings that survived the Stage 5 confidence gate. Advisory findings are included when they were surfaced to this phase (advisory findings normally live in the report-only queue, but when the review flow routes them here for acknowledgment they take the advisory variant below).
- The merged findings list in severity order (P0 → P1 → P2 → P3), filtered to `gated_auto` and `manual` findings that survived the Stage 5 anchor gate (anchor 75+, with P0 escape at anchor 50). Advisory findings are included when they were surfaced to this phase (advisory findings normally live in the report-only queue, but when the review flow routes them here for acknowledgment they take the advisory variant below).
- The cached tracker-detection tuple from `tracker-defer.md` (`{ tracker_name, confidence, named_sink_available, any_sink_available }`). `any_sink_available` determines whether the Defer option is offered; `named_sink_available` + `confidence` determine whether the label names the tracker inline.
- The run id for artifact lookups.
@@ -156,7 +156,7 @@ For each finding's answer:
- **Acknowledge — mark as reviewed** (advisory variant) — record Acknowledge in the in-memory decision list. Advance to the next finding. No side effects.
- **Defer — file a [TRACKER] ticket** — invoke the tracker-defer flow from `tracker-defer.md`. The walk-through's position indicator stays on the current finding during any failure-path sub-question (Retry / Fall back / Convert to Skip). On success, record the tracker URL / reference in the in-memory decision list and advance. On conversion-to-Skip from the failure path, advance with the failure noted in the completion report.
- **Skip — don't apply, don't track** — record Skip in the in-memory decision list. Advance. No side effects.
- **LFG the rest — apply the agent's best judgment to this and remaining findings** — exit the walk-through loop. Dispatch the bulk preview from `bulk-preview.md`, scoped to the current finding and everything not yet decided. The preview header reports the count of already-decided findings ("K already decided"). If the user picks `Cancel` from the preview, return to the current finding's per-finding question (not to the routing question). If the user picks `Proceed`, execute the plan per `bulk-preview.md` — Apply findings join the in-memory Apply set with the ones the user already picked, Defer findings route through `tracker-defer.md`, Skip / Acknowledge no-op — then proceed to end-of-walk-through dispatch.
- **LFG the rest — apply the agent's best judgment to this and remaining findings** — exit the walk-through loop. Before dispatching the bulk preview, run Stage 5b on the remaining action set (the current finding plus any not yet decided) using the same gate as LFG routing (option B): validator template at `references/validator-template.md`, 15-finding cap, parallel per-finding dispatch. Findings Stage 5b rejects are dropped from the remaining set. Then dispatch the bulk preview from `bulk-preview.md` scoped to the Stage 5b-validated survivors. The preview header reports the count of already-decided findings ("K already decided"). If the user picks `Cancel` from the preview, return to the current finding's per-finding question (not to the routing question). If the user picks `Proceed`, execute the plan per `bulk-preview.md` — Apply findings join the in-memory Apply set with the ones the user already picked, Defer findings route through `tracker-defer.md`, Skip / Acknowledge no-op — then proceed to end-of-walk-through dispatch.
---

View File

@@ -170,7 +170,10 @@ describe("ce-code-review contract", () => {
"plugins/compound-engineering/skills/ce-code-review/references/findings-schema.json",
)
const schema = JSON.parse(rawSchema) as {
_meta: { confidence_thresholds: { suppress: string } }
_meta: {
confidence_thresholds: { suppress: string; report: string }
confidence_anchors: Record<string, string>
}
properties: {
findings: {
items: {
@@ -178,6 +181,7 @@ describe("ce-code-review contract", () => {
autofix_class: { enum: string[] }
owner: { enum: string[] }
requires_verification: { type: string }
confidence: { type: string; enum: number[] }
}
required: string[]
}
@@ -201,8 +205,206 @@ describe("ce-code-review contract", () => {
"release",
])
expect(schema.properties.findings.items.properties.requires_verification.type).toBe("boolean")
expect(schema._meta.confidence_thresholds.suppress).toContain("0.60")
// Anchored confidence: integer enum, no floats
expect(schema.properties.findings.items.properties.confidence.type).toBe("integer")
expect(schema.properties.findings.items.properties.confidence.enum).toEqual([0, 25, 50, 75, 100])
// Threshold: anchor 75 (P0 escape at anchor 50)
expect(schema._meta.confidence_thresholds.suppress).toContain("anchor 75")
expect(schema._meta.confidence_thresholds.suppress).toContain("anchor 50")
expect(schema._meta.confidence_thresholds.suppress).toMatch(/P0/)
// Behavioral anchors documented for personas
expect(schema._meta.confidence_anchors).toBeDefined()
expect(schema._meta.confidence_anchors["0"]).toBeDefined()
expect(schema._meta.confidence_anchors["25"]).toBeDefined()
expect(schema._meta.confidence_anchors["50"]).toBeDefined()
expect(schema._meta.confidence_anchors["75"]).toBeDefined()
expect(schema._meta.confidence_anchors["100"]).toBeDefined()
})
test("subagent template carries verbatim 5-anchor rubric and lint-ignore suppression", async () => {
const template = await readRepoFile(
"plugins/compound-engineering/skills/ce-code-review/references/subagent-template.md",
)
// Anchored rubric: each anchor named with behavioral criterion
expect(template).toMatch(/`0`.*Not confident/)
expect(template).toMatch(/`25`.*Somewhat confident/)
expect(template).toMatch(/`50`.*Moderately confident/)
expect(template).toMatch(/`75`.*Highly confident/)
expect(template).toMatch(/`100`.*Absolutely certain/)
// Schema conformance hard constraints reject floats
expect(template).toContain("`0`, `25`, `50`, `75`, or `100`")
expect(template).toMatch(/0\.85.*validation failure/i)
// Lint-ignore rule in false-positive catalog
expect(template).toMatch(/lint.ignore|lint disable|eslint-disable/i)
expect(template).toMatch(/suppress unless the suppression itself violates/i)
// Advisory routing rule preserved
expect(template).toMatch(/Advisory observations.*route to advisory/i)
// Personas never produce anchors 0 or 25 (suppress silently)
expect(template).toMatch(/personas never produce/i)
})
test("Stage 5 synthesis uses anchor gate and one-anchor promotion", async () => {
const content = await readRepoFile("plugins/compound-engineering/skills/ce-code-review/SKILL.md")
// Confidence value constraint is integer enum
expect(content).toMatch(/confidence:\s*integer in \{0, 25, 50, 75, 100\}/)
// Confidence gate at anchor 75 with P0 exception at 50
expect(content).toMatch(/suppress remaining findings below anchor 75/i)
expect(content).toMatch(/P0 findings at anchor 50\+ survive/)
// Confidence gate runs AFTER dedup, promotion, and demotion so anchor-50 findings
// can be promoted by cross-reviewer agreement or rerouted to soft buckets first.
// This is a load-bearing ordering — if the gate runs early, promotion/demotion become unreachable.
expect(content).toMatch(/gate runs late deliberately/i)
// One-anchor promotion replaces +0.10 boost
expect(content).toMatch(/one anchor step.*50 -> 75.*75 -> 100/)
expect(content).not.toContain("boost the merged confidence by 0.10")
// Sort by anchor descending, not "confidence (descending)"
expect(content).toMatch(/anchor \(descending\)/)
})
test("Stage 5b validation pass dispatches conditionally and bounds parallelism", async () => {
const content = await readRepoFile("plugins/compound-engineering/skills/ce-code-review/SKILL.md")
const validatorTemplate = await readRepoFile(
"plugins/compound-engineering/skills/ce-code-review/references/validator-template.md",
)
// Stage 5b exists between Stage 5 and Stage 6
expect(content).toContain("### Stage 5b: Validation pass")
// Mode-conditional dispatch
expect(content).toContain("`headless`")
expect(content).toContain("`autofix`")
expect(content).toContain("walk-through routing (option A)")
expect(content).toContain("LFG routing (option B)")
expect(content).toContain("File-tickets routing (option C)")
expect(content).toMatch(/Report-only routing.*nothing is being externalized/i)
// Per-finding parallel dispatch (not batched)
expect(content).toMatch(/per.finding parallel dispatch/i)
expect(content).toMatch(/Independence is the point/i)
// Budget cap of 15
expect(content).toMatch(/exceeds 15 findings/i)
expect(content).toMatch(/highest-severity 15.*Drop the remainder/i)
// After-Review options B and C invoke validation before externalizing
expect(content).toMatch(/\(B\)\s*`LFG.*first run Stage 5b validation/)
expect(content).toMatch(/\(C\)\s*`File a \[TRACKER\].*first run Stage 5b validation/)
// Validator template exists and is read-only
expect(validatorTemplate).toContain("independent validator")
expect(validatorTemplate).toContain("operationally read-only")
expect(validatorTemplate).toContain('"validated": true | false')
expect(validatorTemplate).toMatch(/introduced by THIS diff/i)
expect(validatorTemplate).toMatch(/handled elsewhere/i)
})
test("PR-mode skip-condition pre-check stops without dispatching reviewers", async () => {
const content = await readRepoFile("plugins/compound-engineering/skills/ce-code-review/SKILL.md")
// Skip-check section exists
expect(content).toContain("**Skip-condition pre-check.**")
// gh pr view fetches state and file list for trivial judgment
expect(content).toMatch(/gh pr view.*--json state,title,body,files/)
// Hard skip rules
expect(content).toMatch(/state.*CLOSED.*MERGED/)
// Draft PRs are explicitly NOT skipped
expect(content).not.toMatch(/isDraft.*true.*stop/)
expect(content).toMatch(/Draft PRs are reviewed normally/)
// Trivial-PR judgment uses lightweight model, not a regex
expect(content).toMatch(/lightweight sub-agent/)
expect(content).toMatch(/model.*haiku/i)
expect(content).not.toMatch(/chore\\?\(deps\\?\)/)
// Skip cleanly without dispatching reviewers
expect(content).toMatch(/stop without dispatching reviewers/)
// Standalone branch and base: modes unaffected
expect(content).toMatch(/Standalone branch mode and `base:` mode are unaffected/)
})
test("mode-aware demotion routes weak general-quality findings to soft buckets", async () => {
const content = await readRepoFile("plugins/compound-engineering/skills/ce-code-review/SKILL.md")
// Mode-aware demotion step exists (sub-step within Stage 5; numbering may shift if steps reorder)
expect(content).toMatch(/Mode-aware demotion of weak general-quality findings/i)
// Conservative scope: testing + maintainability personas only
expect(content).toContain("`testing` or `maintainability`")
// Severity P2 or P3 only (P0/P1 always stay primary)
expect(content).toMatch(/Severity is P2 or P3/)
// autofix_class is advisory
expect(content).toMatch(/`autofix_class` is `advisory`/)
// Interactive/report-only: route to testing_gaps or residual_risks
expect(content).toMatch(/`testing`,?\s*append.*`testing_gaps`/)
expect(content).toMatch(/`maintainability`,?\s*append.*`residual_risks`/)
// Demotion entry uses title-only (compact return omits why_it_matters; report-only has no artifact)
expect(content).toMatch(/append `<file:line> -- <title>` to/)
expect(content).toMatch(/title only.*compact return omits/i)
// Headless/autofix: suppress entirely
expect(content).toMatch(/Headless and autofix modes.*Suppress/)
// Coverage section reports demotion count
expect(content).toMatch(/mode-aware demotion/)
})
test("personas use anchored rubric language and no float references remain", async () => {
const personas = [
"ce-correctness-reviewer",
"ce-testing-reviewer",
"ce-maintainability-reviewer",
"ce-project-standards-reviewer",
"ce-security-reviewer",
"ce-performance-reviewer",
"ce-api-contract-reviewer",
"ce-data-migrations-reviewer",
"ce-reliability-reviewer",
"ce-adversarial-reviewer",
"ce-cli-readiness-reviewer",
"ce-previous-comments-reviewer",
"ce-dhh-rails-reviewer",
"ce-kieran-rails-reviewer",
"ce-kieran-python-reviewer",
"ce-kieran-typescript-reviewer",
"ce-julik-frontend-races-reviewer",
"ce-swift-ios-reviewer",
"ce-agent-native-reviewer",
]
for (const persona of personas) {
const content = await readRepoFile(`plugins/compound-engineering/agents/${persona}.agent.md`)
// Anchored language appears
expect(content).toMatch(/Anchor (75|100)/)
expect(content).toMatch(/Anchor 25 or below.*suppress/i)
// No float confidence references
expect(content).not.toMatch(/0\.\d{2}\+/)
expect(content).not.toMatch(/0\.60-0\.79/)
expect(content).not.toMatch(/below 0\.60/)
}
})
test("documents stack-specific conditional reviewers for the JSON pipeline", async () => {