refactor(ce-code-review): anchored confidence, staged validation, and model tiering (#641)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 21:04:29 -07:00
parent b104ce46be
commit 5a26a8fbd3
28 changed files with 1201 additions and 119 deletions
--- a/docs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md
+++ b/docs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md
@@ -85,7 +85,7 @@ Filter at `>= 50` for doc review; let the routing menu handle volume. Dismissing

 ## When NOT to port directly

- Code review workflows (e.g., `ce-code-review`) have linter backstops and public-comment costs. Port the rubric structure, but tune the threshold higher (`>= 75` or `>= 80` per Anthropic). This is out of scope for the ce-doc-review migration; evaluate separately.
+- Code review workflows have linter backstops and public-comment costs. Port the rubric structure, but tune the threshold higher (`>= 75`). See the "ce-code-review migration" section below for the completed port.
 - High-throughput pipelines where the `25` anchor ("couldn't verify") represents most findings. Dropping everything below `50` may be too aggressive; consider surfacing `25` as "needs human triage" instead.

 ## Migration history
@@ -121,7 +121,88 @@ Re-evaluation with the tightened criterion shifted cli-printing-press from 21 De
 - **Harness drift between iterations.** Iteration-1 orchestrators dispatched parallel persona subagents; iteration-2 orchestrators executed personas inline (nested Agent tool unavailable in that session). This affected side metrics (proposed-fix count on cli-printing-press iteration-2 dropped 15 → 4, likely harness-driven rather than tweak-driven) but did not obscure the tweak's core effect, which was large-magnitude.
 - **No absolute-calibration ground truth.** The evaluation measured the migration's stated failure modes disappearing. Whether an anchor-75 finding literally hits 75% of the time remains unmeasured; no labeled doc-review corpus exists.

+## ce-code-review migration (2026-04-21)
+
+Ported the same anchored-rubric structure into `ce-code-review` and bundled it with three additional code-review-specific precision controls. The two skills now share calibration discipline but diverge on threshold and on how independent verification is implemented.
+
+### Threshold: `>= 75` (not `>= 50` like ce-doc-review, not `>= 80` like Anthropic)
+
+ce-code-review uses anchor 75 as the gate. P0 findings escape at anchor 50.
+
+`>= 75` matches the ce-doc-review choice of using the anchor itself as the threshold (no awkward middle-bucket gap). At `>= 75`, anchors 75 ("real, will hit in practice") and 100 ("verifiable from code alone") survive; anchors 0/25/50 are dropped. Anthropic's `>= 80` under a discrete `{0,25,50,75,100}` scale would collapse to "anchor 100 only," which is too narrow — it would silence findings where personas can construct the trace but cannot literally read the bug off the code.
+
+The threshold divergence from ce-doc-review (`>= 50`) is correct for the same reasons documented in the "Why the threshold diverges from Anthropic" section above, applied in reverse: code review HAS a linter backstop, IS publicly visible, and code claims ARE ground-truth verifiable. Code review wants narrow precision; doc review wants broad surfacing.
+
+### Validation pass (Stage 5b): the deferred follow-up, now landed
+
+The ce-doc-review plan deferred a "neutral-scorer second pass" to a follow-up plan. ce-code-review implements it as **Stage 5b**: an independent validator sub-agent per surviving finding, mode-conditional dispatch, and a 15-finding budget cap.
+
+- **Why now for code review, not doc review:** code review has externalizing modes (autofix applies fixes, headless returns findings to programmatic callers) where false positives have real cost — wrong fixes get committed, downstream automation acts on bad signal. Doc review's worst case is a noisy report a user dismisses with a keystroke; code review's worst case is a wrong-fix PR getting merged.
+- **Mode-conditional dispatch:** validation runs in `headless`, `autofix`, and the interactive LFG/File-tickets routing paths. It is skipped in interactive walk-through (the human is the per-finding validator) and report-only (nothing is being externalized). This scopes cost to the cases where false positives have real cost.
+- **Per-finding parallel dispatch, not batched:** independence is the design point. A single batched validator looking at all findings together pattern-matches across them and recreates the persona-bias problem we are escaping. Per-file batching is left as a future optimization for reviews with many findings clustered in few files.
+- **No `validated` field on findings:** an early plan added a `validated: boolean` field; it was removed during planning. Surviving findings post-validation are validated by definition (rejected ones are dropped); in modes where validation does not run, the run's mode tells consumers everything they need. A field constant within any mode does no work.
+- **Conservative failure mode:** validator timeout, malformed output, or dispatch error → drop the finding. Unverified findings should not externalize.
+
+The validator's protocol is `{ "validated": true | false, "reason": "<one sentence>" }` answering three questions: is the issue real, is it introduced by THIS diff, and is it not handled elsewhere. Template: `references/validator-template.md`.
+
+### Mode-aware false-positive demotion
+
+ce-code-review's broader persona surface (17 reviewers vs ce-doc-review's 7) means more weak general-quality signal. Stricter precision in externalizing modes was already accomplished by the higher threshold; for interactive mode, a different policy: route weak findings to existing soft buckets (`testing_gaps`, `residual_risks`, `advisory`) rather than suppress.
+
+The demotion rule is intentionally narrow: severity P2 or P3, `autofix_class` advisory, contributing reviewer is `testing` or `maintainability`. Headless and autofix suppress these entirely; interactive and report-only demote them to soft buckets where they remain visible without competing for primary-findings attention.
+
+This is the "tier the precision bar by mode" framing. Synthesis owns it; personas don't change what they flag based on mode.
+
+### Lint-ignore suppression
+
+Code carrying an explicit lint disable comment for the rule a reviewer is about to flag (`eslint-disable-next-line no-unused-vars`, `# rubocop:disable Style/...`, `# noqa: E501`, etc.) — suppress unless the suppression itself violates a project-standards rule. The author already chose to suppress; re-flagging via a different reviewer creates noise and ignores their decision.
+
+This is the only entirely new false-positive category in ce-code-review's catalog; the rest were ported from the existing pre-anchor catalog.
+
+### PR-mode skip-condition pre-check
+
+Before running the full review on a PR, a single `gh pr view` call probes for skip conditions:
+- Closed or merged PR
+- Draft PR
+- Trivial automated PR (conservative `chore(deps)` / `build(deps)` / release-bump pattern with empty body)
+- Already has a ce-code-review report comment
+
+Skip cleanly without dispatching reviewers. Standalone branch and `base:` modes always run — the skip-check is PR-mode only. Already-reviewed detection deliberately ignores commits-since-comment; the escape hatch for "I want to re-review after pushing more commits" is branch mode or `base:` mode, both of which bypass the skip-check entirely.
+
+This avoids the wasted multi-agent review cost on PRs that should not be reviewed (closed, draft, dependabot-style, or already-reviewed). It is the cheapest mechanism in this migration and disproportionately valuable for any team that runs the skill against arbitrary PR queues.
+
+### Files
+
+- `plugins/compound-engineering/skills/ce-code-review/references/findings-schema.json` — `confidence` is integer enum `[0, 25, 50, 75, 100]` with code-review-specific behavioral definitions in the description; `_meta.confidence_anchors` and `_meta.confidence_thresholds` document the anchors and `>= 75` gate
+- `plugins/compound-engineering/skills/ce-code-review/references/subagent-template.md` — verbatim 5-anchor rubric with code-review framing, expanded false-positive catalog including lint-ignore rule, hard schema-conformance constraints rejecting floats
+- `plugins/compound-engineering/skills/ce-code-review/references/validator-template.md` — Stage 5b validator subagent prompt
+- `plugins/compound-engineering/skills/ce-code-review/SKILL.md` — Stage 5 anchor gate and one-anchor promotion (replaces `+0.10`), Stage 5 step 7c mode-aware demotion, Stage 5b validation pass with budget cap, Stage 1 PR-mode skip-condition pre-check, After-Review options B and C invoke validation before externalizing
+- `plugins/compound-engineering/agents/ce-*-reviewer.agent.md` — 18 personas updated from float bands to anchored language, preserving each persona's specific calibration signal
+- `plugins/compound-engineering/skills/ce-code-review/references/review-output-template.md` — Confidence column renders as integer (`75`, `100`), not float
+- `tests/review-skill-contract.test.ts` — schema, synthesis, validation pass, skip-conditions, mode-aware demotion, and per-persona anchored-language assertions
+
+### When to apply this combined pattern to a new skill
+
+Apply the full bundle (anchored rubric + validation pass + mode-aware demotion + skip-conditions + lint-ignore) when **all** of:
+1. The skill is a multi-persona review workflow producing structured findings.
+2. The skill has externalizing modes — outputs that get acted on without further human review (PR comments, autofix, downstream automation, headless callers).
+3. The skill is invoked frequently enough that wasted runs are visible (skip-conditions are pure win in this case; modest cost in low-volume cases).
+
+Apply only the **anchored rubric** (the ce-doc-review subset) when:
+- The skill is single-shot or dismissal is cheap via UI/menu — validation pass adds cost without protecting anything that wasn't already going to be triaged by a human.
+- The skill operates on premise/strategy claims that lack ground-truth verification — anchor 100 is unreachable; threshold should be `>= 50`.
+
+Skip the entire pattern when:
+- The skill produces a single value, not a population of findings.
+- The skill operates on user input where the user IS the source of truth (e.g., interactive Q&A skills).
+
+### Migration history (ce-code-review)
+
+Landed in a single PR with anchored rubric, validation pass, skip-conditions, mode-aware demotion, lint-ignore suppression, and persona sweep all together. The schema change is the load-bearing commit; subagent template, synthesis, and persona updates consume it. Branch: `refactor/ce-code-review-precision-and-validation`. The plan with full decision rationale lives at `docs/plans/2026-04-21-002-refactor-ce-code-review-precision-and-validation-plan.md`.
+
 ## Deferred follow-ups

- Port the pattern to `ce-code-review` with a code-review-appropriate threshold
- Evaluate a neutral-scorer second pass (a cheap agent that re-scores findings independent of the producing persona) once the anchor rubric has been observed in practice
+- **PR inline comment posting mode for ce-code-review.** Anthropic's plugin posts findings as inline GitHub PR comments via `mcp__github_inline_comment__create_inline_comment` with full-SHA link discipline and committable suggestion blocks. ce-code-review currently has no PR-comment mode at all (terminal output, fixer auto-apply, or headless return only). Real workflow gap; deferred because it is a substantial new mode (link format, suggestion-block handling, deduplication semantics, tracker integration overlap).
+- **Per-file validator batching.** When real-world reviews routinely surface many findings clustered in few files (large refactors), a per-file validator that reads the file once and evaluates all findings against it could meaningfully reduce cost while preserving cross-file independence. Implement when data shows the saving matters.
+- **Haiku-tier orchestrator-side checks.** ce-code-review currently uses sonnet for all subagent dispatch including the cheap PR skip-condition probe. Push obvious cheap checks (skip-conditions, standards path discovery) to haiku.
+- **Re-evaluate which always-on personas earn their noise.** ce-code-review keeps `testing` and `maintainability` always-on with mode-aware demotion as the safety valve. If real review runs show the demotion is firing constantly, consider making them conditional rather than always-on.