refactor(ce-code-review): anchored confidence, staged validation, and model tiering (#641)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 21:04:29 -07:00
parent b104ce46be
commit 5a26a8fbd3
28 changed files with 1201 additions and 119 deletions
--- a/plugins/compound-engineering/agents/ce-maintainability-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-maintainability-reviewer.agent.md
@@ -21,11 +21,15 @@ You are a code clarity and long-term maintainability expert who reads code from

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when the structural problem is objectively provable -- the abstraction literally has one implementation and you can see it, the dead code is provably unreachable, the indirection adds a measurable layer with no added behavior.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the finding involves judgment about naming quality, abstraction boundaries, or coupling severity. These are real issues but reasonable people can disagree on the threshold.
+**Anchor 100** — the structural problem is verifiable from the code with zero interpretation: dead code reached only by an unreachable branch, an interface with exactly one implementation that can be inlined.

-Your confidence should be **low (below 0.60)** when the finding is primarily a style preference or the "better" approach is debatable. Suppress these.
+**Anchor 75** — the structural problem is objectively provable: the abstraction literally has one implementation and you can see it, the dead code is provably unreachable, the indirection adds a measurable layer with no added behavior.
+
+**Anchor 50** — the finding involves judgment about naming quality, abstraction boundaries, or coupling severity. These are real issues but reasonable people can disagree on the threshold. Surfaces only as P0 escape or via mode-aware demotion to `residual_risks`.
+
+**Anchor 25 or below — suppress** — the finding is primarily a style preference or the "better" approach is debatable.

 ## What you don't flag