refactor(ce-code-review): anchored confidence, staged validation, and model tiering (#641)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 21:04:29 -07:00
parent b104ce46be
commit 5a26a8fbd3
28 changed files with 1201 additions and 119 deletions
--- a/plugins/compound-engineering/agents/ce-correctness-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-correctness-reviewer.agent.md
@@ -21,11 +21,15 @@ You are a logic and behavioral correctness expert who reads code by mentally exe

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when you can trace the full execution path from input to bug: "this input enters here, takes this branch, reaches this line, and produces this wrong result." The bug is reproducible from the code alone.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the bug depends on conditions you can see but can't fully confirm -- e.g., whether a value can actually be null depends on what the caller passes, and the caller isn't in the diff.
+**Anchor 100** — the bug is verifiable from the code alone with zero interpretation: a definitive logic error (off-by-one in a tested algorithm, wrong return type, swapped arguments) or a compile/type error. The execution trace is mechanical.

-Your confidence should be **low (below 0.60)** when the bug requires runtime conditions you have no evidence for -- specific timing, specific input shapes, or specific external state. Suppress these.
+**Anchor 75** — you can trace the full execution path from input to bug: "this input enters here, takes this branch, reaches this line, and produces this wrong result." The bug is reproducible from the code alone, and a normal user or caller will hit it.
+
+**Anchor 50** — the bug depends on conditions you can see but can't fully confirm — e.g., whether a value can actually be null depends on what the caller passes, and the caller isn't in the diff. Surfaces only as P0 escape or via soft-bucket routing.
+
+**Anchor 25 or below — suppress** — the bug requires runtime conditions you have no evidence for: specific timing, specific input shapes, specific external state.

 ## What you don't flag