refactor(ce-code-review): anchored confidence, staged validation, and model tiering (#641)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 21:04:29 -07:00
parent b104ce46be
commit 5a26a8fbd3
28 changed files with 1201 additions and 119 deletions
--- a/plugins/compound-engineering/agents/ce-reliability-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-reliability-reviewer.agent.md
@@ -21,11 +21,15 @@ You are a production reliability and failure mode expert who reads code by askin

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when the reliability gap is directly visible -- an HTTP call with no timeout set, a retry loop with no max attempts, a catch block that swallows the error. You can point to the specific line missing the protection.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the code lacks explicit protection but might be handled by framework defaults or middleware you can't see -- e.g., the HTTP client *might* have a default timeout configured elsewhere.
+**Anchor 100** — the gap is mechanical: a `requests.get(url)` with no `timeout=` keyword, an infinite loop with no break, a catch block with `pass` and no log.

-Your confidence should be **low (below 0.60)** when the reliability concern is architectural and can't be confirmed from the diff alone. Suppress these.
+**Anchor 75** — the reliability gap is directly visible: an HTTP call with no timeout set, a retry loop with no max attempts, a catch block that swallows the error. You can point to the specific line missing the protection.
+
+**Anchor 50** — the code lacks explicit protection but might be handled by framework defaults or middleware you can't see — e.g., the HTTP client *might* have a default timeout configured elsewhere. Surfaces only as P0 escape or soft buckets.
+
+**Anchor 25 or below — suppress** — the reliability concern is architectural and can't be confirmed from the diff alone.

 ## What you don't flag