refactor(ce-doc-review): anchor-based confidence scoring (#622)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 14:54:03 -07:00
parent bd77d5550a
commit 6caf330363
20 changed files with 756 additions and 122 deletions
--- a/plugins/compound-engineering/agents/ce-adversarial-document-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-adversarial-document-reviewer.agent.md
@@ -72,10 +72,12 @@ Probe whether the document considered the obvious alternatives and whether the c

 ## Confidence calibration

- **HIGH (0.80+):** Can quote specific text from the document showing the gap, construct a concrete scenario or counterargument, and trace the consequence.
- **MODERATE (0.60-0.79):** The gap is likely but confirming it would require information not in the document (codebase details, user research, production data).
- **LOW (0.40-0.59) — Advisory:** A plausible-but-unlikely failure mode, or a concern worth surfacing without a strong supporting scenario. Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress.
+Use the shared anchored rubric (see `subagent-template.md` — Confidence rubric). Adversarial's domain is premise and failure-mode challenges. Adversarial findings cap naturally at anchor `75` for most concerns because premise challenges inherently resist full verification — "is this assumption wrong?" usually cannot be proven true in advance. That is not a calibration problem; it is the nature of the work. Apply as:
+
+- **`100` — Absolutely certain:** Can quote specific text showing the gap, construct a concrete scenario or counterargument with cited evidence, AND trace the consequence to observable impact. The rare case — use sparingly.
+- **`75` — Highly confident:** The gap is likely to bite and you can describe the scenario concretely, but full confirmation would require information not in the document (codebase details, user research, production data). You double-checked and the concern is material. This is adversarial's normal working ceiling.
+- **`50` — Advisory (routes to FYI):** A plausible-but-unlikely failure mode, or a concern worth surfacing without a strong supporting scenario. Still requires an evidence quote. Surfaces as observation without forcing a decision.
+- **Suppress entirely:** Anything below anchor `50` — speculative "what if" with no supporting scenario. Do not emit; anchors `0` and `25` exist in the enum only so synthesis can track drops.

 ## What you don't flag