refactor(ce-doc-review): anchor-based confidence scoring (#622)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 14:54:03 -07:00
parent bd77d5550a
commit 6caf330363
20 changed files with 756 additions and 122 deletions
--- a/plugins/compound-engineering/agents/ce-adversarial-document-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-adversarial-document-reviewer.agent.md
@@ -72,10 +72,12 @@ Probe whether the document considered the obvious alternatives and whether the c

 ## Confidence calibration

- **HIGH (0.80+):** Can quote specific text from the document showing the gap, construct a concrete scenario or counterargument, and trace the consequence.
- **MODERATE (0.60-0.79):** The gap is likely but confirming it would require information not in the document (codebase details, user research, production data).
- **LOW (0.40-0.59) — Advisory:** A plausible-but-unlikely failure mode, or a concern worth surfacing without a strong supporting scenario. Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress.
+Use the shared anchored rubric (see `subagent-template.md` — Confidence rubric). Adversarial's domain is premise and failure-mode challenges. Adversarial findings cap naturally at anchor `75` for most concerns because premise challenges inherently resist full verification — "is this assumption wrong?" usually cannot be proven true in advance. That is not a calibration problem; it is the nature of the work. Apply as:
+
+- **`100` — Absolutely certain:** Can quote specific text showing the gap, construct a concrete scenario or counterargument with cited evidence, AND trace the consequence to observable impact. The rare case — use sparingly.
+- **`75` — Highly confident:** The gap is likely to bite and you can describe the scenario concretely, but full confirmation would require information not in the document (codebase details, user research, production data). You double-checked and the concern is material. This is adversarial's normal working ceiling.
+- **`50` — Advisory (routes to FYI):** A plausible-but-unlikely failure mode, or a concern worth surfacing without a strong supporting scenario. Still requires an evidence quote. Surfaces as observation without forcing a decision.
+- **Suppress entirely:** Anything below anchor `50` — speculative "what if" with no supporting scenario. Do not emit; anchors `0` and `25` exist in the enum only so synthesis can track drops.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-coherence-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-coherence-reviewer.agent.md
@@ -23,7 +23,7 @@ You are a technical editor reading for internal consistency. You don't evaluate

 ## Safe_auto patterns you own

-Coherence is the primary persona for surfacing mechanically-fixable consistency issues. These patterns should land as `safe_auto` with `confidence: 0.85+` when the document supplies the authoritative signal:
+Coherence is the primary persona for surfacing mechanically-fixable consistency issues. These patterns should land as `safe_auto` with `confidence: 100` when the document supplies the authoritative signal (the document text leaves no room for interpretation):

 - **Header/body count mismatch.** Section header claims a count (e.g., "6 requirements") and the enumerated body list has a different count (5 items). The body is authoritative unless the document explicitly identifies a missing item. Fix: correct the header to match the list.
 - **Cross-reference to a named section that does not exist.** Text says "see Unit 7" / "per Section 4.2" / "as described in the Rollout section" and that target is not defined anywhere in the document. Fix: delete the reference or fix it to point at an existing target.
@@ -39,10 +39,12 @@ When in doubt, surface the finding as `safe_auto` with `why_it_matters` that nam

 ## Confidence calibration

- **HIGH (0.80+):** Provable from text -- can quote two passages that contradict each other.
- **MODERATE (0.60-0.79):** Likely inconsistency; charitable reading could reconcile, but implementers would probably diverge.
- **LOW (0.40-0.59) — Advisory:** Minor asymmetry or drift with no downstream consequence (e.g., parallel names that don't need to match, phrasing that's inconsistent but unambiguous). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress entirely.
+Use the shared anchored rubric (see `subagent-template.md` — Confidence rubric). Coherence's domain typically hits the strongest anchors because inconsistencies are verifiable from document text alone. Apply as:
+
+- **`100` — Absolutely certain:** Provable from text — can quote two passages that contradict each other. Document text leaves no room for interpretation.
+- **`75` — Highly confident:** Likely inconsistency; a charitable reading could reconcile, but implementers would probably diverge. You double-checked and the issue will be hit in practice.
+- **`50` — Advisory (routes to FYI):** Minor asymmetry or drift with no downstream consequence (parallel names that don't need to match, phrasing that's inconsistent but unambiguous). Still requires an evidence quote. Surfaces as observation without forcing a decision.
+- **Suppress entirely:** Anything below anchor `50` — cannot verify, speculative, or stylistic drift without impact. Do not emit; anchors `0` and `25` exist in the enum only so synthesis can track drops.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-design-lens-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-design-lens-reviewer.agent.md
@@ -34,10 +34,12 @@ Explain what's missing: the functional design thinking that makes the interface

 ## Confidence calibration

- **HIGH (0.80+):** Missing states/flows that will clearly cause UX problems during implementation.
- **MODERATE (0.60-0.79):** Gap exists but a skilled designer could resolve from context.
- **LOW (0.40-0.59) — Advisory:** Pattern or micro-layout preference without strong usability evidence (e.g., button placement alternatives, visual hierarchy micro-choices). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress.
+Use the shared anchored rubric (see `subagent-template.md` — Confidence rubric). Design-lens's domain grounds in named interaction states and user flows. Apply as:
+
+- **`100` — Absolutely certain:** Missing states or flows that will clearly cause UX problems during implementation. Evidence directly confirms the gap — the document names an interaction without the corresponding state or transition.
+- **`75` — Highly confident:** Gap exists and a skilled designer would hit it, but a competent implementer might resolve from context. You double-checked and the issue will surface in practice.
+- **`50` — Advisory (routes to FYI):** Pattern or micro-layout preference without strong usability evidence (button placement alternatives, visual hierarchy micro-choices). Still requires an evidence quote. Surfaces as observation without forcing a decision.
+- **Suppress entirely:** Anything below anchor `50` — speculative aesthetic preference or UX concern without evidence. Do not emit; anchors `0` and `25` exist in the enum only so synthesis can track drops.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-feasibility-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-feasibility-reviewer.agent.md
@@ -27,10 +27,12 @@ Apply each check only when relevant. Silence is only a finding when the gap woul

 ## Confidence calibration

- **HIGH (0.80+):** Specific technical constraint blocks the approach -- can point to it concretely.
- **MODERATE (0.60-0.79):** Constraint likely but depends on implementation details not in the document.
- **LOW (0.40-0.59) — Advisory:** Theoretical constraint with no current-scale evidence (e.g., "could be slow if data grows 10x", speculative scalability concerns with no baseline number). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress entirely.
+Use the shared anchored rubric (see `subagent-template.md` — Confidence rubric). Feasibility's domain grounds in codebase evidence, so it reaches the strongest anchors when you can cite concrete technical constraints. Apply as:
+
+- **`100` — Absolutely certain:** Specific technical constraint blocks the approach and you can cite it concretely (codebase reference, framework behavior, platform limit). Evidence directly confirms.
+- **`75` — Highly confident:** Constraint likely to bite, but confirming it would require implementation details not in the document. You double-checked and the issue will be hit in practice.
+- **`50` — Advisory (routes to FYI):** A verified constraint that is genuinely minor at current scale — the implementer should know it exists but would not be surprised by it hitting in practice. Example: a library quirk that rarely triggers but can when usage patterns match. Still requires an evidence quote. Surfaces as observation without forcing a decision. Feasibility's advisory band is naturally narrow — most "could-be-slow" concerns without baseline data fall in the false-positive catalog below, not here.
+- **Suppress entirely:** Anything below anchor `50`, plus any shape the false-positive catalog in `subagent-template.md` names. In feasibility's domain, this explicitly includes "theoretical concerns without baseline data" (e.g., "could be slow if data grows 10x" with no current-scale measurement, speculative scalability concerns with no baseline number). Those are non-findings that must NOT be routed to anchor `50`. Do not emit; anchors `0` and `25` exist in the enum only so synthesis can track drops.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-product-lens-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-product-lens-reviewer.agent.md
@@ -58,10 +58,12 @@ If priority tiers exist: do assignments match stated goals? Are must-haves truly

 ## Confidence calibration

- **HIGH (0.80+):** Can quote both the goal and the conflicting work -- disconnect is clear.
- **MODERATE (0.60-0.79):** Likely misalignment, depends on business context not in document.
- **LOW (0.40-0.59) — Advisory:** Observation about positioning, naming, or strategy without a concrete impact (subjective preference, speculative future-product concern with no current signal). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress.
+Use the shared anchored rubric (see `subagent-template.md` — Confidence rubric). Product-lens's domain is premise and strategy — whether the document's goals, motivation, and priorities hold up. Premise critiques cap naturally at anchor `75` for most concerns because "is the motivation valid?" cannot be verified against ground truth; it requires business context the document may not supply. That is not a calibration problem; it is the nature of the work. Apply as:
+
+- **`100` — Absolutely certain:** Can quote both the goal and the conflicting work — disconnect is clear. Evidence directly confirms the misalignment within the document itself. The rare case — use sparingly.
+- **`75` — Highly confident:** Likely misalignment, full confirmation depends on business context not in the document. You double-checked and the concern will materially affect direction. This is product-lens's normal working ceiling.
+- **`50` — Advisory (routes to FYI):** Observation about positioning, naming, or strategy without a concrete impact (subjective preference about framing with an evidence quote, minor identity-drift note where the drift has no downstream user consequence). Still requires an evidence quote. Surfaces as observation without forcing a decision.
+- **Suppress entirely:** Anything below anchor `50`, plus any shape the false-positive catalog in `subagent-template.md` names. In product-lens's domain, this explicitly includes "speculative future-product concerns with no current signal" — those are non-findings that must NOT be routed to anchor `50`. Do not emit; anchors `0` and `25` exist in the enum only so synthesis can track drops.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-scope-guardian-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-scope-guardian-reviewer.agent.md
@@ -41,10 +41,12 @@ With AI-assisted implementation, the cost gap between shortcuts and complete sol

 ## Confidence calibration

- **HIGH (0.80+):** Can quote goal statement and scope item showing the mismatch.
- **MODERATE (0.60-0.79):** Misalignment likely but depends on context not in document.
- **LOW (0.40-0.59) — Advisory:** Organizational preference without a concrete cost (unit ordering, section placement alternatives that read equally well, "this could also be split" observations without real impact). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress.
+Use the shared anchored rubric (see `subagent-template.md` — Confidence rubric). Scope-guardian's domain grounds in the document's own stated goals and declared scope. Apply as:
+
+- **`100` — Absolutely certain:** Can quote both the goal statement and the scope item showing the mismatch. Evidence directly confirms the misalignment.
+- **`75` — Highly confident:** Misalignment likely to derail the work, but fully confirming it would require context not in the document (strategic priorities, prior decisions). You double-checked and the issue will hit implementers.
+- **`50` — Advisory (routes to FYI):** Organizational preference without a concrete cost (unit ordering, section placement alternatives that read equally well, "this could also be split" observations without real impact). Still requires an evidence quote. Surfaces as observation without forcing a decision.
+- **Suppress entirely:** Anything below anchor `50` — speculative concern or stylistic preference. Do not emit; anchors `0` and `25` exist in the enum only so synthesis can track drops.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-security-lens-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-security-lens-reviewer.agent.md
@@ -25,10 +25,12 @@ Skip areas not relevant to the document's scope.

 ## Confidence calibration

- **HIGH (0.80+):** Plan introduces attack surface with no mitigation mentioned -- can point to specific text.
- **MODERATE (0.60-0.79):** Concern likely but plan may address implicitly or in a later phase.
- **LOW (0.40-0.59) — Advisory:** Theoretical attack surface with no realistic exploit path under current design (e.g., speculative timing-attack on non-sensitive data, defense-in-depth nice-to-have with no current vector). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress.
+Use the shared anchored rubric (see `subagent-template.md` — Confidence rubric). Security-lens's domain grounds in named attack surfaces and missing mitigations. Apply as:
+
+- **`100` — Absolutely certain:** Plan introduces attack surface with no mitigation mentioned — can point to specific text. Evidence directly confirms the gap; the exploit path is concrete.
+- **`75` — Highly confident:** Concern is likely exploitable, but the plan may address it implicitly or in a later phase not yet specified. You double-checked and the vector is material.
+- **`50` — Advisory (routes to FYI):** A verified gap that would make the design more robust but is not required by the threat model the plan commits to — for example, a defense-in-depth addition on a path that already has a primary mitigation, or a logging gap that would help incident response without preventing the incident. Still requires an evidence quote. Surfaces as observation without forcing a decision.
+- **Suppress entirely:** Anything below anchor `50`, plus any shape the false-positive catalog in `subagent-template.md` names. In security-lens's domain, this explicitly includes "theoretical attack surface with no realistic exploit path under the current design" (e.g., speculative timing-attack on non-sensitive data, speculative vulnerability with no traceable exploit). Those are non-findings that must NOT be routed to anchor `50`. Do not emit; anchors `0` and `25` exist in the enum only so synthesis can track drops.

 ## What you don't flag