refactor(ce-doc-review): anchor-based confidence scoring (#622)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 14:54:03 -07:00
parent bd77d5550a
commit 6caf330363
20 changed files with 756 additions and 122 deletions
--- a/plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md
+++ b/plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md
@@ -24,10 +24,24 @@ Return ONLY valid JSON matching the findings schema below. No prose, no markdown
 - `finding_type`: one of `"error"`, `"omission"` — nothing else (no `"tension"`, `"concern"`, `"observation"`, etc.).
 - `autofix_class`: one of `"safe_auto"`, `"gated_auto"`, `"manual"`.
 - `evidence`: an ARRAY of strings with at least one element. A single string value is a validation failure — wrap every quote in `["..."]` even when there is only one.
- `confidence`: a number between 0.0 and 1.0 inclusive.
+- `confidence`: one of exactly `0`, `25`, `50`, `75`, or `100` — a discrete anchor, NOT a continuous number. Any other value (e.g., `72`, `0.85`, `"high"`) is a validation failure. Pick the anchor whose behavioral criterion you can honestly self-apply to this finding (see "Confidence rubric" below).

 If your persona description uses severity vocabulary like "high-priority" or "critical" in its rubric text, translate to the P0-P3 scale at emit time. "Critical / must-fix" → P0, "important / should-fix" → P1, "worth-noting / could-fix" → P2, "low-signal" → P3. Same for priorities described qualitatively in your analysis — map to P0-P3 on the way out.

+**Confidence rubric — use these exact behavioral anchors.** Pick the single anchor whose criterion you can honestly self-apply. Do not pick a value between anchors; only `0`, `25`, `50`, `75`, and `100` are valid. The rubric is anchored on behavior you performed, not on a vague sense of certainty — if you cannot truthfully attach the behavioral claim to the finding, step down to the next anchor.
+
+- **`0` — Not confident at all.** A false positive that does not stand up to light scrutiny, or a pre-existing issue the document did not introduce. **Do not emit — suppress silently.** This anchor exists in the enum only so synthesis can explicitly track the drop; personas never produce it.
+- **`25` — Somewhat confident.** Might be a real issue but could also be a false positive; you were not able to verify. **Do not emit — suppress silently.** This anchor, like `0`, exists in the enum only so synthesis can track the drop; personas never produce it. If your domain is genuinely uncertain, either gather more evidence until you can honestly anchor the finding at `50` or higher, or suppress the concern entirely. (Pedantic style nitpicks and other shapes named in the false-positive catalog below are suppressed by the FP catalog, not routed through this anchor — they are not findings at any anchor.)
+- **`50` — Moderately confident.** You verified this is a real issue but it may be a nitpick or not meaningfully affect plan correctness. Relative to the rest of the document, it is not very important. Advisory observations — where the honest answer to "what breaks if we do not fix this?" is "nothing breaks, but..." — land here. Surfaces in the FYI subsection.
+- **`75` — Highly confident.** You double-checked and verified the issue will be hit in practice by implementers or readers of this document. The existing approach in the document is insufficient. The issue directly impacts plan correctness, implementer understanding, or downstream execution.
+
+  **Anchor `75` requires naming a concrete downstream consequence someone will hit** — a wrong deploy order, an unimplementable step, a contract mismatch, missing evidence that blocks a decision. Strength-of-argument concerns ("motivation is thin," "premise is unconvincing," "a different reader might disagree") do not meet this bar on their own — they are advisory observations and land at anchor `50` unless they also name the specific downstream outcome the reader hits. When in doubt between `50` and `75`, ask: "will a competent implementer or reader concretely encounter this, or is this my opinion about the document's strength?" The former is `75`; the latter is `50`.
+- **`100` — Absolutely certain.** You double-checked and confirmed the issue. The evidence directly confirms it will happen frequently in practice. The document text, codebase, or cross-references leave no room for interpretation.
+
+Anchor and severity are independent axes. A P2 finding can be anchor `100` if the evidence is airtight; a P0 finding can be anchor `50` if it is an important concern you could not fully verify. Anchor gates where the finding surfaces (drop / FYI / actionable); severity orders it within the actionable surface.
+
+Synthesis drops anchors `0` and `25` silently; anchor `50` routes to the FYI subsection; anchors `75` and `100` enter the actionable tier (walk-through, proposed fixes, safe_auto when `autofix_class` also warrants).
+
 Example of a schema-valid finding (all required fields, correct enum values, correct array shape):

 ```json
@@ -39,7 +53,7 @@ Example of a schema-valid finding (all required fields, correct enum values, cor
  "finding_type": "omission",
  "autofix_class": "gated_auto",
  "suggested_fix": "Require Units 1-4 to land in a single atomic PR, or define the sequence explicitly.",
-  "confidence": 0.92,
+  "confidence": 100,
  "evidence": [
    "If the migration runs before Units 1-3 land, the code reads stale data.",
    "If after, new code temporarily sees old entries until migration runs."
@@ -47,10 +61,12 @@ Example of a schema-valid finding (all required fields, correct enum values, cor
 }
 ```

+The `confidence: 100` in the example is justified because all three anchor-100 criteria hold: the reviewer double-checked (the plan literally names both orderings and resolves neither), the evidence directly confirms the outcome (quoted text shows each branch produces incorrect state), and the issue will happen frequently in practice (every deploy is subject to it).
+
 Rules:

 - You are a leaf reviewer inside an already-running compound-engineering review workflow. Do not invoke compound-engineering skills or agents unless this template explicitly instructs you to. Perform your analysis directly and return findings in the required output format only.
- Suppress any finding below your stated confidence floor (see your Confidence calibration section).
+- Suppress any finding you cannot honestly anchor at `50` or higher (the actionable floor is `50`; anchors `0` and `25` are suppressed by synthesis anyway, so emitting them only adds noise). If your persona's domain description sets a stricter floor (e.g., anchor `75` minimum), honor it.
 - Every finding MUST include at least one evidence item — a direct quote from the document.
 - You are operationally read-only. Analyze the document and produce findings. Do not edit the document, create files, or make changes. You may use non-mutating tools (file reads, glob, grep, git log) to gather context about the codebase when evaluating feasibility or existing patterns.
 - **Exclude prior-round deferred entries from review scope.** If the document under review contains a `## Deferred / Open Questions` section or subsections such as `### From YYYY-MM-DD review`, ignore that content — it is review output from prior rounds, not part of the document's actual plan/requirements content. Do not flag entries inside it as new findings. Do not quote its text as evidence. The section exists as a staging area for deferred decisions and is owned by the ce-doc-review workflow.
@@ -88,7 +104,7 @@ The `why_it_matters` field is how the reader — a developer triaging findings,
 - **Lead with observable consequence.** Describe what goes wrong from the reader's or implementer's perspective — what breaks, what gets misread, what decision gets made wrong, what the downstream audience experiences. Do not lead with document structure ("Section X on line Y says..."). Start with the effect ("Implementers will disagree on which tier applies when..."). Section references appear later, only when the reader needs them to locate the issue.
 - **Explain why the fix resolves the problem.** If you include a `suggested_fix`, the `why_it_matters` should make clear why that specific fix addresses the root cause. When a similar pattern exists elsewhere in the document or codebase (a parallel section, an established convention, a cited code pattern), reference it so the recommendation is grounded in what the team has already chosen.
 - **Keep it tight.** Approximately 2-4 sentences. Longer framings are a regression — downstream surfaces have narrow display budgets, and verbose content gets truncated or skimmed.
- **Always produce substantive content.** `why_it_matters` is required by the schema. Empty strings, nulls, and single-phrase entries are validation failures. If you found something worth flagging (confidence at or above your persona's floor), you can explain it — the field exists because every finding needs a reason.
+- **Always produce substantive content.** `why_it_matters` is required by the schema. Empty strings, nulls, and single-phrase entries are validation failures. If you found something worth flagging at anchor `50` or higher, you can explain it — the field exists because every finding needs a reason.

 Illustrative pair — same finding, weak vs. strong framing:

@@ -106,14 +122,21 @@ STRONG (observable consequence first, grounded fix reasoning):
  Synthesis already lacks a route for it.
 ```

-False-positive categories to actively suppress:
+False-positive categories to actively suppress. Do NOT emit a finding when any of these apply — not even at anchor `25` or `50`. These are not edge cases you should route to FYI; they are non-findings.

- Pedantic style nitpicks (word choice, bullet vs. numbered lists, comma-vs-semicolon) — style belongs to the document author
- Issues that belong to other personas (see your Suppress conditions)
- Findings already resolved elsewhere in the document (search before flagging)
- Content inside `## Deferred / Open Questions` sections (prior-round review output, not document content)
+- **Pedantic style nitpicks** (word choice, bullet vs. numbered lists, comma-vs-semicolon, em-dash vs en-dash) — style belongs to the document author
+- **Issues that belong to other personas** (see your Suppress conditions at the top of your persona prompt) — surfacing another persona's territory inflates the Coverage table and forces synthesis to dedup work that should not exist
+- **Findings already resolved elsewhere in the document** — search the document before flagging. If the concern is addressed in a later section, the earlier section's apparent omission is not a real finding
+- **Content inside `## Deferred / Open Questions` sections** — prior-round review output, not document content. This is the ce-doc-review workflow's own staging area
+- **Pre-existing issues the document did not introduce** — if the concern exists in the codebase or organizational context independent of this document's proposal, flagging it here is scope creep
+- **Speculative future-work concerns with no current signal** — "what if requirements change" / "this might need rework later" are not findings unless the document itself introduces the risk
+- **Theoretical concerns without baseline data** — scalability worries without current scale numbers, performance worries without current latency measurements, edge cases without evidence the edge is reachable
+- **Changes in functionality that are likely intentional** — if the document is explicitly making a design choice different from a precedent you noticed, that is a decision, not an error. Flag only when the document appears unaware of the precedent
+- **Issues that a linter, typechecker, or validator would catch** — spelling in identifiers, JSON syntax errors, YAML indentation. These surface automatically elsewhere; the review layer adds value by catching what tools cannot

-**Advisory observations — route to FYI, do not force a decision.** If the honest answer to "what actually breaks if we don't fix this?" is "nothing breaks, but…", the finding is advisory. Ask: would a competent implementer hit a wrong outcome, a production bug, a misleading plan, or rework later? If no, set confidence in the **0.40–0.59 LOW/Advisory band** so synthesis routes the finding to FYI rather than surfacing it as a manual decision. Do not suppress — the observation still has value; it just does not warrant user judgment. Typical advisory shapes: naming asymmetry with no wrong answer, stylistic preference without evidence of impact, speculative future-work concern with no current signal, subjective readability note, theoretical scalability concern without baseline data, "could also be split" organizational preference when the current split is not broken.
+**Advisory observations — route to FYI, do not force a decision.** If the honest answer to "what actually breaks if we don't fix this?" is "nothing breaks, but…", the finding is advisory. Ask: would a competent implementer hit a wrong outcome, a production bug, a misleading plan, or rework later? If no, set `confidence: 50` so synthesis routes the finding to the FYI subsection rather than surfacing it as a decision or proposed fix. Do not suppress — the observation still has value; it just does not warrant user judgment. Typical advisory shapes: naming asymmetry with no wrong answer, subjective readability note about non-stylistic content (e.g., a definition placed before the term it defines), "could also be split" organizational preference when the current split is not broken. Style belongs to the false-positive catalog above, not here — pedantic style nitpicks suppress entirely.
+
+**Precedence over the false-positive catalog.** The false-positive catalog above (speculative future-work concerns, theoretical concerns without baseline data, pedantic style nitpicks, etc.) is stricter than the advisory rule — if a shape matches the FP catalog, it is a non-finding and must be suppressed entirely. Do NOT route it to anchor `50` / FYI. The advisory rule applies only to shapes that are NOT in the FP catalog.
 </output-contract>

 <review-context>