refactor(ce-doc-review): anchor-based confidence scoring (#622)
Some checks failed
CI / pr-title (push) Has been cancelled
CI / test (push) Has been cancelled
Release PR / release-pr (push) Has been cancelled
Release PR / publish-cli (push) Has been cancelled

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trevin Chow
2026-04-21 14:54:03 -07:00
committed by GitHub
parent bd77d5550a
commit 6caf330363
20 changed files with 756 additions and 122 deletions

View File

@@ -0,0 +1,363 @@
---
title: "Refactor ce-doc-review confidence scoring to anchored rubric"
type: refactor
status: active
date: 2026-04-21
---
# Refactor ce-doc-review confidence scoring to anchored rubric
## Overview
Replace ce-doc-review's continuous `confidence: 0.0-1.0` field with a 5-anchor rubric (`0 | 25 | 50 | 75 | 100`), each tied to a behavioral definition the persona can honestly self-apply. The change adopts the structural techniques from Anthropic's official code-review plugin (anchored scoring, verbatim rubric in agent prompt, explicit false-positive catalog) while tuning the threshold (`>= 50`) to document-review economics — which have opposite asymmetries from code review (no linter backstop, premise challenges resist verification, surfaced findings are cheap to dismiss via routing menu, missed findings derail downstream implementation).
The goal is to eliminate false-precision gaming (personas anchoring on round numbers like 0.65 / 0.72 / 0.85 and implying differentiation that the model cannot actually produce) and replace it with discrete anchors whose meaning is stable and behaviorally grounded.
## Problem Frame
Current state: `confidence` is a float between 0.0 and 1.0. Synthesis uses per-severity gates (0.50 / 0.60 / 0.65 / 0.75) and a 0.40 FYI floor. LLM-generated confidence at this granularity is not meaningfully calibrated — personas in practice cluster on round numbers (0.60, 0.65, 0.72, 0.80, 0.85), and the gate boundaries create coin-flip bands where trivial score shifts move findings in and out of the actionable tier.
Evidence surfaced in a recent review run:
- One 0.65 adversarial finding sat right at the P2 gate — below-noise admission
- Multiple product-lens findings in the 0.68-0.72 range all shared the same underlying premise ("motivation weak") — fake precision on top of redundant signal
- Residual concerns and deferred questions near-duplicated actionable findings, indicating the persona's internal confidence ordering did not distinguish "above-gate finding" from "below-gate concern" coherently
Anthropic's official code-review plugin (`anthropics/claude-plugins-official/plugins/code-review/commands/code-review.md`) solves this with:
- 5 anchor points (0/25/50/75/100) each tied to a behavioral criterion ("double-checked and verified", "wasn't able to verify", "evidence directly confirms")
- A rubric passed verbatim to a separate scoring agent
- Threshold >= 80 (code-review-specific; doc review uses a different threshold)
- Explicit false-positive catalog
This plan ports the structural techniques and tunes the threshold to document-review economics.
## Requirements Trace
- R1. Replace continuous `confidence` field with 5 discrete anchor points (0, 25, 50, 75, 100) and a behavioral rubric per anchor.
- R2. Update synthesis pipeline to consume anchor values (gates, tiebreaks, dedup, promotion, cross-persona boost, FYI floor).
- R3. Update all 7 document-review persona agents' prompts so the rubric is embedded verbatim.
- R4. Add an explicit false-positive catalog to the subagent template (consolidated from scattered current guidance).
- R5. Adopt doc-review-appropriate filter threshold: >= 50 across severities (drop only "false positive" and "stylistic-unverified" tiers). Replace graduated per-severity gates.
- R6. Preserve current tier routing semantics: 50 -> FYI, 75 -> Decision, 100 -> Proposed fix / safe_auto.
- R7. Update rendering surfaces (template, walkthrough, headless envelope) so anchors display consistently as integer scores, not floats.
- R8. Update tests and fixtures without regressing coverage.
- R9. Keep `ce-code-review` unchanged in this PR — it is a separate migration with different economics (see Scope Boundaries).
## Scope Boundaries
- No change to persona-specific domain logic (what each persona looks for). Only the confidence rubric and synthesis consumption change.
- No change to severity taxonomy (`P0 | P1 | P2 | P3`).
- No change to `finding_type` or `autofix_class` enums.
- No change to `residual_risks` / `deferred_questions` schema shape (they remain string arrays).
- No new schema fields (explicitly rejected `finding_type: grounded | pattern | premise` tag — redundant with persona attribution).
### Deferred to Separate Tasks
- **ce-code-review scoring migration**: Same pattern, but code-review economics differ (linter backstop, PR-comment cost, ground-truth verifiability). Threshold likely `>= 75` there, matching Anthropic more closely. Separate plan once ce-doc-review migration is proven in practice.
- **Separate neutral-scorer agent pass**: A second scoring pass where a neutral agent re-scores each finding against the rubric, independent of the producing persona. Structurally valuable (breaks self-serving score inflation) but adds latency and token cost. Evaluate as a follow-up once the anchor rubric is in place and its effect on score inflation can be measured directly.
## Context & Research
### Relevant Code and Patterns
- `plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json` — confidence field definition (lines 60-65, continuous 0.0-1.0)
- `plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md` — schema rule (line 27), advisory band rule (line 116), false-positive list (lines 109-114)
- `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md` — per-severity gate table (lines 15-25), FYI floor (line 28), cross-persona boost (line 45), promotion patterns (section 3.6), sort (section 3.8)
- `plugins/compound-engineering/skills/ce-doc-review/references/review-output-template.md` — confidence column rendering (line 67 and section rules)
- `plugins/compound-engineering/skills/ce-doc-review/references/walkthrough.md` — confidence display in per-finding block
- `plugins/compound-engineering/agents/document-review/*.md` — 7 persona files. Only `ce-coherence-reviewer.agent.md` currently references a specific confidence floor (`0.85+` for safe_auto patterns, line 26); the others rely on the template
- `tests/pipeline-review-contract.test.ts`, `tests/review-skill-contract.test.ts`, `tests/fixtures/ce-doc-review/seeded-*.md` — test fixtures with embedded confidence values
### Institutional Learnings
No prior `docs/solutions/` entry on scoring calibration. This plan should produce one on completion (under `docs/solutions/workflow/` or `docs/solutions/skill-design/`) documenting the migration and the reasoning behind the doc-review threshold vs Anthropic's code-review threshold, since the tradeoff is non-obvious and future contributors may question the divergence.
### External References
- `anthropics/claude-plugins-official/plugins/code-review/commands/code-review.md` — canonical anchored-rubric pattern. The rubric text and filter approach are the structural model; the threshold is not ported directly (see Key Technical Decisions).
- Calibration research context: LLM verbal-confidence studies show coarse anchor scales outperform continuous numeric scales because continuous scales invite false precision the model cannot produce. This is why Anthropic chose 5 anchors rather than 0-100 continuous.
## Key Technical Decisions
- **5 anchors, not 3 or 10**: Matches Anthropic's proven format. More resolution than Low/Medium/High, still discrete enough to avoid gaming. The anchor values (0/25/50/75/100) are literal integer scores, preserved as integers in the schema.
- **Filter threshold `>= 50`, not `>= 80`**: Doc review has opposite economics from code review. The threshold drops only tier 0 ("false positive, pre-existing, or can't survive light scrutiny") and tier 25 ("might be real but couldn't verify; stylistic-not-in-origin"). Tiers 50+ surface with appropriate routing. Rationale documented inline in the rubric so future contributors see why doc review diverges from Anthropic's `>= 80`.
- **No separate scoring agent (this PR)**: Self-scoring with a rigorous rubric is the first step. Adding a neutral scorer is a follow-up once we can measure whether self-scoring with anchors still inflates scores relative to ground truth.
- **Anchor-to-tier mapping**: 50 -> FYI subsection, 75 -> Decision / Proposed fix, 100 -> eligible for safe_auto when `autofix_class` also warrants. Tier 25 -> dropped. Tier 0 -> dropped. This replaces both the graduated per-severity gate AND the FYI floor with a single anchor-based routing.
- **Cross-persona corroboration promotes by one anchor, not `+0.10`**: When 2+ personas raise the same finding, promote one anchor step (50 -> 75, 75 -> 100). Cleaner than the magic `+0.10` and semantically meaningful (independent corroboration genuinely moves a "verified but nitpick" finding to "very likely, will hit in practice").
- **Tiebreak ordering**: When sorting findings within a severity tier, use anchor descending, then document order (deterministic). Drop the pseudo-precision tiebreak that currently uses float confidence.
- **Preserve reviewer attribution as the persona-calibration signal**: No `finding_type: grounded | pattern | premise` tag. If a persona's domain caps its natural ceiling at 50-75, the anchors and threshold handle it — findings land in FYI or Decision as appropriate. The reviewer name in the output already tells the user which persona raised it; they can apply their own mental model.
- **Strawman rule stays; advisory band rule absorbed into the rubric**: The advisory-band guidance currently lives as a "0.40-0.59 LOW" instruction. Under the new rubric, "advisory observations" map cleanly to tier 25 or 50 depending on verifiability. Rewrite the advisory rule to refer to anchors, not a float range.
## Open Questions
### Resolved During Planning
- **Port ce-code-review in the same PR?** No. Different economics require a different threshold; bundling conflates the migration with the threshold tuning. Do ce-doc-review first, observe, then plan ce-code-review.
- **Keep numeric anchors or use semantic labels (weak / plausible / verified / certain)?** Keep numeric. Matches Anthropic, preserves ordinality for synthesis comparisons, keeps the rendering compact (`Tier: 75` vs `Tier: verified-strong`).
- **Add a `finding_type: grounded | pattern | premise` dimension?** No. Redundant with persona attribution and adds decoding overhead without changing what the user does with the finding.
- **Single threshold or severity-graduated?** Single `>= 50` across severities. Severity already sorts the list; an additional gate gradient adds complexity without differentiating signal.
### Deferred to Implementation
- **Exact rubric wording for each anchor.** The implementation pass writes the final text; this plan captures the behavioral criteria. The wording must be concrete enough that a persona can self-apply it without inventing interpretation — "double-checked against evidence" is concrete; "highly confident" is not.
- **Whether any persona needs a persona-specific floor override.** Coherence currently cites `0.85+` as its safe_auto threshold. Under the new scale, "safe_auto" maps to anchor 100 (evidence directly confirms) — no separate floor needed. If any other persona has equivalent persona-specific guidance during implementation, decide per-persona whether to preserve or remove.
- **Fixture value choices.** The seeded plan fixtures carry specific confidence values. Converting `0.85` -> `75` vs `100` is a per-fixture judgment call; the implementer decides based on what the fixture is demonstrating.
## Implementation Units
- [ ] **Unit 1: Update schema and rubric authority file**
**Goal:** Replace the `confidence` field definition with an integer enum and write the canonical behavioral rubric in one place.
**Requirements:** R1
**Dependencies:** None (this unit establishes the contract everything else consumes)
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json`
- Test: `tests/frontmatter.test.ts` (schema-shape test if one exists; otherwise covered by contract tests in later units)
**Approach:**
- Replace `confidence: { type: "number", minimum: 0.0, maximum: 1.0 }` with `confidence: { type: "integer", enum: [0, 25, 50, 75, 100] }`
- Embed the rubric in the `description` field as a multi-line string so agents consuming the schema see it inline. Each anchor point gets a behavioral criterion (see "Patterns to follow" below)
- Keep `"calibrated per persona"` language gone — the rubric is shared, not per-persona
**Patterns to follow:**
- Anthropic's verbatim rubric from `anthropics/claude-plugins-official/plugins/code-review/commands/code-review.md` step 5. Adapt the criteria to document-review context: replace "PR bug" framing with "document issue" framing; replace "directly impacts code functionality" with "directly impacts plan correctness or implementer understanding"; preserve the "double-checked" / "wasn't able to verify" / "evidence directly confirms" behavioral anchors verbatim where they apply
**Test scenarios:**
- Happy path: A JSON finding with `confidence: 75` validates against the schema
- Error path: A JSON finding with `confidence: 0.72` fails validation (continuous values rejected)
- Error path: A JSON finding with `confidence: 10` fails validation (non-anchor integer rejected)
- Edge case: `confidence: 0` validates (false-positive anchor is a legitimate value, not a validation failure — surface-then-drop happens in synthesis)
**Verification:**
- `bun test tests/frontmatter.test.ts` passes
- Manually running the schema validator against a fixture finding with `confidence: 0.85` produces a clear error message
- [ ] **Unit 2: Rewrite rubric guidance in the subagent template**
**Goal:** Update the shared template that all 7 personas include, so the rubric, false-positive catalog, and advisory rule all reference the new anchors.
**Requirements:** R3, R4
**Dependencies:** Unit 1 (schema is the contract this template communicates)
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md`
**Approach:**
- Replace line 27's `confidence: a number between 0.0 and 1.0 inclusive` with the anchor definition plus the full behavioral rubric (5 bullets, one per anchor). The rubric goes in the template verbatim — this is what every persona sees when the template renders
- Rewrite the advisory-band rule (line 116) to refer to anchor 25 or anchor 50 instead of "0.40-0.59 LOW band"
- Consolidate the false-positive catalog (currently lines 109-114, scattered) into a single bulleted list positioned adjacent to the rubric. Add explicit false-positive categories adapted from Anthropic's code-review list: "Issues already resolved elsewhere in the document", "Content inside prior-round Deferred / Open Questions sections", "Stylistic preferences without evidence of impact", "Pre-existing issues the document didn't introduce", "Issues that belong to other personas", "Speculative future-work concerns with no current signal"
- Update the suppress-below-floor rule (line 53) from "your stated confidence floor" to "anchor tier 50 (the actionable floor) unless your persona sets a stricter floor"
- Update the example finding (lines 33-48) to use `confidence: 100` instead of `0.92`, with a one-line inline note explaining why ("all three conditions met: double-checked, will hit in practice, evidence directly confirms")
**Patterns to follow:**
- Structure of the existing autofix_class section (lines 60-63) — three tiers with a one-sentence behavioral definition each. Mirror this format for the confidence anchors
**Test scenarios:**
- Test expectation: none — this is a prompt-content file. Behavioral changes are tested via the persona output-shape tests in Unit 6
**Verification:**
- Rubric text is present verbatim in the template
- No references to float confidence values (0.0-1.0) remain anywhere in the file
- False-positive catalog appears as a single consolidated list, not scattered sentences
- [ ] **Unit 3: Update synthesis pipeline to consume anchor values**
**Goal:** Replace every numeric-confidence comparison in the synthesis pipeline with anchor-based logic.
**Requirements:** R2, R5, R6
**Dependencies:** Unit 1
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md`
**Approach:**
- **Section 3.2 (Confidence Gate):** Replace the per-severity gate table with a single rule: findings with `confidence: 0` or `confidence: 25` are dropped; findings with `confidence: 50` route to FYI; findings with `confidence: 75` or `100` enter the actionable tier and are classified by autofix_class. Delete the separate "FYI floor at 0.40" concept — it is now the `confidence: 50` anchor
- **Section 3.3 (Deduplicate):** Replace "keep the highest confidence" tiebreak with "keep the highest anchor; if tied, keep the first by document order"
- **Section 3.3b (Same-persona redundancy, added in prior session):** Update the kept-finding selection rule to use anchor ordering
- **Section 3.4 (Cross-persona boost):** Replace `+0.10` boost with "promote by one anchor step (50 -> 75, 75 -> 100). Anchor 100 does not promote further. Record the promotion in the Reviewer column (e.g., `coherence, feasibility (+1 anchor)`)"
- **Section 3.5b (Tiebreak):** Update the `suggested_fix present` default-to-Apply gate to reference the anchor ordering for tiebreaks
- **Section 3.6 (Promote):** The "promote manual to safe_auto/gated_auto" logic is orthogonal to confidence and stays as-is; add a note that promotion does not change the confidence anchor (autofix_class and confidence are independent)
- **Section 3.7 (Route):** Update the routing table: anchor 100 + `safe_auto` -> silent apply; anchor 100 + `gated_auto` -> proposed fix (recommended Apply); anchor 75 -> proposed fix / decision per autofix_class; anchor 50 -> FYI subsection regardless of autofix_class
- **Section 3.8 (Sort):** Replace "confidence (descending)" with "anchor (descending)" in the sort-key chain
- **Section 3.9 (Residual/Deferred restatement suppression, added in prior session):** No confidence-dependent logic; no change needed
**Patterns to follow:**
- The existing vocabulary-rule pattern at the Phase 4 preamble — a single strong directive followed by examples. Apply the same style to the anchor-routing rules so they cannot drift
**Test scenarios:**
- Happy path: A finding with `confidence: 75, autofix_class: gated_auto` surfaces in the Proposed Fixes bucket
- Happy path: A finding with `confidence: 50, autofix_class: manual` surfaces in the FYI subsection
- Happy path: A finding with `confidence: 100, autofix_class: safe_auto` applies silently
- Edge case: A finding with `confidence: 25` is dropped entirely (not surfaced in FYI, not surfaced in Residual Concerns)
- Edge case: Two personas raise the same finding, both at anchor 50; post-boost anchor is 75 and the finding routes as a Decision
- Edge case: One persona at anchor 100 and one at anchor 50 raise the same finding; merged keeps 100, boost does not apply beyond the cap
**Verification:**
- No numeric thresholds (0.40, 0.50, 0.60, 0.65, 0.75) remain in the synthesis file
- The routing table explicitly names each anchor and its destination
- Cross-persona boost mentions "anchor step" not "+0.10"
- [ ] **Unit 4: Update rendering surfaces**
**Goal:** Display anchors as integer scores in the user-facing output; remove float-formatting artifacts.
**Requirements:** R7
**Dependencies:** Unit 1, Unit 3
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/review-output-template.md`
- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/walkthrough.md`
- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/open-questions-defer.md` (if it renders confidence)
- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/bulk-preview.md` (if it renders confidence)
**Approach:**
- Table `Confidence` columns show the integer score as-is (e.g., `75`), not formatted as a decimal (`0.75`)
- Walkthrough per-finding block displays `confidence 75` not `confidence 0.75`
- Headless envelope template in `synthesis-and-presentation.md` Phase 4 shows the integer anchor
- Add a one-line rubric legend somewhere user-visible so a reader seeing `75` for the first time knows what it means without reading the schema. Candidates: a footer under the Coverage table, or a one-line note at the top of the findings list. Decide during implementation — whichever integrates cleanly with the existing layout
**Patterns to follow:**
- The existing `Tier` column in the output template (which surfaces internal enum values for transparency). Add a `Confidence` or rename `Confidence` to display the anchor integer; keep the `Tier` column separate since anchor and tier are independent
**Test scenarios:**
- Happy path: A rendered table shows `75` in the Confidence column, not `0.75` or `75%` or `75 (high)`
- Happy path: Walkthrough per-finding block reads naturally with integer anchor
- Edge case: When a finding was cross-persona-boosted, the display shows the post-boost anchor value (e.g., 75) and the Reviewer column notes the boost (`coherence, feasibility (+1 anchor)`)
**Verification:**
- Rendering a fixture finding end-to-end through the synthesis pipeline produces output with integer anchors throughout, no float values
- [ ] **Unit 5: Update persona files**
**Goal:** Remove per-persona references to specific float confidence values; ensure each persona's domain instructions work with the shared rubric.
**Requirements:** R3
**Dependencies:** Unit 2
**Files:**
- Modify: `plugins/compound-engineering/agents/document-review/ce-coherence-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/document-review/ce-adversarial-document-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/document-review/ce-design-lens-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/document-review/ce-feasibility-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/document-review/ce-product-lens-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/document-review/ce-scope-guardian-reviewer.agent.md`
- Modify: `plugins/compound-engineering/agents/document-review/ce-security-lens-reviewer.agent.md`
**Approach:**
- Grep each persona file for `confidence` and float values. Replace any specific numeric references (e.g., coherence's `confidence: 0.85+`) with anchor-based equivalents (`anchor 100 when ... ; otherwise anchor 75`)
- If a persona's domain naturally caps at anchor 75 (e.g., adversarial critiques of premises), add one sentence acknowledging this in the persona's domain rubric so it doesn't over-reach for 100. Do not add a per-persona floor override — the shared >= 50 threshold handles all personas
- Verify each persona's suppress-conditions section still makes sense under anchor vocabulary; rewrite any float-referencing lines
**Patterns to follow:**
- The shared subagent template's rubric, included by every persona. Any persona-specific guidance should defer to the shared rubric and only add calibration hints specific to that persona's domain
**Test scenarios:**
- Test expectation: none per-persona — behavior tested via the contract tests in Unit 6
**Verification:**
- No float confidence values remain in any persona file
- Each persona's prompt reads coherently with the new rubric
- [ ] **Unit 6: Update tests and fixtures**
**Goal:** Update all test fixtures and contract assertions to use anchor values; add a migration-correctness test that rejects float confidence.
**Requirements:** R8
**Dependencies:** Unit 1, Unit 3
**Files:**
- Modify: `tests/pipeline-review-contract.test.ts`
- Modify: `tests/review-skill-contract.test.ts`
- Modify: `tests/fixtures/ce-doc-review/seeded-plan.md`
- Modify: `tests/fixtures/ce-doc-review/seeded-auth-plan.md`
- Test: new contract case in `tests/pipeline-review-contract.test.ts` asserting float confidence is rejected
**Approach:**
- Grep every test and fixture file for `confidence` float values. Convert each per-fixture based on what the fixture is demonstrating:
- Fixtures showing strong findings -> `confidence: 100` or `75`
- Fixtures showing low-confidence findings -> `confidence: 25` or `50`
- Fixtures showing FYI-band findings -> `confidence: 50`
- Update contract assertions that reference threshold values (0.40, 0.60, 0.65) to anchor equivalents (50, 75, 100)
- Add a new contract case: construct a finding with `confidence: 0.72` and assert the schema validator rejects it
**Patterns to follow:**
- Existing test patterns in `tests/pipeline-review-contract.test.ts` for fixture loading and schema validation
**Test scenarios:**
- Happy path: All existing fixtures validate against the new schema after conversion
- Error path: A synthesized finding with `confidence: 0.72` fails validation
- Edge case: A fixture converted from `confidence: 0.65` (previously above-gate for P2) to `confidence: 75` still surfaces in the same tier post-migration (migration does not drop borderline findings)
**Verification:**
- `bun test` passes with 0 failures
- Total test count matches or exceeds pre-migration count (new rejection-test added)
- [ ] **Unit 7: Document the migration and the threshold divergence**
**Goal:** Write a `docs/solutions/` entry so future contributors understand why doc review uses a different threshold from Anthropic's code-review reference.
**Requirements:** R1-R9 (documents the whole migration)
**Dependencies:** Units 1-6 complete
**Files:**
- Create: `docs/solutions/skill-design/confidence-anchored-scoring.md`
**Approach:**
- Frontmatter: `module: ce-doc-review`, `problem_type: design_pattern`, `tags: [scoring, calibration, personas]`
- Body sections:
- Problem: continuous confidence invites false precision; LLMs cluster on round numbers
- Reference pattern: Anthropic's 5-anchor rubric
- Doc-review-specific divergence: threshold >= 50 vs Anthropic's >= 80, with the economics argument (no linter backstop, premise challenges resist verification, routing menu makes dismissal cheap)
- When to port this pattern: other persona-based review skills with similar economics
- When NOT to port directly: ce-code-review has linter-backstop economics and should tune threshold higher
**Patterns to follow:**
- Existing entries under `docs/solutions/skill-design/` for frontmatter shape and section structure
**Test scenarios:**
- Test expectation: none — documentation file with no executable behavior
**Verification:**
- File validates via whatever existing tooling checks `docs/solutions/` frontmatter (if any)
- A reader unfamiliar with this migration can read the entry and understand both the mechanic and the threshold-tuning rationale
## System-Wide Impact
- **Interaction graph:** The `confidence` field is read by every synthesis step (3.2, 3.3, 3.3b, 3.4, 3.5b, 3.6, 3.7, 3.8), every rendering surface (template, walkthrough, open-questions-defer, bulk-preview, headless envelope), and every persona's output contract. A missed update in any of these leaves a format mismatch that will surface as a validation or rendering bug.
- **Error propagation:** If the schema change lands before the persona prompts update, persona outputs will fail validation and the pipeline will drop all findings. Unit sequencing (Unit 1 before Unit 2 before Unit 5) is load-bearing for this reason.
- **State lifecycle risks:** The multi-round decision primer (R29 suppression, R30 fix-landed) stores prior-round findings in memory. Prior-round findings serialized with float confidence will not match current-round anchor confidence in fingerprint comparisons. Implementation should check whether the primer carries confidence in its fingerprint — if it does, add a one-time migration or tolerance in the matcher.
- **API surface parity:** ce-code-review has the same field shape and the same kind of synthesis pipeline. It is intentionally NOT updated in this PR (Scope Boundaries). When ce-code-review's migration eventually runs, it can reuse the rubric structure but will need a higher threshold.
- **Integration coverage:** End-to-end test invoking the full ce-doc-review flow against a seeded plan is the only way to verify all the surfaces stay in sync. Unit 6's contract tests should include one such end-to-end case.
- **Unchanged invariants:** Severity taxonomy, finding_type enum, autofix_class enum, rendering structure (sections, coverage table, routing menu), multi-round decision primer shape, chain-linking logic (3.5c), strawman rule. This change is strictly about the confidence dimension; other dimensions remain stable.
## Risks & Dependencies
| Risk | Mitigation |
|------|------------|
| Personas over-cluster on anchor 75 (new version of gaming) | Rubric criteria for 75 vs 100 must be behaviorally distinct: 75 = "double-checked, will hit in practice"; 100 = "evidence directly confirms, will happen frequently". If clustering still occurs post-migration, consider the neutral-scorer follow-up (deferred scope) |
| Tests and fixtures update incompletely, leaving hidden float references | Unit 6 includes a grep-all-fixtures audit step; the new rejection test catches any fixture that slips through |
| Anchor routing rule in synthesis contradicts rendering rule, causing tier/display drift | Unit 3 and Unit 4 share a test case (end-to-end fixture through pipeline) that catches this. Single-source-of-truth routing table in synthesis-and-presentation.md is the canonical reference; rendering reads from it, not reinvents it |
| `confidence: 0` findings surface in user output by mistake (they should drop silently) | Synthesis 3.2 explicitly drops anchor 0 and anchor 25. Contract test in Unit 6 asserts neither surfaces in any output bucket |
| Doc review threshold >= 50 proves too permissive in practice (too many noisy findings surface) | The threshold is easy to tune post-migration (change one rule in synthesis 3.2). Documented in the solution entry (Unit 7) so future contributors know where to adjust |
| Persona prompt changes degrade finding quality | Unit 5 preserves persona-specific domain logic; only confidence-related language changes. Run the reference plan through the migrated flow as a smoke test (Unit 6 end-to-end case) |
## Documentation / Operational Notes
- This is a breaking change for the ce-doc-review schema. Any external consumer of the findings JSON (there are none currently — the schema is internal) would need to update. No external-consumer impact expected.
- No rollout flag needed — the migration is atomic across the skill. Before-and-after review of the same document produces comparable output; the anchor scores replace float scores uniformly.
- The `docs/solutions/skill-design/confidence-anchored-scoring.md` entry (Unit 7) is the canonical explanation for why doc review diverges from Anthropic's code-review threshold. Link to it from the PR description.
## Sources & References
- Anthropic reference rubric: `anthropics/claude-plugins-official/plugins/code-review/commands/code-review.md`
- Current schema: `plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json`
- Current synthesis pipeline: `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md`
- Related prior session work: 2026-04-21 review of a ce-doc-review output that surfaced the fine-grained-score gaming problem, leading to this plan

View File

@@ -0,0 +1,127 @@
---
title: "ce-doc-review confidence scoring: anchored rubric over continuous floats"
date: 2026-04-21
category: skill-design
module: compound-engineering / ce-doc-review
problem_type: design_pattern
component: tooling
severity: medium
tags:
- ce-doc-review
- scoring
- calibration
- personas
- persona-rubric
---
# ce-doc-review confidence scoring: anchored rubric over continuous floats
## Problem
Persona-based document review originally used a continuous `confidence` field (0.0 to 1.0) that synthesis compared against per-severity numeric gates (0.50 / 0.60 / 0.65 / 0.75) and a 0.40 FYI floor. In practice the continuous scale invited false precision: personas clustered on round values (0.60, 0.65, 0.72, 0.80, 0.85), and gate boundaries created coin-flip bands where trivial score shifts moved findings in and out of the actionable tier. The personas were not genuinely differentiating 0.65 from 0.72; the model cannot calibrate self-reported confidence at that granularity.
Symptoms surfaced in review output:
- Single personas filing 3+ findings all rated 0.68-0.72, all variants of the same root premise
- Findings at 0.65 admitted into the actionable tier on noise, not signal
- Residual concerns and deferred questions near-duplicated findings already surfaced, indicating the persona's own ordering did not distinguish "raise this" from "note this"
## Reference pattern: Anthropic's anchored rubric
Anthropic's official code-review plugin (`anthropics/claude-plugins-official/plugins/code-review/commands/code-review.md`) solves the calibration problem with 5 discrete anchors (`0`, `25`, `50`, `75`, `100`) each tied to a behavioral criterion the model can honestly self-apply:
- `0` — false positive or pre-existing issue
- `25` — might be real but couldn't verify; stylistic-not-in-CLAUDE.md
- `50` — verified real but nitpick / not very important
- `75` — double-checked, will hit in practice, directly impacts functionality
- `100` — confirmed, evidence directly confirms, will happen frequently
The rubric is passed verbatim to a separate scoring agent. Filter threshold: `>= 80`.
## Solution adopted for ce-doc-review
Port the structural techniques — anchored rubric, verbatim persona-facing text, explicit false-positive catalog — and tune the filter threshold for document-review economics. The doc-review threshold is `>= 50`, not Anthropic's `>= 80`.
### Anchor-to-route mapping
| Anchor | Route |
|--------|-------|
| `0`, `25` | Dropped silently (counted in Coverage only) |
| `50` | FYI subsection (surface-only, no forced decision) |
| `75`, `100` | Actionable tier, classified by `autofix_class` |
Cross-persona corroboration promotes one anchor step (`50 → 75`, `75 → 100`, `100 → 100`). This replaces the prior `+0.10` numeric boost.
Within-severity sort: anchor descending, then document order as the deterministic final tiebreak.
### Files
- `plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json``confidence` is an integer enum `[0, 25, 50, 75, 100]` with behavioral definitions embedded in the `description` field
- `plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md` — the rubric section personas see verbatim, plus the consolidated false-positive catalog
- `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md` — anchor-based gate in 3.2, anchor-step promotion in 3.4, anchor-sorted ordering in 3.8, anchor+autofix routing in 3.7
- `plugins/compound-engineering/agents/document-review/*.agent.md` — each of the 7 personas carries a persona-specific calibration section that maps domain criteria to the shared anchors
- `tests/pipeline-review-contract.test.ts` — contract tests that assert the schema enforces discrete anchors and the template embeds the rubric
## Why the threshold diverges from Anthropic
Code review and document review have different economics. Anthropic's `>= 80` filter is load-bearing for code review because of three constraints that do not apply to doc review:
1. **Code review has a linter backstop.** CI runs linters, typecheckers, and tests. The LLM reviewer is a second layer on top of automated tooling, and a second layer only adds value by being *more selective*. If automation already catches the 50-75 tier, the LLM surfacing it again is noise.
2. **Code review is high-frequency and publicly visible.** Every surfaced finding becomes a PR comment. A reviewer who cries wolf 5 times gets muted. Precision dominates recall.
3. **Code claims are ground-truth verifiable.** "The code does X" can be proven or refuted by reading it. A 75 in code review often means "I couldn't verify" — which means waiting for someone who can.
Document review inverts all three:
1. **Doc review IS the backstop.** There is no linter that catches a plan's premise gaps or scope drift. A missed finding in the plan derails implementation weeks later.
2. **Doc review is low-frequency and private.** One review per plan, not per PR. Surfaced findings are dismissed with a keystroke via the routing menu; they are not public commentary.
3. **Premise claims have a natural confidence ceiling.** "Is the motivation valid?" and "does this scope match the goal?" cannot be verified against ground truth. Personas working in strategy, premise, and adversarial domains (product-lens, adversarial) legitimately cap at anchors 50-75 because full verification is not possible from document text alone. A `>= 80` filter would silence those personas.
Filter at `>= 50` for doc review; let the routing menu handle volume. Dismissing a surfaced finding is cheap; missing a real concern is expensive.
## When to port this pattern
- Other persona-based review skills with similar economics (no linter backstop, one-shot consumption, dismissal cheap via routing). Default threshold for such skills: `>= 50`.
- Any scoring workflow where the model is asked to self-report confidence on a continuous scale and clustering on round numbers is observed.
## When NOT to port directly
- Code review workflows (e.g., `ce-code-review`) have linter backstops and public-comment costs. Port the rubric structure, but tune the threshold higher (`>= 75` or `>= 80` per Anthropic). This is out of scope for the ce-doc-review migration; evaluate separately.
- High-throughput pipelines where the `25` anchor ("couldn't verify") represents most findings. Dropping everything below `50` may be too aggressive; consider surfacing `25` as "needs human triage" instead.
## Migration history
Landed in a single atomic change because the schema, template, synthesis, rendering, personas, and tests are coupled — a partial migration would have failed validation at every boundary. The schema change is the load-bearing commit; the persona updates and test updates consume it.
## Evaluation
After the migration, an A/B evaluation compared baseline (continuous float) against treatment (anchored integer rubric) across four documents spanning size and type: a 7KB in-repo plan, a 63KB in-repo plan, a 27KB external-repo plan, and a 10KB in-repo brainstorm. Both versions were executed by orchestrator subagents reading their matching skill snapshot as prompt material, dispatching all 7 personas, and emitting the Phase 4 headless envelope. The workspace, per-run envelopes, and timing data live under `.context/compound-engineering/ce-doc-review-eval/` during the evaluation.
### Confirmed effects
- **Score dispersion collapsed.** Baseline produced 7-12 distinct float values per document (typical: 0.45, 0.50, 0.55, 0.65, 0.72, 0.80, 0.85) — the exact false-precision clustering the migration targeted. Treatment concentrated on 2-3 anchors per document. Anchors `0` and `25` were never emitted by any persona, which matches the template's "suppress silently" instruction for those tiers.
- **Cross-persona +1 anchor promotion fires as specified.** Observed on cli-printing-press plan (security-lens + feasibility promoting an IP-range-check finding to anchor 100) and interactive-judgment plan (product-lens + adversarial promoting a premise finding to anchor 100).
- **Chain linking, safe_auto silent-apply, FYI routing, and per-persona redundancy collapse** all exercised correctly on at least one run.
- **The `>= 50` threshold is load-bearing on large plans.** On cli-printing-press, baseline's graduated per-severity gates admitted 13 Decisions; treatment admitted 21. Inspection of the delta confirmed the new findings were genuine concerns the old gates' coin-flip behavior at boundaries was suppressing — not noise. The migration doc's prediction that "missing a real concern is expensive" held in practice.
### Anchor-75 calibration boundary discovered
The evaluation surfaced a boundary issue: on large plans, personas emitted anchor 75 for premise-strength concerns ("motivation is thin," "premise is unconvincing") whose "will be hit in practice" claim was the reviewer's reading, not a concrete downstream outcome. This inflated the actionable tier with strength-of-argument critique that was more appropriately observational.
The subagent template's anchor 75 bullet was refined with a calibration paragraph:
> **Anchor `75` requires naming a concrete downstream consequence someone will hit** — a wrong deploy order, an unimplementable step, a contract mismatch, missing evidence that blocks a decision. Strength-of-argument concerns ("motivation is thin," "premise is unconvincing," "a different reader might disagree") do not meet this bar on their own — they are advisory observations and land at anchor `50` unless they also name the specific downstream outcome the reader hits.
The test the template adds: *"will a competent implementer or reader concretely encounter this, or is this my opinion about the document's strength?"* The former is `75`; the latter is `50`.
Re-evaluation with the tightened criterion shifted cli-printing-press from 21 Decisions/4 FYI to 10 Decisions/23 FYI — premise-strength concerns moved to observational routing. The change was *not* a blanket suppression of premise findings: on interactive-judgment plan, the premise challenge survived the tightening and got cross-persona-promoted to anchor 100, because its concrete consequence was explicit ("8-unit redesign creates maintenance debt across three reference files if the premise is wrong"). The refinement distinguishes grounded premise challenges from hand-wavy framing critique — which is the exact precision the rubric was meant to have from the start.
### Limitations
- **Small corpus.** Four documents is enough to confirm macro patterns (clustering, severity inflation, feature coverage) but not to tune threshold values or anchor boundaries at finer granularity.
- **Harness drift between iterations.** Iteration-1 orchestrators dispatched parallel persona subagents; iteration-2 orchestrators executed personas inline (nested Agent tool unavailable in that session). This affected side metrics (proposed-fix count on cli-printing-press iteration-2 dropped 15 → 4, likely harness-driven rather than tweak-driven) but did not obscure the tweak's core effect, which was large-magnitude.
- **No absolute-calibration ground truth.** The evaluation measured the migration's stated failure modes disappearing. Whether an anchor-75 finding literally hits 75% of the time remains unmeasured; no labeled doc-review corpus exists.
## Deferred follow-ups
- Port the pattern to `ce-code-review` with a code-review-appropriate threshold
- Evaluate a neutral-scorer second pass (a cheap agent that re-scores findings independent of the producing persona) once the anchor rubric has been observed in practice

View File

@@ -72,10 +72,12 @@ Probe whether the document considered the obvious alternatives and whether the c
## Confidence calibration
- **HIGH (0.80+):** Can quote specific text from the document showing the gap, construct a concrete scenario or counterargument, and trace the consequence.
- **MODERATE (0.60-0.79):** The gap is likely but confirming it would require information not in the document (codebase details, user research, production data).
- **LOW (0.40-0.59) — Advisory:** A plausible-but-unlikely failure mode, or a concern worth surfacing without a strong supporting scenario. Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress.
Use the shared anchored rubric (see `subagent-template.md` — Confidence rubric). Adversarial's domain is premise and failure-mode challenges. Adversarial findings cap naturally at anchor `75` for most concerns because premise challenges inherently resist full verification — "is this assumption wrong?" usually cannot be proven true in advance. That is not a calibration problem; it is the nature of the work. Apply as:
- **`100` — Absolutely certain:** Can quote specific text showing the gap, construct a concrete scenario or counterargument with cited evidence, AND trace the consequence to observable impact. The rare case — use sparingly.
- **`75` — Highly confident:** The gap is likely to bite and you can describe the scenario concretely, but full confirmation would require information not in the document (codebase details, user research, production data). You double-checked and the concern is material. This is adversarial's normal working ceiling.
- **`50` — Advisory (routes to FYI):** A plausible-but-unlikely failure mode, or a concern worth surfacing without a strong supporting scenario. Still requires an evidence quote. Surfaces as observation without forcing a decision.
- **Suppress entirely:** Anything below anchor `50` — speculative "what if" with no supporting scenario. Do not emit; anchors `0` and `25` exist in the enum only so synthesis can track drops.
## What you don't flag

View File

@@ -23,7 +23,7 @@ You are a technical editor reading for internal consistency. You don't evaluate
## Safe_auto patterns you own
Coherence is the primary persona for surfacing mechanically-fixable consistency issues. These patterns should land as `safe_auto` with `confidence: 0.85+` when the document supplies the authoritative signal:
Coherence is the primary persona for surfacing mechanically-fixable consistency issues. These patterns should land as `safe_auto` with `confidence: 100` when the document supplies the authoritative signal (the document text leaves no room for interpretation):
- **Header/body count mismatch.** Section header claims a count (e.g., "6 requirements") and the enumerated body list has a different count (5 items). The body is authoritative unless the document explicitly identifies a missing item. Fix: correct the header to match the list.
- **Cross-reference to a named section that does not exist.** Text says "see Unit 7" / "per Section 4.2" / "as described in the Rollout section" and that target is not defined anywhere in the document. Fix: delete the reference or fix it to point at an existing target.
@@ -39,10 +39,12 @@ When in doubt, surface the finding as `safe_auto` with `why_it_matters` that nam
## Confidence calibration
- **HIGH (0.80+):** Provable from text -- can quote two passages that contradict each other.
- **MODERATE (0.60-0.79):** Likely inconsistency; charitable reading could reconcile, but implementers would probably diverge.
- **LOW (0.40-0.59) — Advisory:** Minor asymmetry or drift with no downstream consequence (e.g., parallel names that don't need to match, phrasing that's inconsistent but unambiguous). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress entirely.
Use the shared anchored rubric (see `subagent-template.md` — Confidence rubric). Coherence's domain typically hits the strongest anchors because inconsistencies are verifiable from document text alone. Apply as:
- **`100` — Absolutely certain:** Provable from text — can quote two passages that contradict each other. Document text leaves no room for interpretation.
- **`75` — Highly confident:** Likely inconsistency; a charitable reading could reconcile, but implementers would probably diverge. You double-checked and the issue will be hit in practice.
- **`50` — Advisory (routes to FYI):** Minor asymmetry or drift with no downstream consequence (parallel names that don't need to match, phrasing that's inconsistent but unambiguous). Still requires an evidence quote. Surfaces as observation without forcing a decision.
- **Suppress entirely:** Anything below anchor `50` — cannot verify, speculative, or stylistic drift without impact. Do not emit; anchors `0` and `25` exist in the enum only so synthesis can track drops.
## What you don't flag

View File

@@ -34,10 +34,12 @@ Explain what's missing: the functional design thinking that makes the interface
## Confidence calibration
- **HIGH (0.80+):** Missing states/flows that will clearly cause UX problems during implementation.
- **MODERATE (0.60-0.79):** Gap exists but a skilled designer could resolve from context.
- **LOW (0.40-0.59) — Advisory:** Pattern or micro-layout preference without strong usability evidence (e.g., button placement alternatives, visual hierarchy micro-choices). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress.
Use the shared anchored rubric (see `subagent-template.md` — Confidence rubric). Design-lens's domain grounds in named interaction states and user flows. Apply as:
- **`100` — Absolutely certain:** Missing states or flows that will clearly cause UX problems during implementation. Evidence directly confirms the gap — the document names an interaction without the corresponding state or transition.
- **`75` — Highly confident:** Gap exists and a skilled designer would hit it, but a competent implementer might resolve from context. You double-checked and the issue will surface in practice.
- **`50` — Advisory (routes to FYI):** Pattern or micro-layout preference without strong usability evidence (button placement alternatives, visual hierarchy micro-choices). Still requires an evidence quote. Surfaces as observation without forcing a decision.
- **Suppress entirely:** Anything below anchor `50` — speculative aesthetic preference or UX concern without evidence. Do not emit; anchors `0` and `25` exist in the enum only so synthesis can track drops.
## What you don't flag

View File

@@ -27,10 +27,12 @@ Apply each check only when relevant. Silence is only a finding when the gap woul
## Confidence calibration
- **HIGH (0.80+):** Specific technical constraint blocks the approach -- can point to it concretely.
- **MODERATE (0.60-0.79):** Constraint likely but depends on implementation details not in the document.
- **LOW (0.40-0.59) — Advisory:** Theoretical constraint with no current-scale evidence (e.g., "could be slow if data grows 10x", speculative scalability concerns with no baseline number). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress entirely.
Use the shared anchored rubric (see `subagent-template.md` — Confidence rubric). Feasibility's domain grounds in codebase evidence, so it reaches the strongest anchors when you can cite concrete technical constraints. Apply as:
- **`100` — Absolutely certain:** Specific technical constraint blocks the approach and you can cite it concretely (codebase reference, framework behavior, platform limit). Evidence directly confirms.
- **`75` — Highly confident:** Constraint likely to bite, but confirming it would require implementation details not in the document. You double-checked and the issue will be hit in practice.
- **`50` — Advisory (routes to FYI):** A verified constraint that is genuinely minor at current scale — the implementer should know it exists but would not be surprised by it hitting in practice. Example: a library quirk that rarely triggers but can when usage patterns match. Still requires an evidence quote. Surfaces as observation without forcing a decision. Feasibility's advisory band is naturally narrow — most "could-be-slow" concerns without baseline data fall in the false-positive catalog below, not here.
- **Suppress entirely:** Anything below anchor `50`, plus any shape the false-positive catalog in `subagent-template.md` names. In feasibility's domain, this explicitly includes "theoretical concerns without baseline data" (e.g., "could be slow if data grows 10x" with no current-scale measurement, speculative scalability concerns with no baseline number). Those are non-findings that must NOT be routed to anchor `50`. Do not emit; anchors `0` and `25` exist in the enum only so synthesis can track drops.
## What you don't flag

View File

@@ -58,10 +58,12 @@ If priority tiers exist: do assignments match stated goals? Are must-haves truly
## Confidence calibration
- **HIGH (0.80+):** Can quote both the goal and the conflicting work -- disconnect is clear.
- **MODERATE (0.60-0.79):** Likely misalignment, depends on business context not in document.
- **LOW (0.40-0.59) — Advisory:** Observation about positioning, naming, or strategy without a concrete impact (subjective preference, speculative future-product concern with no current signal). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress.
Use the shared anchored rubric (see `subagent-template.md` — Confidence rubric). Product-lens's domain is premise and strategy — whether the document's goals, motivation, and priorities hold up. Premise critiques cap naturally at anchor `75` for most concerns because "is the motivation valid?" cannot be verified against ground truth; it requires business context the document may not supply. That is not a calibration problem; it is the nature of the work. Apply as:
- **`100` — Absolutely certain:** Can quote both the goal and the conflicting work — disconnect is clear. Evidence directly confirms the misalignment within the document itself. The rare case — use sparingly.
- **`75` — Highly confident:** Likely misalignment, full confirmation depends on business context not in the document. You double-checked and the concern will materially affect direction. This is product-lens's normal working ceiling.
- **`50` — Advisory (routes to FYI):** Observation about positioning, naming, or strategy without a concrete impact (subjective preference about framing with an evidence quote, minor identity-drift note where the drift has no downstream user consequence). Still requires an evidence quote. Surfaces as observation without forcing a decision.
- **Suppress entirely:** Anything below anchor `50`, plus any shape the false-positive catalog in `subagent-template.md` names. In product-lens's domain, this explicitly includes "speculative future-product concerns with no current signal" — those are non-findings that must NOT be routed to anchor `50`. Do not emit; anchors `0` and `25` exist in the enum only so synthesis can track drops.
## What you don't flag

View File

@@ -41,10 +41,12 @@ With AI-assisted implementation, the cost gap between shortcuts and complete sol
## Confidence calibration
- **HIGH (0.80+):** Can quote goal statement and scope item showing the mismatch.
- **MODERATE (0.60-0.79):** Misalignment likely but depends on context not in document.
- **LOW (0.40-0.59) — Advisory:** Organizational preference without a concrete cost (unit ordering, section placement alternatives that read equally well, "this could also be split" observations without real impact). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress.
Use the shared anchored rubric (see `subagent-template.md` — Confidence rubric). Scope-guardian's domain grounds in the document's own stated goals and declared scope. Apply as:
- **`100` — Absolutely certain:** Can quote both the goal statement and the scope item showing the mismatch. Evidence directly confirms the misalignment.
- **`75` — Highly confident:** Misalignment likely to derail the work, but fully confirming it would require context not in the document (strategic priorities, prior decisions). You double-checked and the issue will hit implementers.
- **`50` — Advisory (routes to FYI):** Organizational preference without a concrete cost (unit ordering, section placement alternatives that read equally well, "this could also be split" observations without real impact). Still requires an evidence quote. Surfaces as observation without forcing a decision.
- **Suppress entirely:** Anything below anchor `50` — speculative concern or stylistic preference. Do not emit; anchors `0` and `25` exist in the enum only so synthesis can track drops.
## What you don't flag

View File

@@ -25,10 +25,12 @@ Skip areas not relevant to the document's scope.
## Confidence calibration
- **HIGH (0.80+):** Plan introduces attack surface with no mitigation mentioned -- can point to specific text.
- **MODERATE (0.60-0.79):** Concern likely but plan may address implicitly or in a later phase.
- **LOW (0.40-0.59) — Advisory:** Theoretical attack surface with no realistic exploit path under current design (e.g., speculative timing-attack on non-sensitive data, defense-in-depth nice-to-have with no current vector). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress.
Use the shared anchored rubric (see `subagent-template.md` — Confidence rubric). Security-lens's domain grounds in named attack surfaces and missing mitigations. Apply as:
- **`100` — Absolutely certain:** Plan introduces attack surface with no mitigation mentioned — can point to specific text. Evidence directly confirms the gap; the exploit path is concrete.
- **`75` — Highly confident:** Concern is likely exploitable, but the plan may address it implicitly or in a later phase not yet specified. You double-checked and the vector is material.
- **`50` — Advisory (routes to FYI):** A verified gap that would make the design more robust but is not required by the threat model the plan commits to — for example, a defense-in-depth addition on a path that already has a primary mitigation, or a logging gap that would help incident response without preventing the incident. Still requires an evidence quote. Surfaces as observation without forcing a decision.
- **Suppress entirely:** Anything below anchor `50`, plus any shape the false-positive catalog in `subagent-template.md` names. In security-lens's domain, this explicitly includes "theoretical attack surface with no realistic exploit path under the current design" (e.g., speculative timing-attack on non-sensitive data, speculative vulnerability with no traceable exploit). Those are non-findings that must NOT be routed to anchor `50`. Do not emit; anchors `0` and `25` exist in the enum only so synthesis can track drops.
## What you don't flag

View File

@@ -177,7 +177,7 @@ Cross-session persistence is out of scope. A new invocation of ce-doc-review on
## Phases 3-5: Synthesis, Presentation, and Next Action
After all dispatched agents return, read `references/synthesis-and-presentation.md` for the synthesis pipeline (validate, per-severity gate, dedup, cross-persona agreement boost, resolve contradictions, auto-promotion, route by three tiers with FYI subsection), `safe_auto` fix application, headless-envelope output, and the handoff to the routing question.
After all dispatched agents return, read `references/synthesis-and-presentation.md` for the synthesis pipeline (validate, anchor-based gate, dedup, cross-persona agreement promotion, resolve contradictions, auto-promotion, route by three tiers with FYI subsection), `safe_auto` fix application, headless-envelope output, and the handoff to the routing question.
For the four-option routing question and per-finding walk-through (interactive mode), read `references/walkthrough.md`. For the bulk-action preview used by LFG, Append-to-Open-Questions, and walk-through `LFG-the-rest`, read `references/bulk-preview.md`. Do not load these files before agent dispatch completes.

View File

@@ -10,8 +10,8 @@ Interactive mode only.
Three call sites:
1. **Routing option B (top-level LFG)** — after the user picks `LFG. Apply the agent's best-judgment action per finding` from the routing question, but before any action executes. Scope: every pending `gated_auto` / above-gate `manual` finding.
2. **Routing option C (top-level Append-to-Open-Questions)** — after the user picks `Append findings to the doc's Open Questions section and proceed` but before any append runs. Scope: every pending `gated_auto` / above-gate `manual` finding. Every finding appears under `Appending to Open Questions (N):` regardless of the agent's natural recommendation, because option C is batch-defer.
1. **Routing option B (top-level LFG)** — after the user picks `LFG. Apply the agent's best-judgment action per finding` from the routing question, but before any action executes. Scope: every pending `gated_auto` or `manual` finding at confidence anchor `75` or `100`.
2. **Routing option C (top-level Append-to-Open-Questions)** — after the user picks `Append findings to the doc's Open Questions section and proceed` but before any append runs. Scope: every pending `gated_auto` or `manual` finding at confidence anchor `75` or `100`. Every finding appears under `Appending to Open Questions (N):` regardless of the agent's natural recommendation, because option C is batch-defer.
3. **Walk-through `LFG the rest`** — after the user picks `LFG the rest — apply the agent's best judgment to this and remaining findings` from a per-finding question, but before the remaining findings are resolved. Scope: the current finding and everything not yet decided. Already-decided findings from the walk-through are not included in the preview.
In all three cases the user confirms with `Proceed` or backs out with `Cancel`. No per-item decisions inside the preview — per-item decisioning is the walk-through's role.

View File

@@ -58,10 +58,9 @@
"description": "Concrete fix text. Omit or null if no good fix is obvious -- a bad suggestion is worse than none."
},
"confidence": {
"type": "number",
"description": "Reviewer confidence in this finding, calibrated per persona",
"minimum": 0.0,
"maximum": 1.0
"type": "integer",
"enum": [0, 25, 50, 75, 100],
"description": "Anchored confidence score. Use exactly one of 0, 25, 50, 75, 100. Each anchor has a behavioral criterion the reviewer must honestly self-apply. 0: Not confident at all. This is a false positive that does not stand up to light scrutiny, or a pre-existing issue the document did not introduce. 25: Somewhat confident. Might be a real issue but could also be a false positive; the reviewer was not able to verify. 50: Moderately confident. The reviewer verified this is a real issue but it may be a nitpick or not meaningfully affect plan correctness. Relative to the rest of the document, it is not very important. Advisory observations (the honest answer to 'what breaks if we do not fix this?' is 'nothing breaks, but...') land here. 75: Highly confident. The reviewer double-checked and verified the issue will be hit in practice by implementers or readers of this document. The existing approach is insufficient. The issue is important and will directly impact plan correctness, implementer understanding, or downstream execution. 100: Absolutely certain. The reviewer double-checked and confirmed the issue. The evidence directly confirms it will happen frequently in practice. The document text, codebase, or cross-references leave no room for interpretation."
},
"evidence": {
"type": "array",

View File

@@ -42,8 +42,8 @@ Fields come from the finding's schema:
- `{title}` — the finding's title field
- `{section}` — the finding's section field, unmodified (human-readable)
- `{severity}` — P0 / P1 / P2 / P3
- `{reviewer}` — the persona that produced the finding (after dedup, the persona with the highest confidence; surface all co-flagging personas if multiple)
- `{confidence}`rounded to 2 decimal places
- `{reviewer}` — the persona that produced the finding (after dedup, the persona with the highest confidence anchor; surface all co-flagging personas if multiple)
- `{confidence}`the integer anchor (`50`, `75`, or `100`), emitted without a decimal point or percent sign
- `{why_it_matters}` — the full why_it_matters text, preserving the framing guidance from the subagent template
HTML-comment fields (machine-readable, used by Step 4 dedup):
@@ -133,7 +133,7 @@ Starting document state:
### From 2026-04-10 review
- **Alias compatibility-theater concern** — Risks (P1, scope-guardian, confidence 0.87)
- **Alias compatibility-theater concern** — Risks (P1, scope-guardian, confidence 75)
The alias exists without documented external consumers...
@@ -152,7 +152,7 @@ After appending two findings in a 2026-04-18 session:
### From 2026-04-10 review
- **Alias compatibility-theater concern** — Risks (P1, scope-guardian, confidence 0.87)
- **Alias compatibility-theater concern** — Risks (P1, scope-guardian, confidence 75)
The alias exists without documented external consumers...
@@ -160,14 +160,14 @@ After appending two findings in a 2026-04-18 session:
### From 2026-04-18 review
- **Unit 2/3 merge judgment call** — Scope Boundaries (P2, scope-guardian, confidence 0.78)
- **Unit 2/3 merge judgment call** — Scope Boundaries (P2, scope-guardian, confidence 75)
The two units update consumer sites that deploy together. Splitting
adds dependency tracking without enabling independent delivery.
<!-- dedup-key: section="scope boundaries" title="unit 23 merge judgment call" evidence="the two units update consumer sites that deploy together" -->
- **Strawman alternatives on migration strategy** — Unit 3 Files (P2, coherence, confidence 0.72)
- **Strawman alternatives on migration strategy** — Unit 3 Files (P2, coherence, confidence 75)
The fix options list (a) through (c) as alternatives, but (b) and (c)
are "accept the regression" framings that don't solve the problem the

View File

@@ -8,6 +8,8 @@ This template describes the Phase 4 interactive presentation — what the user s
**Vocabulary note.** Internal enum values (`safe_auto`, `gated_auto`, `manual`, `FYI`) live in the schema and synthesis pipeline. User-facing rendered text uses plain-language labels instead: fixes (for `safe_auto`), proposed fixes (for `gated_auto`), decisions (for `manual`), and FYI observations (for `FYI`). The `Tier` column in the tables below is the one place that still names the internal enum so the user can see the synthesis decision; everything else reads as plain language.
**Confidence column.** The `Confidence` column shows the integer anchor value (`50`, `75`, or `100`) — never a decimal or percentage. Anchor `50` = advisory (routed to FYI); anchor `75` = verified, will hit in practice; anchor `100` = certain, evidence directly confirms. Anchors `0` and `25` are dropped by synthesis before this layer and never appear in the rendered output. Cross-persona agreement promotes by one anchor step; when this happens, the Reviewer column notes it (e.g., `coherence, feasibility (+1 anchor)`).
## Example
```markdown
@@ -35,7 +37,7 @@ Applied 5 fixes. 4 items need attention (2 errors, 2 omissions). 2 FYI observati
| # | Section | Issue | Reviewer | Confidence | Tier |
|---|---------|-------|----------|------------|------|
| 1 | Requirements Trace | Goal states "offline support" but technical approach assumes persistent connectivity | coherence | 0.92 | manual |
| 1 | Requirements Trace | Goal states "offline support" but technical approach assumes persistent connectivity | coherence | 100 | manual |
### P1 — Should Fix
@@ -43,13 +45,13 @@ Applied 5 fixes. 4 items need attention (2 errors, 2 omissions). 2 FYI observati
| # | Section | Issue | Reviewer | Confidence | Tier |
|---|---------|-------|----------|------------|------|
| 2 | Scope Boundaries | 8 of 12 units build admin infrastructure; only 2 touch stated goal | scope-guardian | 0.80 | manual |
| 2 | Scope Boundaries | 8 of 12 units build admin infrastructure; only 2 touch stated goal | scope-guardian | 75 | manual |
#### Omissions
| # | Section | Issue | Reviewer | Confidence | Tier |
|---|---------|-------|----------|------------|------|
| 3 | Implementation Unit 3 | Plan proposes custom auth but does not mention existing Devise setup or migration path | feasibility | 0.85 | gated_auto |
| 3 | Implementation Unit 3 | Plan proposes custom auth but does not mention existing Devise setup or migration path | feasibility | 100 | gated_auto |
### P2 — Consider Fixing
@@ -57,7 +59,7 @@ Applied 5 fixes. 4 items need attention (2 errors, 2 omissions). 2 FYI observati
| # | Section | Issue | Reviewer | Confidence | Tier |
|---|---------|-------|----------|------------|------|
| 4 | API Design | Public webhook endpoint has no rate limiting mentioned | security-lens | 0.75 | gated_auto |
| 4 | API Design | Public webhook endpoint has no rate limiting mentioned | security-lens | 75 | gated_auto |
### FYI Observations
@@ -65,12 +67,12 @@ Low-confidence observations surfaced without requiring a decision. Content advis
| # | Section | Observation | Reviewer | Confidence |
|---|---------|-------------|----------|------------|
| 1 | Naming | Filename `plan.md` is asymmetric with command name `user-auth`; could go either way | coherence | 0.52 |
| 2 | Risk Analysis | Rollout-cadence decision may benefit from monitoring thresholds, though not blocking | scope-guardian | 0.48 |
| 1 | Naming | Filename `plan.md` is asymmetric with command name `user-auth`; could go either way | coherence | 50 |
| 2 | Risk Analysis | Rollout-cadence decision may benefit from monitoring thresholds, though not blocking | scope-guardian | 50 |
### Residual Concerns
Residual concerns are issues the reviewers noticed but could not confirm with above-gate confidence. These are not actionable; they appear here for transparency only and are not promoted into the review surface.
Residual concerns are issues the reviewers noticed but could not confirm at confidence anchor `50` or higher. These are not actionable; they appear here for transparency only and are not promoted into the review surface.
| # | Concern | Source |
|---|---------|--------|
@@ -93,7 +95,9 @@ Residual concerns are issues the reviewers noticed but could not confirm with ab
| product-lens | not activated | -- | -- | -- | -- | -- | -- |
| design-lens | not activated | -- | -- | -- | -- | -- | -- |
Dropped: 3 (anchors 0/25 suppressed)
Chains: 1 root with 2 dependents
Restated: 2 (residual/deferred items suppressed as duplicates of actionable findings)
```
## Section Rules
@@ -101,10 +105,12 @@ Chains: 1 root with 2 dependents
- **Summary line**: Always present after the reviewer list. Format: "Applied N fixes. K items need attention (X errors, Y omissions). Z FYI observations." Omit any zero clause except the FYI clause when zero (it's informative that none surfaced).
- **Applied fixes**: List all fixes that were applied automatically (`safe_auto` tier). Include enough detail per fix to convey the substance — especially for fixes that add content or touch document meaning. Omit section if none.
- **P0-P3 sections**: Only include sections that have actionable findings (`gated_auto` or `manual`). Omit empty severity levels. Within each severity, separate into **Errors** and **Omissions** sub-headers. Omit a sub-header if that severity has none of that type. The `Tier` column surfaces whether a finding is `gated_auto` (concrete fix exists, Apply recommended in walk-through) or `manual` (requires user judgment).
- **FYI Observations**: Low-confidence `manual` findings above the 0.40 FYI floor but below the per-severity gate. Surface here for transparency; these are not actionable and do not enter the walk-through. Omit section if none.
- **FYI Observations**: Findings at confidence anchor `50` regardless of `autofix_class`. Surface here for transparency; these are not actionable and do not enter the walk-through. Omit section if none.
- **Residual Concerns**: Residual concerns noted by personas that did not make it above the confidence gate. Listed for transparency; not promoted into the review surface (cross-persona agreement boost runs on findings that already survived the gate, per synthesis step 3.4). Omit section if none.
- **Deferred Questions**: Questions for later workflow stages. Omit if none.
- **Coverage**: Always include. All counts are **post-synthesis**. **Findings** must equal Auto + Proposed + Decisions + FYI exactly — if deduplication merged a finding across personas, attribute it to the persona with the highest confidence and reduce the other persona's count. **Residual** = count of `residual_risks` from this persona's raw output (not the promoted subset in the Residual Concerns section). The `Auto` column counts `safe_auto` findings, `Proposed` counts `gated_auto`, `Decisions` counts above-gate `manual`, and `FYI` counts below-gate `manual` findings at or above the 0.40 FYI floor.
- **Compact rendering for FYI / Residual / Deferred (high-count mode)**: When the combined count across these three sections is **5 or more**, collapse each section to a one-line summary followed by the items as a tight bullet list (no table, no per-item `Why` elaboration). Rationale: these sections are observational, not decision-forcing — when they are lengthy, they bury the actionable tiers above them. A P0/P1/P2 actionable finding stays fully rendered regardless of how many FYI/Residual/Deferred items exist. When the combined count is 4 or fewer, render each section as today.
- **Coverage**: Always include. All counts are **post-synthesis**. **Findings** must equal Auto + Proposed + Decisions + FYI exactly — if deduplication merged a finding across personas, attribute it to the persona with the highest confidence anchor and reduce the other persona's count. **Residual** = count of `residual_risks` from this persona's raw output (not the promoted subset in the Residual Concerns section). The `Auto` column counts `safe_auto` findings at anchor `100`, `Proposed` counts `gated_auto` findings at anchor `75` or `100`, `Decisions` counts `manual` findings at anchor `75` or `100`, and `FYI` counts findings at anchor `50` regardless of `autofix_class`. Findings at anchors `0` or `25` were dropped by synthesis and do not appear in any column. Do NOT invent additional columns (e.g., `Dropped`, `Surviving`). The column schema above is the canonical set.
- **Coverage footnote lines** (optional, appear below the table when non-zero): `Dropped: N (anchors 0/25 suppressed)` when synthesis 3.2 dropped any findings. `Chains: N root(s) with M dependents` when premise-dependency chains exist. `Restated: N (residual/deferred items suppressed as duplicates of actionable findings)` when synthesis 3.9 suppressed any restatements. These footnotes — not the summary line, not per-persona columns — are the canonical location for cross-cutting counts that don't fit the per-persona shape. Order: `Dropped:`, then `Chains:`, then `Restated:`, each on its own line. Omit any footnote whose count is zero.
## Chain-Rendering Rules

View File

@@ -24,10 +24,24 @@ Return ONLY valid JSON matching the findings schema below. No prose, no markdown
- `finding_type`: one of `"error"`, `"omission"` — nothing else (no `"tension"`, `"concern"`, `"observation"`, etc.).
- `autofix_class`: one of `"safe_auto"`, `"gated_auto"`, `"manual"`.
- `evidence`: an ARRAY of strings with at least one element. A single string value is a validation failure — wrap every quote in `["..."]` even when there is only one.
- `confidence`: a number between 0.0 and 1.0 inclusive.
- `confidence`: one of exactly `0`, `25`, `50`, `75`, or `100` — a discrete anchor, NOT a continuous number. Any other value (e.g., `72`, `0.85`, `"high"`) is a validation failure. Pick the anchor whose behavioral criterion you can honestly self-apply to this finding (see "Confidence rubric" below).
If your persona description uses severity vocabulary like "high-priority" or "critical" in its rubric text, translate to the P0-P3 scale at emit time. "Critical / must-fix" → P0, "important / should-fix" → P1, "worth-noting / could-fix" → P2, "low-signal" → P3. Same for priorities described qualitatively in your analysis — map to P0-P3 on the way out.
**Confidence rubric — use these exact behavioral anchors.** Pick the single anchor whose criterion you can honestly self-apply. Do not pick a value between anchors; only `0`, `25`, `50`, `75`, and `100` are valid. The rubric is anchored on behavior you performed, not on a vague sense of certainty — if you cannot truthfully attach the behavioral claim to the finding, step down to the next anchor.
- **`0` — Not confident at all.** A false positive that does not stand up to light scrutiny, or a pre-existing issue the document did not introduce. **Do not emit — suppress silently.** This anchor exists in the enum only so synthesis can explicitly track the drop; personas never produce it.
- **`25` — Somewhat confident.** Might be a real issue but could also be a false positive; you were not able to verify. **Do not emit — suppress silently.** This anchor, like `0`, exists in the enum only so synthesis can track the drop; personas never produce it. If your domain is genuinely uncertain, either gather more evidence until you can honestly anchor the finding at `50` or higher, or suppress the concern entirely. (Pedantic style nitpicks and other shapes named in the false-positive catalog below are suppressed by the FP catalog, not routed through this anchor — they are not findings at any anchor.)
- **`50` — Moderately confident.** You verified this is a real issue but it may be a nitpick or not meaningfully affect plan correctness. Relative to the rest of the document, it is not very important. Advisory observations — where the honest answer to "what breaks if we do not fix this?" is "nothing breaks, but..." — land here. Surfaces in the FYI subsection.
- **`75` — Highly confident.** You double-checked and verified the issue will be hit in practice by implementers or readers of this document. The existing approach in the document is insufficient. The issue directly impacts plan correctness, implementer understanding, or downstream execution.
**Anchor `75` requires naming a concrete downstream consequence someone will hit** — a wrong deploy order, an unimplementable step, a contract mismatch, missing evidence that blocks a decision. Strength-of-argument concerns ("motivation is thin," "premise is unconvincing," "a different reader might disagree") do not meet this bar on their own — they are advisory observations and land at anchor `50` unless they also name the specific downstream outcome the reader hits. When in doubt between `50` and `75`, ask: "will a competent implementer or reader concretely encounter this, or is this my opinion about the document's strength?" The former is `75`; the latter is `50`.
- **`100` — Absolutely certain.** You double-checked and confirmed the issue. The evidence directly confirms it will happen frequently in practice. The document text, codebase, or cross-references leave no room for interpretation.
Anchor and severity are independent axes. A P2 finding can be anchor `100` if the evidence is airtight; a P0 finding can be anchor `50` if it is an important concern you could not fully verify. Anchor gates where the finding surfaces (drop / FYI / actionable); severity orders it within the actionable surface.
Synthesis drops anchors `0` and `25` silently; anchor `50` routes to the FYI subsection; anchors `75` and `100` enter the actionable tier (walk-through, proposed fixes, safe_auto when `autofix_class` also warrants).
Example of a schema-valid finding (all required fields, correct enum values, correct array shape):
```json
@@ -39,7 +53,7 @@ Example of a schema-valid finding (all required fields, correct enum values, cor
"finding_type": "omission",
"autofix_class": "gated_auto",
"suggested_fix": "Require Units 1-4 to land in a single atomic PR, or define the sequence explicitly.",
"confidence": 0.92,
"confidence": 100,
"evidence": [
"If the migration runs before Units 1-3 land, the code reads stale data.",
"If after, new code temporarily sees old entries until migration runs."
@@ -47,10 +61,12 @@ Example of a schema-valid finding (all required fields, correct enum values, cor
}
```
The `confidence: 100` in the example is justified because all three anchor-100 criteria hold: the reviewer double-checked (the plan literally names both orderings and resolves neither), the evidence directly confirms the outcome (quoted text shows each branch produces incorrect state), and the issue will happen frequently in practice (every deploy is subject to it).
Rules:
- You are a leaf reviewer inside an already-running compound-engineering review workflow. Do not invoke compound-engineering skills or agents unless this template explicitly instructs you to. Perform your analysis directly and return findings in the required output format only.
- Suppress any finding below your stated confidence floor (see your Confidence calibration section).
- Suppress any finding you cannot honestly anchor at `50` or higher (the actionable floor is `50`; anchors `0` and `25` are suppressed by synthesis anyway, so emitting them only adds noise). If your persona's domain description sets a stricter floor (e.g., anchor `75` minimum), honor it.
- Every finding MUST include at least one evidence item — a direct quote from the document.
- You are operationally read-only. Analyze the document and produce findings. Do not edit the document, create files, or make changes. You may use non-mutating tools (file reads, glob, grep, git log) to gather context about the codebase when evaluating feasibility or existing patterns.
- **Exclude prior-round deferred entries from review scope.** If the document under review contains a `## Deferred / Open Questions` section or subsections such as `### From YYYY-MM-DD review`, ignore that content — it is review output from prior rounds, not part of the document's actual plan/requirements content. Do not flag entries inside it as new findings. Do not quote its text as evidence. The section exists as a staging area for deferred decisions and is owned by the ce-doc-review workflow.
@@ -88,7 +104,7 @@ The `why_it_matters` field is how the reader — a developer triaging findings,
- **Lead with observable consequence.** Describe what goes wrong from the reader's or implementer's perspective — what breaks, what gets misread, what decision gets made wrong, what the downstream audience experiences. Do not lead with document structure ("Section X on line Y says..."). Start with the effect ("Implementers will disagree on which tier applies when..."). Section references appear later, only when the reader needs them to locate the issue.
- **Explain why the fix resolves the problem.** If you include a `suggested_fix`, the `why_it_matters` should make clear why that specific fix addresses the root cause. When a similar pattern exists elsewhere in the document or codebase (a parallel section, an established convention, a cited code pattern), reference it so the recommendation is grounded in what the team has already chosen.
- **Keep it tight.** Approximately 2-4 sentences. Longer framings are a regression — downstream surfaces have narrow display budgets, and verbose content gets truncated or skimmed.
- **Always produce substantive content.** `why_it_matters` is required by the schema. Empty strings, nulls, and single-phrase entries are validation failures. If you found something worth flagging (confidence at or above your persona's floor), you can explain it — the field exists because every finding needs a reason.
- **Always produce substantive content.** `why_it_matters` is required by the schema. Empty strings, nulls, and single-phrase entries are validation failures. If you found something worth flagging at anchor `50` or higher, you can explain it — the field exists because every finding needs a reason.
Illustrative pair — same finding, weak vs. strong framing:
@@ -106,14 +122,21 @@ STRONG (observable consequence first, grounded fix reasoning):
Synthesis already lacks a route for it.
```
False-positive categories to actively suppress:
False-positive categories to actively suppress. Do NOT emit a finding when any of these apply — not even at anchor `25` or `50`. These are not edge cases you should route to FYI; they are non-findings.
- Pedantic style nitpicks (word choice, bullet vs. numbered lists, comma-vs-semicolon) — style belongs to the document author
- Issues that belong to other personas (see your Suppress conditions)
- Findings already resolved elsewhere in the document (search before flagging)
- Content inside `## Deferred / Open Questions` sections (prior-round review output, not document content)
- **Pedantic style nitpicks** (word choice, bullet vs. numbered lists, comma-vs-semicolon, em-dash vs en-dash) — style belongs to the document author
- **Issues that belong to other personas** (see your Suppress conditions at the top of your persona prompt) — surfacing another persona's territory inflates the Coverage table and forces synthesis to dedup work that should not exist
- **Findings already resolved elsewhere in the document** — search the document before flagging. If the concern is addressed in a later section, the earlier section's apparent omission is not a real finding
- **Content inside `## Deferred / Open Questions` sections** — prior-round review output, not document content. This is the ce-doc-review workflow's own staging area
- **Pre-existing issues the document did not introduce** — if the concern exists in the codebase or organizational context independent of this document's proposal, flagging it here is scope creep
- **Speculative future-work concerns with no current signal** — "what if requirements change" / "this might need rework later" are not findings unless the document itself introduces the risk
- **Theoretical concerns without baseline data** — scalability worries without current scale numbers, performance worries without current latency measurements, edge cases without evidence the edge is reachable
- **Changes in functionality that are likely intentional** — if the document is explicitly making a design choice different from a precedent you noticed, that is a decision, not an error. Flag only when the document appears unaware of the precedent
- **Issues that a linter, typechecker, or validator would catch** — spelling in identifiers, JSON syntax errors, YAML indentation. These surface automatically elsewhere; the review layer adds value by catching what tools cannot
**Advisory observations — route to FYI, do not force a decision.** If the honest answer to "what actually breaks if we don't fix this?" is "nothing breaks, but…", the finding is advisory. Ask: would a competent implementer hit a wrong outcome, a production bug, a misleading plan, or rework later? If no, set confidence in the **0.400.59 LOW/Advisory band** so synthesis routes the finding to FYI rather than surfacing it as a manual decision. Do not suppress — the observation still has value; it just does not warrant user judgment. Typical advisory shapes: naming asymmetry with no wrong answer, stylistic preference without evidence of impact, speculative future-work concern with no current signal, subjective readability note, theoretical scalability concern without baseline data, "could also be split" organizational preference when the current split is not broken.
**Advisory observations — route to FYI, do not force a decision.** If the honest answer to "what actually breaks if we don't fix this?" is "nothing breaks, but…", the finding is advisory. Ask: would a competent implementer hit a wrong outcome, a production bug, a misleading plan, or rework later? If no, set `confidence: 50` so synthesis routes the finding to the FYI subsection rather than surfacing it as a decision or proposed fix. Do not suppress — the observation still has value; it just does not warrant user judgment. Typical advisory shapes: naming asymmetry with no wrong answer, subjective readability note about non-stylistic content (e.g., a definition placed before the term it defines), "could also be split" organizational preference when the current split is not broken. Style belongs to the false-positive catalog above, not here — pedantic style nitpicks suppress entirely.
**Precedence over the false-positive catalog.** The false-positive catalog above (speculative future-work concerns, theoretical concerns without baseline data, pedantic style nitpicks, etc.) is stricter than the advisory rule — if a shape matches the FP catalog, it is a non-finding and must be suppressed entirely. Do NOT route it to anchor `50` / FYI. The advisory rule applies only to shapes that are NOT in the FP catalog.
</output-contract>
<review-context>

View File

@@ -12,23 +12,25 @@ Check each agent's returned JSON against the findings schema:
- Drop findings with invalid enum values (including the pre-rename `auto` / `present` values from older personas — treat those as malformed until all persona output has been regenerated)
- Note the agent name for any malformed output in the Coverage section
### 3.2 Confidence Gate (Per-Severity)
**Do not narrate remap / validation diagnostics to the user.** Schema-drift notes ("persona X returned unknown enum Y, remapped to Z"), persona-prompt-drift commentary, and other validator-internal diagnostics are maintainer-facing information. They do not belong in the Phase 4 output the user reads. If a persona's output is malformed, the only user-visible consequence is a Coverage-row annotation (e.g., the persona shows fewer findings or a `malformed` marker). Everything else stays internal.
Gate findings using per-severity thresholds rather than a single flat floor:
### 3.2 Confidence Gate (Anchor-Based)
| Severity | Gate |
|----------|------|
| P0 | 0.50 |
| P1 | 0.60 |
| P2 | 0.65 |
| P3 | 0.75 |
Gate findings by their `confidence` anchor value. Anchors are discrete integers (`0`, `25`, `50`, `75`, `100`) with behavioral definitions documented in `references/findings-schema.json` and embedded in the persona rubric (`references/subagent-template.md`). This replaces the prior continuous 0.0-1.0 scale with per-severity gates — doc-review economics do not warrant threshold gradation by severity, and coarse anchors prevent false-precision gaming.
Findings at or above their severity's gate survive for classification. Findings below the gate are evaluated for FYI-eligibility:
| Anchor | Meaning | Route |
|--------|---------|-------|
| `0` | False positive or pre-existing issue | Drop silently |
| `25` | Might be real but could not verify | Drop silently |
| `50` | Verified real but nitpick / advisory / not very important | Surface in FYI subsection |
| `75` | Double-checked, will hit in practice, directly impacts correctness | Enter actionable tier (classify by `autofix_class`) |
| `100` | Evidence directly confirms; will happen frequently | Enter actionable tier (classify by `autofix_class`) |
- **FYI-eligible** when `autofix_class` is `manual` and confidence is between 0.40 (FYI floor) and the severity gate. These surface in an FYI subsection at the presentation layer (see 3.7) but do not enter the walk-through or any bulk action — they exist as observational value without forcing a decision.
- **Dropped** when confidence is below 0.40, or when the finding is `safe_auto` / `gated_auto` but below the severity gate (auto-apply findings need confidence above the decision gate to avoid silent mistakes).
- **Dropped silently** (anchors `0` and `25`): these do not surface in any output bucket — not as findings, not as FYI observations, not as residual concerns. Record the total drop count as a Coverage footnote line when non-zero: `Dropped: N (anchors 0/25 suppressed)`. The footnote appears below the Coverage table, alongside the `Chains:` footnote when both apply. This is the canonical location for drop-count reporting — not the summary line and not a per-persona Coverage column. Omit the footnote when N is zero.
- **FYI-subsection** (anchor `50`): surface in the presentation layer's FYI subsection regardless of `autofix_class`. These do not enter the walk-through or any bulk action — observational value without forcing a decision. Advisory observations ("nothing breaks, but...") naturally land here.
- **Actionable** (anchors `75` and `100`): enter the classification pipeline. Route by `autofix_class` (see 3.7).
Record the drop count and the FYI count in Coverage.
**Why this threshold, not Anthropic's ≥ 80 code-review threshold:** Document review has opposite economics from code review. There is no linter backstop — the review IS the backstop. Premise-level concerns (product-lens, adversarial) naturally cap at anchors 50-75 because "is the motivation valid?" cannot be verified against ground truth. The routing menu already makes dismissal cheap (Skip, Append to Open Questions), so surfaced-and-skipped is a low-cost outcome while missed-and-shipped derails downstream implementation. Filter low (`≥ 50`) and let the routing menu handle volume.
### 3.3 Deduplicate
@@ -37,14 +39,38 @@ Fingerprint each finding using `normalize(section) + normalize(title)`. Normaliz
When fingerprints match across personas:
- If the findings recommend opposing actions (e.g., one says cut, the other says keep), do not merge — preserve both for contradiction resolution in 3.5
- Otherwise merge: keep the highest severity, keep the highest confidence, union all evidence arrays, note all agreeing reviewers (e.g., "coherence, feasibility")
- **Coverage attribution:** Attribute the merged finding to the persona with the highest confidence. Decrement the losing persona's Findings count and the corresponding route bucket so totals stay exact.
- Otherwise merge: keep the highest severity, keep the highest confidence anchor (if tied, keep the finding appearing first in document order — deterministic, not probabilistic), union all evidence arrays, note all agreeing reviewers (e.g., "coherence, feasibility")
- **Coverage attribution:** Attribute the merged finding to the persona with the highest confidence anchor. If anchors tie, attribute to the persona whose entry appeared first in document order. Decrement the losing persona's Findings count and the corresponding route bucket so totals stay exact.
### 3.4 Cross-Persona Agreement Boost
### 3.3b Same-Persona Premise Redundancy Collapse
When 2+ independent personas flagged the same merged finding (from 3.3), boost the merged finding's confidence by +0.10 (capped at 1.0). Independent corroboration is strong signal — multiple reviewers converging on the same issue is more reliable than any single reviewer's confidence. Note the boost in the Reviewer column of the output (e.g., "coherence, feasibility +0.10").
A single persona sometimes files multiple findings that share the same root premise expressed at different sections or wrapped in different framing (e.g., product-lens firing five variants of "motivation is weak" attached to Motivation, Unit 4b, Key Technical Decisions, and two other sections). Cross-persona dedup (3.3) does not catch this — it fingerprints on section+title, which differ even when the underlying concern is the same. Surfacing all N variants over-weights one persona's perspective relative to the other five and inflates the P2 Decisions tier with near-duplicate signal.
This replaces the earlier residual-concern promotion step. Findings below the confidence gate are not promoted back into the review surface; they appear in Coverage as residual concerns only. If a below-gate finding is genuinely important, the reviewer should raise their confidence or provide stronger evidence rather than relying on a promotion rule.
For each persona, cluster that persona's surviving findings by shared root premise. A cluster forms when 3 or more findings from the same persona share:
- The same `finding_type` (error or omission)
- Substantially overlapping `why_it_matters` phrasing (same key nouns/verbs signaling the same concern, e.g., "motivation", "justification", "premise unsupported", "scope creep")
- Fixes that would all be obviated by the same upstream decision (e.g., "add the triggering incident" would moot all five motivation-weakness findings)
For each cluster of size N ≥ 3:
- Keep the single finding with the strongest evidence (highest confidence anchor, or if tied, the one citing the most concrete document reference)
- Demote the remaining N-1 findings to FYI-subsection status (anchor `50`), regardless of their original anchor
- On the kept finding, note in the Reviewer column that the persona raised N-1 related variants (e.g., `product-lens (+4 related variants demoted to FYI)`)
This runs per-persona before 3.4 cross-persona boost. Cross-persona agreement across the *kept* finding still qualifies for the anchor-step promotion in 3.4; demoted variants do not participate in cross-persona promotion (they are observational only after collapse).
Do NOT collapse across personas at this step — different personas surfacing the same concern is exactly the independence signal the cross-persona boost rewards. Collapse applies within one persona's output only.
### 3.4 Cross-Persona Agreement Promotion
When 2+ independent personas flagged the same merged finding (from 3.3), promote the merged finding's anchor by one step: `50 → 75`, `75 → 100`. Anchor `100` does not promote further (already at the ceiling). Findings at anchors `0` or `25` do not reach this step (they were dropped in 3.2).
Independent corroboration is strong signal — multiple reviewers converging on the same issue is more reliable than any single reviewer's anchor. Promoting by one anchor step is semantically meaningful (a "verified but nitpick" finding that two personas independently surface is plausibly "will hit in practice"). This replaces the prior `+0.10` boost — the magic-number bump was calibrated to the continuous scale and no longer applies.
Note the promotion in the Reviewer column of the output (e.g., `coherence, feasibility (+1 anchor)`).
This replaces the earlier residual-concern promotion step. Findings at anchors `0` / `25` are not promoted back into the review surface; they appear only as drop counts in Coverage. If a dropped finding is genuinely important, the reviewer should raise their anchor to `50` or higher through stronger evidence rather than relying on a promotion rule.
### 3.5 Resolve Contradictions
@@ -113,7 +139,7 @@ If multiple candidates match the criteria, elevate ALL of them. The criteria abo
If none match, skip the rest of this step — no chains exist.
**Dependent assignment under multiple roots.** When multiple roots exist and a candidate dependent could plausibly link to more than one, assign it to the root whose rejection most directly dissolves the dependent's concern. If ambiguity remains, assign to the higher-confidence root. A dependent never links to more than one root — a single `depends_on` value.
**Dependent assignment under multiple roots.** When multiple roots exist and a candidate dependent could plausibly link to more than one, assign it to the root whose rejection most directly dissolves the dependent's concern. If ambiguity remains, assign to the root with the higher confidence anchor; if anchors tie, assign to the root appearing first in document order. A dependent never links to more than one root — a single `depends_on` value.
**Step 2: Identify dependents.** For each candidate root, scan the remaining findings for dependents. The predicate must match the cascade trigger in `references/walkthrough.md` — dependents cascade when the user rejects (Skip/Defer) the root, so dependency is defined on the rejection branch, not the acceptance branch. A finding is a dependent of a root when:
@@ -131,9 +157,9 @@ Test with the substitution check: "If the user rejects the root (Skip/Defer), do
When uncertain, default to NOT linking. A mis-linked chain hides a real issue; leaving a finding unlinked only costs one extra decision.
**Step 4: Annotate.** On each dependent, record `depends_on: <root_finding_id>` (use section + normalized title as the id). On each root, record `dependents: [<dependent_ids>]`. Cap `dependents` at 6 entries per root — if more than 6 candidates link to the same root, keep the top 6 by severity then confidence and leave the rest unlinked (over-aggressive chaining risks obscuring independent concerns).
**Step 4: Annotate.** On each dependent, record `depends_on: <root_finding_id>` (use section + normalized title as the id). On each root, record `dependents: [<dependent_ids>]`. Cap `dependents` at 6 entries per root — if more than 6 candidates link to the same root, keep the top 6 by severity, then confidence anchor (descending), then document order as the deterministic final tiebreak; leave the rest unlinked (over-aggressive chaining risks obscuring independent concerns).
Do NOT reclassify, re-route, or change confidence of any finding in this step. Linking is purely annotative; the walk-through and presentation use the annotation, synthesis proper does not.
Do NOT reclassify, re-route, or change the confidence anchor of any finding in this step. Linking is purely annotative; the walk-through and presentation use the annotation, synthesis proper does not.
**Step 5: Report in Coverage.** Add a line to the coverage summary: `Chains: N root(s) with M total dependents`. When N = 0, omit the line.
@@ -177,12 +203,19 @@ Do not promote if the finding involves scope or priority changes where the autho
**Severity and autofix_class are independent.** A P1 finding can be `safe_auto` if the correct fix is obvious. The test is not "how important?" but "is there one clear correct fix, or does this require judgment?"
| Autofix Class | Route |
|---------------|-------|
| `safe_auto` | Apply silently in Phase 4. Requires `suggested_fix`. Demote to `gated_auto` if missing. |
| `gated_auto` | Enter the per-finding walk-through with Apply marked (recommended). Requires `suggested_fix`. Demote to `manual` if missing. |
| `manual` | Enter the per-finding walk-through with user-judgment framing. `suggested_fix` is optional. |
| FYI-subsection | `manual` findings below the severity gate but at or above the FYI floor (0.40) — surface in a distinct FYI subsection of the presentation, do not enter the walk-through or any bulk action. |
**Anchor and autofix_class are also independent.** Anchor gates the finding into a surface (FYI vs actionable); `autofix_class` decides what the actionable surface does with it. Both are consulted in this step.
Findings reaching 3.7 have already been gated to anchors `50`, `75`, or `100` by 3.2 (anchors `0` and `25` were dropped).
| Anchor | Autofix Class | Route |
|--------|---------------|-------|
| `100` | `safe_auto` | Apply silently in Phase 4. Requires `suggested_fix`. Demote to `gated_auto` if missing. |
| `100` | `gated_auto` | Enter the per-finding walk-through with Apply marked (recommended). Requires `suggested_fix`. Demote to `manual` if missing. |
| `100` | `manual` | Enter the per-finding walk-through with user-judgment framing. `suggested_fix` is optional. |
| `75` | `safe_auto` | Demote to `gated_auto` before routing — silent apply is reserved for anchor `100` findings where evidence directly confirms the fix. Enter the walk-through with Apply marked (recommended). |
| `75` | `gated_auto` | Enter the per-finding walk-through with Apply marked (recommended). Requires `suggested_fix`. Demote to `manual` if missing. |
| `75` | `manual` | Enter the per-finding walk-through with user-judgment framing. `suggested_fix` is optional. |
| `50` | any | Surface in the FYI subsection regardless of `autofix_class`. Do not enter the walk-through or any bulk action. These are observations, not decisions. |
**Auto-eligible patterns for safe_auto:** summary/detail mismatch (body authoritative over overview), wrong counts, missing list entries derivable from elsewhere in the document, stale internal cross-references, terminology drift, prose/diagram contradictions where prose is more detailed, missing steps mechanically implied by other content, unstated thresholds implied by surrounding context.
@@ -190,17 +223,33 @@ Do not promote if the finding involves scope or priority changes where the autho
### 3.8 Sort
Sort findings for presentation: P0 → P1 → P2 → P3, then by finding type (errors before omissions), then by confidence (descending), then by document order (section position).
Sort findings for presentation: P0 → P1 → P2 → P3, then by finding type (errors before omissions), then by confidence anchor (descending: `100` first, then `75`, then `50`), then by document order (section position) as the deterministic final tiebreak.
### 3.9 Suppress Restatements in Residual Concerns and Deferred Questions
Persona outputs carry `residual_risks` and `deferred_questions` arrays alongside `findings`. After the actionable-tier set is finalized (post-3.7 routing), personas often re-surface the same substance in their residual/deferred arrays — the persona's own finding and the persona's own residual concern are about the same issue. Rendering both sections verbatim inflates the output with restatements that carry no new signal.
For every `residual_risk` and `deferred_question` across all persona outputs, check against the finalized actionable-finding set (findings at confidence anchor `75` or `100`, plus FYI-subsection findings at anchor `50`). Drop the residual/deferred item if either of these holds:
- **Section-and-substance overlap.** The residual/deferred item names the same section as an actionable finding AND its substance fuzzy-matches the finding's `title` or `why_it_matters` (shared key nouns/verbs indicating the same concern).
- **Question form of an actionable finding.** A deferred question whose subject is directly answered by or obviated by an actionable finding's recommendation. Example: actionable finding "Motivation cites no real incident" → deferred question "Is there a concrete triggering event?" — the finding already raised this; the question restates it interrogatively.
Do NOT drop residual/deferred items that introduce genuinely new signal (a concern or question the actionable findings do not touch). When in doubt, keep — this pass is for obvious restatements, not borderline calls.
Run this pass on the merged set across all personas. Record the count dropped as a Coverage footnote line when non-zero: `Restated: N (residual/deferred items suppressed as duplicates of actionable findings)`. Ordering: footnotes appear in the sequence `Dropped:`, `Chains:`, `Restated:` below the Coverage table, each on its own line. Omit any footnote whose count is zero.
## Phase 4: Apply and Present
**User-facing vocabulary rule (applies to ALL user-visible output in Phase 4, not just the rendered template).** Internal enum values — `safe_auto`, `gated_auto`, `manual`, `FYI` — stay inside the schema and synthesis prose. Every word the user sees in Phase 4 output, including free-text narration between sections, transition preambles, status lines, and confirmation messages, MUST use user-facing vocabulary: "fixes" (for `safe_auto`), "proposed fixes" (for `gated_auto`), "decisions" (for `manual` findings at anchor `75` or `100`), "FYI observations" (for any finding at anchor `50`). The only exception is the `Tier` column in rendered tables, which is explicitly documented as surfacing the internal enum for transparency. Do NOT emit narration like "safe_auto fixes applied" or "N safe_auto findings" — write "fixes applied" or "N fixes" instead.
### Apply safe_auto fixes
Apply all `safe_auto` findings to the document in a single pass:
Apply only `safe_auto` findings **at confidence anchor `100`** to the document in a single pass. This matches the 3.7 routing table: anchor `100` + `safe_auto` silent-applies; anchor `75` + `safe_auto` was demoted to `gated_auto` in 3.7 and enters the walk-through instead; anchor `50` + any `autofix_class` routes to FYI and must never auto-apply.
- Edit the document inline using the platform's edit tool
- Track what was changed for the "Applied fixes" section in the rendered output (`safe_auto` is the internal enum; the rendered section header reads "Applied fixes")
- Do not ask for approval — these have one clear correct fix
- Do not ask for approval — these have one clear correct fix AND evidence directly confirms (anchor `100`)
- Do NOT silent-apply any `safe_auto` finding at anchor `75` or `50`. If a finding reaches this step with `autofix_class: safe_auto` and anchor below `100`, the 3.7 routing rule was not applied correctly; re-run 3.7 for that finding before continuing.
List every applied fix in the output summary so the user can see what changed. Use enough detail to convey the substance of each fix (section, what was changed, reviewer attribution). This is especially important for fixes that add content or touch document meaning — the user should not have to diff the document to understand what the review did.
@@ -208,7 +257,7 @@ List every applied fix in the output summary so the user can see what changed. U
After safe_auto fixes apply, remaining findings split into buckets:
- `gated_auto` and `manual` findings at or above the severity gate → enter the routing question (see Unit 5 / `references/walkthrough.md`)
- `gated_auto` and `manual` findings at confidence anchor `75` or `100` → enter the routing question (see Unit 5 / `references/walkthrough.md`)
- FYI-subsection findings → surface in the presentation only, no routing
- Zero actionable findings remaining → skip the routing question; flow directly to Phase 5 terminal question
@@ -223,25 +272,25 @@ Applied N fixes:
Proposed fixes (concrete fix, requires user confirmation):
[P0] Section: <section> — <title> (<reviewer>, confidence <N>)
[P0] Section: <section> — <title> (<reviewer>, confidence <anchor>)
Why: <why_it_matters>
Suggested fix: <suggested_fix>
Decisions (requires user judgment):
[P1] Section: <section> — <title> (<reviewer>, confidence <N>)
[P1] Section: <section> — <title> (<reviewer>, confidence <anchor>)
Why: <why_it_matters>
Suggested fix: <suggested_fix or "none">
Dependents (would resolve if this root is rejected):
[P2] Section: <section> — <title> (<reviewer>, confidence <N>)
[P2] Section: <section> — <title> (<reviewer>, confidence <anchor>)
Why: <why_it_matters>
[P2] Section: <section> — <title> (<reviewer>, confidence <N>)
[P2] Section: <section> — <title> (<reviewer>, confidence <anchor>)
Why: <why_it_matters>
FYI observations (low-confidence, no decision required):
FYI observations (anchor 50, no decision required):
[P3] Section: <section> — <title> (<reviewer>, confidence <N>)
[P3] Section: <section> — <title> (<reviewer>, confidence <anchor>)
Why: <why_it_matters>
Residual concerns:
@@ -250,10 +299,16 @@ Residual concerns:
Deferred questions:
- <question> (<source>)
Dropped: N (anchors 0/25 suppressed)
Chains: N root(s) with M dependents
Restated: N (residual/deferred items suppressed as duplicates of actionable findings)
Review complete
```
Omit any section with zero items. The section headers reflect user-facing vocabulary: the "Proposed fixes" bucket carries `gated_auto` findings (the persona has a concrete fix; the user confirms), "Decisions" carries above-gate `manual` findings (judgment calls), and "FYI observations" carries `manual` findings between the 0.40 FYI floor and the per-severity gate. When a root has dependents, render the root at its normal position in the severity-sorted list and nest its dependents as an indented `Dependents (...)` sub-block immediately below. Do not re-list dependents at their own severity position — they appear only under their root. End with "Review complete" as the terminal signal so callers can detect completion.
Omit any section with zero items. The section headers reflect user-facing vocabulary: the "Proposed fixes" bucket carries `gated_auto` findings at anchor `75` or `100` (the persona has a concrete fix; the user confirms), "Decisions" carries `manual` findings at anchor `75` or `100` (judgment calls), and "FYI observations" carries any finding at anchor `50` regardless of `autofix_class`. When a root has dependents, render the root at its normal position in the severity-sorted list and nest its dependents as an indented `Dependents (...)` sub-block immediately below. Do not re-list dependents at their own severity position — they appear only under their root. End with "Review complete" as the terminal signal so callers can detect completion.
**Compact rendering for FYI observations, residual concerns, and deferred questions (high-count mode).** When the combined count of these three buckets is 5 or more, collapse each to a one-line count followed by a tight bullet list without per-item `Why` expansion. Actionable buckets (Proposed fixes / Decisions) remain fully rendered regardless. This mirrors the interactive-mode rule in `references/review-output-template.md` so both modes produce the same shape.
**Interactive mode:**
@@ -266,6 +321,8 @@ Brief summary at the top: "Applied N fixes. K items need attention (X errors, Y
Include the Coverage table, applied fixes, FYI observations (as a distinct subsection), residual concerns, and deferred questions.
**All tables MUST be pipe-delimited markdown (`| col | col |`). Do NOT use ASCII box-drawing characters (`┌ ┬ ┐ ├ ┼ ┤ └ ┴ ┘ │ ─`) under any circumstances, including for the Coverage table.** This rule restates the template's formatting requirement at the point of rendering so it cannot drift. Pipe-delimited tables render correctly across all target harnesses; box-drawing characters break rendering in some and violate the repo convention documented in root `AGENTS.md`.
### R29 Rejected-Finding Suppression (Round 2+)
When the orchestrator is running round 2+ on the same document in the same session, the decision primer (see `SKILL.md` — Decision primer) carries forward every prior-round Skipped, Deferred, and Acknowledged finding. Synthesis suppresses re-raised rejected findings rather than re-surfacing them to the user. Acknowledged is treated as a rejected-class decision here: the user saw the finding, chose not to act on it (no Apply, no Defer append), and wants it on record — equivalent to Skip for suppression purposes.

View File

@@ -25,7 +25,7 @@ D. Report only — take no further action
The per-finding `(recommended)` labeling lives inside the walk-through (option A) and the bulk preview (options B/C), where it's applied per-finding from synthesis step 3.5b's `recommended_action`. The routing question itself does not recommend one of A/B/C/D because the right route depends on user intent (engage / trust / triage / skim), not on the finding-set shape — a rule that mapped finding-set shape to routing recommendation (e.g., "most findings are Apply-shaped → recommend LFG") would pressure users toward automated paths in ways that conflict with the user-intent framing.
If all remaining findings are FYI-subsection-only (no `gated_auto` or above-gate `manual` findings), skip the routing question entirely and flow to the Phase 5 terminal question.
If all remaining findings are FYI-subsection-only (no `gated_auto` or `manual` findings at confidence anchor `75` or `100`), skip the routing question entirely and flow to the Phase 5 terminal question.
**Append-availability adaptation.** When `references/open-questions-defer.md` has cached `append_available: false` at Phase 4 start (e.g., read-only document, unwritable filesystem), option C is suppressed from the routing question because every per-finding Defer would fail into the open-questions failure path. The menu shows three options (A / B / D) and the stem appends one line explaining why (e.g., `Append to Open Questions unavailable — document is read-only in this environment.`). This mirrors the per-finding option B suppression described under "Adaptations" below — both routing-level and per-finding Defer paths share the same availability signal so the user never sees Defer surfaced at one level and omitted at the other.
@@ -42,7 +42,7 @@ If all remaining findings are FYI-subsection-only (no `gated_auto` or above-gate
The walk-through receives, from the orchestrator:
- The merged findings list in severity order (P0 → P1 → P2 → P3), filtered to `gated_auto` and `manual` findings that survived the per-severity confidence gate. FYI-subsection findings are not included — they surface in the final report only and have no walk-through entry.
- The merged findings list in severity order (P0 → P1 → P2 → P3), filtered to actionable findings (confidence anchor `75` or `100` with `autofix_class` `gated_auto` or `manual`). FYI-subsection findings (anchor `50`) are not included — they surface in the final report only and have no walk-through entry.
- The run id for artifact lookups (when applicable).
- Premise-dependency chain annotations from synthesis step 3.5c: each finding may carry `depends_on: <root_id>` or `dependents: [<ids>]`.
@@ -246,7 +246,7 @@ Every terminal path of Interactive mode emits the same completion report structu
### Minimum required fields
- **Per-finding entries:** for every finding the flow touched, a line with — at minimum — title, severity, the action taken (Applied / Deferred / Skipped / Acknowledged), the append location for Deferred entries, a one-line reason for Skipped entries (grounded in the finding's confidence or the one-line `why_it_matters` snippet), and the acknowledgement reason for Acknowledged entries (e.g., `Apply picked but no suggested_fix available`).
- **Per-finding entries:** for every finding the flow touched, a line with — at minimum — title, severity, the action taken (Applied / Deferred / Skipped / Acknowledged), the append location for Deferred entries, a one-line reason for Skipped entries (grounded in the finding's confidence anchor or the one-line `why_it_matters` snippet), and the acknowledgement reason for Acknowledged entries (e.g., `Apply picked but no suggested_fix available`).
- **Summary counts by action:** totals per bucket (e.g., `4 applied, 2 deferred, 2 skipped`). Include an `acknowledged` count when any entries land in that bucket; omit the label when the count is zero.
- **Failures called out explicitly:** any Apply that failed (e.g., document write error, or the defensive no-fix fallback skipping an Apply-set entry), any Open-Questions append that failed. Failures surface above the per-finding list so they are not missed.
- **End-of-review verdict:** carried over from Phase 4's Coverage section.
@@ -257,7 +257,7 @@ Failures first (above the per-finding list), then per-finding entries grouped by
### Zero-findings degenerate case
When the routing question was skipped because no `gated_auto` / above-gate `manual` findings remained after `safe_auto`, the completion report collapses to its summary-counts + verdict form with one added line — the count of `safe_auto` fixes applied. The summary wording:
When the routing question was skipped because no `gated_auto` / `manual` findings at confidence anchor `75` or `100` remained after `safe_auto`, the completion report collapses to its summary-counts + verdict form with one added line — the count of `safe_auto` fixes applied. The summary wording:
No FYI or residual concerns:

View File

@@ -56,7 +56,7 @@ Seed map (run this plan through ce-doc-review to verify):
- PII handling during migration window unstated — compliance
gap independent of premise
- FYI candidates (4, confidence 0.40-0.65 at P3):
- FYI candidates (4, anchor 50 at P3):
- naming preference ("AuthContext" vs "SessionContext" — both
legible in the code)
- speculative future-work concern (could reuse this for a
@@ -65,7 +65,7 @@ Seed map (run this plan through ce-doc-review to verify):
- unit-organization preference (could group by route rather
than by endpoint class — current split also reads fine)
- drop-worthy P3s (3, confidence 0.55-0.74):
- drop-worthy P3s (3, anchors 0/25):
- vague performance concern without baseline ("could be slow
under load")
- theoretical multi-region concern not relevant to single-region

View File

@@ -26,14 +26,14 @@ Seed map (run this plan through ce-doc-review to verify):
scope-guardian complexity challenge (is this abstraction warranted),
product-lens trajectory concern (does this paint the system into a
corner)
- FYI candidates (5, confidence 0.40-0.65 at P3): filename-symmetry
- FYI candidates (5, anchor 50 at P3): filename-symmetry
observation, drift note, stylistic preference without evidence of
impact, speculative future-work concern, subjective readability note
- drop-worthy P3s (3, confidence 0.55-0.74): vague style nitpick, low-
- drop-worthy P3s (3, anchors 0/25): vague style nitpick, low-
signal "consider X" residual, theoretical scalability concern without
current evidence
The descriptions intentionally vary in evidence quality so the confidence
The descriptions intentionally vary in evidence quality so the anchor
gate is exercised.
-->
@@ -205,7 +205,7 @@ one-command rename. (Seeded manual: scope-guardian complexity challenge
- The plan's section ordering could be improved; "Miscellaneous Notes"
feels like a catch-all. (Seeded drop: vague style nitpick at P3,
confidence should register below 0.75 gate.)
should register at anchor 0 or 25 and drop silently.)
- Consider whether the schema migration strategy scales if the codebase
grows 10x. (Seeded drop: theoretical scalability concern without
current evidence, P3.)

View File

@@ -372,6 +372,51 @@ describe("ce-doc-review contract", () => {
expect(enumValues).not.toContain("present")
})
test("findings schema enforces discrete confidence anchors", async () => {
const schema = JSON.parse(
await readRepoFile("plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json")
)
const confidence = schema.properties.findings.items.properties.confidence
// Anchored integer enum, not continuous float
expect(confidence.type).toBe("integer")
expect(confidence.enum).toEqual([0, 25, 50, 75, 100])
// No stale continuous-range properties
expect(confidence.minimum).toBeUndefined()
expect(confidence.maximum).toBeUndefined()
// Rubric text embedded in the description so persona agents see it
expect(confidence.description).toContain("Absolutely certain")
expect(confidence.description).toContain("Highly confident")
expect(confidence.description).toContain("Moderately confident")
expect(confidence.description).toContain("double-checked")
expect(confidence.description).toContain("evidence directly confirms")
})
test("subagent template embeds anchor rubric and bans float confidence", async () => {
const template = await readRepoFile(
"plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md"
)
// Rubric section embedded verbatim in the persona-facing template
expect(template).toContain("Confidence rubric")
expect(template).toContain("`0`")
expect(template).toContain("`25`")
expect(template).toContain("`50`")
expect(template).toContain("`75`")
expect(template).toContain("`100`")
// Example finding uses anchor, not float
expect(template).toContain('"confidence": 100')
expect(template).not.toMatch(/"confidence":\s*0\.\d+/)
// Advisory observations route to anchor 50, not to a 0.40-0.59 band
expect(template).toContain("`confidence: 50`")
expect(template).not.toContain("0.400.59 LOW/Advisory band")
expect(template).not.toContain("0.40-0.59 LOW/Advisory band")
})
test("subagent template carries framing guidance and strawman rule", async () => {
const template = await readRepoFile(
"plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md"
@@ -397,30 +442,30 @@ describe("ce-doc-review contract", () => {
expect(template).toContain("<decision-primer-rules>")
})
test("synthesis pipeline routes three tiers with per-severity gates and FYI subsection", async () => {
test("synthesis pipeline routes three tiers with anchor-based gating and FYI subsection", async () => {
const synthesis = await readRepoFile(
"plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md"
)
// Per-severity confidence gate with the specific thresholds
expect(synthesis).toContain("Per-Severity")
expect(synthesis).toMatch(/P0\s*\|\s*0\.50/)
expect(synthesis).toMatch(/P1\s*\|\s*0\.60/)
expect(synthesis).toMatch(/P2\s*\|\s*0\.65/)
expect(synthesis).toMatch(/P3\s*\|\s*0\.75/)
// Anchor-based confidence gate
expect(synthesis).toContain("Anchor-Based")
expect(synthesis).toMatch(/`0`\s*\|/)
expect(synthesis).toMatch(/`25`\s*\|/)
expect(synthesis).toMatch(/`50`\s*\|/)
expect(synthesis).toMatch(/`75`\s*\|/)
expect(synthesis).toMatch(/`100`\s*\|/)
// FYI floor at 0.40 for low-confidence manual findings
expect(synthesis).toContain("0.40")
expect(synthesis).toContain("FYI floor")
// Anchor 50 routes to FYI, anchors 75/100 enter actionable tier
expect(synthesis).toContain("FYI subsection")
// Three-tier routing table present
// Three-tier routing table present (autofix_class)
expect(synthesis).toContain("`safe_auto`")
expect(synthesis).toContain("`gated_auto`")
expect(synthesis).toContain("`manual`")
// Cross-persona agreement boost (replaces residual-concern promotion)
expect(synthesis).toContain("Cross-Persona Agreement Boost")
expect(synthesis).toContain("+0.10")
// Cross-persona agreement promotion (replaces +0.10 boost)
expect(synthesis).toContain("Cross-Persona Agreement Promotion")
expect(synthesis).toContain("one anchor step")
// R29 and R30 round-2 rules
expect(synthesis).toContain("R29 Rejected-Finding Suppression")