refactor(ce-doc-review): anchor-based confidence scoring (#622)
Some checks failed
CI / pr-title (push) Has been cancelled
CI / test (push) Has been cancelled
Release PR / release-pr (push) Has been cancelled
Release PR / publish-cli (push) Has been cancelled

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trevin Chow
2026-04-21 14:54:03 -07:00
committed by GitHub
parent bd77d5550a
commit 6caf330363
20 changed files with 756 additions and 122 deletions

View File

@@ -177,7 +177,7 @@ Cross-session persistence is out of scope. A new invocation of ce-doc-review on
## Phases 3-5: Synthesis, Presentation, and Next Action
After all dispatched agents return, read `references/synthesis-and-presentation.md` for the synthesis pipeline (validate, per-severity gate, dedup, cross-persona agreement boost, resolve contradictions, auto-promotion, route by three tiers with FYI subsection), `safe_auto` fix application, headless-envelope output, and the handoff to the routing question.
After all dispatched agents return, read `references/synthesis-and-presentation.md` for the synthesis pipeline (validate, anchor-based gate, dedup, cross-persona agreement promotion, resolve contradictions, auto-promotion, route by three tiers with FYI subsection), `safe_auto` fix application, headless-envelope output, and the handoff to the routing question.
For the four-option routing question and per-finding walk-through (interactive mode), read `references/walkthrough.md`. For the bulk-action preview used by LFG, Append-to-Open-Questions, and walk-through `LFG-the-rest`, read `references/bulk-preview.md`. Do not load these files before agent dispatch completes.

View File

@@ -10,8 +10,8 @@ Interactive mode only.
Three call sites:
1. **Routing option B (top-level LFG)** — after the user picks `LFG. Apply the agent's best-judgment action per finding` from the routing question, but before any action executes. Scope: every pending `gated_auto` / above-gate `manual` finding.
2. **Routing option C (top-level Append-to-Open-Questions)** — after the user picks `Append findings to the doc's Open Questions section and proceed` but before any append runs. Scope: every pending `gated_auto` / above-gate `manual` finding. Every finding appears under `Appending to Open Questions (N):` regardless of the agent's natural recommendation, because option C is batch-defer.
1. **Routing option B (top-level LFG)** — after the user picks `LFG. Apply the agent's best-judgment action per finding` from the routing question, but before any action executes. Scope: every pending `gated_auto` or `manual` finding at confidence anchor `75` or `100`.
2. **Routing option C (top-level Append-to-Open-Questions)** — after the user picks `Append findings to the doc's Open Questions section and proceed` but before any append runs. Scope: every pending `gated_auto` or `manual` finding at confidence anchor `75` or `100`. Every finding appears under `Appending to Open Questions (N):` regardless of the agent's natural recommendation, because option C is batch-defer.
3. **Walk-through `LFG the rest`** — after the user picks `LFG the rest — apply the agent's best judgment to this and remaining findings` from a per-finding question, but before the remaining findings are resolved. Scope: the current finding and everything not yet decided. Already-decided findings from the walk-through are not included in the preview.
In all three cases the user confirms with `Proceed` or backs out with `Cancel`. No per-item decisions inside the preview — per-item decisioning is the walk-through's role.

View File

@@ -58,10 +58,9 @@
"description": "Concrete fix text. Omit or null if no good fix is obvious -- a bad suggestion is worse than none."
},
"confidence": {
"type": "number",
"description": "Reviewer confidence in this finding, calibrated per persona",
"minimum": 0.0,
"maximum": 1.0
"type": "integer",
"enum": [0, 25, 50, 75, 100],
"description": "Anchored confidence score. Use exactly one of 0, 25, 50, 75, 100. Each anchor has a behavioral criterion the reviewer must honestly self-apply. 0: Not confident at all. This is a false positive that does not stand up to light scrutiny, or a pre-existing issue the document did not introduce. 25: Somewhat confident. Might be a real issue but could also be a false positive; the reviewer was not able to verify. 50: Moderately confident. The reviewer verified this is a real issue but it may be a nitpick or not meaningfully affect plan correctness. Relative to the rest of the document, it is not very important. Advisory observations (the honest answer to 'what breaks if we do not fix this?' is 'nothing breaks, but...') land here. 75: Highly confident. The reviewer double-checked and verified the issue will be hit in practice by implementers or readers of this document. The existing approach is insufficient. The issue is important and will directly impact plan correctness, implementer understanding, or downstream execution. 100: Absolutely certain. The reviewer double-checked and confirmed the issue. The evidence directly confirms it will happen frequently in practice. The document text, codebase, or cross-references leave no room for interpretation."
},
"evidence": {
"type": "array",

View File

@@ -42,8 +42,8 @@ Fields come from the finding's schema:
- `{title}` — the finding's title field
- `{section}` — the finding's section field, unmodified (human-readable)
- `{severity}` — P0 / P1 / P2 / P3
- `{reviewer}` — the persona that produced the finding (after dedup, the persona with the highest confidence; surface all co-flagging personas if multiple)
- `{confidence}`rounded to 2 decimal places
- `{reviewer}` — the persona that produced the finding (after dedup, the persona with the highest confidence anchor; surface all co-flagging personas if multiple)
- `{confidence}`the integer anchor (`50`, `75`, or `100`), emitted without a decimal point or percent sign
- `{why_it_matters}` — the full why_it_matters text, preserving the framing guidance from the subagent template
HTML-comment fields (machine-readable, used by Step 4 dedup):
@@ -133,7 +133,7 @@ Starting document state:
### From 2026-04-10 review
- **Alias compatibility-theater concern** — Risks (P1, scope-guardian, confidence 0.87)
- **Alias compatibility-theater concern** — Risks (P1, scope-guardian, confidence 75)
The alias exists without documented external consumers...
@@ -152,7 +152,7 @@ After appending two findings in a 2026-04-18 session:
### From 2026-04-10 review
- **Alias compatibility-theater concern** — Risks (P1, scope-guardian, confidence 0.87)
- **Alias compatibility-theater concern** — Risks (P1, scope-guardian, confidence 75)
The alias exists without documented external consumers...
@@ -160,14 +160,14 @@ After appending two findings in a 2026-04-18 session:
### From 2026-04-18 review
- **Unit 2/3 merge judgment call** — Scope Boundaries (P2, scope-guardian, confidence 0.78)
- **Unit 2/3 merge judgment call** — Scope Boundaries (P2, scope-guardian, confidence 75)
The two units update consumer sites that deploy together. Splitting
adds dependency tracking without enabling independent delivery.
<!-- dedup-key: section="scope boundaries" title="unit 23 merge judgment call" evidence="the two units update consumer sites that deploy together" -->
- **Strawman alternatives on migration strategy** — Unit 3 Files (P2, coherence, confidence 0.72)
- **Strawman alternatives on migration strategy** — Unit 3 Files (P2, coherence, confidence 75)
The fix options list (a) through (c) as alternatives, but (b) and (c)
are "accept the regression" framings that don't solve the problem the

View File

@@ -8,6 +8,8 @@ This template describes the Phase 4 interactive presentation — what the user s
**Vocabulary note.** Internal enum values (`safe_auto`, `gated_auto`, `manual`, `FYI`) live in the schema and synthesis pipeline. User-facing rendered text uses plain-language labels instead: fixes (for `safe_auto`), proposed fixes (for `gated_auto`), decisions (for `manual`), and FYI observations (for `FYI`). The `Tier` column in the tables below is the one place that still names the internal enum so the user can see the synthesis decision; everything else reads as plain language.
**Confidence column.** The `Confidence` column shows the integer anchor value (`50`, `75`, or `100`) — never a decimal or percentage. Anchor `50` = advisory (routed to FYI); anchor `75` = verified, will hit in practice; anchor `100` = certain, evidence directly confirms. Anchors `0` and `25` are dropped by synthesis before this layer and never appear in the rendered output. Cross-persona agreement promotes by one anchor step; when this happens, the Reviewer column notes it (e.g., `coherence, feasibility (+1 anchor)`).
## Example
```markdown
@@ -35,7 +37,7 @@ Applied 5 fixes. 4 items need attention (2 errors, 2 omissions). 2 FYI observati
| # | Section | Issue | Reviewer | Confidence | Tier |
|---|---------|-------|----------|------------|------|
| 1 | Requirements Trace | Goal states "offline support" but technical approach assumes persistent connectivity | coherence | 0.92 | manual |
| 1 | Requirements Trace | Goal states "offline support" but technical approach assumes persistent connectivity | coherence | 100 | manual |
### P1 — Should Fix
@@ -43,13 +45,13 @@ Applied 5 fixes. 4 items need attention (2 errors, 2 omissions). 2 FYI observati
| # | Section | Issue | Reviewer | Confidence | Tier |
|---|---------|-------|----------|------------|------|
| 2 | Scope Boundaries | 8 of 12 units build admin infrastructure; only 2 touch stated goal | scope-guardian | 0.80 | manual |
| 2 | Scope Boundaries | 8 of 12 units build admin infrastructure; only 2 touch stated goal | scope-guardian | 75 | manual |
#### Omissions
| # | Section | Issue | Reviewer | Confidence | Tier |
|---|---------|-------|----------|------------|------|
| 3 | Implementation Unit 3 | Plan proposes custom auth but does not mention existing Devise setup or migration path | feasibility | 0.85 | gated_auto |
| 3 | Implementation Unit 3 | Plan proposes custom auth but does not mention existing Devise setup or migration path | feasibility | 100 | gated_auto |
### P2 — Consider Fixing
@@ -57,7 +59,7 @@ Applied 5 fixes. 4 items need attention (2 errors, 2 omissions). 2 FYI observati
| # | Section | Issue | Reviewer | Confidence | Tier |
|---|---------|-------|----------|------------|------|
| 4 | API Design | Public webhook endpoint has no rate limiting mentioned | security-lens | 0.75 | gated_auto |
| 4 | API Design | Public webhook endpoint has no rate limiting mentioned | security-lens | 75 | gated_auto |
### FYI Observations
@@ -65,12 +67,12 @@ Low-confidence observations surfaced without requiring a decision. Content advis
| # | Section | Observation | Reviewer | Confidence |
|---|---------|-------------|----------|------------|
| 1 | Naming | Filename `plan.md` is asymmetric with command name `user-auth`; could go either way | coherence | 0.52 |
| 2 | Risk Analysis | Rollout-cadence decision may benefit from monitoring thresholds, though not blocking | scope-guardian | 0.48 |
| 1 | Naming | Filename `plan.md` is asymmetric with command name `user-auth`; could go either way | coherence | 50 |
| 2 | Risk Analysis | Rollout-cadence decision may benefit from monitoring thresholds, though not blocking | scope-guardian | 50 |
### Residual Concerns
Residual concerns are issues the reviewers noticed but could not confirm with above-gate confidence. These are not actionable; they appear here for transparency only and are not promoted into the review surface.
Residual concerns are issues the reviewers noticed but could not confirm at confidence anchor `50` or higher. These are not actionable; they appear here for transparency only and are not promoted into the review surface.
| # | Concern | Source |
|---|---------|--------|
@@ -93,7 +95,9 @@ Residual concerns are issues the reviewers noticed but could not confirm with ab
| product-lens | not activated | -- | -- | -- | -- | -- | -- |
| design-lens | not activated | -- | -- | -- | -- | -- | -- |
Dropped: 3 (anchors 0/25 suppressed)
Chains: 1 root with 2 dependents
Restated: 2 (residual/deferred items suppressed as duplicates of actionable findings)
```
## Section Rules
@@ -101,10 +105,12 @@ Chains: 1 root with 2 dependents
- **Summary line**: Always present after the reviewer list. Format: "Applied N fixes. K items need attention (X errors, Y omissions). Z FYI observations." Omit any zero clause except the FYI clause when zero (it's informative that none surfaced).
- **Applied fixes**: List all fixes that were applied automatically (`safe_auto` tier). Include enough detail per fix to convey the substance — especially for fixes that add content or touch document meaning. Omit section if none.
- **P0-P3 sections**: Only include sections that have actionable findings (`gated_auto` or `manual`). Omit empty severity levels. Within each severity, separate into **Errors** and **Omissions** sub-headers. Omit a sub-header if that severity has none of that type. The `Tier` column surfaces whether a finding is `gated_auto` (concrete fix exists, Apply recommended in walk-through) or `manual` (requires user judgment).
- **FYI Observations**: Low-confidence `manual` findings above the 0.40 FYI floor but below the per-severity gate. Surface here for transparency; these are not actionable and do not enter the walk-through. Omit section if none.
- **FYI Observations**: Findings at confidence anchor `50` regardless of `autofix_class`. Surface here for transparency; these are not actionable and do not enter the walk-through. Omit section if none.
- **Residual Concerns**: Residual concerns noted by personas that did not make it above the confidence gate. Listed for transparency; not promoted into the review surface (cross-persona agreement boost runs on findings that already survived the gate, per synthesis step 3.4). Omit section if none.
- **Deferred Questions**: Questions for later workflow stages. Omit if none.
- **Coverage**: Always include. All counts are **post-synthesis**. **Findings** must equal Auto + Proposed + Decisions + FYI exactly — if deduplication merged a finding across personas, attribute it to the persona with the highest confidence and reduce the other persona's count. **Residual** = count of `residual_risks` from this persona's raw output (not the promoted subset in the Residual Concerns section). The `Auto` column counts `safe_auto` findings, `Proposed` counts `gated_auto`, `Decisions` counts above-gate `manual`, and `FYI` counts below-gate `manual` findings at or above the 0.40 FYI floor.
- **Compact rendering for FYI / Residual / Deferred (high-count mode)**: When the combined count across these three sections is **5 or more**, collapse each section to a one-line summary followed by the items as a tight bullet list (no table, no per-item `Why` elaboration). Rationale: these sections are observational, not decision-forcing — when they are lengthy, they bury the actionable tiers above them. A P0/P1/P2 actionable finding stays fully rendered regardless of how many FYI/Residual/Deferred items exist. When the combined count is 4 or fewer, render each section as today.
- **Coverage**: Always include. All counts are **post-synthesis**. **Findings** must equal Auto + Proposed + Decisions + FYI exactly — if deduplication merged a finding across personas, attribute it to the persona with the highest confidence anchor and reduce the other persona's count. **Residual** = count of `residual_risks` from this persona's raw output (not the promoted subset in the Residual Concerns section). The `Auto` column counts `safe_auto` findings at anchor `100`, `Proposed` counts `gated_auto` findings at anchor `75` or `100`, `Decisions` counts `manual` findings at anchor `75` or `100`, and `FYI` counts findings at anchor `50` regardless of `autofix_class`. Findings at anchors `0` or `25` were dropped by synthesis and do not appear in any column. Do NOT invent additional columns (e.g., `Dropped`, `Surviving`). The column schema above is the canonical set.
- **Coverage footnote lines** (optional, appear below the table when non-zero): `Dropped: N (anchors 0/25 suppressed)` when synthesis 3.2 dropped any findings. `Chains: N root(s) with M dependents` when premise-dependency chains exist. `Restated: N (residual/deferred items suppressed as duplicates of actionable findings)` when synthesis 3.9 suppressed any restatements. These footnotes — not the summary line, not per-persona columns — are the canonical location for cross-cutting counts that don't fit the per-persona shape. Order: `Dropped:`, then `Chains:`, then `Restated:`, each on its own line. Omit any footnote whose count is zero.
## Chain-Rendering Rules

View File

@@ -24,10 +24,24 @@ Return ONLY valid JSON matching the findings schema below. No prose, no markdown
- `finding_type`: one of `"error"`, `"omission"` — nothing else (no `"tension"`, `"concern"`, `"observation"`, etc.).
- `autofix_class`: one of `"safe_auto"`, `"gated_auto"`, `"manual"`.
- `evidence`: an ARRAY of strings with at least one element. A single string value is a validation failure — wrap every quote in `["..."]` even when there is only one.
- `confidence`: a number between 0.0 and 1.0 inclusive.
- `confidence`: one of exactly `0`, `25`, `50`, `75`, or `100` — a discrete anchor, NOT a continuous number. Any other value (e.g., `72`, `0.85`, `"high"`) is a validation failure. Pick the anchor whose behavioral criterion you can honestly self-apply to this finding (see "Confidence rubric" below).
If your persona description uses severity vocabulary like "high-priority" or "critical" in its rubric text, translate to the P0-P3 scale at emit time. "Critical / must-fix" → P0, "important / should-fix" → P1, "worth-noting / could-fix" → P2, "low-signal" → P3. Same for priorities described qualitatively in your analysis — map to P0-P3 on the way out.
**Confidence rubric — use these exact behavioral anchors.** Pick the single anchor whose criterion you can honestly self-apply. Do not pick a value between anchors; only `0`, `25`, `50`, `75`, and `100` are valid. The rubric is anchored on behavior you performed, not on a vague sense of certainty — if you cannot truthfully attach the behavioral claim to the finding, step down to the next anchor.
- **`0` — Not confident at all.** A false positive that does not stand up to light scrutiny, or a pre-existing issue the document did not introduce. **Do not emit — suppress silently.** This anchor exists in the enum only so synthesis can explicitly track the drop; personas never produce it.
- **`25` — Somewhat confident.** Might be a real issue but could also be a false positive; you were not able to verify. **Do not emit — suppress silently.** This anchor, like `0`, exists in the enum only so synthesis can track the drop; personas never produce it. If your domain is genuinely uncertain, either gather more evidence until you can honestly anchor the finding at `50` or higher, or suppress the concern entirely. (Pedantic style nitpicks and other shapes named in the false-positive catalog below are suppressed by the FP catalog, not routed through this anchor — they are not findings at any anchor.)
- **`50` — Moderately confident.** You verified this is a real issue but it may be a nitpick or not meaningfully affect plan correctness. Relative to the rest of the document, it is not very important. Advisory observations — where the honest answer to "what breaks if we do not fix this?" is "nothing breaks, but..." — land here. Surfaces in the FYI subsection.
- **`75` — Highly confident.** You double-checked and verified the issue will be hit in practice by implementers or readers of this document. The existing approach in the document is insufficient. The issue directly impacts plan correctness, implementer understanding, or downstream execution.
**Anchor `75` requires naming a concrete downstream consequence someone will hit** — a wrong deploy order, an unimplementable step, a contract mismatch, missing evidence that blocks a decision. Strength-of-argument concerns ("motivation is thin," "premise is unconvincing," "a different reader might disagree") do not meet this bar on their own — they are advisory observations and land at anchor `50` unless they also name the specific downstream outcome the reader hits. When in doubt between `50` and `75`, ask: "will a competent implementer or reader concretely encounter this, or is this my opinion about the document's strength?" The former is `75`; the latter is `50`.
- **`100` — Absolutely certain.** You double-checked and confirmed the issue. The evidence directly confirms it will happen frequently in practice. The document text, codebase, or cross-references leave no room for interpretation.
Anchor and severity are independent axes. A P2 finding can be anchor `100` if the evidence is airtight; a P0 finding can be anchor `50` if it is an important concern you could not fully verify. Anchor gates where the finding surfaces (drop / FYI / actionable); severity orders it within the actionable surface.
Synthesis drops anchors `0` and `25` silently; anchor `50` routes to the FYI subsection; anchors `75` and `100` enter the actionable tier (walk-through, proposed fixes, safe_auto when `autofix_class` also warrants).
Example of a schema-valid finding (all required fields, correct enum values, correct array shape):
```json
@@ -39,7 +53,7 @@ Example of a schema-valid finding (all required fields, correct enum values, cor
"finding_type": "omission",
"autofix_class": "gated_auto",
"suggested_fix": "Require Units 1-4 to land in a single atomic PR, or define the sequence explicitly.",
"confidence": 0.92,
"confidence": 100,
"evidence": [
"If the migration runs before Units 1-3 land, the code reads stale data.",
"If after, new code temporarily sees old entries until migration runs."
@@ -47,10 +61,12 @@ Example of a schema-valid finding (all required fields, correct enum values, cor
}
```
The `confidence: 100` in the example is justified because all three anchor-100 criteria hold: the reviewer double-checked (the plan literally names both orderings and resolves neither), the evidence directly confirms the outcome (quoted text shows each branch produces incorrect state), and the issue will happen frequently in practice (every deploy is subject to it).
Rules:
- You are a leaf reviewer inside an already-running compound-engineering review workflow. Do not invoke compound-engineering skills or agents unless this template explicitly instructs you to. Perform your analysis directly and return findings in the required output format only.
- Suppress any finding below your stated confidence floor (see your Confidence calibration section).
- Suppress any finding you cannot honestly anchor at `50` or higher (the actionable floor is `50`; anchors `0` and `25` are suppressed by synthesis anyway, so emitting them only adds noise). If your persona's domain description sets a stricter floor (e.g., anchor `75` minimum), honor it.
- Every finding MUST include at least one evidence item — a direct quote from the document.
- You are operationally read-only. Analyze the document and produce findings. Do not edit the document, create files, or make changes. You may use non-mutating tools (file reads, glob, grep, git log) to gather context about the codebase when evaluating feasibility or existing patterns.
- **Exclude prior-round deferred entries from review scope.** If the document under review contains a `## Deferred / Open Questions` section or subsections such as `### From YYYY-MM-DD review`, ignore that content — it is review output from prior rounds, not part of the document's actual plan/requirements content. Do not flag entries inside it as new findings. Do not quote its text as evidence. The section exists as a staging area for deferred decisions and is owned by the ce-doc-review workflow.
@@ -88,7 +104,7 @@ The `why_it_matters` field is how the reader — a developer triaging findings,
- **Lead with observable consequence.** Describe what goes wrong from the reader's or implementer's perspective — what breaks, what gets misread, what decision gets made wrong, what the downstream audience experiences. Do not lead with document structure ("Section X on line Y says..."). Start with the effect ("Implementers will disagree on which tier applies when..."). Section references appear later, only when the reader needs them to locate the issue.
- **Explain why the fix resolves the problem.** If you include a `suggested_fix`, the `why_it_matters` should make clear why that specific fix addresses the root cause. When a similar pattern exists elsewhere in the document or codebase (a parallel section, an established convention, a cited code pattern), reference it so the recommendation is grounded in what the team has already chosen.
- **Keep it tight.** Approximately 2-4 sentences. Longer framings are a regression — downstream surfaces have narrow display budgets, and verbose content gets truncated or skimmed.
- **Always produce substantive content.** `why_it_matters` is required by the schema. Empty strings, nulls, and single-phrase entries are validation failures. If you found something worth flagging (confidence at or above your persona's floor), you can explain it — the field exists because every finding needs a reason.
- **Always produce substantive content.** `why_it_matters` is required by the schema. Empty strings, nulls, and single-phrase entries are validation failures. If you found something worth flagging at anchor `50` or higher, you can explain it — the field exists because every finding needs a reason.
Illustrative pair — same finding, weak vs. strong framing:
@@ -106,14 +122,21 @@ STRONG (observable consequence first, grounded fix reasoning):
Synthesis already lacks a route for it.
```
False-positive categories to actively suppress:
False-positive categories to actively suppress. Do NOT emit a finding when any of these apply — not even at anchor `25` or `50`. These are not edge cases you should route to FYI; they are non-findings.
- Pedantic style nitpicks (word choice, bullet vs. numbered lists, comma-vs-semicolon) — style belongs to the document author
- Issues that belong to other personas (see your Suppress conditions)
- Findings already resolved elsewhere in the document (search before flagging)
- Content inside `## Deferred / Open Questions` sections (prior-round review output, not document content)
- **Pedantic style nitpicks** (word choice, bullet vs. numbered lists, comma-vs-semicolon, em-dash vs en-dash) — style belongs to the document author
- **Issues that belong to other personas** (see your Suppress conditions at the top of your persona prompt) — surfacing another persona's territory inflates the Coverage table and forces synthesis to dedup work that should not exist
- **Findings already resolved elsewhere in the document** — search the document before flagging. If the concern is addressed in a later section, the earlier section's apparent omission is not a real finding
- **Content inside `## Deferred / Open Questions` sections** — prior-round review output, not document content. This is the ce-doc-review workflow's own staging area
- **Pre-existing issues the document did not introduce** — if the concern exists in the codebase or organizational context independent of this document's proposal, flagging it here is scope creep
- **Speculative future-work concerns with no current signal** — "what if requirements change" / "this might need rework later" are not findings unless the document itself introduces the risk
- **Theoretical concerns without baseline data** — scalability worries without current scale numbers, performance worries without current latency measurements, edge cases without evidence the edge is reachable
- **Changes in functionality that are likely intentional** — if the document is explicitly making a design choice different from a precedent you noticed, that is a decision, not an error. Flag only when the document appears unaware of the precedent
- **Issues that a linter, typechecker, or validator would catch** — spelling in identifiers, JSON syntax errors, YAML indentation. These surface automatically elsewhere; the review layer adds value by catching what tools cannot
**Advisory observations — route to FYI, do not force a decision.** If the honest answer to "what actually breaks if we don't fix this?" is "nothing breaks, but…", the finding is advisory. Ask: would a competent implementer hit a wrong outcome, a production bug, a misleading plan, or rework later? If no, set confidence in the **0.400.59 LOW/Advisory band** so synthesis routes the finding to FYI rather than surfacing it as a manual decision. Do not suppress — the observation still has value; it just does not warrant user judgment. Typical advisory shapes: naming asymmetry with no wrong answer, stylistic preference without evidence of impact, speculative future-work concern with no current signal, subjective readability note, theoretical scalability concern without baseline data, "could also be split" organizational preference when the current split is not broken.
**Advisory observations — route to FYI, do not force a decision.** If the honest answer to "what actually breaks if we don't fix this?" is "nothing breaks, but…", the finding is advisory. Ask: would a competent implementer hit a wrong outcome, a production bug, a misleading plan, or rework later? If no, set `confidence: 50` so synthesis routes the finding to the FYI subsection rather than surfacing it as a decision or proposed fix. Do not suppress — the observation still has value; it just does not warrant user judgment. Typical advisory shapes: naming asymmetry with no wrong answer, subjective readability note about non-stylistic content (e.g., a definition placed before the term it defines), "could also be split" organizational preference when the current split is not broken. Style belongs to the false-positive catalog above, not here — pedantic style nitpicks suppress entirely.
**Precedence over the false-positive catalog.** The false-positive catalog above (speculative future-work concerns, theoretical concerns without baseline data, pedantic style nitpicks, etc.) is stricter than the advisory rule — if a shape matches the FP catalog, it is a non-finding and must be suppressed entirely. Do NOT route it to anchor `50` / FYI. The advisory rule applies only to shapes that are NOT in the FP catalog.
</output-contract>
<review-context>

View File

@@ -12,23 +12,25 @@ Check each agent's returned JSON against the findings schema:
- Drop findings with invalid enum values (including the pre-rename `auto` / `present` values from older personas — treat those as malformed until all persona output has been regenerated)
- Note the agent name for any malformed output in the Coverage section
### 3.2 Confidence Gate (Per-Severity)
**Do not narrate remap / validation diagnostics to the user.** Schema-drift notes ("persona X returned unknown enum Y, remapped to Z"), persona-prompt-drift commentary, and other validator-internal diagnostics are maintainer-facing information. They do not belong in the Phase 4 output the user reads. If a persona's output is malformed, the only user-visible consequence is a Coverage-row annotation (e.g., the persona shows fewer findings or a `malformed` marker). Everything else stays internal.
Gate findings using per-severity thresholds rather than a single flat floor:
### 3.2 Confidence Gate (Anchor-Based)
| Severity | Gate |
|----------|------|
| P0 | 0.50 |
| P1 | 0.60 |
| P2 | 0.65 |
| P3 | 0.75 |
Gate findings by their `confidence` anchor value. Anchors are discrete integers (`0`, `25`, `50`, `75`, `100`) with behavioral definitions documented in `references/findings-schema.json` and embedded in the persona rubric (`references/subagent-template.md`). This replaces the prior continuous 0.0-1.0 scale with per-severity gates — doc-review economics do not warrant threshold gradation by severity, and coarse anchors prevent false-precision gaming.
Findings at or above their severity's gate survive for classification. Findings below the gate are evaluated for FYI-eligibility:
| Anchor | Meaning | Route |
|--------|---------|-------|
| `0` | False positive or pre-existing issue | Drop silently |
| `25` | Might be real but could not verify | Drop silently |
| `50` | Verified real but nitpick / advisory / not very important | Surface in FYI subsection |
| `75` | Double-checked, will hit in practice, directly impacts correctness | Enter actionable tier (classify by `autofix_class`) |
| `100` | Evidence directly confirms; will happen frequently | Enter actionable tier (classify by `autofix_class`) |
- **FYI-eligible** when `autofix_class` is `manual` and confidence is between 0.40 (FYI floor) and the severity gate. These surface in an FYI subsection at the presentation layer (see 3.7) but do not enter the walk-through or any bulk action — they exist as observational value without forcing a decision.
- **Dropped** when confidence is below 0.40, or when the finding is `safe_auto` / `gated_auto` but below the severity gate (auto-apply findings need confidence above the decision gate to avoid silent mistakes).
- **Dropped silently** (anchors `0` and `25`): these do not surface in any output bucket — not as findings, not as FYI observations, not as residual concerns. Record the total drop count as a Coverage footnote line when non-zero: `Dropped: N (anchors 0/25 suppressed)`. The footnote appears below the Coverage table, alongside the `Chains:` footnote when both apply. This is the canonical location for drop-count reporting — not the summary line and not a per-persona Coverage column. Omit the footnote when N is zero.
- **FYI-subsection** (anchor `50`): surface in the presentation layer's FYI subsection regardless of `autofix_class`. These do not enter the walk-through or any bulk action — observational value without forcing a decision. Advisory observations ("nothing breaks, but...") naturally land here.
- **Actionable** (anchors `75` and `100`): enter the classification pipeline. Route by `autofix_class` (see 3.7).
Record the drop count and the FYI count in Coverage.
**Why this threshold, not Anthropic's ≥ 80 code-review threshold:** Document review has opposite economics from code review. There is no linter backstop — the review IS the backstop. Premise-level concerns (product-lens, adversarial) naturally cap at anchors 50-75 because "is the motivation valid?" cannot be verified against ground truth. The routing menu already makes dismissal cheap (Skip, Append to Open Questions), so surfaced-and-skipped is a low-cost outcome while missed-and-shipped derails downstream implementation. Filter low (`≥ 50`) and let the routing menu handle volume.
### 3.3 Deduplicate
@@ -37,14 +39,38 @@ Fingerprint each finding using `normalize(section) + normalize(title)`. Normaliz
When fingerprints match across personas:
- If the findings recommend opposing actions (e.g., one says cut, the other says keep), do not merge — preserve both for contradiction resolution in 3.5
- Otherwise merge: keep the highest severity, keep the highest confidence, union all evidence arrays, note all agreeing reviewers (e.g., "coherence, feasibility")
- **Coverage attribution:** Attribute the merged finding to the persona with the highest confidence. Decrement the losing persona's Findings count and the corresponding route bucket so totals stay exact.
- Otherwise merge: keep the highest severity, keep the highest confidence anchor (if tied, keep the finding appearing first in document order — deterministic, not probabilistic), union all evidence arrays, note all agreeing reviewers (e.g., "coherence, feasibility")
- **Coverage attribution:** Attribute the merged finding to the persona with the highest confidence anchor. If anchors tie, attribute to the persona whose entry appeared first in document order. Decrement the losing persona's Findings count and the corresponding route bucket so totals stay exact.
### 3.4 Cross-Persona Agreement Boost
### 3.3b Same-Persona Premise Redundancy Collapse
When 2+ independent personas flagged the same merged finding (from 3.3), boost the merged finding's confidence by +0.10 (capped at 1.0). Independent corroboration is strong signal — multiple reviewers converging on the same issue is more reliable than any single reviewer's confidence. Note the boost in the Reviewer column of the output (e.g., "coherence, feasibility +0.10").
A single persona sometimes files multiple findings that share the same root premise expressed at different sections or wrapped in different framing (e.g., product-lens firing five variants of "motivation is weak" attached to Motivation, Unit 4b, Key Technical Decisions, and two other sections). Cross-persona dedup (3.3) does not catch this — it fingerprints on section+title, which differ even when the underlying concern is the same. Surfacing all N variants over-weights one persona's perspective relative to the other five and inflates the P2 Decisions tier with near-duplicate signal.
This replaces the earlier residual-concern promotion step. Findings below the confidence gate are not promoted back into the review surface; they appear in Coverage as residual concerns only. If a below-gate finding is genuinely important, the reviewer should raise their confidence or provide stronger evidence rather than relying on a promotion rule.
For each persona, cluster that persona's surviving findings by shared root premise. A cluster forms when 3 or more findings from the same persona share:
- The same `finding_type` (error or omission)
- Substantially overlapping `why_it_matters` phrasing (same key nouns/verbs signaling the same concern, e.g., "motivation", "justification", "premise unsupported", "scope creep")
- Fixes that would all be obviated by the same upstream decision (e.g., "add the triggering incident" would moot all five motivation-weakness findings)
For each cluster of size N ≥ 3:
- Keep the single finding with the strongest evidence (highest confidence anchor, or if tied, the one citing the most concrete document reference)
- Demote the remaining N-1 findings to FYI-subsection status (anchor `50`), regardless of their original anchor
- On the kept finding, note in the Reviewer column that the persona raised N-1 related variants (e.g., `product-lens (+4 related variants demoted to FYI)`)
This runs per-persona before 3.4 cross-persona boost. Cross-persona agreement across the *kept* finding still qualifies for the anchor-step promotion in 3.4; demoted variants do not participate in cross-persona promotion (they are observational only after collapse).
Do NOT collapse across personas at this step — different personas surfacing the same concern is exactly the independence signal the cross-persona boost rewards. Collapse applies within one persona's output only.
### 3.4 Cross-Persona Agreement Promotion
When 2+ independent personas flagged the same merged finding (from 3.3), promote the merged finding's anchor by one step: `50 → 75`, `75 → 100`. Anchor `100` does not promote further (already at the ceiling). Findings at anchors `0` or `25` do not reach this step (they were dropped in 3.2).
Independent corroboration is strong signal — multiple reviewers converging on the same issue is more reliable than any single reviewer's anchor. Promoting by one anchor step is semantically meaningful (a "verified but nitpick" finding that two personas independently surface is plausibly "will hit in practice"). This replaces the prior `+0.10` boost — the magic-number bump was calibrated to the continuous scale and no longer applies.
Note the promotion in the Reviewer column of the output (e.g., `coherence, feasibility (+1 anchor)`).
This replaces the earlier residual-concern promotion step. Findings at anchors `0` / `25` are not promoted back into the review surface; they appear only as drop counts in Coverage. If a dropped finding is genuinely important, the reviewer should raise their anchor to `50` or higher through stronger evidence rather than relying on a promotion rule.
### 3.5 Resolve Contradictions
@@ -113,7 +139,7 @@ If multiple candidates match the criteria, elevate ALL of them. The criteria abo
If none match, skip the rest of this step — no chains exist.
**Dependent assignment under multiple roots.** When multiple roots exist and a candidate dependent could plausibly link to more than one, assign it to the root whose rejection most directly dissolves the dependent's concern. If ambiguity remains, assign to the higher-confidence root. A dependent never links to more than one root — a single `depends_on` value.
**Dependent assignment under multiple roots.** When multiple roots exist and a candidate dependent could plausibly link to more than one, assign it to the root whose rejection most directly dissolves the dependent's concern. If ambiguity remains, assign to the root with the higher confidence anchor; if anchors tie, assign to the root appearing first in document order. A dependent never links to more than one root — a single `depends_on` value.
**Step 2: Identify dependents.** For each candidate root, scan the remaining findings for dependents. The predicate must match the cascade trigger in `references/walkthrough.md` — dependents cascade when the user rejects (Skip/Defer) the root, so dependency is defined on the rejection branch, not the acceptance branch. A finding is a dependent of a root when:
@@ -131,9 +157,9 @@ Test with the substitution check: "If the user rejects the root (Skip/Defer), do
When uncertain, default to NOT linking. A mis-linked chain hides a real issue; leaving a finding unlinked only costs one extra decision.
**Step 4: Annotate.** On each dependent, record `depends_on: <root_finding_id>` (use section + normalized title as the id). On each root, record `dependents: [<dependent_ids>]`. Cap `dependents` at 6 entries per root — if more than 6 candidates link to the same root, keep the top 6 by severity then confidence and leave the rest unlinked (over-aggressive chaining risks obscuring independent concerns).
**Step 4: Annotate.** On each dependent, record `depends_on: <root_finding_id>` (use section + normalized title as the id). On each root, record `dependents: [<dependent_ids>]`. Cap `dependents` at 6 entries per root — if more than 6 candidates link to the same root, keep the top 6 by severity, then confidence anchor (descending), then document order as the deterministic final tiebreak; leave the rest unlinked (over-aggressive chaining risks obscuring independent concerns).
Do NOT reclassify, re-route, or change confidence of any finding in this step. Linking is purely annotative; the walk-through and presentation use the annotation, synthesis proper does not.
Do NOT reclassify, re-route, or change the confidence anchor of any finding in this step. Linking is purely annotative; the walk-through and presentation use the annotation, synthesis proper does not.
**Step 5: Report in Coverage.** Add a line to the coverage summary: `Chains: N root(s) with M total dependents`. When N = 0, omit the line.
@@ -177,12 +203,19 @@ Do not promote if the finding involves scope or priority changes where the autho
**Severity and autofix_class are independent.** A P1 finding can be `safe_auto` if the correct fix is obvious. The test is not "how important?" but "is there one clear correct fix, or does this require judgment?"
| Autofix Class | Route |
|---------------|-------|
| `safe_auto` | Apply silently in Phase 4. Requires `suggested_fix`. Demote to `gated_auto` if missing. |
| `gated_auto` | Enter the per-finding walk-through with Apply marked (recommended). Requires `suggested_fix`. Demote to `manual` if missing. |
| `manual` | Enter the per-finding walk-through with user-judgment framing. `suggested_fix` is optional. |
| FYI-subsection | `manual` findings below the severity gate but at or above the FYI floor (0.40) — surface in a distinct FYI subsection of the presentation, do not enter the walk-through or any bulk action. |
**Anchor and autofix_class are also independent.** Anchor gates the finding into a surface (FYI vs actionable); `autofix_class` decides what the actionable surface does with it. Both are consulted in this step.
Findings reaching 3.7 have already been gated to anchors `50`, `75`, or `100` by 3.2 (anchors `0` and `25` were dropped).
| Anchor | Autofix Class | Route |
|--------|---------------|-------|
| `100` | `safe_auto` | Apply silently in Phase 4. Requires `suggested_fix`. Demote to `gated_auto` if missing. |
| `100` | `gated_auto` | Enter the per-finding walk-through with Apply marked (recommended). Requires `suggested_fix`. Demote to `manual` if missing. |
| `100` | `manual` | Enter the per-finding walk-through with user-judgment framing. `suggested_fix` is optional. |
| `75` | `safe_auto` | Demote to `gated_auto` before routing — silent apply is reserved for anchor `100` findings where evidence directly confirms the fix. Enter the walk-through with Apply marked (recommended). |
| `75` | `gated_auto` | Enter the per-finding walk-through with Apply marked (recommended). Requires `suggested_fix`. Demote to `manual` if missing. |
| `75` | `manual` | Enter the per-finding walk-through with user-judgment framing. `suggested_fix` is optional. |
| `50` | any | Surface in the FYI subsection regardless of `autofix_class`. Do not enter the walk-through or any bulk action. These are observations, not decisions. |
**Auto-eligible patterns for safe_auto:** summary/detail mismatch (body authoritative over overview), wrong counts, missing list entries derivable from elsewhere in the document, stale internal cross-references, terminology drift, prose/diagram contradictions where prose is more detailed, missing steps mechanically implied by other content, unstated thresholds implied by surrounding context.
@@ -190,17 +223,33 @@ Do not promote if the finding involves scope or priority changes where the autho
### 3.8 Sort
Sort findings for presentation: P0 → P1 → P2 → P3, then by finding type (errors before omissions), then by confidence (descending), then by document order (section position).
Sort findings for presentation: P0 → P1 → P2 → P3, then by finding type (errors before omissions), then by confidence anchor (descending: `100` first, then `75`, then `50`), then by document order (section position) as the deterministic final tiebreak.
### 3.9 Suppress Restatements in Residual Concerns and Deferred Questions
Persona outputs carry `residual_risks` and `deferred_questions` arrays alongside `findings`. After the actionable-tier set is finalized (post-3.7 routing), personas often re-surface the same substance in their residual/deferred arrays — the persona's own finding and the persona's own residual concern are about the same issue. Rendering both sections verbatim inflates the output with restatements that carry no new signal.
For every `residual_risk` and `deferred_question` across all persona outputs, check against the finalized actionable-finding set (findings at confidence anchor `75` or `100`, plus FYI-subsection findings at anchor `50`). Drop the residual/deferred item if either of these holds:
- **Section-and-substance overlap.** The residual/deferred item names the same section as an actionable finding AND its substance fuzzy-matches the finding's `title` or `why_it_matters` (shared key nouns/verbs indicating the same concern).
- **Question form of an actionable finding.** A deferred question whose subject is directly answered by or obviated by an actionable finding's recommendation. Example: actionable finding "Motivation cites no real incident" → deferred question "Is there a concrete triggering event?" — the finding already raised this; the question restates it interrogatively.
Do NOT drop residual/deferred items that introduce genuinely new signal (a concern or question the actionable findings do not touch). When in doubt, keep — this pass is for obvious restatements, not borderline calls.
Run this pass on the merged set across all personas. Record the count dropped as a Coverage footnote line when non-zero: `Restated: N (residual/deferred items suppressed as duplicates of actionable findings)`. Ordering: footnotes appear in the sequence `Dropped:`, `Chains:`, `Restated:` below the Coverage table, each on its own line. Omit any footnote whose count is zero.
## Phase 4: Apply and Present
**User-facing vocabulary rule (applies to ALL user-visible output in Phase 4, not just the rendered template).** Internal enum values — `safe_auto`, `gated_auto`, `manual`, `FYI` — stay inside the schema and synthesis prose. Every word the user sees in Phase 4 output, including free-text narration between sections, transition preambles, status lines, and confirmation messages, MUST use user-facing vocabulary: "fixes" (for `safe_auto`), "proposed fixes" (for `gated_auto`), "decisions" (for `manual` findings at anchor `75` or `100`), "FYI observations" (for any finding at anchor `50`). The only exception is the `Tier` column in rendered tables, which is explicitly documented as surfacing the internal enum for transparency. Do NOT emit narration like "safe_auto fixes applied" or "N safe_auto findings" — write "fixes applied" or "N fixes" instead.
### Apply safe_auto fixes
Apply all `safe_auto` findings to the document in a single pass:
Apply only `safe_auto` findings **at confidence anchor `100`** to the document in a single pass. This matches the 3.7 routing table: anchor `100` + `safe_auto` silent-applies; anchor `75` + `safe_auto` was demoted to `gated_auto` in 3.7 and enters the walk-through instead; anchor `50` + any `autofix_class` routes to FYI and must never auto-apply.
- Edit the document inline using the platform's edit tool
- Track what was changed for the "Applied fixes" section in the rendered output (`safe_auto` is the internal enum; the rendered section header reads "Applied fixes")
- Do not ask for approval — these have one clear correct fix
- Do not ask for approval — these have one clear correct fix AND evidence directly confirms (anchor `100`)
- Do NOT silent-apply any `safe_auto` finding at anchor `75` or `50`. If a finding reaches this step with `autofix_class: safe_auto` and anchor below `100`, the 3.7 routing rule was not applied correctly; re-run 3.7 for that finding before continuing.
List every applied fix in the output summary so the user can see what changed. Use enough detail to convey the substance of each fix (section, what was changed, reviewer attribution). This is especially important for fixes that add content or touch document meaning — the user should not have to diff the document to understand what the review did.
@@ -208,7 +257,7 @@ List every applied fix in the output summary so the user can see what changed. U
After safe_auto fixes apply, remaining findings split into buckets:
- `gated_auto` and `manual` findings at or above the severity gate → enter the routing question (see Unit 5 / `references/walkthrough.md`)
- `gated_auto` and `manual` findings at confidence anchor `75` or `100` → enter the routing question (see Unit 5 / `references/walkthrough.md`)
- FYI-subsection findings → surface in the presentation only, no routing
- Zero actionable findings remaining → skip the routing question; flow directly to Phase 5 terminal question
@@ -223,25 +272,25 @@ Applied N fixes:
Proposed fixes (concrete fix, requires user confirmation):
[P0] Section: <section> — <title> (<reviewer>, confidence <N>)
[P0] Section: <section> — <title> (<reviewer>, confidence <anchor>)
Why: <why_it_matters>
Suggested fix: <suggested_fix>
Decisions (requires user judgment):
[P1] Section: <section> — <title> (<reviewer>, confidence <N>)
[P1] Section: <section> — <title> (<reviewer>, confidence <anchor>)
Why: <why_it_matters>
Suggested fix: <suggested_fix or "none">
Dependents (would resolve if this root is rejected):
[P2] Section: <section> — <title> (<reviewer>, confidence <N>)
[P2] Section: <section> — <title> (<reviewer>, confidence <anchor>)
Why: <why_it_matters>
[P2] Section: <section> — <title> (<reviewer>, confidence <N>)
[P2] Section: <section> — <title> (<reviewer>, confidence <anchor>)
Why: <why_it_matters>
FYI observations (low-confidence, no decision required):
FYI observations (anchor 50, no decision required):
[P3] Section: <section> — <title> (<reviewer>, confidence <N>)
[P3] Section: <section> — <title> (<reviewer>, confidence <anchor>)
Why: <why_it_matters>
Residual concerns:
@@ -250,10 +299,16 @@ Residual concerns:
Deferred questions:
- <question> (<source>)
Dropped: N (anchors 0/25 suppressed)
Chains: N root(s) with M dependents
Restated: N (residual/deferred items suppressed as duplicates of actionable findings)
Review complete
```
Omit any section with zero items. The section headers reflect user-facing vocabulary: the "Proposed fixes" bucket carries `gated_auto` findings (the persona has a concrete fix; the user confirms), "Decisions" carries above-gate `manual` findings (judgment calls), and "FYI observations" carries `manual` findings between the 0.40 FYI floor and the per-severity gate. When a root has dependents, render the root at its normal position in the severity-sorted list and nest its dependents as an indented `Dependents (...)` sub-block immediately below. Do not re-list dependents at their own severity position — they appear only under their root. End with "Review complete" as the terminal signal so callers can detect completion.
Omit any section with zero items. The section headers reflect user-facing vocabulary: the "Proposed fixes" bucket carries `gated_auto` findings at anchor `75` or `100` (the persona has a concrete fix; the user confirms), "Decisions" carries `manual` findings at anchor `75` or `100` (judgment calls), and "FYI observations" carries any finding at anchor `50` regardless of `autofix_class`. When a root has dependents, render the root at its normal position in the severity-sorted list and nest its dependents as an indented `Dependents (...)` sub-block immediately below. Do not re-list dependents at their own severity position — they appear only under their root. End with "Review complete" as the terminal signal so callers can detect completion.
**Compact rendering for FYI observations, residual concerns, and deferred questions (high-count mode).** When the combined count of these three buckets is 5 or more, collapse each to a one-line count followed by a tight bullet list without per-item `Why` expansion. Actionable buckets (Proposed fixes / Decisions) remain fully rendered regardless. This mirrors the interactive-mode rule in `references/review-output-template.md` so both modes produce the same shape.
**Interactive mode:**
@@ -266,6 +321,8 @@ Brief summary at the top: "Applied N fixes. K items need attention (X errors, Y
Include the Coverage table, applied fixes, FYI observations (as a distinct subsection), residual concerns, and deferred questions.
**All tables MUST be pipe-delimited markdown (`| col | col |`). Do NOT use ASCII box-drawing characters (`┌ ┬ ┐ ├ ┼ ┤ └ ┴ ┘ │ ─`) under any circumstances, including for the Coverage table.** This rule restates the template's formatting requirement at the point of rendering so it cannot drift. Pipe-delimited tables render correctly across all target harnesses; box-drawing characters break rendering in some and violate the repo convention documented in root `AGENTS.md`.
### R29 Rejected-Finding Suppression (Round 2+)
When the orchestrator is running round 2+ on the same document in the same session, the decision primer (see `SKILL.md` — Decision primer) carries forward every prior-round Skipped, Deferred, and Acknowledged finding. Synthesis suppresses re-raised rejected findings rather than re-surfacing them to the user. Acknowledged is treated as a rejected-class decision here: the user saw the finding, chose not to act on it (no Apply, no Defer append), and wants it on record — equivalent to Skip for suppression purposes.

View File

@@ -25,7 +25,7 @@ D. Report only — take no further action
The per-finding `(recommended)` labeling lives inside the walk-through (option A) and the bulk preview (options B/C), where it's applied per-finding from synthesis step 3.5b's `recommended_action`. The routing question itself does not recommend one of A/B/C/D because the right route depends on user intent (engage / trust / triage / skim), not on the finding-set shape — a rule that mapped finding-set shape to routing recommendation (e.g., "most findings are Apply-shaped → recommend LFG") would pressure users toward automated paths in ways that conflict with the user-intent framing.
If all remaining findings are FYI-subsection-only (no `gated_auto` or above-gate `manual` findings), skip the routing question entirely and flow to the Phase 5 terminal question.
If all remaining findings are FYI-subsection-only (no `gated_auto` or `manual` findings at confidence anchor `75` or `100`), skip the routing question entirely and flow to the Phase 5 terminal question.
**Append-availability adaptation.** When `references/open-questions-defer.md` has cached `append_available: false` at Phase 4 start (e.g., read-only document, unwritable filesystem), option C is suppressed from the routing question because every per-finding Defer would fail into the open-questions failure path. The menu shows three options (A / B / D) and the stem appends one line explaining why (e.g., `Append to Open Questions unavailable — document is read-only in this environment.`). This mirrors the per-finding option B suppression described under "Adaptations" below — both routing-level and per-finding Defer paths share the same availability signal so the user never sees Defer surfaced at one level and omitted at the other.
@@ -42,7 +42,7 @@ If all remaining findings are FYI-subsection-only (no `gated_auto` or above-gate
The walk-through receives, from the orchestrator:
- The merged findings list in severity order (P0 → P1 → P2 → P3), filtered to `gated_auto` and `manual` findings that survived the per-severity confidence gate. FYI-subsection findings are not included — they surface in the final report only and have no walk-through entry.
- The merged findings list in severity order (P0 → P1 → P2 → P3), filtered to actionable findings (confidence anchor `75` or `100` with `autofix_class` `gated_auto` or `manual`). FYI-subsection findings (anchor `50`) are not included — they surface in the final report only and have no walk-through entry.
- The run id for artifact lookups (when applicable).
- Premise-dependency chain annotations from synthesis step 3.5c: each finding may carry `depends_on: <root_id>` or `dependents: [<ids>]`.
@@ -246,7 +246,7 @@ Every terminal path of Interactive mode emits the same completion report structu
### Minimum required fields
- **Per-finding entries:** for every finding the flow touched, a line with — at minimum — title, severity, the action taken (Applied / Deferred / Skipped / Acknowledged), the append location for Deferred entries, a one-line reason for Skipped entries (grounded in the finding's confidence or the one-line `why_it_matters` snippet), and the acknowledgement reason for Acknowledged entries (e.g., `Apply picked but no suggested_fix available`).
- **Per-finding entries:** for every finding the flow touched, a line with — at minimum — title, severity, the action taken (Applied / Deferred / Skipped / Acknowledged), the append location for Deferred entries, a one-line reason for Skipped entries (grounded in the finding's confidence anchor or the one-line `why_it_matters` snippet), and the acknowledgement reason for Acknowledged entries (e.g., `Apply picked but no suggested_fix available`).
- **Summary counts by action:** totals per bucket (e.g., `4 applied, 2 deferred, 2 skipped`). Include an `acknowledged` count when any entries land in that bucket; omit the label when the count is zero.
- **Failures called out explicitly:** any Apply that failed (e.g., document write error, or the defensive no-fix fallback skipping an Apply-set entry), any Open-Questions append that failed. Failures surface above the per-finding list so they are not missed.
- **End-of-review verdict:** carried over from Phase 4's Coverage section.
@@ -257,7 +257,7 @@ Failures first (above the per-finding list), then per-finding entries grouped by
### Zero-findings degenerate case
When the routing question was skipped because no `gated_auto` / above-gate `manual` findings remained after `safe_auto`, the completion report collapses to its summary-counts + verdict form with one added line — the count of `safe_auto` fixes applied. The summary wording:
When the routing question was skipped because no `gated_auto` / `manual` findings at confidence anchor `75` or `100` remained after `safe_auto`, the completion report collapses to its summary-counts + verdict form with one added line — the count of `safe_auto` fixes applied. The summary wording:
No FYI or residual concerns: