feat(doc-review, learnings-researcher): tiers, chain grouping, rewrite (#601)
Some checks failed
CI / pr-title (push) Has been cancelled
CI / test (push) Has been cancelled
Release PR / release-pr (push) Has been cancelled
Release PR / publish-cli (push) Has been cancelled

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trevin Chow
2026-04-19 20:25:47 -07:00
committed by GitHub
parent 409b07fbc7
commit c1f68d4d55
39 changed files with 3142 additions and 290 deletions

View File

@@ -0,0 +1,157 @@
---
date: 2026-04-18
topic: ce-doc-review-autofix-and-interaction
---
# ce-doc-review Autofix and Interaction Overhaul
## Problem Frame
`ce-doc-review` consistently produces painful reviews. It surfaces too many findings as "requires judgment" when one reasonable fix exists, nitpicks on low-confidence items, and hands the user a wall of prose with only two terminal options — "refine and re-review" or "review complete." The interaction model lags behind what `ce-code-review` now offers (per PR #590): per-finding walk-through, LFG, bulk preview, tracker defer, and a recommendation-stable routing question.
A real-world review of a plan document produced **14 findings all routed to "needs judgment"** — including five P3 findings at 0.550.68 confidence, three concrete mechanical fixes that a competent implementer would arrive at independently, and one subjective filename-symmetry observation that didn't need a decision at all. The user had to parse 14 prose blocks, pick answers, and then was forced into a re-review regardless of how little the edits actually changed.
The gaps are structural and line up with four observable failure modes:
1. **Classification is binary and coarse.** `autofix_class` is `auto` or `present`. There is no `gated_auto` tier (concrete fix, minor sign-off) and no `advisory` tier (report-only FYI). Everything that isn't "one clear correct fix with zero judgment" becomes `present`, which conflates high-stakes strategic decisions with small mechanical follow-ups.
2. **Confidence gate is flat and too low.** A single 0.50 threshold across all severities lets borderline P3s through. `ce-code-review` moved to 0.60 with P0-only survival at 0.50+.
3. **"Reasonable alternative" test is permissive.** Persona reviewers list `(a) / (b) / (c)` fix options where (b) and (c) are strawmen ("accept the regression," "document in release notes," "do nothing"). The classification rule reads those as multiple reasonable fixes and routes the finding to `present`, when in fact only (a) is a real option.
4. **Subagent framing and interaction model are pre-PR-590.** No observable-behavior-first framing guidance, no walk-through, no bulk preview, no per-severity confidence calibration, no post-fix "apply and proceed" exit — every path that addresses findings forces a re-review, even when the user is done.
## Requirements
**Classification tiers**
- R1. `autofix_class` expands from two values to four: `auto`, `gated_auto`, `advisory`, `present`. Values preserve the existing "is there one correct fix" axis but add (a) a tier for concrete fixes that touch document scope / meaning and should be user-confirmed (`gated_auto`), and (b) a tier for report-only observations with no decision to make (`advisory`).
- R2. `auto` findings are applied silently, same as today. The promotion rules in the synthesis pipeline (current steps 3.6 and 3.7) are sharpened per R4 below and carry the new strictness forward.
- R3. `gated_auto` findings carry a concrete `suggested_fix` and a user-confirmation requirement. They enter the per-finding walk-through (R13) with `Apply the proposed fix` marked `(recommended)`. They are the default tier for "concrete fix exists, but it changes what the document says in a way the author should sign off on" (e.g., adding a backward-compatibility read-fallback, requiring two units land in one commit, substituting a framework-native API for a hand-rolled one).
- R4. `advisory` findings are report-only. They surface in a compact FYI block in the final output and do not enter the walk-through or any bulk action. Subjective observations ("filename asymmetry — could go either way"), drift notes without actionable fixes, and low-stakes calibration gaps live here.
- R5. `present` findings remain for genuinely strategic / scope / prioritization decisions where multiple reasonable approaches exist and the right choice depends on context the reviewer doesn't have.
**Classification rule sharpening**
- R6. The subagent-template classification rule adds teeth: "a 'do nothing / accept the defect' option is not a real alternative — it's the failure state the finding describes." If the only listed alternatives to the primary fix are strawmen, the finding is `auto` (or `gated_auto` if confirmation is warranted), not `present`. This applies equally to "document in release notes," "accept drift," and other deferral framings that sidestep the actual problem.
- R7. Auto-promotion patterns already scattered in prose (steps 3.6 and 3.7) are consolidated into an explicit promotion rule set, covering:
- Factually incorrect behavior where the correct behavior is derivable from context or the codebase
- Missing standard security / reliability controls with established implementations (HTTPS, fallback-with-deprecation-warning, input sanitization, checksum verification, private IP rejection, etc.)
- Codebase-pattern-resolved fixes that cite a concrete existing pattern
- Framework-native-API substitutions when a hand-rolled implementation duplicates first-class framework behavior (e.g., cobra's `Deprecated` field)
- Completeness additions mechanically implied by the document's own explicit decisions
- R8. The subagent template includes a framing-guidance block (ported from the `ce-code-review` shared template): observable-behavior-first phrasing, why-the-fix-works grounding, 2-4 sentence budget, required-field reminder, positive/negative example pair. One file change, applied universally across all seven personas.
**Per-severity confidence gates**
- R9. The single 0.50 confidence gate is replaced with per-severity gates:
- P0: survive at 0.50+
- P1: survive at 0.60+
- P2: survive at 0.65+
- P3: survive at 0.75+
- R10. The residual-concern promotion step (current step 3.4) is dropped. Cross-persona agreement instead boosts the confidence of findings that already survived the gate (by +0.10, capped at 1.0), mirroring `ce-code-review` stage 5 step 4. Residual concerns surface in Coverage only.
- R11. `advisory` findings are exempt from the confidence gate — they are report-only and can't generate false-positive work even at lower confidence. This is the safety valve for observations the reviewer wants on record but doesn't want to escalate.
**Interaction model (post-fix routing)**
- R12. After `auto` fixes are applied and before any user interaction, Interactive mode presents a four-option routing question that mirrors `ce-code-review`'s post-PR-590 design:
- (A) `Review each finding one by one — accept the recommendation or choose another action`
- (B) `LFG. Apply the agent's best-judgment action per finding`
- (C) `Append findings to the doc's Open Questions section and proceed` (ce-doc-review analogue of ce-code-review's "file a tracker ticket" — for docs, "defer" means appending the findings to a `## Deferred / Open Questions` section within the document itself, not an external system)
- (D) `Report only — take no further action`
If zero `gated_auto` / `present` findings remain after the `auto` pass, the routing question is skipped and the flow falls directly into the terminal question (R19).
- R13. Routing option A enters a per-finding walk-through, presented one finding at a time in severity order (P0 first). Each per-finding question carries: position indicator (`Finding N of M`), severity, confidence, a plain-English statement of the problem, the proposed edit, and a short reasoning grounded in the document's own content or the codebase. Options: `Apply the proposed fix` / `Defer — append to the doc's Open Questions section` / `Skip — don't apply, don't append` / `LFG the rest — apply the agent's best judgment to this and remaining findings`. Advisory-only findings substitute `Acknowledge — mark as reviewed` for Apply.
- R14. Routing option B and walk-through `LFG the rest` execute the agent's per-finding recommended action across the selected scope (all pending findings for B, remaining-undecided for walk-through). The recommendation for each finding is determined deterministically by R16.
- R15. Before any bulk action executes (routing B, routing C, walk-through `LFG the rest`), a compact plan preview renders findings grouped by intended action (`Applying (N):`, `Appending to Open Questions (N):`, `Skipping (N):`, `Acknowledging (N):`) with a one-line summary per finding. Exactly two responses: `Proceed` or `Cancel`. Cancel from walk-through `LFG the rest` returns the user to the current finding, not to the routing question.
**Recommendation tie-breaking**
- R16. When merged findings carry conflicting recommendations across contributing personas (one says Apply, another says Defer), synthesis picks the most conservative using `Skip > Defer > Apply > Acknowledge`, so walk-through recommendations and LFG behavior are deterministic across re-runs.
**Terminal "next step" question (the re-review fix)**
- R17. The current Phase 5 binary question (`Refine — re-review` / `Review complete`) conflates "apply fixes" with "re-review" into a single option. This is replaced by a three-option terminal question that separates the two axes:
- (A) `Apply decisions and proceed to <next stage>` — for requirements docs, hand off to `ce-plan`; for plan docs, hand off to `ce-work`. Default / recommended when fixes were applied or decisions were made.
- (B) `Apply decisions and re-review` — opt-in re-review when the user believes the edits warrant another pass.
- (C) `Exit without further action` — user wants to stop for now.
When zero actionable findings remain (everything was `auto` or `advisory`), option B is omitted — re-review is not useful when there's nothing to re-examine.
- R18. The terminal question is distinct from the mid-flow routing question (R12). The routing question chooses *how* to engage with findings; the terminal question chooses *what to do next* once engagement is complete. The two are asked separately, not merged.
- R19. The zero-findings degenerate case (no `gated_auto` / `present` findings after the `auto` pass) skips the routing question entirely and proceeds directly to the terminal question with option B suppressed.
**In-doc deferral (Defer analogue)**
- R20. Document-review's `Defer` action appends the deferred finding to a `## Deferred / Open Questions` section at the end of the document under review. If the heading does not exist, it is created on first defer within a review. Multiple deferred findings from a single review accumulate under a single timestamped subsection (e.g., `### From 2026-04-18 review`) to keep sequential reviews distinguishable. This replaces `ce-code-review`'s tracker-ticket mechanic with a document-native analogue: deferred findings stay attached to the document they came from.
- R21. The appended entry for each deferred finding includes: title, severity, reviewer attribution, confidence, and the `why_it_matters` framing — enough context that a reader returning to the doc later can understand the concern without re-running the review. The entry does not include `suggested_fix` or `evidence` — those live in the review run artifact and can be looked up if needed.
- R22. When the append fails (document is read-only, path issue, write failure), the agent surfaces the failure inline and offers: retry, fall back to recording the deferral in the completion report only, or convert the finding to Skip. Silent failure is not acceptable.
**Framing quality in reviewer output**
- R23. Every user-facing surface that describes a finding — walk-through questions, LFG completion reports, Open Questions entries — explains the problem and fix in plain English. The framing leads with the *observable consequence* of the issue (what an implementer, reader, or downstream caller sees), not the document's structural phrasing.
- R24. The framing explains *why the fix works*, not just what it changes. When a pattern exists elsewhere in the document or codebase, reference it so the recommendation is grounded.
- R25. The framing is tight — approximately two to four sentences. Longer framings are a regression.
**Cross-cutting**
- R26. Tool-loading pre-flight mirrors `ce-code-review`: on Claude Code, `AskUserQuestion` is pre-loaded once at the start of Interactive mode via `ToolSearch` (`select:AskUserQuestion`), not lazily per-question. The numbered-list text fallback applies only when `ToolSearch` explicitly returns no match or the tool call errors.
- R27. Headless mode behavior is preserved. `mode:headless` continues to apply `auto` fixes silently and return all other findings as structured text to the caller. The caller owns routing. New tiers (`gated_auto`, `advisory`) must appear distinctly in headless output so callers can route them appropriately.
**Multi-round decision memory**
- R28. Every review round after the first passes a cumulative decision primer to every persona, carrying forward all prior rounds' decisions in the current interactive session: rejected findings (Skipped / Deferred from any prior round) with title, evidence quote, and rejection reason; plus Applied findings from any prior round with title and section reference. Personas still receive the full current document as their primary input. No diff is passed — fixed findings self-suppress because their evidence no longer exists, regressions surface as normal findings on the current doc, and rejected findings are handled by the suppression rule in R29.
- R29. Personas must not re-raise a finding whose title and evidence pattern-match a finding rejected in any prior round, unless the current document state makes the concern materially different. The orchestrator drops any finding that would violate this rule and records the drop in Coverage.
- R30. For each prior-round Applied finding, synthesis confirms the fix landed by checking that the specific issue the finding described no longer appears in the referenced section. If a persona re-surfaces the same finding at the same location, synthesis flags it as "fix did not land" in the final report rather than treating it as a new finding.
**Institutional memory (learnings-researcher integration)**
- R31. `ce-doc-review` dispatches `research:ce-learnings-researcher` as an always-on agent, in parallel with coherence-reviewer and feasibility-reviewer. The agent owns its own fast-exit behavior when `docs/solutions/` is empty or absent — no activation-gating in the orchestrator.
- R32. The orchestrator produces a compressed search seed during Phase 1's classify-and-select step: document type, 3-5 topic keywords extracted from the doc, named entities (tools, frameworks, patterns explicitly named), and the doc's top-level decision points. Learnings-researcher receives the search seed plus the document path, not the full document content. It searches `docs/solutions/` by frontmatter metadata first, then selectively reads matching solution bodies.
- R33. Learnings-researcher returns, per match: the solution doc's path, a one-line relevance reason, and the specific claim in the doc under review that the past solution relates to. Full solution content is loaded on demand by other personas or the orchestrator if the match is promoted into a finding. Results are capped at a small N (default 5) most relevant matches — past-solution volume is not the goal; directly applicable grounding is.
- R34. Learnings-researcher output surfaces in a dedicated "Past Solutions" section of the review output. Entries default to `advisory` tier (report-only grounding) unless a past solution directly contradicts a specific claim in the document under review, in which case they promote to `gated_auto` or `present` with the past solution's path as evidence.
- R35. Learnings-researcher content does not participate in confidence-gating (R9) or cross-persona dedup (existing step 3.3). Its role is to add institutional memory, not to compete with persona findings for user attention.
**learnings-researcher agent rewrite (bundled)**
- R36. Rewrite `research:ce-learnings-researcher` to treat the `docs/solutions/` corpus as domain-agnostic institutional knowledge. Code bugs are one genre among several, alongside skill-design patterns, workflow learnings, developer-experience discoveries, integration gotchas, and anything else captured by `ce-compound` and its refresh counterpart. The agent's primary function is "find applicable past learnings given a work context," not "find past bugs given a feature description."
- R37. The agent accepts a structured `<work-context>` input from callers: a short description of what the caller is working on or considering, a list of key concepts / decisions / domains / components extracted from the caller's work, and an optional domain hint when one applies cleanly (e.g., `skill-design`, `workflow`, `code-implementation`). No mode flag is required — the context shape adapts to the calling skill without the agent branching on caller identity.
- R38. The hardcoded category-to-directory table is replaced with a dynamic probe of `docs/solutions/` to discover available subdirectories at runtime. Category narrowing uses the discovered set. The agent no longer assumes which subdirectories exist in a given repo.
- R39. Keyword extraction handles decision-and-approach-shape content alongside symptom-and-component-shape content. The extraction taxonomy expands from the current four dimensions (Module names, Technical terms, Problem indicators, Component types) to include Concepts, Decisions, Approaches, and Domains. No input shape is privileged over another; the caller's context determines which dimensions carry weight.
- R40. Output framing drops code-bug-biased phrasing ("gotchas to avoid during implementation," "prevent repeated mistakes" framed narrowly around bugs) in favor of neutral institutional-memory framing ("applicable past learnings," "related decisions and their outcomes"). The pointer + one-line-relevance + key-insight summary format carries across all input genres.
- R41. Read `docs/solutions/patterns/critical-patterns.md` only when it exists. When absent, the agent proceeds without it — this file is a per-repo convention, not a protocol requirement.
- R42. The agent's Integration Points section documents invocation by `/ce-plan`, `/ce-code-review`, `ce-doc-review`, and any other skill benefiting from institutional memory. Remove the framing that implies planning-time is the agent's primary home.
**Frontmatter enum expansion (bundled)**
- R43. Expand the `ce-compound` frontmatter `problem_type` enum to add non-bug genre values: `architecture_pattern`, `design_pattern`, `tooling_decision`, `convention`. Document `best_practice` as the fallback for entries not covered by any narrower value, not the default. Migrate the 8 existing `best_practice` entries that fit a narrower value (3 architecture patterns, 3 design patterns, 1 tooling decision, 1 remaining as best_practice), and resolve the one `correctness-gap` schema violation (`workflow/todo-status-lifecycle.md`) into a valid enum value. Update `ce-compound` and `ce-compound-refresh` so they steer authors toward narrower values when the new categories apply.
## Scope Boundaries
- Not introducing a document-native tracker integration (e.g., Linear / Jira / GitHub Issues). Document-review's Defer analogue is an in-doc `## Deferred / Open Questions` section. If users later want tracker integration for doc findings, that's a follow-up proposal.
- Not changing persona selection logic. The seven personas and the activation signals for conditional ones stay as-is. The persona markdown files themselves change only to absorb the subagent-template framing-guidance block.
- Not changing headless mode's structural contract with callers (`ce-brainstorm`, `ce-plan`). Headless continues to apply `auto` fixes silently and return a structured text envelope. Callers must be updated to handle the new `gated_auto` and `advisory` tiers but the envelope shape stays.
- Not adding a `requires_verification` field or an in-skill fixer subagent. Document edits happen inline during the walk-through; there is no batch-fixer analogue to `ce-code-review`'s Step 3 fixer because document fixes are trivially confined in scope (single-file markdown edits).
- Not addressing iteration-limit guidance. The existing "after 2 refinement passes, recommend completion" heuristic stays.
- Not persisting decision primers across interactive sessions. The cumulative decision list (R28) lives in-memory across rounds within a single invocation. A new invocation of `ce-doc-review` on the same doc starts fresh with no carried memory, even if prior-session decisions were Applied to the document. Mirrors `ce-code-review` walk-through state rules.
- Not building a fully new frontmatter schema. R43 adds non-bug enum values but does not redesign the schema dimensions (no split into `learning_category` + `problem_type`, no new required fields). The existing authoring flow stays the same; only the set of valid `problem_type` values grows.
## Design Decisions Worth Calling Out
- **Three new tiers, not two.** A minimal refactor could add only `gated_auto` and keep `advisory` collapsed into `present`. But real-world evidence shows FYI-grade findings (subjective observations, low-stakes drift notes) drive significant noise, and folding them into `present` forces user decisions on things that don't warrant any decision. Adding `advisory` as a distinct tier is cheap (one enum value + one output block) and materially reduces decision fatigue.
- **Strawman-aware classification rule in the subagent template, not in synthesis.** Moving the rule to synthesis means persona reviewers still emit inflated alternative lists and the orchestrator retroactively collapses them. Moving it to the subagent template changes what reviewers produce at the source, so the evidence and framing travel together correctly.
- **Per-severity confidence gates, not a flat 0.60.** A flat 0.60 would still let 0.600.68 P3 nits through (three of them in the attached real-world example). Severity-aware gates recognize that a P3 finding at 0.65 is noise in a way a P1 at 0.65 is not, because P3 impact is low enough that the expected value of a borderline call doesn't justify the user's attention.
- **Separate terminal question from routing question.** The current skill conflates "engage with findings" and "exit the review" into one question with two poorly-aligned options. Splitting them gives the user explicit control over whether re-review happens — the most common user frustration surfaced in the bug report that prompted this work.
- **In-doc Open Questions section, not a sibling follow-up note or external tracker, as Defer analogue.** Documents don't have the same "handoff to a different system" shape that code findings do. A sibling markdown note would fragment context; an external tracker would add platform complexity with no upside for document review. Appending deferred findings to a `## Deferred / Open Questions` section inside the document itself keeps deferred concerns attached to the artifact they came from, is naturally discoverable by anyone reading the doc, and requires no new infrastructure. The trade-off is that deferred findings visibly mutate the doc — but that is the point: "I want to remember this but not act now" is exactly what an Open Questions section expresses in a planning doc.
- **Port framing-guidance once via the shared subagent template.** Matches how `ce-code-review` shipped the same fix in PR #590. One file change, applied universally. Per-persona edits would inflate scope to seven files; a synthesis-time rewrite pass would add per-review model cost and paper over the root cause in the persona output itself.
- **Classification-rule sharpening and promotion-pattern consolidation ship together with the tier expansion.** Shipping the tiers without the sharpened rule would leave the classifier behavior unchanged and just add new tier labels nothing routes to. Shipping the rule without the tiers has no tier to promote findings into.
- **Keep the existing persona markdown files mostly unchanged.** The framing-guidance block lives in the shared subagent template that wraps every persona dispatch; the personas themselves retain their confidence calibration, suppress conditions, and domain focus. This keeps the persona-level failure-mode catalogs stable while upgrading the shared framing bar.
- **No diff passed to the multi-round decision primer.** Fixed findings self-suppress because their evidence is gone from the current doc; regressions surface as normal findings; rejected findings are handled by the suppression rule (R29). A diff would be signal amplification, not a correctness requirement, and would add prompt weight without changing what the agent can do.
- **learnings-researcher rewrite bundled, not split.** The review-time use case has no consumer without ce-doc-review, so splitting into a precursor PR would ship a dormant feature. Bundling keeps the change coherent and easier to review as one unit. The agent rewrite (R36R42) and the frontmatter enum expansion (R43) also benefit `/ce-plan`'s existing usage, so the scope investment pays off beyond ce-doc-review.
- **Generalize learnings-researcher rather than patch with a mode flag.** The original proposal was a minimal `review-time` mode flag grafted onto the agent. But the real issue is that the agent's taxonomy, categories, and output framing are code-bug-shaped even when invoked by non-review callers — the plugin already captures non-code learnings via `ce-compound` / `ce-compound-refresh`, and the agent should treat them as first-class. Rewriting for domain-agnostic institutional knowledge is a bigger change but removes the drift, rather than accumulating special cases.
- **Expand `problem_type` rather than introduce a new orthogonal dimension.** A cleaner design might split current `problem_type` into separate `learning_category` (genre) and `problem_type` (bug-shape detail) fields. But that requires migrating every existing entry and teaching authors to pick both. Expanding the existing enum with non-bug values absorbs the `best_practice` overflow with minimal schema churn and keeps the authoring flow stable.
## Calibration Against Real-World Example
The attached review output (14 findings, all `present`) re-classifies under the proposed rules as:
- **4 `auto`** (silently applied, no user interaction): missing fallback-with-deprecation-warning (industry-standard pattern), public-repo grep step (single action), deployment-coupling-commit guarantee (mechanical), cobra's native `Deprecated` field (framework-native substitution).
- **1 `advisory`** (FYI line): filename asymmetry — genuinely ambiguous, no wrong answer.
- **4 `present`** (walk-through): historical-docs rule, alias-compatibility breaking-change, escape-hatch scope decision, Unit merging decision.
- **5 dropped** by per-severity gates: five P3-P2 findings at 0.550.68 confidence.
Net: the user sees **4 decisions**, not 14. The walk-through's `LFG the rest` escape further bounds fatigue — after the user calibrates on the agent's recommendations, they can bail and accept the rest.

View File

@@ -0,0 +1,708 @@
---
title: ce-doc-review Autofix and Interaction Overhaul
type: feat
status: active
date: 2026-04-18
origin: docs/brainstorms/2026-04-18-ce-doc-review-autofix-and-interaction-requirements.md
---
# ce-doc-review Autofix and Interaction Overhaul
## Overview
Overhaul `ce-doc-review` to match the interaction quality and auto-fix leverage of `ce-code-review` (post-PR #590). Today, ce-doc-review surfaces too many findings as "needs user judgment" when one clear fix exists, nitpicks at low confidence, and ends with a binary question that forces re-review when the user wants to apply fixes and move on. This plan expands the autofix classification from binary (`safe_auto` / `manual`) to three tiers (`safe_auto` / `gated_auto` / `manual`) using ce-code-review-aligned names, raises and severity-weights the confidence gate, ports the per-finding walk-through + bulk-preview + routing-question pattern from `ce-code-review`, adds in-doc deferral, introduces multi-round decision memory, rewrites `learnings-researcher` to handle domain-agnostic institutional knowledge, and expands the `ce-compound` frontmatter `problem_type` enum to absorb the `best_practice` overflow. **Advisory-style findings** (low-confidence observations worth surfacing but not worth a decision) render as a distinct FYI subsection of the `manual` bucket at the presentation layer rather than a separate schema tier.
The plan ships in phases so lower-risk foundation work (enum expansion, agent rewrite) can land and stabilize before the interaction-model port. Each implementation unit is atomic and can ship as its own PR.
## Problem Frame
See origin document for full problem framing. In brief, a real-world review surfaced **14 findings all routed to `manual`**, including five P3s at 0.550.68 confidence, three concrete mechanical fixes that a competent implementer would arrive at independently, and one subjective observation with no right answer. Under the revised rules the same review produces 4 auto-applied fixes, 1 FYI entry, 4 real decisions, and 5 dropped — the user engages with 4 items instead of 14.
## Requirements Trace
38 requirements from the origin document. Full definitions live there; listed here for traceability.
- **Classification tiers:** R1R5 (three tiers — add `gated_auto`; keep `safe_auto` / `manual`; advisory-style findings become presentation-layer FYI subsection of manual, not a distinct enum value)
- **Classification rule sharpening:** R6R8 (strawman-aware rule with safeguard, consolidated promotion patterns, shared framing-guidance block)
- **Per-severity confidence gates:** R9R11 (P0 0.50 / P1 0.60 / P2 0.65 / P3 0.75; drop residual promotion; low-confidence manual findings surface in a distinct FYI subsection without being dropped)
- **Interaction model:** R12R16 (4-option routing, per-finding walk-through, bulk preview, tie-break)
- **Terminal question:** R17R19 (three-option split: apply-and-proceed / apply-and-re-review / exit)
- **In-doc deferral:** R20R22 (append to `## Deferred / Open Questions` section)
- **Framing quality:** R23R25 (observable consequence, why-the-fix-works, tight)
- **Cross-cutting:** R26R27 (AskUserQuestion pre-load, headless preservation)
- **Multi-round memory:** R28R30 (cumulative decision primer, suppression, fix-landed verification)
- **learnings-researcher agent rewrite:** R36R42 (domain-agnostic, `<work-context>`, dynamic category probe, optional critical-patterns read) — benefits `/ce-plan`'s existing usage
- **Frontmatter enum expansion:** R43 (add `architecture_pattern`, `design_pattern`, `tooling_decision`, `convention`)
**Dropped from scope:** R31R35 (learnings-researcher integration into ce-doc-review). See Key Technical Decisions and Alternative Approaches Considered for the rationale. **In scope:** R36R42 (learnings-researcher domain-agnostic rewrite, Unit 2) and R43 (frontmatter enum expansion, Unit 1), which benefit `/ce-plan`'s existing usage even though learnings-researcher is not dispatched from ce-doc-review.
## Scope Boundaries
- Not introducing external tracker integration. Document-review's Defer analogue is an in-doc section.
- Not changing persona activation/selection logic. The 7 personas and their conditional activation signals stay as-is.
- Not adding `requires_verification` or a batch fixer subagent. Document fixes apply inline.
- Not addressing iteration-limit guidance. "After 2 refinement passes, recommend completion" stays.
- Not persisting decision primers across interactive sessions (matches `ce-code-review` walk-through state rules).
- Not redesigning the frontmatter schema dimensions. Enum expansion only — no new `learning_category` field alongside `problem_type`.
### Deferred to Separate Tasks
- Frontmatter validation test. Adding a pre-commit or CI check that enforces `problem_type` enum membership is valuable (`correctness-gap` slipped through today) but is additive and can ship as a follow-up.
- Updating the frontmatter `component` enum. It's heavily Rails-focused and would benefit from expansion for non-Rails work, but that's out of scope for this overhaul.
## Context & Research
### Relevant Code and Patterns
**Port-from targets (`ce-code-review`):**
- `plugins/compound-engineering/skills/ce-code-review/references/walkthrough.md` — per-finding walk-through (terminal output block + blocking question split, fixed-order options, `(recommended)` marker, LFG-the-rest escape, N=1 adaptation, unified completion report)
- `plugins/compound-engineering/skills/ce-code-review/references/bulk-preview.md` — grouped Apply/Filing/Skipping/Acknowledging preview with `Proceed` / `Cancel`
- `plugins/compound-engineering/skills/ce-code-review/references/subagent-template.md:51-73` — framing-guidance block for personas
- `plugins/compound-engineering/skills/ce-code-review/SKILL.md:75` — AskUserQuestion pre-load directive
- `plugins/compound-engineering/skills/ce-code-review/SKILL.md:477` (stage 5 step 7b) — recommendation tie-break order `Skip > Defer > Apply > Acknowledge`
**Target surfaces (`ce-doc-review`):**
- `plugins/compound-engineering/skills/ce-doc-review/SKILL.md` — orchestrator
- `plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md` — framing-guidance block lands here
- `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md` — tier routing, confidence gate, decision primer, and headless envelope updates
- `plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json``autofix_class` enum expansion
- `plugins/compound-engineering/agents/document-review/ce-*.agent.md` — 7 persona files (mostly unchanged)
**Caller contracts:**
- `plugins/compound-engineering/skills/ce-brainstorm/SKILL.md:188-194` — invokes interactively on requirements doc
- `plugins/compound-engineering/skills/ce-brainstorm/references/handoff.md:29,56,65` — surfaces residual P0/P1 adjacent to menus; offers re-review
- `plugins/compound-engineering/skills/ce-plan/references/plan-handoff.md:5-17` — phase 5.3.8; interactive normally, `mode:headless` in pipeline
**Schema surfaces:**
- `plugins/compound-engineering/skills/ce-compound/references/schema.yaml` (canonical) and `yaml-schema.md` (human-readable) — `problem_type` enum definitions + category mapping
- `plugins/compound-engineering/skills/ce-compound-refresh/references/schema.yaml` and `yaml-schema.md`**duplicate** copies, must update in sync
- `plugins/compound-engineering/skills/ce-compound/SKILL.md` — author steering language
- `plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md` — refresh steering language
**Agent to rewrite:**
- `plugins/compound-engineering/agents/research/ce-learnings-researcher.agent.md` — domain-agnostic rewrite
**Test surfaces:**
- `tests/pipeline-review-contract.test.ts:279-352` — asserts ce-doc-review is invoked with `mode:headless` in pipeline mode. Will need extension for new tier visibility in headless envelope.
- `tests/converter.test.ts:417-438` — OpenCode 3-segment → flat name rewrite for ce-doc-review agent refs. Unaffected.
- No dedicated test file for ce-doc-review itself. Adding one is in scope (Unit 8).
### Institutional Learnings
Seven directly applicable learnings from `docs/solutions/`:
- `docs/solutions/best-practices/ce-pipeline-end-to-end-learnings-2026-04-17.md`**Mandatory read.** Authored from the `ce-code-review` PR #590 redesign this plan ports. Documents the bulk-preview vs. walk-through distinction, the 4-option `AskUserQuestion` cap as a structural constraint, the "two semantic meanings in one flag" risk, and the "sample 10-20 real artifacts before accepting research-agent architectural recommendations" rule.
- `docs/solutions/skill-design/compound-refresh-skill-improvements.md` — Six-item skill-review checklist (no hardcoded tool names, no contradictory phase rules, no blind questions, no unsatisfied preconditions, no shell in subagents, autonomous-mode opt-in). The "borderline cases get marked stale in autonomous mode" template informs how `advisory` findings behave in headless runs.
- `docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md` — Classifies `learnings-researcher` as ce-plan-owned (HOW / implementation-context). **Drove the decision to remove R31R35 from scope entirely:** rather than dispatch from ce-doc-review in any form (always-on or conditional), keep the agent in its ce-plan pipeline lane. ce-doc-review does not dispatch it. Users who want institutional memory should invoke ce-plan.
- `docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md` — Default to path-passing; 7× tool-call difference from prompt phrasing. Relevant to Unit 2's learnings-researcher rewrite — the `<work-context>` input should pass paths and compressed context, not full documents.
- `docs/solutions/skill-design/beta-skills-framework.md` + `beta-promotion-orchestration-contract.md` + `ce-work-beta-promotion-checklist-2026-03-31.md` — Beta-skill pattern for major overhauls. Considered and rejected for this work (see Alternative Approaches below).
- `docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md`**High severity for this plan.** Model tier/confidence/deferral as an explicit state machine; re-read state at each transition boundary. Directly shapes Unit 4 (synthesis pipeline) structure.
- `docs/solutions/skill-design/discoverability-check-for-documented-solutions-2026-03-30.md` — When enum expands, update instruction-discovery surface (schema reference, learnings-researcher prompt, AGENTS.md) in the same PR. Shapes Unit 1 and Unit 2.
### External References
No external research was needed — the work is internal plugin refactoring with strong local patterns (ce-code-review post-PR #590 is the canonical reference).
## Key Technical Decisions
- **Port the ce-code-review walk-through / bulk-preview pattern rather than invent a new one.** Same menu shape, same tie-break rule, same AskUserQuestion pre-load pattern. Users who've experienced ce-code-review's new flow will find ce-doc-review consistent. **Tier naming aligned with ce-code-review** (`safe_auto`, `gated_auto`, `manual`) so cross-skill mental model is consistent.
- **Three tiers, not four — advisory is a display treatment, not an enum value.** ce-code-review has four tiers (adds `advisory`) because code reviews have a meaningful "report-only, release/human-owned" category (rollout notes, residual risk, learnings). Document reviews rarely produce that shape — FYI observations are typically just low-confidence manual findings that don't need a decision. Collapsing to three tiers + FYI-subsection presentation drops a schema value without losing the user-facing distinction between "needs decision" and "FYI, move on." Cognitive load lower; schema simpler.
- **Interaction-surface convergence with ce-code-review is intentional; keep the skills separate.** Post-plan, ce-doc-review and ce-code-review share interaction mechanics (walk-through shape, bulk preview, routing question, tie-break order) but evaluate genuinely different things: the personas are different (coherence/feasibility/scope-guardian for docs vs correctness/security/performance for code), the inputs are different (prose vs diff), and the failure modes are different. Shared interaction scaffold, distinct review content. Unifying into one skill would smear the focused-review value each delivers today.
- **Ship without a `ce-doc-review-beta` fork.** See Alternative Approaches.
- **Do not dispatch `learnings-researcher` from ce-doc-review at all.** The agent is ce-plan-owned (implementation-context per `research-agent-pipeline-separation-2026-04-05.md`). When ce-doc-review runs inside ce-plan, the agent has already fired and its output lives in the plan. When ce-doc-review runs inside ce-brainstorm, the context is WHY (product-framing), not HOW (implementation) — running an implementation-context agent would be a pipeline violation. When ce-doc-review runs standalone, the personas already cover coherence, feasibility, and scope — institutional memory is a nice-to-have that adds dispatch cost without proportional value. Users who want institutional memory for a doc should invoke `/ce-plan`, where that lookup is a first-class pipeline stage.
- **Put R1R8 classification changes in the shared subagent template**, not in each persona. One file edit propagates to all 7 personas. Matches how `ce-code-review` shipped the same quality upgrade.
- **Keep R9R11 confidence gates in synthesis** (`synthesis-and-presentation.md` step 3.2), not in personas. Personas keep their existing HIGH/MODERATE/<0.50 calibration.
- **No diff passed in multi-round primer (R28).** Fixed findings self-suppress (evidence gone); regressions surface as normal findings; rejected findings use pattern-match suppression. The diff would add prompt weight without changing what the agent can detect.
- **Enum expansion values go on the knowledge track**, not the bug track. All four new values (`architecture_pattern`, `design_pattern`, `tooling_decision`, `convention`) are knowledge-track per the two-track schema in `schema.yaml:12-31`.
- **Update duplicate schema files in both `ce-compound` and `ce-compound-refresh`** in the same commit. They are intentional duplicates; divergence is a known pitfall.
- **Model tier/confidence/deferral as an explicit state machine** (per `git-workflow-skills-need-explicit-state-machines` learning). See High-Level Technical Design for the state diagram.
## Open Questions
### Resolved During Planning
- **Beta fork vs phased ship?** Phased ship without beta. The overhaul is large but cleanly phaseable; each phase is independently testable; callers stay compatible via the preserved headless envelope contract (R27).
- **Dispatch learnings-researcher from ce-doc-review?** No. Dropped from scope (R31R35 removed). The agent is ce-plan-owned; users who want institutional memory should invoke ce-plan, which has it as a first-class pipeline stage. Unit 2 still rewrites the agent to be domain-agnostic — that benefits ce-plan's existing usage.
- **Diff in multi-round primer?** No. Decision metadata alone is sufficient.
- **Defer destination for docs?** In-doc `## Deferred / Open Questions` section, not a sibling file. See origin document R20.
### Deferred to Implementation
- **How many existing `best_practice` entries map to each new enum value?** Research suggests ~11 candidates; final mapping happens when migrating.
- **Exact wording of the `gated_auto` / `manual` labels in the AskUserQuestion menus.** Draft wording exists in origin document R12R13; final phrasing validated against harness rendering during implementation.
- **Exact line-count budget for the subagent-template framing-guidance block.** Target ~40-50 lines per research findings; adjust as needed to stay under the ~150-line `@` inclusion threshold.
- **Whether to extend `tests/pipeline-review-contract.test.ts` or add a new `tests/ce-doc-review-contract.test.ts`.** Decide during Unit 8 based on assertion overlap.
## High-Level Technical Design
> *This illustrates the intended approach and is directional guidance for review, not implementation specification. The implementing agent should treat it as context, not code to reproduce.*
### Finding lifecycle state machine
Per the `git-workflow-skills-need-explicit-state-machines` learning, the tier/confidence/deferral interactions form a state machine that must be modeled explicitly — prose-level carry-forward silently breaks.
```mermaid
stateDiagram-v2
[*] --> Raised: Persona emits finding
Raised --> Dropped: confidence < per-severity gate (R9)
Raised --> Dropped: re-raises rejected prior-round finding (R29)
Raised --> Deduplicated: fingerprint matches another persona's finding
Deduplicated --> Classified
Raised --> Classified: after confidence + dedup gates
Classified --> SafeAuto: autofix_class = safe_auto (R2)
Classified --> GatedAuto: autofix_class = gated_auto (R3)
Classified --> Manual: autofix_class = manual (R5)
Classified --> FYI: low-confidence manual, FYI-floor <= conf < per-severity gate
SafeAuto --> Applied: orchestrator edits doc silently
Applied --> Verified: next round confirms fix landed (R30)
Applied --> FixDidNotLand: persona re-raises same finding at same spot (R30)
GatedAuto --> WalkThrough: routing option A (R13)
GatedAuto --> BulkApply: routing option B LFG (R14)
GatedAuto --> BulkDefer: routing option C (R12)
Manual --> WalkThrough
Manual --> BulkApply
Manual --> BulkDefer
FYI --> Reported: surfaces in FYI subsection at presentation layer, no decision
WalkThrough --> UserChoice
UserChoice --> Applied: user picks Apply
UserChoice --> Deferred: user picks Defer (R20-R22)
UserChoice --> Skipped: user picks Skip
BulkApply --> Applied: proceed
BulkDefer --> Deferred: proceed
Deferred --> AppendedToOpenQuestions: append succeeds (R20)
Deferred --> Skipped: append fails, user converts to Skip (R22)
Verified --> [*]
FixDidNotLand --> [*]: flagged in report
AppendedToOpenQuestions --> [*]
Skipped --> [*]
Reported --> [*]
Dropped --> [*]
```
This diagram models ce-doc-review persona findings only. Learnings-researcher findings (R31R35) are out of scope — ce-doc-review does not dispatch the agent (see Key Technical Decisions and Alternative Approaches Considered).
Transitions to verify explicitly in synthesis (not carry forward as prose):
- Classified → one of four buckets (tier routing, step 3.7 rewrite)
- Rejected-in-prior-round → Dropped (R29 suppression, new synthesis step)
- Applied → Verified or FixDidNotLand (R30, new synthesis step)
- Auto / GatedAuto → Applied (separate paths; Auto is silent, GatedAuto goes through walk-through or bulk)
### Three interaction surfaces
```mermaid
flowchart TD
A[Auto fixes applied silently] --> B{Any gated_auto / present<br/>findings remain?}
B --> |No| Z[Zero-findings degenerate<br/>→ Terminal question<br/>B option omitted]
B --> |Yes| C[Four-option routing question]
C --> |A Review| W[Per-finding walk-through]
C --> |B LFG| P1[Bulk preview]
C --> |C Append to Open Questions| P2[Bulk preview]
C --> |D Report only| E[Terminal question<br/>without applying]
W --> |Apply/Defer/Skip| W
W --> |LFG the rest| P3[Bulk preview<br/>scoped to remaining]
P1 --> |Proceed| X[Apply set dispatched<br/>Defer appends<br/>Skip no-op]
P2 --> |Proceed| Y[All append to<br/>Open Questions section]
P3 --> |Proceed| X
X --> T[Terminal question<br/>3 options]
Y --> T
E --> T
T --> |Apply and proceed| NextStage[ce-plan or ce-work]
T --> |Apply and re-review| Round2[Next review round<br/>with decision primer]
T --> |Exit| End[Done for now]
Round2 --> A
```
The walk-through, bulk preview, and routing question are ports of the same-named `ce-code-review` references with ce-doc-review specific adaptations (Defer = in-doc append; no batch fixer subagent; terminal question routes to pipeline stages instead of PR/push).
## Implementation Units
- [ ] **Unit 1: Frontmatter enum expansion + migration**
**Goal:** Add four knowledge-track values (`architecture_pattern`, `design_pattern`, `tooling_decision`, `convention`) to the `problem_type` enum, update both duplicate schema files, migrate existing `best_practice` overflow entries, resolve the one `correctness-gap` schema violation, and update instruction-discovery surfaces so new values are discoverable.
**Requirements:** R43
**Dependencies:** None (foundation)
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-compound/references/schema.yaml`
- Modify: `plugins/compound-engineering/skills/ce-compound/references/yaml-schema.md`
- Modify: `plugins/compound-engineering/skills/ce-compound-refresh/references/schema.yaml`
- Modify: `plugins/compound-engineering/skills/ce-compound-refresh/references/yaml-schema.md`
- Modify: `plugins/compound-engineering/skills/ce-compound/SKILL.md` (author-steering language toward narrower values)
- Modify: `plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md` (refresh steering language)
- Modify: `plugins/compound-engineering/AGENTS.md` (discoverability line that names `problem_type` values)
- Migrate: the ~811 existing `best_practice` entries under `docs/solutions/` that fit a narrower value (see repo-research report for the candidate list — some entries may stay `best_practice` if no narrower value applies; the final count is a small range, not a fixed number)
- Migrate: `docs/solutions/workflow/todo-status-lifecycle.md` (`correctness-gap` → valid enum value)
**Approach:**
- Add four values to both schema.yaml files under the knowledge track
- Add four category mappings to both yaml-schema.md files (`architecture_pattern → docs/solutions/architecture-patterns/`, etc.)
- Keep `best_practice` valid but document it as the fallback, not the default
- Author-steering language in ce-compound SKILL body should name the new values with one-line descriptions so authors pick the narrower value when applicable
- Category directory creation on first use — don't pre-create empty dirs
- Migration pass: re-classify the ~11 existing `best_practice` entries per the research findings, and move `todo-status-lifecycle.md` off `correctness-gap`
**Patterns to follow:**
- `schema.yaml` existing two-track structure (bug / knowledge)
- `yaml-schema.md` existing "Category Mapping" section format
- ce-compound existing author-steering prose in section naming problem types
**Test scenarios:**
- Happy path: a fixture knowledge-track doc with `problem_type: architecture_pattern` parses and validates
- Happy path: a fixture knowledge-track doc with `problem_type: design_pattern` parses and validates
- Edge case: a doc with `problem_type: best_practice` still validates (backward compat)
- Edge case: a doc with an unknown value (e.g., `problem_type: xyz-invalid`) is flagged
- Integration: ce-compound steering guidance names the new values in its output when classifying an appropriate learning
**Verification:**
- Both schema files contain all 4 new values and the category mappings
- Every `best_practice` entry under `docs/solutions/` that fits a narrower value has been reclassified (final count is the subset of ~811 candidates that genuinely fit a narrower tier; some may legitimately remain `best_practice`)
- `docs/solutions/workflow/todo-status-lifecycle.md` carries a valid enum value
- AGENTS.md references the new categories so future agents discover them
---
- [ ] **Unit 2: learnings-researcher domain-agnostic rewrite**
**Goal:** Rewrite the `learnings-researcher` agent to treat `docs/solutions/` as domain-agnostic institutional knowledge. Accept a structured `<work-context>` input, replace hardcoded category tables with dynamic probing, expand keyword extraction beyond bug-shape taxonomy, and make the `critical-patterns.md` read optional.
**Requirements:** R36, R37, R38, R39, R40, R41, R42
**Dependencies:** Unit 1 for taxonomy-aware output framing only. The dynamic category probe itself has no schema dependency (it reads `docs/solutions/` subdirectories at runtime), so Unit 2 can be drafted in parallel; only the final author-visible framing benefits from Unit 1's enum landing first.
**Files:**
- Modify: `plugins/compound-engineering/agents/research/ce-learnings-researcher.agent.md`
- Test: No agent-specific test exists. Extend or add a fixture under `tests/fixtures/` if needed to validate the dispatch contract across platforms — defer to Unit 8 if non-trivial.
**Approach:**
- Rewrite the opening identity/framing: "domain-agnostic institutional knowledge researcher" (not bug-focused)
- Replace `feature/task description` input format with structured `<work-context>` block (description + concepts + decisions + domains + optional domain hint)
- Replace hardcoded category-to-directory table with a dynamic probe: at invocation time, list subdirectories under `docs/solutions/` and use the discovered set
- Expand keyword extraction taxonomy: existing four dimensions plus Concepts, Decisions, Approaches, Domains
- Make Step 3b (critical-patterns.md read) conditional on file existence
- Rewrite output framing to "applicable past learnings" / "related decisions and their outcomes" from "gotchas to avoid during implementation"
- Update Integration Points to include `/ce-plan` and standalone use (ce-doc-review is explicitly not a caller per this plan's Key Technical Decisions — the rewrite's consumer is `/ce-plan`)
**Execution note:** After rewriting, sample 3-5 real invocations on current codebase learnings to verify the domain-agnostic rewrite produces relevant output for non-code queries (e.g., skill-design questions, workflow questions). Per the ce-pipeline learnings doc: "sample real artifacts before accepting research-agent architectural recommendations."
**Patterns to follow:**
- `ce-code-review` shared subagent template (`references/subagent-template.md`) for the new `<work-context>` block shape
- Existing `learnings-researcher.md` grep-first filtering strategy (Step 3) — preserve it, it's already efficient
- `docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md` classification to keep the agent in its pipeline-stage lane
**Test scenarios:**
- Happy path: invoke with a code-bug work-context → returns bug-relevant learnings matching existing behavior
- Happy path: invoke with a skill-design work-context → returns skill-design learnings (previously would have lumped under `best_practice` with weaker matches)
- Edge case: `docs/solutions/` is empty or absent → fast-exit returns "No relevant learnings" without errors
- Edge case: `docs/solutions/patterns/critical-patterns.md` absent → agent proceeds without warning
- Edge case: `<work-context>` has no domain hint → agent falls back to general keyword extraction across all discovered categories
- Integration: converter tests (`tests/codex-writer.test.ts:329` and siblings) still pass — the agent's dispatch string is preserved, only the inner prompt changes
**Verification:**
- Running the agent on a skill-design question returns results pointing to `docs/solutions/skill-design/` entries, not miscategorized matches from `best-practices/`
- The hardcoded category table is gone; the agent probes `docs/solutions/` at invocation time
- Output framing does not say "gotchas to avoid during implementation" or code-bug-biased language
- Missing `critical-patterns.md` does not cause errors or warnings
- Cross-platform converter tests still pass
- **ce-plan-side validation per #14 review feedback:** run ce-plan's existing Phase 1.1 dispatch flow (on any in-repo plan target) against the rewritten agent and verify (a) the agent's output is consumable by ce-plan's current synthesis step without errors, (b) dispatch-string/contract across Codex, Gemini, and Claude converters is preserved, (c) output shape for a representative code-implementation query matches or improves on pre-rewrite relevance. Document the comparison briefly in the PR description so owners of ce-plan can audit the regression check.
---
- [ ] **Unit 3: ce-doc-review subagent template upgrade: framing, classification rule, tier expansion**
**Goal:** Upgrade the shared ce-doc-review subagent template with an observable-consequence-first framing guidance block, a strawman-aware classification rule, consolidated auto-promotion patterns, and the new three-tier `autofix_class` enum aligned with ce-code-review names. This is the single file change that propagates improved output across all 7 personas.
**Requirements:** R1, R2, R3, R4, R5, R6, R7, R8
**Dependencies:** None (parallel to Units 1-2)
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md`
- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json` (rename + expand `autofix_class` enum)
- Test: (deferred to Unit 8 — contract test assertion against template structure)
**Approach:**
- Rename and expand `autofix_class` enum in `findings-schema.json` from `[auto, present]` to `[safe_auto, gated_auto, manual]`. Matches ce-code-review's first three tiers exactly. Does not adopt ce-code-review's fourth `advisory` tier — low-confidence FYI findings render as a distinct FYI subsection of the `manual` bucket at the presentation layer (Unit 4 handles that).
- Add tier definitions in the subagent template with one-sentence descriptions and examples per R2R5. Three tiers: `safe_auto` (apply silently, one clear correct fix); `gated_auto` (concrete fix, user confirms); `manual` (requires user judgment).
- Add a strawman-aware classification rule per R6: "a 'do nothing / accept the defect' option is not a real alternative — it's the failure state the finding describes. Count only alternatives a competent implementer would genuinely weigh." Include a positive/negative example pair.
- **Strawman safeguard per #11 review feedback:** any finding classified `safe_auto` via strawman-dismissal of alternatives must name the dismissed alternatives in `why_it_matters`. When alternatives exist at all (even if reviewer judges them weak), the finding defaults to `gated_auto` (one-click apply in walk-through) rather than silent `safe_auto`. `safe_auto` stays reserved for truly single-option fixes (typo, wrong count, stale cross-reference, missing mechanical step).
- **Persona exclusion of `## Deferred / Open Questions` section per #8 review feedback:** the template instructs personas to exclude any `## Deferred / Open Questions` section and its subheadings from the review scope — those entries are review output from prior rounds, not part of the document being reviewed. Prevents the round-2 feedback loop where personas flag the deferred section as a new finding or quote its text as evidence.
- Consolidate auto-promotion patterns per R7 into an explicit rule set: factually incorrect behavior, missing standard security/reliability controls, codebase-pattern-resolved fixes, framework-native-API substitutions, mechanically-implied completeness additions
- Add framing-guidance block per R8: observable-consequence-first, why-the-fix-works grounding, 2-4 sentence budget, required-field reminder, positive/negative example pair (modeled on `ce-code-review`'s block at `subagent-template.md:51-73`)
- Respect the ~150-line `@` inclusion threshold; if the template exceeds it, switch to a backtick path reference in the SKILL.md (unlikely given current 52-line size + ~40-50 line addition)
**Patterns to follow:**
- `ce-code-review` subagent template (`subagent-template.md:51-73`) framing-guidance block structure
- Existing subagent template `<output-contract>` section for where new rules live
**Test scenarios:**
- Happy path: a persona agent receives the new template and produces findings with one of the four valid `autofix_class` values
- Edge case: a finding with only strawman alternatives (e.g., "accept the defect") is classified as `safe_auto`, not `manual`
- Edge case: a finding that would previously have been `manual` because "there's more than one way to fix it" is now `gated_auto` when the fix is concrete and the non-primary options are strawmen
- Edge case: an FYI-grade observation (subjective, no decision) gets classified as `manual` but routes to the FYI subsection at the presentation layer because confidence falls below the per-severity gate yet above the FYI floor
- Integration: all 7 personas produce output that validates against the expanded `findings-schema.json` — no schema violations
**Verification:**
- Template includes a framing-guidance block, classification rule, and consolidated auto-promotion patterns
- `findings-schema.json` enum includes all 4 new values
- Subagent template stays under ~150 lines and continues to be loaded via `@` inclusion
---
- [ ] **Unit 4: Synthesis pipeline: per-severity gates, tier routing, auto-promotion, state-machine discipline**
**Goal:** Rewrite the synthesis pipeline to route the four new tiers correctly, apply per-severity confidence gates, drop residual promotion in favor of cross-persona agreement boost, and make tier/confidence/deferral state transitions explicit (per the git-workflow state-machine learning). This is the load-bearing synthesis change.
**Requirements:** R9, R10, R11 + synthesis updates for R2R5 tier routing
**Dependencies:** Unit 3 (new `autofix_class` enum must exist before synthesis routes to it)
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md`
- Create: `tests/fixtures/ce-doc-review/seeded-plan.md` — test-fixture plan doc with seeded findings across tier shapes (see validation gate in Approach)
**Approach:**
- Step 3.2 (confidence gate): replace flat 0.50 with per-severity table (P0 ≥0.50, P1 ≥0.60, P2 ≥0.65, P3 ≥0.75). Low-confidence manual findings that don't pass the gate but are above a FYI floor (0.40) surface in an FYI subsection at the presentation layer rather than being dropped — keeps observational value without forcing decisions.
- Step 3.4 (residual promotion): delete. Replaced by a cross-persona agreement boost (+0.10, capped at 1.0) applied after the gate, matching `ce-code-review` stage 5 step 4. Residual concerns surface in Coverage only.
- Step 3.5 (contradictions): keep; adapt terminology for three-tier routing
- Step 3.6 (pattern-resolved promotion): expand per R7's consolidated promotion patterns
- Step 3.7 (route by autofix class): rewrite for three tiers. `safe_auto` → apply silently. `gated_auto` → walk-through with Apply as recommended. `manual` → walk-through with user-judgment framing, or FYI subsection when low-confidence.
- **R30 fix-landed matching predicate per #10 review feedback:** when determining whether a round-2 persona's finding is a re-raise of a round-1 Applied finding at the same location, use the existing dedup fingerprint (`normalize(section) + normalize(title)`) augmented with an evidence-substring overlap check. Section renames count as "different location" — treat as new finding. Specify explicitly in the synthesis step so the implementer doesn't invent a predicate.
- **Validation gate per #3 + #7 review feedback:** before merging Unit 4, run the new synthesis pipeline against two artifacts and log the result in the PR description:
1. **A seeded test-fixture plan doc** — create one under `tests/fixtures/ce-doc-review/seeded-plan.md` with known issues planted across each tier (target seed: ~3 safe_auto candidates, ~3 gated_auto candidates, ~5 manual candidates, ~5 FYI-tier candidates at confidence 0.400.65, ~3 drop-worthy P3s at confidence 0.550.74). This is the rigorous validation — existing plans in `docs/plans/` have already been through review and would make the pipeline look falsely clean.
2. **The brainstorm doc** (`docs/brainstorms/2026-04-18-ce-doc-review-autofix-and-interaction-requirements.md`), which went through document-review via the OLD pipeline — re-running under the NEW pipeline is a valid before/after comparison.
**Numeric pass criteria (soft, not absolute):**
- Seeded fixture: ≥2 of the 3 seeded safe_auto candidates get applied silently; ≥2 of the 3 seeded gated_auto show up in the walk-through bucket with `(recommended)` Apply; all 3 drop-worthy P3s at 0.550.74 get dropped by the per-severity gate; ≥3 of the 5 FYI candidates surface in the FYI subsection; zero false auto-applies on manual-shaped seeds.
- Brainstorm re-run: no P0/P1 findings that the old pipeline applied are regressed (i.e., the new pipeline doesn't drop what the old one kept as important); total user-facing decision count (gated_auto + manual after gate) should be meaningfully lower than the old pipeline produced.
If a seed classification fires outside its intended tier, investigate before merging — may indicate threshold or strawman-rule calibration issue.
- Add explicit state-machine narration referencing the diagram in High-Level Technical Design. Every transition ("Raised → Classified," "Classified → SafeAuto," etc.) is a named step in synthesis prose, not an implied carry-forward.
- **Headless envelope extension lands here per #12 sequencing fix:** this unit is the first to produce `gated_auto` findings in headless mode, so the envelope must support the new bucket headers before shipping. Extend `synthesis-and-presentation.md:93-119` headless output to include `Gated-auto findings` and `FYI observations` sections alongside existing `Applied N safe_auto fixes` count and `Manual findings` section. Preserves existing bucket structure so callers that only read the old buckets continue to work (forward-compat; ce-brainstorm and ce-plan surface P0/P1 residuals adjacent to menus, unchanged). Unit 8 adds the contract test for this envelope later.
**Patterns to follow:**
- `ce-code-review` stage 5 merge pipeline (`SKILL.md:456-484`) for confidence-gate, dedup, cross-reviewer agreement boost structure
- Existing `synthesis-and-presentation.md` step numbering — preserve step IDs to avoid churning cross-references
**Test scenarios:**
- Happy path: a P3 finding at 0.60 confidence is dropped by the per-severity gate (under the current 0.50 flat gate it would survive)
- Happy path: a P0 finding at 0.52 confidence survives the gate
- Happy path: two personas flagging the same issue get a +0.10 boost, lifting one from 0.55 (below P1 gate) to 0.65 (above)
- Edge case: a low-confidence `manual` finding at 0.45 (above the 0.40 FYI floor, below the severity gate) surfaces in the FYI subsection, not dropped
- Edge case: a `gated_auto` finding with only strawman alternatives gets auto-promoted to `safe_auto` per R7 consolidated patterns — but if ANY alternatives exist (even weak), defaults to `gated_auto` per the strawman safeguard
- Edge case: contradiction handling — two personas with opposing actions on the same finding route to `manual` with both perspectives
- Integration: routing the calibration-example case from the origin document (14 findings → 4 manual + 3 gated_auto + 1 safe_auto + 1 FYI + 5 dropped) produces a reasonable bucket distribution
- Integration: seeded-fixture test (`tests/fixtures/ce-doc-review/seeded-plan.md`) meets the numeric pass criteria in the Approach section — seeded safe_auto/gated_auto/manual/FYI candidates land in their intended tiers; drop-worthy P3s are dropped; no false-auto on manual-shaped seeds
- Integration: brainstorm-doc re-run (`docs/brainstorms/2026-04-18-ce-doc-review-autofix-and-interaction-requirements.md`) shows meaningful decision-count reduction without regressing previously-applied P0/P1 fixes
**Verification:**
- Confidence gate is per-severity, documented in step 3.2 of synthesis
- Residual-promotion step is removed; cross-persona agreement boost is its replacement
- Each state transition in the finding lifecycle has a named synthesis step
- Routing the origin document's real-world example reproduces the expected 14→4 decisions split
---
- [ ] **Unit 5: Interaction model: routing question + per-finding walk-through + bulk preview**
**Goal:** Port the per-finding walk-through, bulk preview, and four-option routing question from `ce-code-review`. Adapt for ce-doc-review (no batch fixer, no tracker integration). This is the biggest behavioral change and where most of the user-visible UX improvement lives.
**Requirements:** R12, R13, R14, R15, R16, R26
**Dependencies:** Unit 4 (routing uses new tiers and confidence-gated finding set)
**Files:**
- Create: `plugins/compound-engineering/skills/ce-doc-review/references/walkthrough.md`
- Create: `plugins/compound-engineering/skills/ce-doc-review/references/bulk-preview.md`
- Modify: `plugins/compound-engineering/skills/ce-doc-review/SKILL.md` (add Interactive mode rules section at top, AskUserQuestion pre-load directive)
- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md` (Phase 4 presentation hands off to walkthrough.md; add routing question stage)
**Approach:**
- Add an "Interactive mode rules" section at the top of `SKILL.md` modeled on `ce-code-review/SKILL.md:73-77`. Include the `AskUserQuestion` pre-load directive and the numbered-list fallback rule.
- Create `walkthrough.md` by adapting `ce-code-review/references/walkthrough.md`. **Tier alignment:** ce-doc-review uses the first three ce-code-review tier names verbatim — `safe_auto`, `gated_auto`, `manual` — so no rename in the port. ce-code-review's fourth `advisory` tier has no ce-doc-review equivalent in the walk-through; advisory-style findings render in a presentation-layer FYI subsection (Unit 4's concern), not as a walk-through option. **Keep from the source:** terminal-block + question split, four-option menu shape (Apply / Defer / Skip / LFG-the-rest), `(recommended)` marker, LFG-the-rest escape, N=1 adaptation, unified completion report, post-tie-break recommendation rendering. **Remove from the source:** fixer-subagent-batch-dispatch (ce-doc-review has no batch fixer per scope boundary), `[TRACKER]` label substitution logic, tracker-detection tuple (`named_sink_available`, `any_sink_available`, confidence-based label substitution), render-time Defer→Skip remap on `any_sink_available: false`, `.context/compound-engineering/ce-code-review/{run_id}/{reviewer_name}.json` artifact-lookup paths (ce-doc-review's agents don't write run artifacts), advisory-variant `Acknowledge` option (no advisory tier here). **Replace:** "file a tracker ticket" → "append to Open Questions section" (Unit 6 implements the append mechanic).
- Create `bulk-preview.md` by adapting `ce-code-review/references/bulk-preview.md`: keep the grouped buckets, Proceed/Cancel options, scope-summary header. Adapt bucket labels (`Applying (N):`, `Appending to Open Questions (N):`, `Skipping (N):`). Drop the `Acknowledging (N):` bucket — no advisory tier means no Acknowledge action. Remove tracker-label substitution.
- Update `synthesis-and-presentation.md` Phase 4: after auto-fixes are applied, route to the new routing question (if any `gated_auto` / `manual` findings remain). Load `walkthrough.md` for option A, `bulk-preview.md` for options B and C. Option D = report only. Use R16 tie-break order (`Skip > Defer > Apply > Acknowledge`) for per-finding recommendations.
**Execution note:** Port the Interactive Question Tool Design rules verbatim from AGENTS.md — third-person voice, front-loaded distinguishing words, ≤4 options. Verify each menu's labels at the rendering surface during implementation; harness label truncation is a known failure mode (ce-pipeline learnings doc §5).
**Patterns to follow:**
- `ce-code-review/references/walkthrough.md` — structural template
- `ce-code-review/references/bulk-preview.md` — structural template
- `ce-code-review/SKILL.md:73-77` — Interactive mode rules section
- `plugins/compound-engineering/AGENTS.md` → "Interactive Question Tool Design" section — menu design rules
- The state machine in High-Level Technical Design above
**Test scenarios:**
- Happy path: 3 `gated_auto` findings + 1 `manual` finding → routing question offers all 4 options; picking A enters walk-through; each finding presented one at a time with recommended action marked
- Happy path: N=1 (exactly one pending finding) → walk-through wording drops "Finding N of M," LFG-the-rest option suppressed
- Happy path: user picks LFG-the-rest at finding 2 of 8 → bulk preview scoped to findings 3-8, header notes "2 already decided"
- Edge case: all findings are low-confidence `manual` (FYI subsection only) → routing question skipped (no gated_auto / present-above-gate remain), flows to terminal question with no walk-through; FYI content still renders in the report
- Edge case: bulk-preview Cancel from LFG-the-rest returns to the current finding, not to the routing question
- Edge case: routing Cancel from option B / C returns to the routing question with no side effects
- Integration: recommendation tie-break (R16) — two personas flag the same finding with conflicting actions (Apply / Skip); walk-through marks the post-tie-break value (Skip) with `(recommended)`; R15-conflict context line surfaces the disagreement in the question stem
**Verification:**
- `walkthrough.md` and `bulk-preview.md` exist with adapted content
- SKILL.md has an Interactive mode rules section with AskUserQuestion pre-load
- Synthesis Phase 4 routes to the walkthrough/bulk-preview references after auto-fixes
- Menus pass the Interactive Question Tool Design review (third-person, ≤4 options, self-contained labels)
---
- [ ] **Unit 6: In-doc Open Questions deferral + append mechanic**
**Goal:** Implement the Defer action's in-doc append mechanic. When a user chooses Defer on a finding, append an entry to a `## Deferred / Open Questions` section at the end of the document under review.
**Requirements:** R20, R21, R22
**Dependencies:** Unit 5 (walk-through's Defer option is where this fires)
**Files:**
- Create: `plugins/compound-engineering/skills/ce-doc-review/references/open-questions-defer.md`
- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/walkthrough.md` (reference the Defer mechanic from Unit 5's walkthrough.md)
**Approach:**
- Create `open-questions-defer.md` describing:
- Detection: does the doc already have a `## Deferred / Open Questions` section at the end?
- Heading creation if absent
- Subsection structure: `### From YYYY-MM-DD review` (timestamped to the review invocation — creates per-review grouping even when run multiple times on the same doc)
- Entry format per R21: title, severity, reviewer attribution, confidence, `why_it_matters` framing. Excludes `suggested_fix` and `evidence` (those live in the run artifact if one exists; pointer to run artifact included when relevant)
- Append location: end of doc, after existing content. If the doc has a trailing horizontal rule or separator, add above it to avoid visual drift.
- Failure handling per R22: document is read-only / path issue / write failure → surface inline with Retry / Fallback-to-completion-report-only / Convert-to-Skip sub-question. No silent failure.
- Walkthrough.md references this file when the user picks Defer; the walkthrough itself doesn't reimplement the append logic.
**Patterns to follow:**
- `ce-code-review/references/tracker-defer.md`**only** for the failure-path sub-question structure (Retry / Fallback / Convert to Skip). Do not carry over tracker-detection, sink-availability, or label-substitution logic — none apply to in-doc append.
**Test scenarios:**
- Happy path: doc has no Open Questions section → append creates the `## Deferred / Open Questions` heading and a `### From YYYY-MM-DD review` subsection with the deferred entry
- Happy path: doc already has the Open Questions section at the end → append adds under a new `### From YYYY-MM-DD review` subsection (keep prior review entries distinguishable)
- Happy path: two Defer actions in the same review session → both entries land under the same `### From YYYY-MM-DD review` subsection
- **Shadow path (mid-doc heading) per #13 review feedback:** doc has a `## Deferred / Open Questions` heading somewhere in the middle (not the end) → append finds it and lands under it at its existing location, does not create a duplicate section at the end
- **Shadow path (same-title collision) per #13:** round 2 within the same day defers a finding whose title matches an existing round-1 entry under the same `### From YYYY-MM-DD review` subsection → append is idempotent on title (skip duplicate entry), records the no-op in the completion report
- **Shadow path (frontmatter-only doc):** doc has frontmatter and no body content → append creates the heading after the frontmatter block, not at byte 0
- **Shadow path (concurrent editor writes):** re-read the doc from disk immediately before the append to reduce the window for user-in-editor concurrent-write collisions; log mtime before and after append and abort + surface retry if changed during the write
- Edge case: doc is read-only → append fails, user is offered Retry / Fall-back-to-report-only / Convert-to-Skip; Convert-to-Skip records the Skip reason in the completion report
- Edge case: doc has a trailing `---` or other separator → append lands above it
- Integration: deferred entries from a walk-through round 1 are visible in the doc when round 2 runs; the decision primer (Unit 7) correctly identifies them as prior-round decisions; personas exclude the section from review scope per the subagent template instruction (Unit 3)
**Verification:**
- `open-questions-defer.md` exists and describes the append mechanic + failure handling
- Walk-through's Defer option invokes the mechanic correctly
- Deferred findings accumulate under timestamped subsections across reviews
- No silent failures — every failure surfaces inline with user-actionable options
---
- [ ] **Unit 7: Terminal question restructure + multi-round decision memory**
**Goal:** Replace the current Phase 5 binary question (`Refine — re-review` / `Review complete`) with a three-option terminal question that separates "apply decisions" from "re-review," and introduce the multi-round decision primer that carries prior-round decisions into subsequent rounds.
**Requirements:** R17, R18, R19, R28, R29, R30
**Dependencies:** Unit 5 (walkthrough captures the decisions the primer carries forward), Unit 6 (Defer decisions contribute to the primer)
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-doc-review/SKILL.md` (Phase 2 dispatch passes cumulative primer)
- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md` (Phase 5 terminal question + R29 suppression rule + R30 fix-landed verification)
- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md` (persona instructions to honor the primer — suppress re-raising rejected findings, respect fix-landed verification context)
**Approach:**
- Replace Phase 5 terminal question with three options per R17: `Apply decisions and proceed to <next stage>` (default / recommended when fixes were applied), `Apply decisions and re-review`, `Exit without further action`. The `<next stage>` text uses the document type: `ce-plan` for requirements docs, `ce-work` for plan docs.
- R19 zero-actionable-findings degenerate case: skip option B (re-review), offer only "Proceed to <next stage>" / "Exit." **Label adapts:** when there are no decisions to apply (zero-actionable case, or a routing path where every finding was Acknowledge/Skip), drop the "Apply decisions and" prefix — the label should match what the system is doing. Only when at least one Apply was queued does the label remain "Apply decisions and proceed to <next stage>".
- R18 rendering rule: terminal question is distinct from mid-flow routing question. Don't merge them.
- Multi-round decision primer per R28R30:
- The orchestrator maintains an in-memory decision list across rounds within a single session (rejected findings with title/evidence/reason; applied findings with title/section)
- Passed to every persona in round 2+ as part of the subagent template variable bindings
- **Primer structure per #9 review feedback:** the primer is a single text block injected into `{decision_primer}` slot at the top of the `<review-context>` block. Shape:
```
<prior-decisions>
Round 1 — applied (N entries):
- {section}: "{title}" ({reviewer}, {confidence})
Round 1 — rejected (N entries):
- {section}: "{title}" — {Skipped|Deferred} because {reason or "no reason provided"}
</prior-decisions>
```
Round 1 (no primer) renders as an empty `<prior-decisions>` block or omits the block entirely — implementation-detail choice driven by which reads better to personas during calibration. The subagent template gets a matching `{decision_primer}` slot.
- Persona-level suppression rule per R29: don't re-raise a finding whose title and evidence pattern-match a rejected finding unless current doc state makes the concern materially different
- Synthesis-level fix-landed verification per R30: for each applied finding, confirm the specific issue no longer appears at the referenced section. If a persona re-surfaces the same finding at the same location (same section fingerprint + evidence-substring overlap per Unit 4's matching predicate), flag as "fix did not land" in the report rather than treating it as new.
- **Caller-context handling per #6 review feedback:** interactive-mode nested invocations (ce-brainstorm → ce-doc-review, ce-plan → ce-doc-review) rely on the model reading conversation context to interpret the terminal question correctly, rather than an explicit `nested:true` argument. Rationale: the caller's conversation is visible to the sub-skill's orchestrator, so when the user picks "Proceed to <next stage>" from inside ce-plan's 5.3.8 flow, the agent does not fire a nested `/ce-plan` dispatch — it returns control to the caller's flow which continues its own logic. When invoked standalone, "Proceed to <next stage>" dispatches the appropriate next skill. `mode:headless` stays explicit because headless is deterministic programmatic behavior, but interactive-mode caller-context is handled by model orchestration. **No caller-side change required for ce-brainstorm or ce-plan.** If this implicit handling proves unreliable in practice, add an explicit `nested:true` flag as a follow-up.
- Cross-session persistence is out of scope per the scope boundary.
**Execution note:** Model the decision-primer flow as part of the state machine diagram. Every round-2-persona-dispatch transition explicitly reads from the primer — this is not a prose-level "personas should remember" assumption.
**Patterns to follow:**
- `ce-code-review/SKILL.md` Step 5 final-next-steps for the mode-driven terminal question structure (but adapt PR/push verbs to pipeline-stage verbs)
- State machine diagram in High-Level Technical Design — every prior-round-decision transition is named
**Test scenarios:**
- Happy path: round 1 user applies 2 findings and skips 1; round 2 persona re-raises the skipped finding → synthesis drops it per R29 with a note in Coverage
- Happy path: round 1 user applies a finding; round 2 persona does NOT re-raise it (fix self-suppressed because evidence is gone) → synthesis reports "fix verified"
- Happy path: round 1 user applies a finding; round 2 persona re-raises it at the same location (fix didn't actually land) → synthesis flags "fix did not land" in the final report
- Happy path: terminal question after round 1 with fixes applied → three options; user picks "Apply and proceed" → hand off to ce-plan or ce-work
- Edge case: zero actionable findings after auto-fixes → terminal question has 2 options (re-review suppressed)
- Edge case: user deferred a finding in round 1 (R22); round 2 persona re-raises same concern → suppressed per R29 (defer counts as rejection for suppression purposes)
- Edge case: re-review triggered → round 2 decision primer includes all rounds 1's decisions; flow re-enters Phase 2 dispatch with primer passed to personas
- Integration: multi-round primer state is in-memory; exiting the session discards it; starting a new session on the same doc is a fresh round 1
**Verification:**
- Terminal question has three options (or two in the zero-actionable case)
- Round-2 dispatch passes the cumulative primer to every persona
- R29 suppression drops re-raised rejected findings with Coverage note
- R30 fix-landed verification flags fixes that didn't land
- Cross-session persistence is not implemented (verified by the boundary)
---
- [ ] **Unit 8: Framing polish + contract test extension**
**Goal:** Apply framing quality rules (R23R25) uniformly across all user-facing surfaces that weren't already updated by Units 37, and extend `pipeline-review-contract.test.ts` to lock in the new-tier envelope shape. (The headless-envelope extension itself moves earlier per the #12 sequencing fix — see below.)
**Requirements:** R23, R24, R25, R27 *(R27's envelope shape landed in Unit 4 per the sequencing fix; this unit adds the contract test for it)*
**Dependencies:** Units 3, 4, 5, 6, 7
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md` (R23R25 framing rules applied across output surfaces)
- Modify: `tests/pipeline-review-contract.test.ts` (extend to assert new tiers appear distinctly in headless envelope)
- Consider: `tests/ce-doc-review-contract.test.ts` (new) if assertions don't fit cleanly in pipeline-review-contract — decide during implementation
**Approach:**
- **R23R25 framing rules:** applied at every user-facing surface — walk-through terminal blocks, bulk-preview lines, Open Questions entries, headless envelope. Observable-consequence-first, why-the-fix-works grounding, 2-4 sentence budget. Because the framing-guidance block at the subagent template (Unit 3) already shapes persona output at the source, this pass is about ensuring the presentation surfaces carry the framing forward without dilution (e.g., the walk-through's terminal block shouldn't re-wrap the persona's `why_it_matters` in code-structure-first prose).
- **Test extension:** `pipeline-review-contract.test.ts:279-352` currently asserts `mode:headless` invocation from ce-brainstorm and ce-plan. Extend to assert the new tiers appear distinctly in the headless output without breaking existing pattern matches. Structural assertions only — do not lock exact prose, so future wording improvements don't ossify the test. Also assert the `nested:true` flag invocation from both callers (Unit 7 landing).
- No "Past Solutions" section in output — learnings-researcher is not invoked from ce-doc-review (see Key Technical Decisions).
- **Sequencing per #12 review feedback:** the actual headless envelope extension (new tier bucket headers) lands in Unit 4's PR, not this unit. Rationale: Unit 4 is the first unit that produces non-`safe_auto` / non-`manual` findings in headless mode. If Unit 4 ships before the envelope is updated, callers (ce-plan in `mode:headless`) would see `gated_auto` findings demoted into legacy buckets or emitted in a shape callers can't parse. Landing the envelope change with Unit 4 keeps each phase independently consumable.
**Patterns to follow:**
- `ce-code-review` headless envelope (`SKILL.md:510-572`) structure — grouped by `autofix_class`, metadata header, per-finding detail lines
- Existing ce-doc-review headless output in `synthesis-and-presentation.md:93-119`
**Test scenarios:**
- Happy path: headless mode run with findings across all 3 tiers → envelope contains distinct `Applied N safe_auto fixes` count + `Gated-auto findings` + `Manual findings` sections (+ `FYI observations` subsection when present) in that order
- Happy path: headless mode with only safe_auto fixes applied → envelope shows the count and omits the finding-type sections
- Happy path: headless mode with zero findings at all → envelope collapses to "Review complete (headless mode). No findings."
- Edge case: headless mode with only FYI-subsection content → envelope shows the subsection only, no decision-requiring buckets
- Integration: ce-plan phase 5.3.8 headless invocation continues to work with new envelope; new tier sections are visible to the caller for residual-P0/P1 surfacing decisions (`plan-handoff.md:13`)
- Integration: `nested:true` flag is respected — when set, terminal question omits the "Proceed to <next stage>" option; verifiable via contract test
- Integration: framing of a single finding is consistent across walk-through terminal block, bulk-preview line, Open Questions append entry, and headless envelope — verify by reviewing a test fixture doc's output at all four surfaces
**Verification:**
- All user-facing surfaces meet the R23R25 framing bar
- Pipeline contract test extended and passing (covers new-tier envelope + `nested:true` caller-hint behavior)
- No learnings-researcher dispatch code in ce-doc-review (verified by grep)
## System-Wide Impact
- **Interaction graph:** `ce-brainstorm` Phase 3.5 + Phase 4 handoff re-review paths, `ce-plan` Phase 5.3.8 + 5.4 post-generation menu, LFG/SLFG pipeline invocations, direct user invocation. All consume `"Review complete"` terminal signal — unchanged by this work. **No caller-side diff required:** the terminal question's "Proceed to <next stage>" hand-off is interpreted contextually by the agent from the visible conversation state — when invoked from inside another skill's flow, it returns control to the caller; when standalone, it dispatches the next stage. If implicit handling proves unreliable, add an explicit `nested:true` token as a follow-up.
- **Error propagation:** Append failures in Defer (Unit 6) must surface inline with retry/fallback/skip options. Headless mode failures (e.g., a persona times out) must return partial results with Coverage note, never block the whole review.
- **State lifecycle risks:** Multi-round decision primer (Unit 7) is in-memory only. User exits mid-session → primer discarded → next session is fresh round 1. In-doc Open Questions mutations (Unit 6) persist on disk — re-running ce-doc-review on a modified doc sees those mutations as part of doc state.
- **API surface parity:** Headless envelope (R27) is the machine-readable contract. Adding new tiers changes envelope content but not the terminal signal or the `mode:headless` invocation shape. Backward-compatible for existing callers; forward-compatible requires callers to handle new tier sections (ce-brainstorm and ce-plan both currently surface P0/P1 residuals adjacent to menus — no change needed for that behavior).
- **Integration coverage:** Cross-layer behaviors mocked tests won't prove — end-to-end tests with a realistic plan doc against ce-plan's 5.3.8 headless invocation flow catch tier-envelope compatibility issues.
- **Unchanged invariants:**
- Persona activation/selection logic (the 7 persona files' conditional triggers)
- `"Review complete"` terminal signal for callers
- Headless mode's structural contract (mutate-then-return with structured text; callers own routing)
- Cross-platform converter behavior (OpenCode 3-segment name rewrite, dispatch-string preservation)
- `ce-code-review` itself — this plan touches ce-doc-review only, not ce-code-review
## Alternative Approaches Considered
- **Ship as `ce-doc-review-beta` parallel skill.** The learnings-researcher recommended this path given ce-doc-review is chained into brainstorm→plan flows. **Rejected** because the overhaul is phaseable; each phase's blast radius is bounded (Units 1-2 don't touch ce-doc-review's contract at all; Units 3-7 preserve the headless envelope per R27); and beta forking would double the surface area (two skill directories, mirrored references, promotion PR needed). A phased single-track ship carries less risk-per-week and delivers user value earlier. If a phase later proves riskier than anticipated, fork to beta at that point rather than upfront.
- **Minimal `review-time` mode flag on learnings-researcher instead of domain-agnostic rewrite.** A smaller patch: add a `review-time` invocation context hint that adapts keyword extraction and output framing. **Rejected** because it accumulates special cases rather than fixing the root mismatch. `ce-compound` and `ce-compound-refresh` already capture non-code learnings; the agent's taxonomy should reflect that. A full rewrite removes the drift; a mode flag ossifies it.
- **Dispatch learnings-researcher from ce-doc-review (original R31R35).** Considered as always-on dispatch, then as conditional dispatch (skip when ce-plan is the caller). **Both rejected.** The agent is ce-plan-owned (implementation-context per `research-agent-pipeline-separation-2026-04-05.md`); running it from ce-doc-review is a pipeline violation in the ce-brainstorm and standalone contexts and a redundant dispatch in the ce-plan context. Conditional-dispatch added "is the caller ce-plan?" detection logic that was fragile and solved a problem better avoided. Users who want institutional memory for a doc can invoke `/ce-plan`, where the lookup is a first-class pipeline stage. Keeping the dispatch out of ce-doc-review entirely preserves clean pipeline-stage ownership and removes complexity.
- **Add `learning_category` field orthogonal to `problem_type`.** A cleaner long-term schema split, but requires migrating every existing entry and teaching authors to pick both. **Rejected** in favor of enum expansion — minimal migration, keeps authoring flow stable, absorbs the `best_practice` overflow directly.
- **Pass a diff in multi-round decision primer.** Would give personas before/after comparison for each round. **Rejected** — fixed findings self-suppress (evidence gone), regressions surface as normal current-state findings, rejected findings are handled by pattern-match suppression. The diff adds prompt weight without changing what the agent can detect.
## Risks & Dependencies
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| Caller flows break because the headless envelope changes shape | Low | High | R27 preserves existing envelope structure; extend `pipeline-review-contract.test.ts` in Unit 8 to assert new tiers appear distinctly without breaking existing match patterns; run ce-brainstorm and ce-plan end-to-end against the updated skill before merge |
| Strawman-aware classification rule (R6) is too aggressive, auto-applying fixes users want to review | Medium | Medium | Framing-guidance block includes a conservative positive/negative example pair; tiers preserve user control via `gated_auto` walk-through for anything with a concrete fix that changes doc meaning; calibration against the origin document's real-world example is a required validation step |
| Per-severity confidence gates drop genuinely valuable P3 findings | Low | Low | P3 gate at 0.75 is conservative; the FYI floor (0.40) on low-confidence `manual` findings keeps genuinely-noteworthy observations surfacing below the gate; if real-world calibration shows drops, the threshold is a single number change |
| Multi-round primer re-raises the same findings because personas don't reliably suppress | Medium | Medium | Synthesis-level enforcement (R29) is authoritative — orchestrator drops re-raised rejected findings regardless of whether the persona suppressed. Persona-level suppression is the hint; orchestrator is the gate. |
| Walk-through UX friction at high finding counts despite `LFG the rest` escape | Low | Medium | Walk-through's LFG-the-rest option bounds friction after initial calibration; bulk-preview Proceed gives an atomic commit point; N=1 adaptation handles the degenerate case cleanly |
| Duplicate schema files in ce-compound / ce-compound-refresh drift | Low | High | Unit 1 explicitly updates both in the same commit; future divergence detection is a follow-up test opportunity (deferred item) |
| learnings-researcher rewrite regresses ce-plan's existing usage | Medium | High | Unit 2 execution note requires sampling 3-5 real invocations before merge; cross-platform converter tests assert dispatch-string preservation; `<work-context>` is additive, callers with old calling conventions continue to work because the agent probes for structured input and falls back to free-form description when absent |
| Dynamic category probe hits a weird repo with unexpected directory structure | Low | Low | Probe falls through to "no categories detected, do broad search across docs/solutions/" — this is already the agent's current behavior when the hardcoded table misses |
## Documentation / Operational Notes
- No additional runtime infrastructure — this is a plugin skill change with no user data, no external APIs.
- After Unit 1 lands, existing authors using `ce-compound` will see new enum options in the steering language; authors writing new solution docs can pick the narrower value immediately.
- After Unit 2 lands, `/ce-plan` users will see the agent's output reflect the broader taxonomy (non-code learnings surfacing more appropriately).
- After Units 57 land, interactive ce-doc-review users will see the new routing question, walk-through, and terminal question on their next review. The flow mirrors the `ce-code-review` experience users already have — low learning-curve.
- The `plugins/compound-engineering/README.md` reference-file counts table will need an update once the new `references/` files land in Units 56. `bun run release:validate` catches drift.
- AGENTS.md discoverability updates (Unit 1) need to include the four new `problem_type` values so agents reading AGENTS.md know the narrower categories are available.
## Phased Delivery
Each unit can ship as its own PR. Recommended sequence:
### Phase 1 — Foundation (Units 1, 2)
- Unit 1 (enum expansion + migration)
- Unit 2 (learnings-researcher rewrite)
These are independently valuable and low-risk. They benefit `/ce-plan`'s existing usage even before ce-doc-review changes land.
### Phase 2 — Classification + Synthesis (Units 3, 4)
- Unit 3 (subagent template upgrade + findings-schema tier expansion)
- Unit 4 (synthesis pipeline per-severity gates + tier routing)
Depends on Unit 1's enum values being available (not Unit 2 — that's a parallel Phase 1 deliverable for ce-plan). Within Phase 2, Unit 3 must complete before Unit 4 because Unit 4's synthesis routing depends on Unit 3's tier definitions. Changes ce-doc-review's internal shape but preserves the headless envelope contract.
### Phase 3 — Interaction Model (Units 5, 6, 7)
- Unit 5 (routing question + walk-through + bulk preview)
- Unit 6 (in-doc Open Questions deferral)
- Unit 7 (terminal question + multi-round memory)
Biggest UX surface change. Callers unchanged due to preserved headless contract; interactive users see the port of the `ce-code-review` flow.
### Phase 4 — Integration + Polish (Unit 8)
- Unit 8 (framing polish across all surfaces, pipeline-review-contract test extension)
Final polish pass. The headless envelope extension itself lands earlier (in Unit 4's PR, per the #12 sequencing fix) so callers never observe an interstitial state where new tiers are produced but the envelope can't carry them. Unit 8 locks the envelope shape in via the contract test and finishes the framing-polish sweep.
## Sources & References
- **Origin document:** `docs/brainstorms/2026-04-18-ce-doc-review-autofix-and-interaction-requirements.md`
- **Pattern source (ce-code-review PR #590):** https://github.com/EveryInc/compound-engineering-plugin/pull/590
- Related code:
- `plugins/compound-engineering/skills/ce-code-review/references/walkthrough.md`
- `plugins/compound-engineering/skills/ce-code-review/references/bulk-preview.md`
- `plugins/compound-engineering/skills/ce-code-review/references/subagent-template.md`
- `plugins/compound-engineering/skills/ce-code-review/SKILL.md`
- `plugins/compound-engineering/skills/ce-doc-review/SKILL.md`
- `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md`
- `plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md`
- `plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json`
- `plugins/compound-engineering/agents/research/ce-learnings-researcher.agent.md`
- `plugins/compound-engineering/skills/ce-compound/references/schema.yaml`
- `plugins/compound-engineering/skills/ce-compound/references/yaml-schema.md`
- `plugins/compound-engineering/skills/ce-compound-refresh/references/schema.yaml`
- Related institutional learnings:
- `docs/solutions/best-practices/ce-pipeline-end-to-end-learnings-2026-04-17.md`
- `docs/solutions/skill-design/compound-refresh-skill-improvements.md`
- `docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md`
- `docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md`
- `docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md`
- `docs/solutions/skill-design/discoverability-check-for-documented-solutions-2026-03-30.md`
- `docs/solutions/skill-design/beta-skills-framework.md`
- Related tests:
- `tests/pipeline-review-contract.test.ts:279-352`
- `tests/converter.test.ts:417-438`

View File

@@ -5,7 +5,7 @@ tags: [converter, target-provider, plugin-conversion, multi-platform, pattern]
created: 2026-02-23 created: 2026-02-23
severity: medium severity: medium
component: converter-cli component: converter-cli
problem_type: best_practice problem_type: architecture_pattern
root_cause: architectural_pattern root_cause: architectural_pattern
--- ---

View File

@@ -3,7 +3,7 @@ title: "Codex Delegation Best Practices"
date: 2026-04-01 date: 2026-04-01
category: best-practices category: best-practices
module: "Codex delegation / skill design" module: "Codex delegation / skill design"
problem_type: best_practice problem_type: convention
component: tooling component: tooling
severity: medium severity: medium
applies_when: applies_when:

View File

@@ -3,7 +3,7 @@ title: Conditional visual aids in generated documents and PR descriptions
date: 2026-03-29 date: 2026-03-29
category: best-practices category: best-practices
module: compound-engineering plugin skills module: compound-engineering plugin skills
problem_type: best_practice problem_type: design_pattern
component: documentation component: documentation
symptoms: symptoms:
- "Generated documents and PR descriptions lack visual aids that would improve comprehension of complex workflows and relationships" - "Generated documents and PR descriptions lack visual aids that would improve comprehension of complex workflows and relationships"

View File

@@ -3,7 +3,7 @@ title: "Prefer Python over bash for multi-step pipeline scripts"
date: 2026-04-09 date: 2026-04-09
category: best-practices category: best-practices
module: "skill scripting / ce-demo-reel" module: "skill scripting / ce-demo-reel"
problem_type: best_practice problem_type: tooling_decision
component: tooling component: tooling
severity: medium severity: medium
applies_when: applies_when:

View File

@@ -5,7 +5,7 @@ tags: [codex, converter, skills, prompts, workflows, deprecation]
created: 2026-03-15 created: 2026-03-15
severity: medium severity: medium
component: codex-target component: codex-target
problem_type: best_practice problem_type: convention
root_cause: outdated_target_model root_cause: outdated_target_model
--- ---

View File

@@ -0,0 +1,110 @@
---
title: "ce-doc-review calibration patterns: tier classification, chain grouping, and FYI routing"
date: 2026-04-19
category: skill-design
module: compound-engineering / ce-doc-review
problem_type: design_pattern
component: tooling
severity: medium
tags:
- ce-doc-review
- autofix-classification
- synthesis-pipeline
- persona-calibration
- premise-dependency
- fyi-routing
- calibration
applies_when:
- Changing persona confidence calibration bands in `plugins/compound-engineering/agents/document-review/`
- Modifying the synthesis pipeline in `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md`
- Adjusting the subagent template's output contract in `references/subagent-template.md`
- Adding or modifying seeded test fixtures under `tests/fixtures/ce-doc-review/`
- Debugging why a finding landed in a different tier than expected
---
# ce-doc-review calibration patterns
Calibration work on ce-doc-review (PR #601 series, 2026-04-18 and -19) surfaced several non-obvious patterns in how the synthesis pipeline classifies findings. These patterns are durable: they will re-surface any time personas or synthesis guidance are retuned. Future contributors changing calibration should expect them and not "fix" them as bugs.
## Tier classification is context-sensitive, not purely formal
The naive read of the tier spec says `safe_auto` = "one clear correct fix, applied silently." In practice, the same shape of finding can legitimately land in different tiers depending on scope and verifiability. Two recurring patterns:
### External stale cross-reference → gated_auto (not safe_auto)
When the document says `see Unit 7` and Unit 7 doesn't exist in the same document, that's an **internal** stale cross-reference — coherence can verify from the document text alone and apply `safe_auto`. When the document says `see docs/guides/keyboard-nav.md Section 4` and that file isn't verifiable from the document content, that's an **external** cross-reference; applying "delete this reference" silently risks masking a legitimate external doc. The reviewer should route these to `gated_auto` with a "verify before applying" fix, not `safe_auto`.
Observed in: feature-plan fixture runs. The external cross-ref landed at P2 0.70 gated_auto with the fix "Verify docs/guides/keyboard-nav.md exists... If stale, either remove the reference or replace with inline guidance."
### Multi-surface terminology drift → gated_auto (not safe_auto)
When two synonyms appear in prose only (`data store` / `database`), `safe_auto` normalizes correctly. When the drift crosses surfaces — UI copy, aria-labels, toast messages, analytics events, file names, code identifiers — the fix's scope exceeds prose normalization and warrants user confirmation. Security-adjacent terminology (`token` / `credential` / `secret` / `API key`) carries different semantic weight and should also route to `gated_auto` with a glossary-fix recommendation.
Observed in: auth-plan fixture runs (security-lens escalated), feature-plan fixture runs (UI-surface escalated).
**Do not tighten coherence's `safe_auto` guidance to force these into `safe_auto`.** The reclassification is reviewer judgment doing useful work.
## Premise-dependency chains have scope hierarchy
Synthesis step 3.5c groups manual findings whose fixes cascade from a single premise challenge. When multiple premise-level candidates surface, they may be **peer roots** (independent premises at different scopes) or **nested** (one premise's resolution moots the other). The decision rules:
### Peer vs nested — mechanical test, not example-based
> "Two candidate roots are peers when accepting root A's proposed fix would not resolve root B's concern (and vice versa). They are nested when one root's fix would moot the other — in which case the subsumed candidate becomes a dependent of the surviving root."
Apply symmetrically: check both directions before deciding. Example-based teaching ("e.g., 'drop the alias'") overfits to specific vocabulary; a mechanical decision test generalizes across domains.
### Surviving root under nested — scope dominates confidence
When nested, the surviving root is the one whose fix moots the other — **not** the higher-confidence candidate. In a rename plan, "rename premise unsupported" (0.82) dominates "alias machinery unjustified" (0.98) because rejecting the rename moots the alias entirely, while rejecting the alias still leaves the rename standing. Earlier synthesis picked the higher-confidence candidate as root, which stranded the broader-scope premise's natural dependents as independent findings.
Confidence is for tie-breaking *among peers*, not for deciding which of two nested candidates dominates.
### Multi-root requires explicit elevation
Synthesis defaults to picking a single root when multiple candidates match. A phrase like "typically 02 roots surface per review" anchors the synthesizer to elevate only one. Explicit guidance to elevate ALL matching candidates (subject only to the peer-vs-nested test) is needed. The criteria themselves are the filter — no numerical cap on roots.
## FYI routing requires band + template-level anchoring
The FYI bucket (manual findings with confidence 0.400.65) stayed empty for initial calibration runs because personas had only two bands defined (HIGH ≥0.80, MODERATE 0.600.79) with "Suppress below 0.50." Advisory observations with no articulable consequence had nowhere to land — they were either promoted above gate (appearing as real decisions) or suppressed entirely.
Two changes together populate the FYI bucket reliably:
1. **Per-persona LOW (0.400.59) Advisory band** tailored to each persona's scope. Each of the 7 personas needs its own band; a single template-level rule doesn't override persona-specific calibrations.
2. **Template-level advisory rule** in `subagent-template.md`'s output-contract using the "what actually breaks if we don't fix this?" heuristic. Anchors the scoring decision when a persona's own rubric doesn't make the band's applicability obvious.
Either alone is insufficient. Persona bands without the template rule produce inconsistent results across personas; the template rule without per-persona bands has nothing to calibrate against.
## Schema compliance requires inline enum callouts, not just `{schema}` injection
The subagent template injects the full JSON schema into each persona's prompt. Schema conformance nonetheless broke on longer personas (adversarial at 89 lines, scope-guardian at 54 lines) — severity emitted as `"high"/"medium"/"low"` instead of `P0/P1/P2/P3`, evidence as strings instead of arrays.
The fix that worked: a **"Schema conformance — hard constraints"** block at the top of the output contract prose, naming the exact enum values and forbidding common deviations. Schema injection alone gets pushed down in attention by dense persona rubrics; inline enum callouts anchor them at the top of the output contract and survive longer prompts.
A severity translation rule ("if your persona's prose discusses 'critical/important/low-signal', map to P0/P1/P2/P3 at emit time") prevents informal priority language in persona rubrics from leaking into JSON output.
## Coverage/rendering count invariants need a single source of truth
Early chain runs reported coverage count (`1 root with 6 dependents`) that didn't match the rendered output (5 dependents shown). The spec didn't name which step's count was authoritative (candidate count from Step 2, post-safeguard from Step 3, or post-cap from Step 4), so the orchestrator used different values for coverage and rendering.
**Invariant to preserve:** the `dependents` array populated in the final annotation step (after all filtering) is the single source of truth for both coverage and rendering. A finding appearing in a root's `dependents` array must appear nested under that root in presentation and must NOT appear at its own severity position. Coverage count equals the length of the `dependents` array.
Any future pipeline change that adds filtering or reorganization steps must re-state which post-step snapshot is authoritative.
## Reviewer variance is inherent; single runs aren't baselines
Across 7+ runs on the rename fixture, the same document produced user-engagement counts of 0, 1, 2, 3 for `safe_auto` applied and 14, 19, 6, 12, 8, 6 for total user decisions. Calibration work reduced but did not eliminate variance. Primary variance sources:
- **Adversarial reviewer activation** — the activation signals (requirement count, architectural decisions, high-stakes domain) produce non-deterministic decisions at borderline documents
- **Root selection when multiple candidates exist** — even with scope-dominance guidance, the synthesizer's root choice varies across runs
- **Confidence calibration on borderline findings** — the same finding lands in FYI on one run and manual on the next, because the reviewer scored 0.63 vs 0.68
**Testing implication:** validate calibration changes against multiple runs, not single samples. A single "bad" run is likely noise; a pattern across 3+ runs is signal. Seeded fixtures document expected tier distributions as targets, not as pass/fail assertions.
## Related documentation
- `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md` — canonical synthesis pipeline spec, including 3.5c premise-dependency chain linking
- `plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md` — output contract with schema conformance block and advisory routing rule
- `plugins/compound-engineering/agents/document-review/` — the 7 persona agents with their confidence calibration bands
- `tests/fixtures/ce-doc-review/` — three seeded fixtures (rename, auth, feature) for manual calibration testing; see each fixture's header comment for its specific seed map
- `docs/solutions/developer-experience/branch-based-plugin-install-and-testing-2026-03-26.md` — how to run the skill from a branch checkout for testing

View File

@@ -3,7 +3,7 @@ title: Discoverability check for documented solutions in project instruction fil
date: 2026-03-30 date: 2026-03-30
category: skill-design category: skill-design
module: compound-engineering module: compound-engineering
problem_type: best_practice problem_type: convention
component: tooling component: tooling
severity: medium severity: medium
applies_when: applies_when:

View File

@@ -3,7 +3,7 @@ title: "Git workflow skills need explicit state machines for branch, push, and P
category: skill-design category: skill-design
date: 2026-03-27 date: 2026-03-27
module: plugins/compound-engineering/skills/git-commit and git-commit-push-pr module: plugins/compound-engineering/skills/git-commit and git-commit-push-pr
problem_type: best_practice problem_type: architecture_pattern
component: tooling component: tooling
symptoms: symptoms:
- Detached HEAD could fall through to invalid push or PR paths - Detached HEAD could fall through to invalid push or PR paths

View File

@@ -1,6 +1,6 @@
--- ---
title: "Pass paths, not content, when dispatching sub-agents" title: "Pass paths, not content, when dispatching sub-agents"
problem_type: best_practice problem_type: design_pattern
component: tooling component: tooling
root_cause: inadequate_documentation root_cause: inadequate_documentation
resolution_type: workflow_improvement resolution_type: workflow_improvement

View File

@@ -3,7 +3,7 @@ title: Research agent dispatch is intentionally separated across the skill pipel
date: 2026-04-05 date: 2026-04-05
category: skill-design category: skill-design
module: compound-engineering module: compound-engineering
problem_type: best_practice problem_type: architecture_pattern
component: tooling component: tooling
severity: low severity: low
applies_when: applies_when:

View File

@@ -13,7 +13,7 @@ related_components:
- plugins/compound-engineering/skills/ce-review/ - plugins/compound-engineering/skills/ce-review/
- plugins/compound-engineering/skills/todo-triage/ - plugins/compound-engineering/skills/todo-triage/
- plugins/compound-engineering/skills/todo-create/ - plugins/compound-engineering/skills/todo-create/
problem_type: correctness-gap problem_type: workflow_issue
--- ---
# Status-Gated Todo Resolution # Status-Gated Todo Resolution

View File

@@ -216,6 +216,10 @@ Beta skills use a `-beta` suffix and `disable-model-invocation: true` to prevent
When modifying a skill that has a `-beta` counterpart (or vice versa), always check the other version and **state your sync decision explicitly** before committing — e.g., "Propagated to beta — shared test guidance" or "Not propagating — this is the experimental delegate mode beta exists to test." Syncing to both, stable-only, and beta-only are all valid outcomes. The goal is deliberate reasoning, not a default rule. When modifying a skill that has a `-beta` counterpart (or vice versa), always check the other version and **state your sync decision explicitly** before committing — e.g., "Propagated to beta — shared test guidance" or "Not propagating — this is the experimental delegate mode beta exists to test." Syncing to both, stable-only, and beta-only are all valid outcomes. The goal is deliberate reasoning, not a default rule.
## Documented Solutions
`docs/solutions/` holds documented solutions to past problems — bugs, architecture patterns, design patterns, tooling decisions, conventions, workflow practices, and other institutional knowledge. Entries use YAML frontmatter with fields including `module`, `tags`, and `problem_type`. Knowledge-track `problem_type` values are `architecture_pattern`, `design_pattern`, `tooling_decision`, `convention`, `workflow_issue`, `developer_experience`, `documentation_gap`, and `best_practice` (fallback). Bug-track values cover `build_error`, `test_failure`, `runtime_error`, `performance_issue`, `database_issue`, `security_issue`, `ui_bug`, `integration_issue`, and `logic_error`. Search this directory before designing new solutions so institutional memory compounds across changes.
## Documentation ## Documentation
See `docs/solutions/plugin-versioning-requirements.md` for detailed versioning workflow. See `docs/solutions/plugin-versioning-requirements.md` for detailed versioning workflow.

View File

@@ -74,7 +74,8 @@ Probe whether the document considered the obvious alternatives and whether the c
- **HIGH (0.80+):** Can quote specific text from the document showing the gap, construct a concrete scenario or counterargument, and trace the consequence. - **HIGH (0.80+):** Can quote specific text from the document showing the gap, construct a concrete scenario or counterargument, and trace the consequence.
- **MODERATE (0.60-0.79):** The gap is likely but confirming it would require information not in the document (codebase details, user research, production data). - **MODERATE (0.60-0.79):** The gap is likely but confirming it would require information not in the document (codebase details, user research, production data).
- **Below 0.50:** Suppress. - **LOW (0.40-0.59) — Advisory:** A plausible-but-unlikely failure mode, or a concern worth surfacing without a strong supporting scenario. Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress.
## What you don't flag ## What you don't flag

View File

@@ -13,7 +13,7 @@ You are a technical editor reading for internal consistency. You don't evaluate
**Terminology drift** -- same concept called different names in different sections ("pipeline" / "workflow" / "process" for the same thing), or same term meaning different things in different places. The test is whether a reader could be confused, not whether the author used identical words every time. **Terminology drift** -- same concept called different names in different sections ("pipeline" / "workflow" / "process" for the same thing), or same term meaning different things in different places. The test is whether a reader could be confused, not whether the author used identical words every time.
**Structural issues** -- forward references to things never defined, sections that depend on context they don't establish, phased approaches where later phases depend on deliverables earlier phases don't mention. Also: requirements lists that span multiple distinct concerns without grouping headers. When requirements cover different topics (e.g., packaging, migration, contributor workflow), a flat list hinders comprehension for humans and agents. Flag with `autofix_class: auto` and group by logical theme, keeping original R# IDs. **Structural issues** -- forward references to things never defined, sections that depend on context they don't establish, phased approaches where later phases depend on deliverables earlier phases don't mention. Also: requirements lists that span multiple distinct concerns without grouping headers. When requirements cover different topics (e.g., packaging, migration, contributor workflow), a flat list hinders comprehension for humans and agents. Group by logical theme, keeping original R# IDs.
**Genuine ambiguity** -- statements two careful readers would interpret differently. Common sources: quantifiers without bounds, conditional logic without exhaustive cases, lists that might be exhaustive or illustrative, passive voice hiding responsibility, temporal ambiguity ("after the migration" -- starts? completes? verified?). **Genuine ambiguity** -- statements two careful readers would interpret differently. Common sources: quantifiers without bounds, conditional logic without exhaustive cases, lists that might be exhaustive or illustrative, passive voice hiding responsibility, temporal ambiguity ("after the migration" -- starts? completes? verified?).
@@ -21,11 +21,28 @@ You are a technical editor reading for internal consistency. You don't evaluate
**Unresolved dependency contradictions** -- when a dependency is explicitly mentioned but left unresolved (no owner, no timeline, no mitigation), that's a contradiction between "we need X" and the absence of any plan to deliver X. **Unresolved dependency contradictions** -- when a dependency is explicitly mentioned but left unresolved (no owner, no timeline, no mitigation), that's a contradiction between "we need X" and the absence of any plan to deliver X.
## Safe_auto patterns you own
Coherence is the primary persona for surfacing mechanically-fixable consistency issues. These patterns should land as `safe_auto` with `confidence: 0.85+` when the document supplies the authoritative signal:
- **Header/body count mismatch.** Section header claims a count (e.g., "6 requirements") and the enumerated body list has a different count (5 items). The body is authoritative unless the document explicitly identifies a missing item. Fix: correct the header to match the list.
- **Cross-reference to a named section that does not exist.** Text says "see Unit 7" / "per Section 4.2" / "as described in the Rollout section" and that target is not defined anywhere in the document. Fix: delete the reference or fix it to point at an existing target.
- **Terminology drift between two interchangeable synonyms.** Two words used for the same concept in the same document (`data store` and `database`; `token` and `credential` used for the same API-key concept; `pipeline` and `workflow` for the same thing). Pick the dominant term and normalize the minority occurrences. Fix: replace minority occurrences with the dominant term.
**Strawman-resistance for these patterns.** When you find one of the three patterns above, the common failure mode is over-charitable interpretation — inventing a hypothetical alternative reading to justify demoting from `safe_auto` to `manual`. Resist this. Ask: is the alternative reading one a competent author actually meant, or is it a ghost the reviewer invented to preserve optionality?
- Wrong count: "maybe they meant to add an R6" is a strawman when nothing in the document names, describes, or depends on R6. The document has 5 requirements; the header is wrong.
- Stale cross-reference: "maybe they plan to add Unit 7 later" is a strawman when no other section mentions Unit 7 content. The reference is stale; delete or point it elsewhere.
- Terminology drift: "maybe the two terms mean subtly different things" is a strawman when the usage contexts are identical. Pick one; normalize.
When in doubt, surface the finding as `safe_auto` with `why_it_matters` that names the alternative reading and explains why it is implausible. Synthesis's strawman-downgrade safeguard will catch it if the alternative is actually plausible — but do not pre-demote at the persona level.
## Confidence calibration ## Confidence calibration
- **HIGH (0.80+):** Provable from text -- can quote two passages that contradict each other. - **HIGH (0.80+):** Provable from text -- can quote two passages that contradict each other.
- **MODERATE (0.60-0.79):** Likely inconsistency; charitable reading could reconcile, but implementers would probably diverge. - **MODERATE (0.60-0.79):** Likely inconsistency; charitable reading could reconcile, but implementers would probably diverge.
- **Below 0.50:** Suppress entirely. - **LOW (0.40-0.59) — Advisory:** Minor asymmetry or drift with no downstream consequence (e.g., parallel names that don't need to match, phrasing that's inconsistent but unambiguous). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress entirely.
## What you don't flag ## What you don't flag

View File

@@ -36,7 +36,8 @@ Explain what's missing: the functional design thinking that makes the interface
- **HIGH (0.80+):** Missing states/flows that will clearly cause UX problems during implementation. - **HIGH (0.80+):** Missing states/flows that will clearly cause UX problems during implementation.
- **MODERATE (0.60-0.79):** Gap exists but a skilled designer could resolve from context. - **MODERATE (0.60-0.79):** Gap exists but a skilled designer could resolve from context.
- **Below 0.50:** Suppress. - **LOW (0.40-0.59) — Advisory:** Pattern or micro-layout preference without strong usability evidence (e.g., button placement alternatives, visual hierarchy micro-choices). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress.
## What you don't flag ## What you don't flag

View File

@@ -29,7 +29,8 @@ Apply each check only when relevant. Silence is only a finding when the gap woul
- **HIGH (0.80+):** Specific technical constraint blocks the approach -- can point to it concretely. - **HIGH (0.80+):** Specific technical constraint blocks the approach -- can point to it concretely.
- **MODERATE (0.60-0.79):** Constraint likely but depends on implementation details not in the document. - **MODERATE (0.60-0.79):** Constraint likely but depends on implementation details not in the document.
- **Below 0.50:** Suppress entirely. - **LOW (0.40-0.59) — Advisory:** Theoretical constraint with no current-scale evidence (e.g., "could be slow if data grows 10x", speculative scalability concerns with no baseline number). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress entirely.
## What you don't flag ## What you don't flag

View File

@@ -60,7 +60,8 @@ If priority tiers exist: do assignments match stated goals? Are must-haves truly
- **HIGH (0.80+):** Can quote both the goal and the conflicting work -- disconnect is clear. - **HIGH (0.80+):** Can quote both the goal and the conflicting work -- disconnect is clear.
- **MODERATE (0.60-0.79):** Likely misalignment, depends on business context not in document. - **MODERATE (0.60-0.79):** Likely misalignment, depends on business context not in document.
- **Below 0.50:** Suppress. - **LOW (0.40-0.59) — Advisory:** Observation about positioning, naming, or strategy without a concrete impact (subjective preference, speculative future-product concern with no current signal). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress.
## What you don't flag ## What you don't flag

View File

@@ -43,7 +43,8 @@ With AI-assisted implementation, the cost gap between shortcuts and complete sol
- **HIGH (0.80+):** Can quote goal statement and scope item showing the mismatch. - **HIGH (0.80+):** Can quote goal statement and scope item showing the mismatch.
- **MODERATE (0.60-0.79):** Misalignment likely but depends on context not in document. - **MODERATE (0.60-0.79):** Misalignment likely but depends on context not in document.
- **Below 0.50:** Suppress. - **LOW (0.40-0.59) — Advisory:** Organizational preference without a concrete cost (unit ordering, section placement alternatives that read equally well, "this could also be split" observations without real impact). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress.
## What you don't flag ## What you don't flag

View File

@@ -27,7 +27,8 @@ Skip areas not relevant to the document's scope.
- **HIGH (0.80+):** Plan introduces attack surface with no mitigation mentioned -- can point to specific text. - **HIGH (0.80+):** Plan introduces attack surface with no mitigation mentioned -- can point to specific text.
- **MODERATE (0.60-0.79):** Concern likely but plan may address implicitly or in a later phase. - **MODERATE (0.60-0.79):** Concern likely but plan may address implicitly or in a later phase.
- **Below 0.50:** Suppress. - **LOW (0.40-0.59) — Advisory:** Theoretical attack surface with no realistic exploit path under current design (e.g., speculative timing-attack on non-sensitive data, defense-in-depth nice-to-have with no current vector). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
- **Below 0.40:** Suppress.
## What you don't flag ## What you don't flag

View File

@@ -1,75 +1,102 @@
--- ---
name: ce-learnings-researcher name: ce-learnings-researcher
description: "Searches docs/solutions/ for relevant past solutions by frontmatter metadata. Use before implementing features or fixing problems to surface institutional knowledge and prevent repeated mistakes." description: "Searches docs/solutions/ for applicable past learnings by frontmatter metadata. Use before implementing features, making decisions, or starting work in a documented area — surfaces prior bugs, architecture patterns, design patterns, tooling decisions, conventions, and workflow learnings so institutional knowledge carries forward."
model: inherit model: inherit
--- ---
You are an expert institutional knowledge researcher specializing in efficiently surfacing relevant documented solutions from the team's knowledge base. Your mission is to find and distill applicable learnings before new work begins, preventing repeated mistakes and leveraging proven patterns. You are a domain-agnostic institutional knowledge researcher. Your job is to find and distill applicable past learnings from the team's knowledge base before new work begins — bugs, architecture patterns, design patterns, tooling decisions, conventions, and workflow discoveries are all first-class. Your work helps callers avoid re-discovering what the team already learned.
Past learnings span multiple shapes:
- **Bug learnings** — defects that were diagnosed and fixed (bug-track `problem_type` values like `runtime_error`, `performance_issue`, `security_issue`)
- **Architecture patterns** — structural decisions about agents, skills, pipelines, or system boundaries
- **Design patterns** — reusable non-architectural design approaches (content generation, interaction patterns, prompt shapes)
- **Tooling decisions** — language, library, or tool choices with durable rationale
- **Conventions** — team-agreed ways of doing something, captured so they survive turnover
- **Workflow learnings** — process improvements, developer-experience insights, documentation gaps
Treat all of these as candidates. Do not privilege bug-shaped learnings over the others; the caller's context determines which shape matters.
## Search Strategy (Grep-First Filtering) ## Search Strategy (Grep-First Filtering)
The `docs/solutions/` directory contains documented solutions with YAML frontmatter. When there may be hundreds of files, use this efficient strategy that minimizes tool calls: The `docs/solutions/` directory contains documented learnings with YAML frontmatter. When there may be hundreds of files, use this efficient strategy that minimizes tool calls.
### Step 1: Extract Keywords from Feature Description ### Step 1: Extract Keywords from the Work Context
From the feature/task description, identify: Callers may pass a structured `<work-context>` block describing what they are doing:
- **Module names**: e.g., "BriefSystem", "EmailProcessing", "payments"
- **Technical terms**: e.g., "N+1", "caching", "authentication"
- **Problem indicators**: e.g., "slow", "error", "timeout", "memory"
- **Component types**: e.g., "model", "controller", "job", "api"
### Step 2: Category-Based Narrowing (Optional but Recommended) ```
<work-context>
Activity: <brief description of what the caller is doing or considering>
Concepts: <named ideas, abstractions, approaches the work touches>
Decisions: <specific decisions under consideration, if any>
Domains: <skill-design | workflow | code-implementation | agent-architecture | ... — optional hint>
</work-context>
```
If the feature type is clear, narrow the search to relevant category directories: When the caller passes this block, extract keywords from each field.
| Feature Type | Search Directory | When the caller passes free-form text instead of a structured block, treat it as the Activity field and extract keywords heuristically from the prose. Both shapes are supported.
|--------------|------------------|
| Performance work | `docs/solutions/performance-issues/` | Keyword dimensions to extract (applies to either input shape):
| Database changes | `docs/solutions/database-issues/` |
| Bug fix | `docs/solutions/runtime-errors/`, `docs/solutions/logic-errors/` | - **Module names** — e.g., "BriefSystem", "EmailProcessing", "payments"
| Security | `docs/solutions/security-issues/` | - **Technical terms** — e.g., "N+1", "caching", "authentication"
| UI work | `docs/solutions/ui-bugs/` | - **Problem indicators** — e.g., "slow", "error", "timeout", "memory" (applies when the work is bug-shaped)
| Integration | `docs/solutions/integration-issues/` | - **Component types** — e.g., "model", "controller", "job", "api"
| General/unclear | `docs/solutions/` (all) | - **Concepts** — named ideas or abstractions: "per-finding walk-through", "fallback-with-warning", "pipeline separation"
- **Decisions** — choices the caller is weighing: "split into units", "migrate to framework X", "add a new tier"
- **Approaches** — strategies or patterns: "test-first", "state machine", "shared template"
- **Domains** — functional areas: "skill-design", "workflow", "code-implementation", "agent-architecture"
The caller's context determines which dimensions carry weight. A code-bug query weights module + technical terms + problem indicators. A design-pattern query weights concepts + approaches + domains. A convention query weights decisions + domains. Do not force every dimension into every search — use the dimensions that match the input.
### Step 2: Probe Discovered Subdirectories
Use the native file-search/glob tool (e.g., Glob in Claude Code) to discover which subdirectories actually exist under `docs/solutions/` at invocation time. Do not assume a fixed list — subdirectory names are per-repo convention and may include any of:
- Bug-shaped: `build-errors/`, `test-failures/`, `runtime-errors/`, `performance-issues/`, `database-issues/`, `security-issues/`, `ui-bugs/`, `integration-issues/`, `logic-errors/`
- Knowledge-shaped: `architecture-patterns/`, `design-patterns/`, `tooling-decisions/`, `conventions/`, `workflow/`, `workflow-issues/`, `developer-experience/`, `documentation-gaps/`, `best-practices/`, `skill-design/`, `integrations/`
- Other per-repo categories
Narrow the search to the discovered subdirectories that match the caller's Domain hint or that align with the keyword shape (e.g., bug-shaped keywords → bug-shaped subdirectories). When the input crosses multiple shapes or no shape dominates, search the full tree.
### Step 3: Content-Search Pre-Filter (Critical for Efficiency) ### Step 3: Content-Search Pre-Filter (Critical for Efficiency)
**Use the native content-search tool (e.g., Grep in Claude Code) to find candidate files BEFORE reading any content.** Run multiple searches in parallel, case-insensitive, returning only matching file paths: **Use the native content-search tool (e.g., Grep in Claude Code) to find candidate files BEFORE reading any content.** Run multiple searches in parallel, case-insensitive, returning only matching file paths:
``` ```
# Search for keyword matches in frontmatter fields (run in PARALLEL, case-insensitive) # Search for keyword matches in frontmatter fields (run in PARALLEL, case-insensitive).
content-search: pattern="title:.*email" path=docs/solutions/ files_only=true case_insensitive=true # Pick fields and synonym sets that match the caller's input shape; mix across shapes when the input is ambiguous.
content-search: pattern="tags:.*(email|mail|smtp)" path=docs/solutions/ files_only=true case_insensitive=true content-search: pattern="title:.*(dispatch|orchestration|pipeline)" path=docs/solutions/ files_only=true case_insensitive=true
content-search: pattern="module:.*(Brief|Email)" path=docs/solutions/ files_only=true case_insensitive=true content-search: pattern="tags:.*(subagent|orchestration|token-efficiency)" path=docs/solutions/ files_only=true case_insensitive=true
content-search: pattern="component:.*background_job" path=docs/solutions/ files_only=true case_insensitive=true content-search: pattern="module:.*(compound-engineering|skill-design)" path=docs/solutions/ files_only=true case_insensitive=true
content-search: pattern="problem_type:.*(architecture_pattern|design_pattern|tooling_decision)" path=docs/solutions/ files_only=true case_insensitive=true
``` ```
**Pattern construction tips:** **Pattern construction tips:**
- Use `|` for synonyms: `tags:.*(payment|billing|stripe|subscription)`
- Include `title:` - often the most descriptive field - Use `|` for synonyms: `tags:.*(subagent|parallel|fan-out)` or `tags:.*(payment|billing|stripe|subscription)`
- Include `title:` — often the most descriptive field
- Search case-insensitively - Search case-insensitively
- Include related terms the user might not have mentioned - Include related terms the user might not have mentioned
- Match the fields to the input shape: bug-shaped queries search `symptoms:` and `root_cause:`; decision- and pattern-shaped queries search `tags:`, `title:`, and `problem_type:`
**Why this works:** Content search scans file contents without reading into context. Only matching filenames are returned, dramatically reducing the set of files to examine. **Why this works:** Content search scans file contents without reading into context. Only matching filenames are returned, dramatically reducing the set of files to examine.
**Combine results** from all searches to get candidate files (typically 5-20 files instead of 200). **Combine results** from all searches to get candidate files (typically 5-20 files instead of 200).
**If search returns >25 candidates:** Re-run with more specific patterns or combine with category narrowing. **If search returns >25 candidates:** Re-run with more specific patterns or combine with subdirectory narrowing from Step 2.
**If search returns <3 candidates:** Do a broader content search (not just frontmatter fields) as fallback: **If search returns <3 candidates:** Do a broader content search (not just frontmatter fields) as fallback:
``` ```
content-search: pattern="email" path=docs/solutions/ files_only=true case_insensitive=true content-search: pattern="email" path=docs/solutions/ files_only=true case_insensitive=true
``` ```
### Step 3b: Always Check Critical Patterns ### Step 3b: Conditionally Check Critical Patterns
**Regardless of Grep results**, always read the critical patterns file: If `docs/solutions/patterns/critical-patterns.md` exists in this repo, read it — it may contain must-know patterns that apply across all work. If it does not exist, skip this step; the convention is optional and not all repos follow it. Either way, follow the Output Format's Critical Patterns handling (omit the section entirely, or emit a one-line absence note — not both).
```bash
Read: docs/solutions/patterns/critical-patterns.md
```
This file contains must-know patterns that apply across all work - high-severity issues promoted to required reading. Scan for patterns relevant to the current feature/task.
### Step 4: Read Frontmatter of Candidates Only ### Step 4: Read Frontmatter of Candidates Only
@@ -81,165 +108,140 @@ Read: [file_path] with limit:30
``` ```
Extract these fields from the YAML frontmatter: Extract these fields from the YAML frontmatter:
- **module**: Which module/system the solution applies to
- **problem_type**: Category of issue (see schema below) - **module** — which module, system, or domain the learning applies to
- **component**: Technical component affected - **problem_type** — category (knowledge-track and bug-track values apply equally; see schema reference below)
- **symptoms**: Array of observable symptoms - **component** — technical component or area affected (when applicable)
- **root_cause**: What caused the issue - **tags** — searchable keywords
- **tags**: Searchable keywords - **symptoms** — observable behaviors or friction (present on bug-track entries and sometimes on knowledge-track entries)
- **severity**: critical, high, medium, low - **root_cause** — underlying cause (present on bug-track entries; optional on knowledge-track entries)
- **severity** — critical, high, medium, low
Some non-bug entries may have looser frontmatter shapes (they do not require `symptoms` or `root_cause`). Do not discard these entries for missing bug-shaped fields — use whatever fields are present for matching.
### Step 5: Score and Rank Relevance ### Step 5: Score and Rank Relevance
Match frontmatter fields against the feature/task description: Match frontmatter fields against the keywords extracted in Step 1:
**Strong matches (prioritize):** **Strong matches (prioritize):**
- `module` matches the feature's target module
- `tags` contain keywords from the feature description - `module` or domain matches the caller's area of work
- `symptoms` describe similar observable behaviors - `tags` contain keywords from the caller's Concepts, Decisions, or Approaches
- `title` contains keywords from the caller's Activity or Concepts
- `component` matches the technical area being touched - `component` matches the technical area being touched
- `symptoms` describe similar observable behaviors (when applicable)
**Moderate matches (include):** **Moderate matches (include):**
- `problem_type` is relevant (e.g., `performance_issue` for optimization work)
- `problem_type` is relevant (e.g., `architecture_pattern` when the caller is making architectural decisions, `performance_issue` when the caller is optimizing)
- `root_cause` suggests a pattern that might apply - `root_cause` suggests a pattern that might apply
- Related modules or components mentioned - Related modules, components, or domains mentioned
**Weak matches (skip):** **Weak matches (skip):**
- No overlapping tags, symptoms, or modules
- Unrelated problem types - No overlapping tags, symptoms, concepts, or modules
- Unrelated `problem_type` and no cross-cutting applicability
### Step 6: Full Read of Relevant Files ### Step 6: Full Read of Relevant Files
Only for files that pass the filter (strong or moderate matches), read the complete document to extract: Only for files that pass the filter (strong or moderate matches), read the complete document to extract:
- The full problem description
- The solution implemented - The full problem framing or decision context
- Prevention guidance - The learning itself (solution, pattern, decision, convention)
- Code examples - Prevention guidance or application notes
- Code examples or illustrative evidence
When a learning's claim conflicts with what you can observe in the current code or docs, flag the conflict explicitly rather than echoing the claim. Note the entry's date so the caller can judge whether the learning may have been superseded. Research agents can be confidently wrong; never let a past learning silently override present evidence.
### Step 7: Return Distilled Summaries ### Step 7: Return Distilled Summaries
For each relevant document, return a summary in this format: Render findings using the structure defined in **## Output Format** below. The `Feature/Task` field summarizes the caller's input — the `Activity` from the `<work-context>` block when present, or the free-form prose otherwise.
```markdown Return up to 5 findings, prioritized by relevance. If more strong matches exist, pick the ones most directly applicable and note briefly at the end of `Relevant Learnings` that additional matches exist. Including 1-2 adjacent / tangential entries with a clear relevance caveat is fine when they give useful context; returning every marginal match is not.
### [Title from document]
- **File**: docs/solutions/[category]/[filename].md Fill `**Problem Type**` with the raw `problem_type` value from the frontmatter (e.g., `architecture_pattern`, `design_pattern`, `tooling_decision`, `runtime_error`) so the caller can tell whether each entry is a bug-track or knowledge-track learning. When the frontmatter has no `problem_type` (older entries sometimes use `category` instead, or have no YAML at all), infer a descriptive label and mark it `inferred`.
- **Module**: [module from frontmatter]
- **Problem Type**: [problem_type]
- **Relevance**: [Brief explanation of why this is relevant to the current task]
- **Key Insight**: [The most important takeaway - the thing that prevents repeating the mistake]
- **Severity**: [severity level]
```
## Frontmatter Schema Reference ## Frontmatter Schema Reference
Use this on-demand schema reference when you need the full contract: The authoritative schema lives at `../../skills/ce-compound/references/yaml-schema.md`; read it on demand when you need the full contract, including `component` and `root_cause` enums (those are repo-specific and evolve — do not hard-code them here).
`../../skills/ce-compound/references/yaml-schema.md`
Key enum values: The two `problem_type` tracks worth knowing in advance:
**problem_type values:** - **Knowledge-track:** `architecture_pattern`, `design_pattern`, `tooling_decision`, `convention`, `workflow_issue`, `developer_experience`, `documentation_gap`, `best_practice` (fallback).
- build_error, test_failure, runtime_error, performance_issue - **Bug-track:** `build_error`, `test_failure`, `runtime_error`, `performance_issue`, `database_issue`, `security_issue`, `ui_bug`, `integration_issue`, `logic_error`.
- database_issue, security_issue, ui_bug, integration_issue
- logic_error, developer_experience, workflow_issue
- best_practice, documentation_gap
**component values:** Subdirectory listings in the schema reference are illustrative, not exhaustive. Probe the live directory (Step 2) for what actually exists.
- rails_model, rails_controller, rails_view, service_object
- background_job, database, frontend_stimulus, hotwire_turbo
- email_processing, brief_system, assistant, authentication
- payments, development_workflow, testing_framework, documentation, tooling
**root_cause values:**
- missing_association, missing_include, missing_index, wrong_api
- scope_issue, thread_violation, async_timing, memory_leak
- config_error, logic_error, test_isolation, missing_validation
- missing_permission, missing_workflow_step, inadequate_documentation
- missing_tooling, incomplete_setup
**Category directories (mapped from problem_type):**
- `docs/solutions/build-errors/`
- `docs/solutions/test-failures/`
- `docs/solutions/runtime-errors/`
- `docs/solutions/performance-issues/`
- `docs/solutions/database-issues/`
- `docs/solutions/security-issues/`
- `docs/solutions/ui-bugs/`
- `docs/solutions/integration-issues/`
- `docs/solutions/logic-errors/`
- `docs/solutions/developer-experience/`
- `docs/solutions/workflow-issues/`
- `docs/solutions/best-practices/`
- `docs/solutions/documentation-gaps/`
## Output Format ## Output Format
Structure your findings as: Structure findings as follows:
```markdown ```markdown
## Institutional Learnings Search Results ## Institutional Learnings Search Results
### Search Context ### Search Context
- **Feature/Task**: [Description of what's being implemented] - **Feature/Task**: [Summary of the caller's activity, decision, or problem — works for bugs, architecture decisions, design patterns, tooling choices, or conventions.]
- **Keywords Used**: [tags, modules, symptoms searched] - **Keywords Used**: [tags, modules, concepts, domains searched]
- **Files Scanned**: [X total files] - **Files Scanned**: [X total files]
- **Relevant Matches**: [Y files] - **Relevant Matches**: [Y files]
### Critical Patterns (Always Check) ### Critical Patterns
[Any matching patterns from critical-patterns.md] [Include only when `docs/solutions/patterns/critical-patterns.md` exists and has relevant content. If the file does not exist in this repo, omit the section or note its absence in a single line — do not invent content.]
### Relevant Learnings ### Relevant Learnings
#### 1. [Title] #### 1. [Title from document]
- **File**: [path] - **File**: [absolute or repo-relative path]
- **Module**: [module] - **Module**: [module/domain from frontmatter, or the repo area the learning applies to]
- **Relevance**: [why this matters for current task] - **Problem Type**: [raw `problem_type` value from frontmatter, e.g. `architecture_pattern`, `design_pattern`, `tooling_decision`, `runtime_error`. Mark as "inferred" when the entry has no `problem_type`.]
- **Key Insight**: [the gotcha or pattern to apply] - **Relevance**: [why this matters for the caller's work]
- **Key Insight**: [the decision, pattern, or pitfall to carry forward]
- **Severity**: [severity level, when present in frontmatter; omit the line otherwise]
#### 2. [Title] #### 2. [Title]
... ...
### Recommendations ### Recommendations
- [Specific actions to take based on learnings] - [Specific actions or decisions to consider based on the surfaced learnings]
- [Patterns to follow] - [Patterns to follow or mirror]
- [Gotchas to avoid] - [Past mis-steps worth avoiding, where applicable]
### No Matches
[If no relevant learnings found, explicitly state this]
``` ```
When no relevant learnings are found, say so explicitly, include the search context so the caller can see what was looked for, and note that the caller's work may be worth capturing with `/ce-compound` after it lands — the absence is itself useful signal.
## Efficiency Guidelines ## Efficiency Guidelines
**DO:** **DO:**
- Use the native content-search tool to pre-filter files BEFORE reading any content (critical for 100+ files) - Use the native content-search tool to pre-filter files BEFORE reading any content (critical for 100+ files)
- Run multiple content searches in PARALLEL for different keywords - Run multiple content searches in PARALLEL across different keyword dimensions
- Include `title:` in search patterns - often the most descriptive field - Probe `docs/solutions/` subdirectories dynamically rather than assuming a fixed list
- Use OR patterns for synonyms: `tags:.*(payment|billing|stripe)` - Include `title:` in search patterns — often the most descriptive field
- Use `-i=true` for case-insensitive matching - Use OR patterns for synonyms and search case-insensitively
- Use category directories to narrow scope when feature type is clear - Narrow to discovered subdirectories when the caller's Domain hint makes one obvious
- Do a broader content search as fallback if <3 candidates found - Broaden the content search as fallback if <3 candidates found; re-narrow if >25
- Re-narrow with more specific patterns if >25 candidates found - Read frontmatter only of search-matched candidates, capped at the first ~30 lines per file (enough to cover YAML)
- Always read the critical patterns file (Step 3b) - Fully read only candidates that pass relevance scoring in Step 5
- Only read frontmatter of search-matched candidates (not all files) - Prioritize high-severity entries and flag date when a learning may be superseded
- Filter aggressively - only fully read truly relevant files - Extract actionable takeaways, not summaries
- Prioritize high-severity and critical patterns
- Extract actionable insights, not just summaries
- Note when no relevant learnings exist (this is valuable information too)
**DON'T:** **DON'T:**
- Read frontmatter of ALL files (use content-search to pre-filter first)
- Skip the grep pre-filter and read frontmatter of every file in `docs/solutions/` — pre-filter first, then read frontmatter of the shortlist
- Read full content of every candidate — only the ones that pass relevance scoring
- Run searches sequentially when they can be parallel - Run searches sequentially when they can be parallel
- Use only exact keyword matches (include synonyms) - Use only exact keyword matches (include synonyms); skip `title:` in patterns; proceed with >25 candidates without narrowing
- Skip the `title:` field in search patterns - Return raw document contents instead of distilling them
- Proceed with >25 candidates without narrowing first - Include every tangentially related match — 1-2 adjacent entries with a caveat is fine; a long tail of weak matches is noise
- Read every file in full (wasteful) - Discard a candidate because it lacks bug-shaped fields like `symptoms` or `root_cause` — non-bug entries legitimately omit them
- Return raw document contents (distill instead) - Assume `docs/solutions/patterns/critical-patterns.md` exists — read it only when present
- Include tangentially related learnings (focus on relevance)
- Skip the critical patterns file (always check it)
## Integration Points ## Integration Points
This agent is designed to be invoked by: This agent is invoked by:
- `/ce-plan` - To inform planning with institutional knowledge and add depth during confidence checking
- Manual invocation before starting work on a feature
The goal is to surface relevant learnings in under 30 seconds for a typical solutions directory, enabling fast knowledge retrieval during planning phases. - `/ce-plan` — to inform planning with institutional knowledge and add depth during confidence checking
- `/ce-code-review`, `/ce-optimize`, `/ce-ideate` — to surface prior learnings relevant to the change, optimization target, or ideation topic
- Standalone invocation before starting work in a documented area
Output is consumed as prose — no downstream caller parses specific field labels out of it — so prioritize distilled, actionable takeaways over structural rigor.

View File

@@ -23,12 +23,16 @@ tracks:
- integration_issue - integration_issue
- logic_error - logic_error
knowledge: knowledge:
description: "Best practices, workflow improvements, patterns, and documentation" description: "Practices, patterns, conventions, decisions, workflow improvements, and documentation"
problem_types: problem_types:
- best_practice - best_practice
- documentation_gap - documentation_gap
- workflow_issue - workflow_issue
- developer_experience - developer_experience
- architecture_pattern
- design_pattern
- tooling_decision
- convention
# --- Fields required by BOTH tracks ----------------------------------------- # --- Fields required by BOTH tracks -----------------------------------------
required_fields: required_fields:
@@ -57,7 +61,11 @@ required_fields:
- workflow_issue - workflow_issue
- best_practice - best_practice
- documentation_gap - documentation_gap
description: "Primary category — determines track (bug vs knowledge)" - architecture_pattern
- design_pattern
- tooling_decision
- convention
description: "Primary category — determines track (bug vs knowledge). Prefer the narrowest applicable value; best_practice is the fallback when no narrower knowledge-track value fits."
component: component:
type: enum type: enum

View File

@@ -16,7 +16,7 @@ The `problem_type` determines which **track** applies. Each track has different
| Track | problem_types | Description | | Track | problem_types | Description |
|-------|--------------|-------------| |-------|--------------|-------------|
| **Bug** | `build_error`, `test_failure`, `runtime_error`, `performance_issue`, `database_issue`, `security_issue`, `ui_bug`, `integration_issue`, `logic_error` | Defects and failures that were diagnosed and fixed | | **Bug** | `build_error`, `test_failure`, `runtime_error`, `performance_issue`, `database_issue`, `security_issue`, `ui_bug`, `integration_issue`, `logic_error` | Defects and failures that were diagnosed and fixed |
| **Knowledge** | `best_practice`, `documentation_gap`, `workflow_issue`, `developer_experience` | Practices, patterns, workflow improvements, and documentation | | **Knowledge** | `best_practice`, `documentation_gap`, `workflow_issue`, `developer_experience`, `architecture_pattern`, `design_pattern`, `tooling_decision`, `convention` | Practices, patterns, conventions, decisions, workflow improvements, and documentation. Prefer the narrowest applicable value; `best_practice` is the fallback. |
## Required Fields (both tracks) ## Required Fields (both tracks)
@@ -73,6 +73,10 @@ Docs created before the track system may have `symptoms`/`root_cause`/`resolutio
- `workflow_issue` -> `docs/solutions/workflow-issues/` - `workflow_issue` -> `docs/solutions/workflow-issues/`
- `best_practice` -> `docs/solutions/best-practices/` - `best_practice` -> `docs/solutions/best-practices/`
- `documentation_gap` -> `docs/solutions/documentation-gaps/` - `documentation_gap` -> `docs/solutions/documentation-gaps/`
- `architecture_pattern` -> `docs/solutions/architecture-patterns/`
- `design_pattern` -> `docs/solutions/design-patterns/`
- `tooling_decision` -> `docs/solutions/tooling-decisions/`
- `convention` -> `docs/solutions/conventions/`
## Validation Rules ## Validation Rules

View File

@@ -410,10 +410,14 @@ Bug track:
- logic-errors/ - logic-errors/
Knowledge track: Knowledge track:
- best-practices/ - architecture-patterns/ — architectural or structural patterns (agent/skill/pipeline/workflow shape decisions)
- design-patterns/ — reusable non-architectural design approaches (content generation, interaction patterns, prompt shapes)
- tooling-decisions/ — language, library, or tool choices with durable rationale
- conventions/ — team-agreed way of doing something, captured so it survives turnover
- workflow-issues/ - workflow-issues/
- developer-experience/ - developer-experience/
- documentation-gaps/ - documentation-gaps/
- best-practices/ — fallback only, use when no narrower knowledge-track value applies
## Common Mistakes to Avoid ## Common Mistakes to Avoid

View File

@@ -23,12 +23,16 @@ tracks:
- integration_issue - integration_issue
- logic_error - logic_error
knowledge: knowledge:
description: "Best practices, workflow improvements, patterns, and documentation" description: "Practices, patterns, conventions, decisions, workflow improvements, and documentation"
problem_types: problem_types:
- best_practice - best_practice
- documentation_gap - documentation_gap
- workflow_issue - workflow_issue
- developer_experience - developer_experience
- architecture_pattern
- design_pattern
- tooling_decision
- convention
# --- Fields required by BOTH tracks ----------------------------------------- # --- Fields required by BOTH tracks -----------------------------------------
required_fields: required_fields:
@@ -57,7 +61,11 @@ required_fields:
- workflow_issue - workflow_issue
- best_practice - best_practice
- documentation_gap - documentation_gap
description: "Primary category — determines track (bug vs knowledge)" - architecture_pattern
- design_pattern
- tooling_decision
- convention
description: "Primary category — determines track (bug vs knowledge). Prefer the narrowest applicable value; best_practice is the fallback when no narrower knowledge-track value fits."
component: component:
type: enum type: enum

View File

@@ -16,7 +16,7 @@ The `problem_type` determines which **track** applies. Each track has different
| Track | problem_types | Description | | Track | problem_types | Description |
|-------|--------------|-------------| |-------|--------------|-------------|
| **Bug** | `build_error`, `test_failure`, `runtime_error`, `performance_issue`, `database_issue`, `security_issue`, `ui_bug`, `integration_issue`, `logic_error` | Defects and failures that were diagnosed and fixed | | **Bug** | `build_error`, `test_failure`, `runtime_error`, `performance_issue`, `database_issue`, `security_issue`, `ui_bug`, `integration_issue`, `logic_error` | Defects and failures that were diagnosed and fixed |
| **Knowledge** | `best_practice`, `documentation_gap`, `workflow_issue`, `developer_experience` | Practices, patterns, workflow improvements, and documentation | | **Knowledge** | `best_practice`, `documentation_gap`, `workflow_issue`, `developer_experience`, `architecture_pattern`, `design_pattern`, `tooling_decision`, `convention` | Practices, patterns, conventions, decisions, workflow improvements, and documentation. Prefer the narrowest applicable value; `best_practice` is the fallback. |
## Required Fields (both tracks) ## Required Fields (both tracks)
@@ -73,6 +73,10 @@ Docs created before the track system may have `symptoms`/`root_cause`/`resolutio
- `workflow_issue` -> `docs/solutions/workflow-issues/` - `workflow_issue` -> `docs/solutions/workflow-issues/`
- `best_practice` -> `docs/solutions/best-practices/` - `best_practice` -> `docs/solutions/best-practices/`
- `documentation_gap` -> `docs/solutions/documentation-gaps/` - `documentation_gap` -> `docs/solutions/documentation-gaps/`
- `architecture_pattern` -> `docs/solutions/architecture-patterns/`
- `design_pattern` -> `docs/solutions/design-patterns/`
- `tooling_decision` -> `docs/solutions/tooling-decisions/`
- `convention` -> `docs/solutions/conventions/`
## Validation Rules ## Validation Rules

View File

@@ -6,28 +6,34 @@ argument-hint: "[mode:headless] [path/to/document.md]"
# Document Review # Document Review
Review requirements or plan documents through multi-persona analysis. Dispatches specialized reviewer agents in parallel, auto-fixes quality issues, and presents strategic questions for user decision. Review requirements or plan documents through multi-persona analysis. Dispatches specialized reviewer agents in parallel, auto-applies `safe_auto` fixes, and routes remaining findings through a four-option interaction (per-finding walk-through, LFG, Append-to-Open-Questions, Report-only) for user decision.
## Interactive mode rules
- **Pre-load the platform question tool before any question fires.** In Claude Code, `AskUserQuestion` is a deferred tool — its schema is not available at session start. At the start of Interactive-mode work (before the routing question, per-finding walk-through questions, bulk-preview Proceed/Cancel, and Phase 5 terminal question), call `ToolSearch` with query `select:AskUserQuestion` to load the schema. Load it once, eagerly, at the top of the Interactive flow — do not wait for the first question site. On Codex (`request_user_input`) and Gemini (`ask_user`) this step is not required; the tools are loaded by default.
- **The numbered-list fallback only applies on confirmed load failure.** Presenting options as a numbered list and waiting for the user's reply is valid only when `ToolSearch` returns no match or the tool call explicitly fails. Rendering a question as narrative text because the tool feels inconvenient, because the model is in report-formatting mode, or because the instruction was buried in a long skill is a bug. A question that calls for a user decision must either fire the tool or fail loudly.
## Phase 0: Detect Mode ## Phase 0: Detect Mode
Check the skill arguments for `mode:headless`. Arguments may contain a document path, `mode:headless`, or both. Tokens starting with `mode:` are flags, not file paths -- strip them from the arguments and use the remaining token (if any) as the document path for Phase 1. Check the skill arguments for `mode:headless`. Arguments may contain a document path, `mode:headless`, or both. Tokens starting with `mode:` are flags, not file paths strip them from the arguments and use the remaining token (if any) as the document path for Phase 1.
If `mode:headless` is present, set **headless mode** for the rest of the workflow. If `mode:headless` is present, set **headless mode** for the rest of the workflow.
**Headless mode** changes the interaction model, not the classification boundaries. Document-review still applies the same judgment about what has one clear correct fix vs. what needs user judgment. The only difference is how non-auto findings are delivered: **Headless mode** changes the interaction model, not the classification boundaries. ce-doc-review still applies the same judgment about which tier each finding belongs in. The only difference is how non-safe_auto findings are delivered:
- `auto` fixes are applied silently (same as interactive)
- `present` findings are returned as structured text for the caller to handle -- no AskUserQuestion prompts, no interactive approval - `safe_auto` fixes are applied silently (same as interactive)
- Phase 5 returns immediately with "Review complete" (no refine/complete question) - `gated_auto`, `manual`, and FYI findings are returned as structured text for the caller to handle — no AskUserQuestion prompts, no interactive routing
- Phase 5 returns immediately with "Review complete" (no routing question, no terminal question)
The caller receives findings with their original classifications intact and decides what to do with them. The caller receives findings with their original classifications intact and decides what to do with them.
Callers invoke headless mode by including `mode:headless` in the skill arguments, e.g.: Callers invoke headless mode by including `mode:headless` in the skill arguments, e.g.:
``` ```
Skill("ce-doc-review", "mode:headless docs/plans/my-plan.md") Skill("ce-doc-review", "mode:headless docs/plans/my-plan.md")
``` ```
If `mode:headless` is not present, the skill runs in its default interactive mode with the routing question, walk-through, and bulk-preview behaviors documented in `references/walkthrough.md` and `references/bulk-preview.md`.
If `mode:headless` is not present, the skill runs in its default interactive mode with no behavior change.
## Phase 1: Get and Analyze Document ## Phase 1: Get and Analyze Document
@@ -124,8 +130,46 @@ Dispatch all agents in **parallel** using the platform's task/agent tool (e.g.,
| `{document_type}` | "requirements" or "plan" from Phase 1 classification | | `{document_type}` | "requirements" or "plan" from Phase 1 classification |
| `{document_path}` | Path to the document | | `{document_path}` | Path to the document |
| `{document_content}` | Full text of the document | | `{document_content}` | Full text of the document |
| `{decision_primer}` | Cumulative prior-round decisions in the current session, or an empty `<prior-decisions>` block on round 1. See "Decision primer" below. |
Pass each agent the **full document** -- do not split into sections. Pass each agent the **full document** do not split into sections.
### Decision primer
On round 1 (no prior decisions), set `{decision_primer}` to:
```
<prior-decisions>
Round 1 — no prior decisions.
</prior-decisions>
```
On round 2+ (after one or more prior rounds in the current interactive session), accumulate prior-round decisions and render them as:
```
<prior-decisions>
Round 1 — applied (N entries):
- {section}: "{title}" ({reviewer}, {confidence})
Evidence: "{evidence_snippet}"
Round 1 — rejected (M entries):
- {section}: "{title}" — Skipped because {reason}
Evidence: "{evidence_snippet}"
- {section}: "{title}" — Deferred to Open Questions because {reason or "no reason provided"}
Evidence: "{evidence_snippet}"
- {section}: "{title}" — Acknowledged without applying because {reason or "no suggested_fix — user acknowledged"}
Evidence: "{evidence_snippet}"
Round 2 — applied (N entries):
...
</prior-decisions>
```
Each entry carries an `Evidence:` line because synthesis R29 (rejected-finding suppression) and R30 (fix-landed verification) both use an evidence-substring overlap check as part of their matching predicate — without the evidence snippet in the primer, the orchestrator cannot compute the `>50%` overlap test and has to fall back to fingerprint-only matching, which either re-surfaces rejected findings or suppresses too aggressively. The `{evidence_snippet}` is the first evidence quote from the finding, truncated to the first ~120 characters (preserving whole words at the boundary) and with internal quotes escaped. If a finding has multiple evidence entries, use the first one; the rest live in the run artifact and are not needed for the overlap check.
Accumulate across all rounds in the current session. Skip, Defer, and Acknowledge actions all count as "rejected" for suppression purposes — each signals the user decided the finding wasn't worth actioning this round (Acknowledge is the no-fix-guard variant: the user saw a finding with no `suggested_fix`, chose not to defer or skip explicitly, and recorded acknowledgement instead; for round-to-round suppression that is semantically equivalent to Skip). Applied findings stay on the applied list so round-N+1 personas can verify fixes landed (see R30 in `references/synthesis-and-presentation.md`).
Cross-session persistence is out of scope. A new invocation of ce-doc-review on the same document starts with a fresh round 1 and no carried primer, even if prior sessions deferred findings into the document's Open Questions section.
**Error handling:** If an agent fails or times out, proceed with findings from agents that completed. Note the failed agent in the Coverage section. Do not block the entire review on a single agent failure. **Error handling:** If an agent fails or times out, proceed with findings from agents that completed. Note the failed agent in the Coverage section. Do not block the entire review on a single agent failure.
@@ -133,7 +177,9 @@ Pass each agent the **full document** -- do not split into sections.
## Phases 3-5: Synthesis, Presentation, and Next Action ## Phases 3-5: Synthesis, Presentation, and Next Action
After all dispatched agents return, read `references/synthesis-and-presentation.md` for the synthesis pipeline (validate, gate, dedup, promote, resolve contradictions, route by autofix class), auto-fix application, finding presentation, and next-action menu. Do not load this file before agent dispatch completes. After all dispatched agents return, read `references/synthesis-and-presentation.md` for the synthesis pipeline (validate, per-severity gate, dedup, cross-persona agreement boost, resolve contradictions, auto-promotion, route by three tiers with FYI subsection), `safe_auto` fix application, headless-envelope output, and the handoff to the routing question.
For the four-option routing question and per-finding walk-through (interactive mode), read `references/walkthrough.md`. For the bulk-action preview used by LFG, Append-to-Open-Questions, and walk-through `LFG-the-rest`, read `references/bulk-preview.md`. Do not load these files before agent dispatch completes.
--- ---

View File

@@ -0,0 +1,128 @@
# Bulk Action Preview
This reference defines the compact plan preview that Interactive mode shows before every bulk action — LFG (routing option B), Append-to-Open-Questions (routing option C), and the walk-through's `LFG the rest` (option D of the per-finding question). The preview gives the user a single-screen view of what the agent is about to do, with exactly two options to Proceed or Cancel.
Interactive mode only.
---
## When the preview fires
Three call sites:
1. **Routing option B (top-level LFG)** — after the user picks `LFG. Apply the agent's best-judgment action per finding` from the routing question, but before any action executes. Scope: every pending `gated_auto` / above-gate `manual` finding.
2. **Routing option C (top-level Append-to-Open-Questions)** — after the user picks `Append findings to the doc's Open Questions section and proceed` but before any append runs. Scope: every pending `gated_auto` / above-gate `manual` finding. Every finding appears under `Appending to Open Questions (N):` regardless of the agent's natural recommendation, because option C is batch-defer.
3. **Walk-through `LFG the rest`** — after the user picks `LFG the rest — apply the agent's best judgment to this and remaining findings` from a per-finding question, but before the remaining findings are resolved. Scope: the current finding and everything not yet decided. Already-decided findings from the walk-through are not included in the preview.
In all three cases the user confirms with `Proceed` or backs out with `Cancel`. No per-item decisions inside the preview — per-item decisioning is the walk-through's role.
---
## Preview structure
The preview is grouped by the action the agent intends to take. Bucket headers appear only when their bucket is non-empty.
```
<Path label> — <scope summary>:
Applying (N):
[P0] <section> — <one-line plain-English summary>
[P1] <section> — <one-line plain-English summary>
Appending to Open Questions (N):
[P2] <section> — <one-line plain-English summary>
Skipping (N):
[P2] <section> — <one-line plain-English summary>
```
Worked example for routing option B (top-level LFG):
```
LFG plan — 8 findings:
Applying (4):
[P0] Requirements Trace — Renumber R4 to match unit reference
[P1] Unit 3 Files — Add read-fallback for renamed report file
[P2] Key Technical Decisions — Use framework's Deprecated field rather than hand-rolling
[P3] Overview — Correct wrong count (says 6, list has 5)
Appending to Open Questions (2):
[P2] Scope Boundaries — Unit 2/3 merge judgment call
[P2] Risks — Alias compatibility-theater concern
Skipping (2):
[P2] Miscellaneous Notes — Low-confidence style preference
[P3] Abstraction Commentary — Speculative, subjective
```
---
## Scope summary wording by path
- **Routing option B (top-level LFG):** header reads `LFG plan — N findings:`.
- **Routing option C (top-level Append-to-Open-Questions):** header reads `Append plan — N findings as Open Questions entries:`. Every finding lands in the `Appending to Open Questions (N):` bucket.
- **Walk-through `LFG the rest`:** header reads `LFG plan — N remaining findings (K already decided):`. Already-decided findings from the walk-through are not included in the preview or in the bucket counts. The `K already decided` counter communicates that the walk-through was partially completed.
---
## Per-finding line format
Each line uses the compressed form of the framing-quality guidance from the subagent template (observable-consequence-first, no internal section numbering unless needed to locate). The one-line summary is drawn from the persona-produced `why_it_matters` by taking the first sentence (and, when the first sentence is too long for the preview width, paraphrasing it tightly to fit).
- **Shape:** `[<severity>] <section> — <one-line summary>`
- **Width target:** keep lines near 80 columns so the preview renders cleanly in narrow terminals. Truncate with ellipsis when necessary.
- **No section numbering** unless the reader needs it to locate the issue (when multiple findings hit the same named section).
When no `why_it_matters` is available for a finding (rare — only if persona output was malformed), fall back to the finding's title directly. Note the gap in the completion report's Coverage section if it affects more than a few findings in the same run.
---
## Question and options
After the preview body is rendered, ask the user using the platform's blocking question tool (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini). In Claude Code, the tool should already be loaded from the Interactive-mode pre-load step — if it isn't, call `ToolSearch` with query `select:AskUserQuestion` now. The text fallback below applies only when that load explicitly fails.
Stem (adapted to the path):
- For routing B: `The agent is about to apply the plan above. Proceed?`
- For routing C: `The agent is about to append the findings above to the doc's Open Questions section. Proceed?`
- For walk-through `LFG the rest`: `The agent is about to resolve the remaining findings above. Proceed?`
Options (exactly two, in all three cases):
- `Proceed` — execute the plan as shown
- `Cancel` — do nothing, return to the originating question
Only when `ToolSearch` explicitly returns no match or the tool call errors — or on a platform with no blocking question tool — fall back to presenting numbered options and waiting for the user's next reply.
---
## Cancel semantics
- **From routing option B Cancel:** return the user to the routing question (the four-option menu). Do not edit the document, do not append any Open Questions entries, do not record any state.
- **From routing option C Cancel:** same — return to the routing question, no side effects.
- **From walk-through `LFG the rest` Cancel:** return the user to the current finding's per-finding question (not to the routing question). The walk-through continues from where it was, with prior decisions intact.
In every case, `Cancel` changes no on-disk or in-memory state.
---
## Proceed semantics
When the user picks `Proceed`:
- **Routing option B (top-level LFG):** for each finding in the plan, execute the recommended action. Apply findings go into the Apply set for a single end-of-batch document-edit pass (see `walkthrough.md` for the Apply batching rules). Defer findings route through `references/open-questions-defer.md`. Skip findings are recorded as no-action. After all actions complete, emit the unified completion report (see `walkthrough.md`).
- **Routing option C (top-level Append-to-Open-Questions):** every finding routes through `references/open-questions-defer.md` for Open Questions append. No document edits apply (beyond the Open Questions section additions themselves). After all appends complete (or fail), emit the unified completion report.
- **Walk-through `LFG the rest`:** same as routing option B, but scoped to the findings the user hadn't decided on. Apply findings join the in-memory Apply set with the ones the user already picked during the walk-through; all dispatch together in the single end-of-walk-through Apply pass.
Failure during `Proceed` (e.g., an Open Questions append fails for one finding during a batch Defer) follows the failure path defined in `references/open-questions-defer.md` — surface the failure inline with Retry / Fall back / Convert to Skip, continue with the rest of the plan, and capture the failure in the completion report's failure section.
---
## Edge cases
- **Zero findings in a bucket:** omit the bucket header. A preview with only Apply and Skip does not show an empty `Appending to Open Questions (0):` line.
- **All findings in one bucket:** preview still shows the bucket header; Proceed / Cancel still offered. This is the common case for routing option C (every finding under `Appending to Open Questions`).
- **N=1 preview (only one finding in scope):** the preview still uses the grouped format, just with a single-line bucket. `Proceed` / `Cancel` still apply.
- **Open Questions append unavailable** (document is read-only, append flow reports no-go): routing option C is not offered upstream (see `references/open-questions-defer.md` unavailability handling). LFG (option B) and walk-through `LFG the rest` can still run — they may contain per-finding Defer recommendations from synthesis. Before rendering any LFG-shaped preview, downgrade every Defer recommendation to Skip when the session's cached append-availability is false, and surface the downgrade on the preview itself (e.g., a `Skipping — append unavailable (N):` bucket, or a note in the header: `N Defer recommendations downgraded to Skip — document is read-only.`).
- **Walk-through `LFG the rest` with zero remaining findings:** the walk-through's own logic suppresses `LFG the rest` as an option when N=1 and otherwise, so the preview should never be invoked with zero remaining findings. If it is, render `LFG plan — 0 remaining findings` and fall through to Proceed with no-op.

View File

@@ -45,8 +45,8 @@
}, },
"autofix_class": { "autofix_class": {
"type": "string", "type": "string",
"enum": ["auto", "present"], "enum": ["safe_auto", "gated_auto", "manual"],
"description": "How this issue should be handled. auto = one clear correct fix that can be applied silently (terminology, formatting, cross-references, completeness corrections, additions mechanically implied by other content). present = requires individual user judgment." "description": "How this issue should be handled. safe_auto = one clear correct fix applied silently (typo, wrong count, stale cross-reference, mechanically-implied addition, terminology drift). gated_auto = concrete fix exists but touches document meaning or scope and warrants user confirmation (substantive additions from codebase-pattern-resolved fixes, framework-native-API substitutions, missing standard controls). manual = requires user judgment; multiple valid approaches. Low-confidence manual findings surface in an FYI subsection at the presentation layer."
}, },
"finding_type": { "finding_type": {
"type": "string", "type": "string",

View File

@@ -0,0 +1,177 @@
# Open Questions Deferral
This reference defines the Defer action's in-doc append mechanic. When the user chooses Defer on a finding (from the walk-through or from the bulk-preview Append-to-Open-Questions path), an entry for that finding appends to a `## Deferred / Open Questions` section at the end of the document under review.
Interactive mode only. Invoked by `references/walkthrough.md` (per-finding Defer option) and `references/bulk-preview.md` (routing option C Proceed).
---
## Append flow
### Step 1: Locate or create the Open Questions section
Scan the document for an existing `## Deferred / Open Questions` heading (case-sensitive match on the full heading text). Behavior by location:
- **Heading present at the end of the document (last `##`-level section):** append new content inside this section at the end.
- **Heading present mid-document (not the last `##`-level section):** still append inside the existing heading at that location. Do not create a duplicate at the end — the user positioned the section deliberately.
- **Heading absent:** create `## Deferred / Open Questions` at the end of the document. If the document has a trailing horizontal-rule separator (`---`) or a trailing footer (table, links section), insert the new section above it. If the document has only frontmatter and no body, create the section after the frontmatter block (not at byte 0).
### Step 2: Locate or create the timestamped subsection
Within the Open Questions section, scan for a subsection heading matching the current review date: `### From YYYY-MM-DD review`. Behavior:
- **Subsection present:** append new entries to it. Multiple Defer actions within a single review session accumulate under the same subsection.
- **Subsection absent:** create `### From YYYY-MM-DD review` as the last subsection within the Open Questions section. Insert one blank line before the heading for readability.
Date format: ISO 8601 calendar date (`YYYY-MM-DD`). If multiple reviews occur on the same document on the same day within the same session, they still share the same subsection. Multi-day same-document reviews get distinct subsections, which is the intended behavior.
### Step 3: Format and append the entry
Per deferred finding, append a bullet-point entry with the following fields. The visible content is a reader-friendly summary; an HTML comment on the entry persists the dedup-key fields so Step 4's compound-key check can run reliably across retries and same-day reruns without requiring the entry format itself to carry machine-oriented metadata:
```
- **{title}** — {section} ({severity}, {reviewer}, confidence {confidence})
{why_it_matters}
<!-- dedup-key: section="{normalized_section}" title="{normalized_title}" evidence="{evidence_fingerprint}" -->
```
Fields come from the finding's schema:
- `{title}` — the finding's title field
- `{section}` — the finding's section field, unmodified (human-readable)
- `{severity}` — P0 / P1 / P2 / P3
- `{reviewer}` — the persona that produced the finding (after dedup, the persona with the highest confidence; surface all co-flagging personas if multiple)
- `{confidence}` — rounded to 2 decimal places
- `{why_it_matters}` — the full why_it_matters text, preserving the framing guidance from the subagent template
HTML-comment fields (machine-readable, used by Step 4 dedup):
- `{normalized_section}``normalize(section)` (lowercase, punctuation-stripped, whitespace-collapsed)
- `{normalized_title}``normalize(title)` (same normalization)
- `{evidence_fingerprint}` — first ~120 chars of the finding's first evidence quote, word-boundary-preserving, then sanitized for single-line HTML-comment embedding; empty string when the finding had no evidence. Sanitization (apply in order, before the 120-char slice so the resulting fingerprint stays under the budget):
1. Collapse any run of whitespace — including newlines, carriage returns, and tabs — to a single space.
2. Strip any occurrence of `-->` (HTML-comment terminator) and any stray `<!--` sequence; replace each with a single space. This prevents the evidence from closing the dedup-key comment prematurely or injecting a nested comment.
3. Replace the double-quote character with `\"` (quote escaping, as before).
4. Trim leading/trailing whitespace.
The sanitized fingerprint MUST be a single line with no embedded `-->` so the dedup-key comment stays parseable by Step 4's compound-key reconstruction. When computing the fingerprint for a new finding, apply this sanitization; when reading back from an existing entry, treat the parsed `evidence="..."` value as already-sanitized and compare verbatim.
Do not include `suggested_fix` or the full `evidence` array in the appended entry. Those live in the review run artifact (when applicable) and do not belong in the document's Open Questions section — the entry is a concern summary for the reader returning later, not a full decision packet. The HTML-comment dedup-key line is the minimum machine-oriented metadata required for reliable idempotence and deliberately sits on a single line with simple `key="value"` shape so a retry can parse it without a markdown parser.
### Step 4: Idempotence on compound-key collisions
If an entry with the same compound key already exists under the same `### From YYYY-MM-DD review` subsection, do not append a duplicate. This can happen when:
- The same review session re-routes the same finding to Defer a second time (rare but possible via LFG-the-rest after a walk-through Defer)
- The orchestrator retries after a partial failure
**Compound key for dedup:** `normalize(section) + normalize(title) + evidence_fingerprint`, reconstructed from each existing entry's `<!-- dedup-key: ... -->` HTML comment (see Step 3 entry format). For a new finding about to append, compute the same fields from the finding's schema data; for existing entries, parse them out of the HTML comment. Match on all three fields.
- `normalize(section)` and `normalize(title)` use the same normalization as synthesis step 3.3 dedup (lowercase, strip punctuation, collapse whitespace)
- `evidence_fingerprint` is the first ~120 characters of the finding's first evidence quote, sanitized per Step 3 (whitespace collapsed to single spaces, `-->` and stray `<!--` stripped, quotes escaped). The same slice is used in the decision primer — see `SKILL.md` under "Decision primer". When no evidence is available on the new finding, fall back to section+title alone. When an existing entry's HTML comment has `evidence=""`, treat the entry as evidence-less and also fall back to section+title for that comparison. If an existing entry's dedup-key comment is malformed (e.g., a newline or `-->` sequence split the comment across lines in a pre-sanitization entry), treat that entry under the legacy-fallback rule below rather than attempting a partial reconstruction.
Title-only dedup is not sufficient: two different findings in the same document (even in the same review date) can legitimately share a short title if their sections and evidence differ. Using only `{title}` would silently drop one of them — losing user-visible backlog context. The compound key mirrors the R29/R30 matching predicate (`section + title + evidence-substring overlap`) so cross-round and intra-round dedup behave consistently.
**Legacy entries without dedup-key comments:** entries written before this format (if any survive in the wild) lack the HTML comment. When Step 4 encounters such an entry, fall back to title-only comparison for that entry — imperfect, but strictly better than duplicate-appending. This is a backwards-compat behavior for legacy data, not a sanctioned format.
On collision, record the no-op in the completion report's Coverage section so the user sees the duplicate was suppressed. Cross-subsection collisions (same compound key, different dates) are not deduplicated — each review is allowed to re-raise the same concern.
---
## Concurrent edit safety
Document edits happen via the platform's edit tool (Edit in Claude Code, or equivalent). Before every append, re-read the document from disk to reduce the window for user-in-editor concurrent-write collisions. If the document's mtime or content has changed unexpectedly between a prior read and the append attempt, abort the append and surface the situation via the failure path below. The user may be editing in their editor during the review session and simultaneous writes would corrupt the document.
The orchestrator only holds the most recent read in memory, not a persistent lock — interactive review doesn't need lock coordination; it needs observation-before-write.
---
## Failure path
When the append cannot complete — document is read-only on disk, path is invalid, the platform's edit tool returns an error, concurrent-edit collision detected, or any other write failure — surface the failure inline to the user via the platform's blocking question tool with the following sub-question:
**Stem:** `Couldn't append the finding to Open Questions. What should the agent do?`
**Options (exactly three; fixed order):**
```
A. Retry the append
B. Record the deferral in the completion report only (don't mutate the document)
C. Convert this finding to Skip
```
**Dispatch:**
- **A Retry** — try the append again. On repeated failure, loop back to the same sub-question.
- **B Record only** — skip the document mutation; record the Deferred action in the completion report with a note that the append failed. The finding does not end up in the document but the user sees in the report that they deferred it.
- **C Convert to Skip** — record the finding as Skip with an explanatory reason ("append to Open Questions failed: <error>"). The finding is treated as no-action for the remainder of the session.
Silent failure is not acceptable. If the user does not respond to the sub-question (session ends, terminal disconnects), default to option B so the in-memory decision state stays consistent even if the document wasn't written.
---
## Upstream availability signal
The walk-through and bulk-preview check append-availability before offering Defer as an option. When the document is known-unwritable (e.g., initial read shows it's on a read-only filesystem), the orchestrator caches an `append_available: false` signal at Phase 4 start and Defer is suppressed in the walk-through menu and in the routing question's option C. See `references/walkthrough.md` under "Adaptations" for the menu behavior and `references/bulk-preview.md` under "Edge cases" for the preview behavior.
When append-availability is true at Phase 4 start but an individual append fails mid-flow, the failure path above handles the specific finding — this does not flip the session-level cached signal (other findings may still append successfully if the failure was transient).
---
## Example appended content
Starting document state:
```markdown
## Risks
...existing content...
## Deferred / Open Questions
### From 2026-04-10 review
- **Alias compatibility-theater concern** — Risks (P1, scope-guardian, confidence 0.87)
The alias exists without documented external consumers...
<!-- dedup-key: section="risks" title="alias compatibilitytheater concern" evidence="the alias exists without documented external consumers" -->
```
After appending two findings in a 2026-04-18 session:
```markdown
## Risks
...existing content...
## Deferred / Open Questions
### From 2026-04-10 review
- **Alias compatibility-theater concern** — Risks (P1, scope-guardian, confidence 0.87)
The alias exists without documented external consumers...
<!-- dedup-key: section="risks" title="alias compatibilitytheater concern" evidence="the alias exists without documented external consumers" -->
### From 2026-04-18 review
- **Unit 2/3 merge judgment call** — Scope Boundaries (P2, scope-guardian, confidence 0.78)
The two units update consumer sites that deploy together. Splitting
adds dependency tracking without enabling independent delivery.
<!-- dedup-key: section="scope boundaries" title="unit 23 merge judgment call" evidence="the two units update consumer sites that deploy together" -->
- **Strawman alternatives on migration strategy** — Unit 3 Files (P2, coherence, confidence 0.72)
The fix options list (a) through (c) as alternatives, but (b) and (c)
are "accept the regression" framings that don't solve the problem the
finding describes.
<!-- dedup-key: section="unit 3 files" title="strawman alternatives on migration strategy" evidence="the fix options list a through c as alternatives but b and c" -->
```

View File

@@ -1,9 +1,13 @@
# Document Review Output Template # Document Review Output Template
Use this **exact format** when presenting synthesized review findings. Findings are grouped by severity, not by reviewer. Use this **exact format** when presenting synthesized review findings in Interactive mode. Findings are grouped by severity, not by reviewer.
**IMPORTANT:** Use pipe-delimited markdown tables (`| col | col |`). Do NOT use ASCII box-drawing characters. **IMPORTANT:** Use pipe-delimited markdown tables (`| col | col |`). Do NOT use ASCII box-drawing characters.
This template describes the Phase 4 interactive presentation — what the user sees before the routing question (`references/walkthrough.md`) fires. The headless-mode envelope is documented in `references/synthesis-and-presentation.md` (Phase 4 "Route Remaining Findings" section) and is separate from this template.
**Vocabulary note.** Internal enum values (`safe_auto`, `gated_auto`, `manual`, `FYI`) live in the schema and synthesis pipeline. User-facing rendered text uses plain-language labels instead: fixes (for `safe_auto`), proposed fixes (for `gated_auto`), decisions (for `manual`), and FYI observations (for `FYI`). The `Tier` column in the tables below is the one place that still names the internal enum so the user can see the synthesis decision; everything else reads as plain language.
## Example ## Example
```markdown ```markdown
@@ -15,9 +19,9 @@ Use this **exact format** when presenting synthesized review findings. Findings
- security-lens -- plan adds public API endpoint with auth flow - security-lens -- plan adds public API endpoint with auth flow
- scope-guardian -- plan has 15 requirements across 3 priority levels - scope-guardian -- plan has 15 requirements across 3 priority levels
Applied 5 auto-fixes. 4 findings to consider (2 errors, 2 omissions). Applied 5 fixes. 4 items need attention (2 errors, 2 omissions). 2 FYI observations.
### Auto-fixes Applied ### Applied fixes
- Standardized "pipeline"/"workflow" terminology to "pipeline" throughout (coherence) - Standardized "pipeline"/"workflow" terminology to "pipeline" throughout (coherence)
- Fixed cross-reference: Section 4 referenced "Section 3.2" which is actually "Section 3.1" (coherence) - Fixed cross-reference: Section 4 referenced "Section 3.2" which is actually "Section 3.1" (coherence)
@@ -25,38 +29,49 @@ Applied 5 auto-fixes. 4 findings to consider (2 errors, 2 omissions).
- Added "update API rate-limit config" step to Unit 4 -- implied by Unit 3's rate-limit introduction (feasibility) - Added "update API rate-limit config" step to Unit 4 -- implied by Unit 3's rate-limit introduction (feasibility)
- Added auth token refresh to test scenarios -- required by Unit 2's token expiry handling (security-lens) - Added auth token refresh to test scenarios -- required by Unit 2's token expiry handling (security-lens)
### P0 -- Must Fix ### P0 Must Fix
#### Errors #### Errors
| # | Section | Issue | Reviewer | Confidence | | # | Section | Issue | Reviewer | Confidence | Tier |
|---|---------|-------|----------|------------| |---|---------|-------|----------|------------|------|
| 1 | Requirements Trace | Goal states "offline support" but technical approach assumes persistent connectivity | coherence | 0.92 | | 1 | Requirements Trace | Goal states "offline support" but technical approach assumes persistent connectivity | coherence | 0.92 | manual |
### P1 -- Should Fix ### P1 Should Fix
#### Errors #### Errors
| # | Section | Issue | Reviewer | Confidence | | # | Section | Issue | Reviewer | Confidence | Tier |
|---|---------|-------|----------|------------| |---|---------|-------|----------|------------|------|
| 2 | Scope Boundaries | 8 of 12 units build admin infrastructure; only 2 touch stated goal | scope-guardian | 0.80 | | 2 | Scope Boundaries | 8 of 12 units build admin infrastructure; only 2 touch stated goal | scope-guardian | 0.80 | manual |
#### Omissions #### Omissions
| # | Section | Issue | Reviewer | Confidence | | # | Section | Issue | Reviewer | Confidence | Tier |
|---|---------|-------|----------|------------| |---|---------|-------|----------|------------|------|
| 3 | Implementation Unit 3 | Plan proposes custom auth but does not mention existing Devise setup or migration path | feasibility | 0.85 | | 3 | Implementation Unit 3 | Plan proposes custom auth but does not mention existing Devise setup or migration path | feasibility | 0.85 | gated_auto |
### P2 -- Consider Fixing ### P2 Consider Fixing
#### Omissions #### Omissions
| # | Section | Issue | Reviewer | Confidence | | # | Section | Issue | Reviewer | Confidence | Tier |
|---|---------|-------|----------|------------| |---|---------|-------|----------|------------|------|
| 4 | API Design | Public webhook endpoint has no rate limiting mentioned | security-lens | 0.75 | | 4 | API Design | Public webhook endpoint has no rate limiting mentioned | security-lens | 0.75 | gated_auto |
### FYI Observations
Low-confidence observations surfaced without requiring a decision. Content advisory only.
| # | Section | Observation | Reviewer | Confidence |
|---|---------|-------------|----------|------------|
| 1 | Naming | Filename `plan.md` is asymmetric with command name `user-auth`; could go either way | coherence | 0.52 |
| 2 | Risk Analysis | Rollout-cadence decision may benefit from monitoring thresholds, though not blocking | scope-guardian | 0.48 |
### Residual Concerns ### Residual Concerns
Residual concerns are issues the reviewers noticed but could not confirm with above-gate confidence. These are not actionable; they appear here for transparency only and are not promoted into the review surface.
| # | Concern | Source | | # | Concern | Source |
|---|---------|--------| |---|---------|--------|
| 1 | Migration rollback strategy not addressed for Phase 2 data changes | feasibility | | 1 | Migration rollback strategy not addressed for Phase 2 data changes | feasibility |
@@ -69,21 +84,32 @@ Applied 5 auto-fixes. 4 findings to consider (2 errors, 2 omissions).
### Coverage ### Coverage
| Persona | Status | Findings | Auto | Present | Residual | | Persona | Status | Findings | Auto | Proposed | Decisions | FYI | Residual |
|---------|--------|----------|------|---------|----------| |---------|--------|----------|------|----------|-----------|-----|----------|
| coherence | completed | 4 | 3 | 1 | 0 | | coherence | completed | 5 | 3 | 0 | 1 | 1 | 0 |
| feasibility | completed | 2 | 1 | 1 | 1 | | feasibility | completed | 3 | 1 | 1 | 0 | 0 | 1 |
| security-lens | completed | 2 | 1 | 1 | 0 | | security-lens | completed | 2 | 1 | 1 | 0 | 0 | 0 |
| scope-guardian | completed | 1 | 0 | 1 | 0 | | scope-guardian | completed | 2 | 0 | 0 | 1 | 1 | 0 |
| product-lens | not activated | -- | -- | -- | -- | | product-lens | not activated | -- | -- | -- | -- | -- | -- |
| design-lens | not activated | -- | -- | -- | -- | | design-lens | not activated | -- | -- | -- | -- | -- | -- |
Chains: 1 root with 2 dependents
``` ```
## Section Rules ## Section Rules
- **Summary line**: Always present after the reviewer list. Format: "Applied N auto-fixes. K findings to consider (X errors, Y omissions)." Omit any zero clause. - **Summary line**: Always present after the reviewer list. Format: "Applied N fixes. K items need attention (X errors, Y omissions). Z FYI observations." Omit any zero clause except the FYI clause when zero (it's informative that none surfaced).
- **Auto-fixes Applied**: List all fixes that were applied automatically (auto class). Include enough detail per fix to convey the substance -- especially for fixes that add content or touch document meaning. Omit section if none. - **Applied fixes**: List all fixes that were applied automatically (`safe_auto` tier). Include enough detail per fix to convey the substance especially for fixes that add content or touch document meaning. Omit section if none.
- **P0-P3 sections**: Only include sections that have findings. Omit empty severity levels. Within each severity, separate into **Errors** and **Omissions** sub-headers. Omit a sub-header if that severity has none of that type. - **P0-P3 sections**: Only include sections that have actionable findings (`gated_auto` or `manual`). Omit empty severity levels. Within each severity, separate into **Errors** and **Omissions** sub-headers. Omit a sub-header if that severity has none of that type. The `Tier` column surfaces whether a finding is `gated_auto` (concrete fix exists, Apply recommended in walk-through) or `manual` (requires user judgment).
- **Residual Concerns**: Findings below confidence threshold that were promoted by cross-persona corroboration, plus unpromoted residual risks. Omit if none. - **FYI Observations**: Low-confidence `manual` findings above the 0.40 FYI floor but below the per-severity gate. Surface here for transparency; these are not actionable and do not enter the walk-through. Omit section if none.
- **Residual Concerns**: Residual concerns noted by personas that did not make it above the confidence gate. Listed for transparency; not promoted into the review surface (cross-persona agreement boost runs on findings that already survived the gate, per synthesis step 3.4). Omit section if none.
- **Deferred Questions**: Questions for later workflow stages. Omit if none. - **Deferred Questions**: Questions for later workflow stages. Omit if none.
- **Coverage**: Always include. All counts are **post-synthesis**. **Findings** must equal Auto + Present exactly -- if deduplication merged a finding across personas, attribute it to the persona with the highest confidence and reduce the other persona's count. **Residual** = count of `residual_risks` from this persona's raw output (not the promoted subset in the Residual Concerns section). - **Coverage**: Always include. All counts are **post-synthesis**. **Findings** must equal Auto + Proposed + Decisions + FYI exactly if deduplication merged a finding across personas, attribute it to the persona with the highest confidence and reduce the other persona's count. **Residual** = count of `residual_risks` from this persona's raw output (not the promoted subset in the Residual Concerns section). The `Auto` column counts `safe_auto` findings, `Proposed` counts `gated_auto`, `Decisions` counts above-gate `manual`, and `FYI` counts below-gate `manual` findings at or above the 0.40 FYI floor.
## Chain-Rendering Rules
Premise-dependency chains from synthesis step 3.5c annotate roots and dependents. Rendering follows the same count invariant documented in the synthesis reference; this template restates the rules so interactive output cannot drift from the headless envelope.
- **Dependents render only under their root.** When a finding has `dependents`, render the root at its normal severity position (in its P-tier Errors or Omissions table). Immediately below the root's table row, emit an indented `Dependents (N)` sub-block listing each dependent's `# | Section | Issue | Reviewer | Confidence | Tier` entry. Dependents MUST NOT appear at their own severity position. Findings without `depends_on` and without `dependents` render as they do today.
- **Count invariant.** The `Findings` column in Coverage continues to equal Auto + Proposed + Decisions + FYI. Each finding counts exactly once: a dependent counts in its assigned bucket (`Auto` / `Proposed` / `Decisions` / `FYI`) but does NOT render at its own severity position. The source of truth is the post-Step-4 `dependents` array on each root — the same array the headless envelope reads — so coverage count and rendering cannot drift.
- **Chains line (optional).** When one or more chains exist, add a final line to the coverage block: `Chains: N root(s) with M dependents` where N is the number of roots and M is the total dependent count summed across all roots. Omit the line when no chains exist. This mirrors the `Chains:` line the headless envelope emits in `references/synthesis-and-presentation.md` so reviewers get the same chain visibility in both modes.

View File

@@ -1,6 +1,6 @@
# Document Review Sub-agent Prompt Template # Document Review Sub-agent Prompt Template
This template is used by the document-review orchestrator to spawn each reviewer sub-agent. Variable substitution slots are filled at dispatch time. This template is used by the ce-doc-review orchestrator to spawn each reviewer sub-agent. Variable substitution slots are filled at dispatch time.
--- ---
@@ -18,35 +18,121 @@ Return ONLY valid JSON matching the findings schema below. No prose, no markdown
{schema} {schema}
**Schema conformance — hard constraints (use these exact values; validation rejects anything else):**
- `severity`: one of `"P0"`, `"P1"`, `"P2"`, `"P3"` — use these exact strings. Do NOT use `"high"`, `"medium"`, `"low"`, `"critical"`, or any other vocabulary, even if your persona's prose discusses priorities in those terms conceptually.
- `finding_type`: one of `"error"`, `"omission"` — nothing else (no `"tension"`, `"concern"`, `"observation"`, etc.).
- `autofix_class`: one of `"safe_auto"`, `"gated_auto"`, `"manual"`.
- `evidence`: an ARRAY of strings with at least one element. A single string value is a validation failure — wrap every quote in `["..."]` even when there is only one.
- `confidence`: a number between 0.0 and 1.0 inclusive.
If your persona description uses severity vocabulary like "high-priority" or "critical" in its rubric text, translate to the P0-P3 scale at emit time. "Critical / must-fix" → P0, "important / should-fix" → P1, "worth-noting / could-fix" → P2, "low-signal" → P3. Same for priorities described qualitatively in your analysis — map to P0-P3 on the way out.
Example of a schema-valid finding (all required fields, correct enum values, correct array shape):
```json
{
"title": "Deployment ordering between migration and code unspecified",
"severity": "P0",
"section": "Unit 4",
"why_it_matters": "The plan acknowledges both deploy orderings produce incorrect state but resolves neither, leaving implementers with no safe deploy recipe.",
"finding_type": "omission",
"autofix_class": "gated_auto",
"suggested_fix": "Require Units 1-4 to land in a single atomic PR, or define the sequence explicitly.",
"confidence": 0.92,
"evidence": [
"If the migration runs before Units 1-3 land, the code reads stale data.",
"If after, new code temporarily sees old entries until migration runs."
]
}
```
Rules: Rules:
- You are a leaf reviewer inside an already-running compound-engineering review workflow. Do not invoke compound-engineering skills or agents unless this template explicitly instructs you to. Perform your analysis directly and return findings in the required output format only. - You are a leaf reviewer inside an already-running compound-engineering review workflow. Do not invoke compound-engineering skills or agents unless this template explicitly instructs you to. Perform your analysis directly and return findings in the required output format only.
- Suppress any finding below your stated confidence floor (see your Confidence calibration section). - Suppress any finding below your stated confidence floor (see your Confidence calibration section).
- Every finding MUST include at least one evidence item -- a direct quote from the document. - Every finding MUST include at least one evidence item a direct quote from the document.
- You are operationally read-only. Analyze the document and produce findings. Do not edit the document, create files, or make changes. You may use non-mutating tools (file reads, glob, grep, git log) to gather context about the codebase when evaluating feasibility or existing patterns. - You are operationally read-only. Analyze the document and produce findings. Do not edit the document, create files, or make changes. You may use non-mutating tools (file reads, glob, grep, git log) to gather context about the codebase when evaluating feasibility or existing patterns.
- **Exclude prior-round deferred entries from review scope.** If the document under review contains a `## Deferred / Open Questions` section or subsections such as `### From YYYY-MM-DD review`, ignore that content — it is review output from prior rounds, not part of the document's actual plan/requirements content. Do not flag entries inside it as new findings. Do not quote its text as evidence. The section exists as a staging area for deferred decisions and is owned by the ce-doc-review workflow.
- Set `finding_type` for every finding: - Set `finding_type` for every finding:
- `error`: Something the document says that is wrong -- contradictions, incorrect statements, design tensions, incoherent tradeoffs. - `error`: Something the document says that is wrong contradictions, incorrect statements, design tensions, incoherent tradeoffs.
- `omission`: Something the document forgot to say -- missing mechanical steps, absent list entries, undefined thresholds, forgotten cross-references. - `omission`: Something the document forgot to say missing mechanical steps, absent list entries, undefined thresholds, forgotten cross-references.
- Set `autofix_class` based on whether there is one clear correct fix, not on severity or importance: - Set `autofix_class` based on whether there is one clear correct fix, not on severity or importance. Three tiers:
- `auto`: One clear correct fix, applied silently. This includes trivial fixes AND substantive ones: - `safe_auto`: One clear correct fix, applied silently. Use only when there is genuinely one right answer — typo, wrong count, stale cross-reference, missing mechanically-implied step, terminology drift, factually incorrect behavior where the correct behavior is derivable from context. Always include `suggested_fix`.
- Internal reconciliation -- one document part authoritative over another (summary/detail mismatches, wrong counts, stale cross-references, terminology drift) - `gated_auto`: A concrete fix exists but it touches document meaning, scope, or author intent in a way that warrants a one-click confirmation before applying. Use for: substantive additions implied by the document's own decisions, codebase-pattern-resolved fixes, framework-native-API substitutions, missing standard security/reliability controls with known implementations. Always include `suggested_fix`. `gated_auto` is the default tier for "I know the fix, but the author should sign off."
- Implied additions -- correct content mechanically obvious from the document (missing steps, unstated thresholds, completeness gaps) - `manual`: Requires user judgment — genuinely multiple valid approaches where the right choice depends on priorities, tradeoffs, or context the reviewer does not have. Examples: architectural choices with real tradeoffs, scope decisions, feature prioritization, UX design choices. Include `suggested_fix` only when the fix is obvious despite the judgment call.
- Codebase-pattern-resolved -- an established codebase pattern resolves ambiguity (cite the specific file/function in `why_it_matters`)
- Incorrect behavior -- the document describes behavior that is factually wrong, and the correct behavior is obvious from context or the codebase - **Strawman-aware classification rule.** When listing alternatives to the primary fix, count only alternatives a competent implementer would genuinely weigh. A "do nothing / accept the defect" option is NOT a real alternative — it is the failure state the finding describes. The same applies to framings like "document in release notes," "accept drift," or "defer to later" when they sidestep the actual problem rather than solving it. If the only alternatives to the primary fix are strawmen (the problem persists under them), the finding is `safe_auto` or `gated_auto`, not `manual`.
- Missing standard security measures -- HTTPS enforcement, checksum verification, input sanitization, private IP rejection, or other controls with known implementations where omission is clearly a bug
- Incomplete technical descriptions -- the accurate/complete version is directly derivable from the codebase Positive example: "Cache key collision causes stale reads. Fix: include user-id in the cache key. Alternative: never cache this data." → The alternative (disable caching) is a legitimate design choice with real tradeoffs — `manual`.
- Missing requirements that follow mechanically from the document's own explicit, concrete decisions (not high-level goals -- a goal can be satisfied by multiple valid requirements)
The test is not "is this fix important?" but "is there more than one reasonable way to fix this?" If a competent implementer would arrive at the same fix independently, it is auto -- even if the fix is substantive. Always include `suggested_fix`. NOT auto if more than one reasonable fix exists or if scope/priority judgment is involved. Negative example: "Silent read-side failure on renamed config files. Fix: read new name, fall back to old with deprecation warning. Alternative: accept drift and document in release notes." → The alternative does not solve the problem; users on mid-flight runs still hit the failure. Treat as `gated_auto` with the concrete fix.
- `present`: Requires user judgment -- genuinely multiple valid approaches where the right choice depends on priorities, tradeoffs, or context the reviewer does not have. Examples: architectural choices with real tradeoffs, scope decisions, feature prioritization, UX design choices.
- `suggested_fix` is required for `auto` findings. For `present` findings, include only when the fix is obvious. - **Strawman safeguard on `safe_auto`.** If you classify a finding as `safe_auto` via strawman-dismissal of alternatives, name the dismissed alternatives explicitly in `why_it_matters` so synthesis and the reader can see the reasoning. When ANY non-strawman alternative exists (even if you judge it weak), downgrade to `gated_auto` — silent auto-apply is reserved for findings with genuinely one option.
- **Auto-promotion patterns** (findings eligible for `safe_auto` or `gated_auto` even when they're substantive):
- Factually incorrect behavior where the correct behavior is derivable from context or the codebase
- Missing standard security or reliability controls with established implementations (HTTPS enforcement, checksum verification, input sanitization, private IP rejection, fallback-with-deprecation-warning on renames)
- Codebase-pattern-resolved fixes that cite a specific existing pattern in a concrete file or function (the citation is required in `why_it_matters`)
- Framework-native-API substitutions — a hand-rolled implementation duplicates first-class framework behavior (cite the framework API in `why_it_matters`)
- Completeness additions mechanically implied by the document's own explicit decisions (not high-level goals — a goal can be satisfied by multiple valid requirements)
- `suggested_fix` is required for `safe_auto` and `gated_auto` findings. For `manual` findings, include only when the fix is obvious.
- If you find no issues, return an empty findings array. Still populate residual_risks and deferred_questions if applicable. - If you find no issues, return an empty findings array. Still populate residual_risks and deferred_questions if applicable.
- Use your suppress conditions. Do not flag issues that belong to other personas. - Use your suppress conditions. Do not flag issues that belong to other personas.
Writing `why_it_matters` (required field, every finding):
The `why_it_matters` field is how the reader — a developer triaging findings, a reader returning to the doc months later, a downstream automated surface — understands the problem without re-reading the file. Treat it as the most important prose field in your output; every downstream surface (walk-through questions, bulk-action previews, Open Questions entries, headless output) depends on it being good.
- **Lead with observable consequence.** Describe what goes wrong from the reader's or implementer's perspective — what breaks, what gets misread, what decision gets made wrong, what the downstream audience experiences. Do not lead with document structure ("Section X on line Y says..."). Start with the effect ("Implementers will disagree on which tier applies when..."). Section references appear later, only when the reader needs them to locate the issue.
- **Explain why the fix resolves the problem.** If you include a `suggested_fix`, the `why_it_matters` should make clear why that specific fix addresses the root cause. When a similar pattern exists elsewhere in the document or codebase (a parallel section, an established convention, a cited code pattern), reference it so the recommendation is grounded in what the team has already chosen.
- **Keep it tight.** Approximately 2-4 sentences. Longer framings are a regression — downstream surfaces have narrow display budgets, and verbose content gets truncated or skimmed.
- **Always produce substantive content.** `why_it_matters` is required by the schema. Empty strings, nulls, and single-phrase entries are validation failures. If you found something worth flagging (confidence at or above your persona's floor), you can explain it — the field exists because every finding needs a reason.
Illustrative pair — same finding, weak vs. strong framing:
```
WEAK (document-citation first; fails the observable-consequence rule):
Section "Classification Tiers" lists four tiers but Section "Synthesis"
routes three. Reconcile.
STRONG (observable consequence first, grounded fix reasoning):
Implementers will disagree on which tier a finding lands in, because
the Classification Tiers section enumerates four values while the
Synthesis routing only handles three. The document does not say which
enumeration is authoritative. Suggest the Classification Tiers list is
authoritative; drop the fourth value from the tier definition since
Synthesis already lacks a route for it.
```
False-positive categories to actively suppress:
- Pedantic style nitpicks (word choice, bullet vs. numbered lists, comma-vs-semicolon) — style belongs to the document author
- Issues that belong to other personas (see your Suppress conditions)
- Findings already resolved elsewhere in the document (search before flagging)
- Content inside `## Deferred / Open Questions` sections (prior-round review output, not document content)
**Advisory observations — route to FYI, do not force a decision.** If the honest answer to "what actually breaks if we don't fix this?" is "nothing breaks, but…", the finding is advisory. Ask: would a competent implementer hit a wrong outcome, a production bug, a misleading plan, or rework later? If no, set confidence in the **0.400.59 LOW/Advisory band** so synthesis routes the finding to FYI rather than surfacing it as a manual decision. Do not suppress — the observation still has value; it just does not warrant user judgment. Typical advisory shapes: naming asymmetry with no wrong answer, stylistic preference without evidence of impact, speculative future-work concern with no current signal, subjective readability note, theoretical scalability concern without baseline data, "could also be split" organizational preference when the current split is not broken.
</output-contract> </output-contract>
<review-context> <review-context>
Document type: {document_type} Document type: {document_type}
Document path: {document_path} Document path: {document_path}
{decision_primer}
Document content: Document content:
{document_content} {document_content}
</review-context> </review-context>
<decision-primer-rules>
When the `<prior-decisions>` block above lists entries (round 2+), honor them:
- Do not re-raise a finding whose title and evidence pattern-match a prior-round rejected (Skipped or Deferred) entry, unless the current document state makes the concern materially different. "Materially different" means the section was substantively edited and your evidence quote no longer appears in the current text — a light-touch edit doesn't count.
- Prior-round Applied findings are informational: the orchestrator verifies those landed via its own matching predicate. You do not need to re-surface them. If the applied fix did not actually land (you find the same issue at the same location), flag it — synthesis will recognize it via the R30 fix-landed predicate.
- Round 1 (no prior decisions) runs with no primer constraints.
This is a soft instruction; the orchestrator enforces the rule authoritatively via synthesis-level suppression (R29) regardless of persona behavior. Following the primer here reduces noisy re-raises and keeps the Coverage section clean.
</decision-primer-rules>
``` ```

View File

@@ -2,163 +2,339 @@
## Phase 3: Synthesize Findings ## Phase 3: Synthesize Findings
Process findings from all agents through this pipeline. **Order matters** -- each step depends on the previous. Process findings from all agents through this pipeline. Order matters each step depends on the previous. The pipeline implements the finding-lifecycle state machine: **Raised → (Confidence Gate | FYI-eligible | Dropped) → Deduplicated → Classified → SafeAuto | GatedAuto | Manual | FYI**. Re-evaluate state at each step boundary; do not carry forward assumptions from earlier steps as prose-level shortcuts.
### 3.1 Validate ### 3.1 Validate
Check each agent's returned JSON against the findings schema: Check each agent's returned JSON against the findings schema:
- Drop findings missing any required field defined in the schema - Drop findings missing any required field defined in the schema
- Drop findings with invalid enum values - Drop findings with invalid enum values (including the pre-rename `auto` / `present` values from older personas — treat those as malformed until all persona output has been regenerated)
- Note the agent name for any malformed output in the Coverage section - Note the agent name for any malformed output in the Coverage section
### 3.2 Confidence Gate ### 3.2 Confidence Gate (Per-Severity)
Suppress findings below 0.50 confidence. Store them as residual concerns for potential promotion in step 3.4. Gate findings using per-severity thresholds rather than a single flat floor:
| Severity | Gate |
|----------|------|
| P0 | 0.50 |
| P1 | 0.60 |
| P2 | 0.65 |
| P3 | 0.75 |
Findings at or above their severity's gate survive for classification. Findings below the gate are evaluated for FYI-eligibility:
- **FYI-eligible** when `autofix_class` is `manual` and confidence is between 0.40 (FYI floor) and the severity gate. These surface in an FYI subsection at the presentation layer (see 3.7) but do not enter the walk-through or any bulk action — they exist as observational value without forcing a decision.
- **Dropped** when confidence is below 0.40, or when the finding is `safe_auto` / `gated_auto` but below the severity gate (auto-apply findings need confidence above the decision gate to avoid silent mistakes).
Record the drop count and the FYI count in Coverage.
### 3.3 Deduplicate ### 3.3 Deduplicate
Fingerprint each finding using `normalize(section) + normalize(title)`. Normalization: lowercase, strip punctuation, collapse whitespace. Fingerprint each finding using `normalize(section) + normalize(title)`. Normalization: lowercase, strip punctuation, collapse whitespace.
When fingerprints match across personas: When fingerprints match across personas:
- If the findings recommend **opposing actions** (e.g., one says cut, the other says keep), do not merge -- preserve both for contradiction resolution in 3.5
- If the findings recommend opposing actions (e.g., one says cut, the other says keep), do not merge — preserve both for contradiction resolution in 3.5
- Otherwise merge: keep the highest severity, keep the highest confidence, union all evidence arrays, note all agreeing reviewers (e.g., "coherence, feasibility") - Otherwise merge: keep the highest severity, keep the highest confidence, union all evidence arrays, note all agreeing reviewers (e.g., "coherence, feasibility")
- **Coverage attribution:** Attribute the merged finding to the persona with the highest confidence. Decrement the losing persona's Findings count *and* the corresponding route bucket (Auto or Present) so `Findings = Auto + Present` stays exact. - **Coverage attribution:** Attribute the merged finding to the persona with the highest confidence. Decrement the losing persona's Findings count and the corresponding route bucket so totals stay exact.
### 3.4 Promote Residual Concerns ### 3.4 Cross-Persona Agreement Boost
Scan the residual concerns (findings suppressed in 3.2) for: When 2+ independent personas flagged the same merged finding (from 3.3), boost the merged finding's confidence by +0.10 (capped at 1.0). Independent corroboration is strong signal — multiple reviewers converging on the same issue is more reliable than any single reviewer's confidence. Note the boost in the Reviewer column of the output (e.g., "coherence, feasibility +0.10").
- **Cross-persona corroboration**: A residual concern from Persona A overlaps with an above-threshold finding from Persona B. Promote at P2 with confidence 0.55-0.65. Inherit `finding_type` from the corroborating above-threshold finding.
- **Concrete blocking risks**: A residual concern describes a specific, concrete risk that would block implementation. Promote at P2 with confidence 0.55. Set `finding_type: omission` (blocking risks surfaced as residual concerns are inherently about something the document failed to address). This replaces the earlier residual-concern promotion step. Findings below the confidence gate are not promoted back into the review surface; they appear in Coverage as residual concerns only. If a below-gate finding is genuinely important, the reviewer should raise their confidence or provide stronger evidence rather than relying on a promotion rule.
### 3.5 Resolve Contradictions ### 3.5 Resolve Contradictions
When personas disagree on the same section: When personas disagree on the same section:
- Create a **combined finding** presenting both perspectives
- Set `autofix_class: present` - Create a combined finding presenting both perspectives
- Set `finding_type: error` (contradictions are by definition about conflicting things the document says, not things it omits) - Set `autofix_class: manual` (contradictions are by definition judgment calls)
- Set `finding_type: error` (contradictions are about conflicting things the document says, not things it omits)
- Frame as a tradeoff, not a verdict - Frame as a tradeoff, not a verdict
Specific conflict patterns: Specific conflict patterns:
- Coherence says "keep for consistency" + scope-guardian says "cut for simplicity" -> combined finding, let user decide
- Feasibility says "this is impossible" + product-lens says "this is essential" -> P1 finding framed as a tradeoff
- Multiple personas flag the same issue -> merge into single finding, note consensus, increase confidence
### 3.6 Promote Pattern-Resolved Findings - Coherence says "keep for consistency" + scope-guardian says "cut for simplicity" → combined finding, let user decide
- Feasibility says "this is impossible" + product-lens says "this is essential" → P1 finding framed as a tradeoff
- Multiple personas flag the same issue (no disagreement) → handled in 3.3 merge, not here
Scan `present` findings for codebase-pattern-resolved auto-eligibility. Promote `present` -> `auto` when **all three** conditions are met: ### 3.5b Deterministic Recommended-Action Tie-Break
1. The finding's `why_it_matters` cites a specific existing codebase pattern -- not just "best practice" or "convention," but a concrete pattern with a file, function, or usage reference Every merged finding carries exactly one `recommended_action` field consumed by the walk-through (`references/walkthrough.md`) to mark the `(recommended)` option, by LFG (`references/bulk-preview.md`) to choose what to execute in bulk, and by the stem's yes/no framing. When a merged finding was flagged by multiple personas who implied different actions, synthesis picks the recommended action deterministically so identical review artifacts produce identical walk-through and LFG behavior across runs.
2. The finding includes a concrete `suggested_fix` that follows that cited pattern
3. There is no genuine tradeoff -- the codebase context resolves any ambiguity about which approach to use
The principle: when a reviewer mentions multiple theoretical approaches but the codebase already has an established pattern that makes one approach clearly correct, the codebase context settles the question. Alternatives mentioned in passing do not create a real tradeoff if the evidence shows the codebase has already chosen. **Tie-break order (most conservative first):** `Skip > Defer > Apply`. The first action that at least one contributing persona implied wins, scanning in that order.
Additional auto-promotion patterns (promote `present` -> `auto` when): - If any contributing persona implied Skip → `recommended_action: Skip`
- The finding identifies factually incorrect behavior in the document and the suggested fix describes the correct behavior (not a design choice between alternatives) - Else if any contributing persona implied Defer → `recommended_action: Defer`
- The finding identifies a missing industry-standard security control where the document's own context makes the omission clearly wrong (not a legitimate design choice for the system described), and the suggested fix follows established practice - Else → `recommended_action: Apply`
- The finding identifies an incomplete technical description and the complete version is directly derivable from the codebase (the reviewer cited specific code showing what the description should say)
Do not promote if the finding involves scope or priority changes where the document author may have weighed tradeoffs invisible to the reviewer. **Persona-to-action mapping.** A persona implies an action through its classification:
- `safe_auto` or `gated_auto` → implies Apply
- `manual` with a concrete `suggested_fix` and a recommended resolution → implies Apply (the persona has an opinion about what to do)
- `manual` flagged as a tradeoff or scope question with no recommended resolution → implies Defer (worth revisiting, not worth acting now)
- Any persona flagging the finding as low-confidence or suppression-eligible via residual concerns → implies Skip
- Persona in the contradiction set (3.5) implying "keep as-is / do not change" → implies Skip
If the contributing personas are all silent on action (e.g., a merged `manual` finding from personas that all flagged it as observation without recommendation), pick the default based on whether the merged finding carries an executable `suggested_fix`:
- `suggested_fix` present → `recommended_action: Apply` as the pragmatic default.
- `suggested_fix` absent → `recommended_action: Defer` (the walk-through and LFG cannot execute Apply without a fix; routing an actionless finding to Defer surfaces it in Open Questions where the user can decide what to do with it).
This gate holds for every branch of the tie-break: if the winning action is `Apply` but the merged finding has no `suggested_fix` after 3.6 (Promote) and 3.7 (Route) have run, downgrade to `Defer`. The walk-through still lets the user pick any of the four options; this rule only governs the agent's default recommendation so LFG and bulk-preview never schedule a non-executable Apply.
**Conflict-context surface.** When the tie-break fires (contributing personas implied different actions), record a one-line conflict-context string on the merged finding. The walk-through renders this on the R15 conflict-context line (see `references/walkthrough.md`). Example: `Coherence recommends Apply; scope-guardian recommends Skip. Agent's recommendation: Skip.`
**Downstream invariant.** The walk-through and bulk-preview never recompute the recommendation — they read `recommended_action` and render `(recommended)` on the matching option. LFG-the-rest and routing option B execute the `recommended_action` across the scoped finding set in bulk. This keeps LFG outcomes reproducible and auditable: the same review artifact always produces the same bulk plan.
### 3.5c Premise-Dependency Chain Linking
Document reviews often produce fanout: a single premise challenge ("is this work justified?") generates downstream findings that all evaporate if the premise is rejected ("alias unjustified", "abstraction overkill", "migration lacks rollback", "naming forecloses future"). Surfacing each as an independent decision forces the user to re-litigate the same root question N times. This step links dependent findings to their root so presentation can group them and the walk-through can cascade a single root decision across the chain.
Run this step after 3.5b (recommended_action normalized) and before 3.6 (auto-promotion), operating on the merged finding set.
**Step 1: Identify roots.** A finding is a candidate root when ALL of the following hold:
- Severity is `P0` or `P1` (premise-level issues carry high priority by nature)
- `autofix_class` is `manual` (the root itself requires judgment — a safe/gated root is acted on, not cascaded)
- `why_it_matters` or `title` challenges a foundational premise, not a detail. Signal phrases (shape, not vocabulary): "premise unsupported", "justification missing", "do-nothing baseline not evaluated", "is X justified", "unsupported by evidence", "is the proposed solution the right approach"
- The finding's `section` is framing-level (Problem Frame, Overview, Why, Motivation, Goals) OR the finding explicitly questions whether a named component should exist
If multiple candidates match the criteria, elevate ALL of them. The criteria above (P0/P1, manual, framing-level section, premise-challenge signal phrases) are restrictive enough that this list will be short for any well-formed document; do not impose a further numerical cap. Picking only one root when two valid roots exist leaves the second root's natural dependents stranded as independent manual findings — the exact UX problem chains are meant to solve.
**Peer vs nested test.** Two candidate roots are peers when accepting root A's proposed fix would not resolve root B's concern (and vice versa). They are nested when one root's fix would moot the other — in which case the subsumed candidate becomes a dependent of the surviving root, not a peer root. Apply the test symmetrically: check both directions before deciding.
**Surviving-root selection under asymmetric subsumption.** When nested, the surviving root is the one whose fix moots the other — **not** the one with higher confidence. If accepting Root A's fix moots Root B's concern, but accepting Root B's fix leaves Root A's concern standing, A is the surviving root and B becomes its dependent, regardless of which candidate scored higher confidence. The subsumption direction determines scope (broader premise wins); confidence determines strength, not scope. Confidence is used for tie-breaking *among peers*, not for deciding which of two nested candidates dominates.
**Sanity diagnostic.** If more than 3 candidates match, reconsider whether the criteria are being applied correctly — it is unusual for a single document to contain more than 3 genuinely distinct premise-level challenges. Do not silently drop candidates; either confirm each one independently meets the criteria (and surface them all), or tighten the application of the criteria. If the count is legitimately high, surfacing all of them is more useful than hiding any.
If none match, skip the rest of this step — no chains exist.
**Dependent assignment under multiple roots.** When multiple roots exist and a candidate dependent could plausibly link to more than one, assign it to the root whose rejection most directly dissolves the dependent's concern. If ambiguity remains, assign to the higher-confidence root. A dependent never links to more than one root — a single `depends_on` value.
**Step 2: Identify dependents.** For each candidate root, scan the remaining findings for dependents. The predicate must match the cascade trigger in `references/walkthrough.md` — dependents cascade when the user rejects (Skip/Defer) the root, so dependency is defined on the rejection branch, not the acceptance branch. A finding is a dependent of a root when:
- The root challenges a foundational premise about a named component — questioning whether it should exist, whether the proposed approach is correct, or whether the work is justified. Shapes to recognize (not a vocabulary list — map to whatever the document's domain actually uses): a compatibility layer whose necessity is challenged, a planned feature whose justification is in doubt, an abstraction whose warrant is questioned, a proposed change whose scope is disputed, a migration target whose choice is contested, an architectural commitment whose basis is unsupported
- The candidate's `suggested_fix` modifies, adds detail to, or constrains that same component
- The candidate's concern would dissolve if the root's premise is rejected — meaning: if the user rejects the root (Skip/Defer), the component the dependent targets is no longer a settled part of the plan, so the dependent's fix has nothing stable to act on and batch-rejects with the root
Test with the substitution check: "If the user rejects the root (Skip/Defer), does the dependent's finding still describe an actionable concern the user would want to engage with this round?" If no — the dependent's premise dissolves alongside the root's — it is a dependent. If yes (the finding identifies a problem that survives root rejection), it is not.
**Step 3: Independence safeguard.** Even when a finding's target component is addressed by the root, do NOT link if:
- The dependent identifies a problem that would exist regardless of the root's resolution. A migration's rollback plan, a module's error handling, a feature's test coverage — these are operational obligations that don't evaporate when the premise changes. They describe how a component must behave if it exists at all.
- The dependent's `why_it_matters` cites evidence (codebase fact, framework convention, production data) that stands on its own, not conditioned on the premise
- The dependent is `safe_auto` — it has one clear correct fix and should apply regardless of the root's resolution
When uncertain, default to NOT linking. A mis-linked chain hides a real issue; leaving a finding unlinked only costs one extra decision.
**Step 4: Annotate.** On each dependent, record `depends_on: <root_finding_id>` (use section + normalized title as the id). On each root, record `dependents: [<dependent_ids>]`. Cap `dependents` at 6 entries per root — if more than 6 candidates link to the same root, keep the top 6 by severity then confidence and leave the rest unlinked (over-aggressive chaining risks obscuring independent concerns).
Do NOT reclassify, re-route, or change confidence of any finding in this step. Linking is purely annotative; the walk-through and presentation use the annotation, synthesis proper does not.
**Step 5: Report in Coverage.** Add a line to the coverage summary: `Chains: N root(s) with M total dependents`. When N = 0, omit the line.
**Count invariant (critical — do not violate).** `M` in the coverage line is the number of findings with `depends_on` set after Step 4 completes — i.e., the final linked count after steps 2 (candidacy), 3 (independence safeguard), and 4 (cap). It is NOT the number of candidates considered in Step 2. The same `dependents` array is the source of truth for both coverage counting AND rendering the `Dependents (...)` sub-block. If a finding appears in a root's `dependents` array, it MUST appear nested under that root in the presentation and MUST NOT appear at its own severity position. If a finding does NOT appear in any root's `dependents` array, it MUST appear at its own severity position and MUST NOT appear nested anywhere. Coverage count and rendering drift apart only if the orchestrator is using two different source-of-truth values — there is exactly one, the post-Step-4 `dependents` array on each root.
**Worked example A (rename-shape).** Review of a refactor plan surfaces 11 findings. One is P0 manual "Rename premise unsupported by user-facing evidence" in Problem Frame — a candidate root. Scanning the other 10:
- P1 manual "Alias mechanism unjustified scope" — root proposes scoping down to a pure alias-free rename; dependent's fix proposes dropping alias infrastructure. Linked.
- P2 manual "AliasedCommand abstraction overkill" — abstraction exists to support the alias; if alias dropped, abstraction dissolves. Linked.
- P2 manual "Rename forecloses dual-mode future" — concern only exists if rename proceeds. Linked.
- P2 manual "Identity drift: command vs artifact names" — naming asymmetry only exists if rename proceeds. Linked.
- P1 manual "Migration lacks rollback strategy" — migration needs rollback regardless of scope. NOT linked (independence safeguard).
- P0 gated_auto "Deployment-ordering between migration and code" — concrete fix user confirms regardless. NOT linked (safeguard: gated_auto with own resolution path).
Result: 1 root + 4 dependents. User sees the root first; rejecting it cascades the 4 dependents to auto-resolved. Manual engagement drops from 11 → 7 (6 unlinked + 1 visible root).
**Worked example B (auth-shape).** Review of a plan to introduce a new session-management middleware. One finding is P1 manual "Middleware rewrite premise unsupported — existing session handling has no reported reliability issues" in Problem Frame. Scanning the other findings:
- P2 manual "Middleware abstraction boundary unclear vs existing request context" — the boundary only matters if the middleware is built. Linked.
- P2 manual "Rollout strategy for new session store not specified" — the rollout only matters if the new store ships. Linked.
- P1 gated_auto "CSRF token regeneration missing on session rotation" — a real security gap in the plan's written design, independent of whether the middleware is the right approach. NOT linked (safeguard: gated_auto, concrete fix applies regardless).
- P2 manual "Existing session timeout behavior not captured in tests" — this is a pre-existing test coverage gap. It exists in the current code regardless of whether the rewrite happens. NOT linked (independence safeguard).
Result: 1 root + 2 dependents. The shape is the same as Example A — different vocabulary, different domain — which is the pattern to recognize.
### 3.6 Promote Auto-Eligible Findings
Scan `manual` findings for promotion to `safe_auto` or `gated_auto`. Promote when the finding meets one of the consolidated auto-promotion patterns:
- **Codebase-pattern-resolved.** `why_it_matters` cites a specific existing codebase pattern (concrete file/function/usage reference, not just "best practice" or "convention"), and `suggested_fix` follows that pattern. Promote to `gated_auto` — the user still confirms, but the codebase evidence resolves ambiguity.
- **Factually incorrect behavior.** The document describes behavior that is factually wrong, and the correct behavior is derivable from context or the codebase. Promote to `gated_auto`.
- **Missing standard security/reliability controls.** The omission is clearly a gap (not a legitimate design choice for the system described), and the fix follows established practice (HTTPS enforcement, checksum verification, input sanitization, fallback-with-deprecation-warning on renames). Promote to `gated_auto`.
- **Framework-native-API substitutions.** A hand-rolled implementation duplicates first-class framework behavior, and the framework API is cited. Promote to `gated_auto`.
- **Mechanically-implied completeness additions.** The missing content follows mechanically from the document's own explicit, concrete decisions (not high-level goals). Promote to `safe_auto` when there is genuinely one correct addition; `gated_auto` when the addition is substantive.
Do not promote if the finding involves scope or priority changes where the author may have weighed tradeoffs invisible to the reviewer.
**Strawman-downgrade safeguard.** If a `safe_auto` finding names dismissed alternatives in `why_it_matters` (per the subagent template's strawman rule), verify the alternatives are genuinely strawmen. If any alternative is a plausible design choice that the persona dismissed too aggressively, downgrade to `gated_auto` so the user sees the tradeoff before the fix applies.
### 3.7 Route by Autofix Class ### 3.7 Route by Autofix Class
**Severity and autofix_class are independent.** A P1 finding can be `auto` if the correct fix is obvious. The test is not "how important?" but "is there one clear correct fix, or does this require judgment?" **Severity and autofix_class are independent.** A P1 finding can be `safe_auto` if the correct fix is obvious. The test is not "how important?" but "is there one clear correct fix, or does this require judgment?"
| Autofix Class | Route | | Autofix Class | Route |
|---------------|-------| |---------------|-------|
| `auto` | Apply automatically -- one clear correct fix. Includes internal reconciliation (one part authoritative over another), additions mechanically implied by the document's own content, and codebase-pattern-resolved fixes where codebase evidence makes one approach clearly correct. | | `safe_auto` | Apply silently in Phase 4. Requires `suggested_fix`. Demote to `gated_auto` if missing. |
| `present` | Present individually for user judgment | | `gated_auto` | Enter the per-finding walk-through with Apply marked (recommended). Requires `suggested_fix`. Demote to `manual` if missing. |
| `manual` | Enter the per-finding walk-through with user-judgment framing. `suggested_fix` is optional. |
| FYI-subsection | `manual` findings below the severity gate but at or above the FYI floor (0.40) — surface in a distinct FYI subsection of the presentation, do not enter the walk-through or any bulk action. |
Demote any `auto` finding that lacks a `suggested_fix` to `present`. **Auto-eligible patterns for safe_auto:** summary/detail mismatch (body authoritative over overview), wrong counts, missing list entries derivable from elsewhere in the document, stale internal cross-references, terminology drift, prose/diagram contradictions where prose is more detailed, missing steps mechanically implied by other content, unstated thresholds implied by surrounding context.
**Auto-eligible patterns:** summary/detail mismatch (body is authoritative over overview), wrong counts, missing list entries derivable from elsewhere in the document, stale internal cross-references, terminology drift, prose/diagram contradictions where prose is more detailed, missing steps mechanically implied by other content, unstated thresholds implied by surrounding context, completeness gaps where the correct addition is obvious, codebase-pattern-resolved fixes where the reviewer cites a specific existing pattern and the suggested_fix follows it, factually incorrect behavior where the correct behavior is obvious from context or the codebase, missing standard security controls with known implementations, incomplete technical descriptions where the complete version is derivable from the codebase. If the fix requires judgment about *what* to do (not just *what to write*) and the codebase context does not resolve the ambiguity, it belongs in `present`. **Auto-eligible patterns for gated_auto:** codebase-pattern-resolved fixes, factually incorrect behavior, missing standard security/reliability controls, framework-native-API substitutions, substantive completeness additions mechanically implied by explicit decisions.
### 3.8 Sort ### 3.8 Sort
Sort findings for presentation: P0 -> P1 -> P2 -> P3, then by finding type (errors before omissions), then by confidence (descending), then by document order (section position). Sort findings for presentation: P0 P1 P2 P3, then by finding type (errors before omissions), then by confidence (descending), then by document order (section position).
## Phase 4: Apply and Present ## Phase 4: Apply and Present
### Apply Auto-fixes ### Apply safe_auto fixes
Apply all `safe_auto` findings to the document in a single pass:
Apply all `auto` findings to the document in a **single pass**:
- Edit the document inline using the platform's edit tool - Edit the document inline using the platform's edit tool
- Track what was changed for the "Auto-fixes Applied" section - Track what was changed for the "Applied fixes" section in the rendered output (`safe_auto` is the internal enum; the rendered section header reads "Applied fixes")
- Do not ask for approval -- these have one clear correct fix - Do not ask for approval these have one clear correct fix
List every auto-fix in the output summary so the user can see what changed. Use enough detail to convey the substance of each fix (section, what was changed, reviewer attribution). This is especially important for fixes that add content or touch document meaning -- the user should not have to diff the document to understand what the review did. List every applied fix in the output summary so the user can see what changed. Use enough detail to convey the substance of each fix (section, what was changed, reviewer attribution). This is especially important for fixes that add content or touch document meaning the user should not have to diff the document to understand what the review did.
### Present Remaining Findings ### Route Remaining Findings
**Headless mode:** Do not use interactive question tools. Output all non-auto findings as a structured text summary the caller can parse and act on: After safe_auto fixes apply, remaining findings split into buckets:
- `gated_auto` and `manual` findings at or above the severity gate → enter the routing question (see Unit 5 / `references/walkthrough.md`)
- FYI-subsection findings → surface in the presentation only, no routing
- Zero actionable findings remaining → skip the routing question; flow directly to Phase 5 terminal question
**Headless mode:** Do not use interactive question tools. Output all findings as a structured text envelope the caller can parse. Internal enum values (`safe_auto`, `gated_auto`, `manual`, `FYI`) stay in the schema and synthesis prose; the envelope below uses user-facing vocabulary — "fixes", "Proposed fixes", "Decisions", "FYI observations" — so headless output reads the same way interactive output does.
``` ```
Document review complete (headless mode). Document review complete (headless mode).
Applied N auto-fixes: Applied N fixes:
- <section>: <what was changed> (<reviewer>) - <section>: <what was changed> (<reviewer>)
- <section>: <what was changed> (<reviewer>) - <section>: <what was changed> (<reviewer>)
Findings (requires judgment): Proposed fixes (concrete fix, requires user confirmation):
[P0] Section: <section> — <title> (<reviewer>, confidence <N>) [P0] Section: <section> — <title> (<reviewer>, confidence <N>)
Why: <why_it_matters> Why: <why_it_matters>
Suggested fix: <suggested_fix or "none"> Suggested fix: <suggested_fix>
Decisions (requires user judgment):
[P1] Section: <section> — <title> (<reviewer>, confidence <N>) [P1] Section: <section> — <title> (<reviewer>, confidence <N>)
Why: <why_it_matters> Why: <why_it_matters>
Suggested fix: <suggested_fix or "none"> Suggested fix: <suggested_fix or "none">
Dependents (would resolve if this root is rejected):
[P2] Section: <section> — <title> (<reviewer>, confidence <N>)
Why: <why_it_matters>
[P2] Section: <section> — <title> (<reviewer>, confidence <N>)
Why: <why_it_matters>
FYI observations (low-confidence, no decision required):
[P3] Section: <section> — <title> (<reviewer>, confidence <N>)
Why: <why_it_matters>
Residual concerns: Residual concerns:
- <concern> (<source>) - <concern> (<source>)
Deferred questions: Deferred questions:
- <question> (<source>) - <question> (<source>)
Review complete
``` ```
Omit any section with zero items. Then proceed directly to Phase 5 (which returns immediately in headless mode). Omit any section with zero items. The section headers reflect user-facing vocabulary: the "Proposed fixes" bucket carries `gated_auto` findings (the persona has a concrete fix; the user confirms), "Decisions" carries above-gate `manual` findings (judgment calls), and "FYI observations" carries `manual` findings between the 0.40 FYI floor and the per-severity gate. When a root has dependents, render the root at its normal position in the severity-sorted list and nest its dependents as an indented `Dependents (...)` sub-block immediately below. Do not re-list dependents at their own severity position — they appear only under their root. End with "Review complete" as the terminal signal so callers can detect completion.
**Interactive mode:** **Interactive mode:**
Present `present` findings using the review output template (read `references/review-output-template.md`). Within each severity level, separate findings by type: Present findings using the review output template (read `references/review-output-template.md`). Within each severity level, separate findings by type:
- **Errors** (design tensions, contradictions, incorrect statements) first -- these need resolution
- **Omissions** (missing steps, absent details, forgotten entries) second -- these need additions
Brief summary at the top: "Applied N auto-fixes. K findings to consider (X errors, Y omissions)." - Errors (design tensions, contradictions, incorrect statements) first — these need resolution
- Omissions (missing steps, absent details, forgotten entries) second — these need additions
Include the Coverage table, auto-fixes applied, residual concerns, and deferred questions. Brief summary at the top: "Applied N fixes. K items need attention (X errors, Y omissions). Z FYI observations."
Include the Coverage table, applied fixes, FYI observations (as a distinct subsection), residual concerns, and deferred questions.
### R29 Rejected-Finding Suppression (Round 2+)
When the orchestrator is running round 2+ on the same document in the same session, the decision primer (see `SKILL.md` — Decision primer) carries forward every prior-round Skipped, Deferred, and Acknowledged finding. Synthesis suppresses re-raised rejected findings rather than re-surfacing them to the user. Acknowledged is treated as a rejected-class decision here: the user saw the finding, chose not to act on it (no Apply, no Defer append), and wants it on record — equivalent to Skip for suppression purposes.
For each current-round finding, compare against the primer's rejected list:
- **Matching predicate:** same as R30 — `normalize(section) + normalize(title)` fingerprint augmented with evidence-substring overlap check (>50%). If a current-round finding matches a prior-round rejected finding on fingerprint AND evidence overlap, drop the current-round finding.
- **Materially-different exception:** if the current document state has changed around the finding's section since the prior round (e.g., the section was edited and the evidence quote no longer appears in the current text), treat the finding as new — the underlying context shifted and the concern may be genuinely different now. The persona's evidence itself reveals this: a quote that doesn't appear in the current document is a signal the prior-round rejection no longer applies.
- **On suppression:** record the drop in Coverage with a "previously rejected, re-raised this round" note so the user can see what was suppressed. The user can explicitly escalate by invoking the review again on a different context if they believe the suppression was wrong.
This rule runs at synthesis time, not at the persona level. Personas have a soft instruction via the subagent template's `{decision_primer}` variable to avoid re-raising rejected findings, but the orchestrator is the authoritative gate — if a persona re-raises despite the primer, synthesis drops the finding.
### R30 Fix-Landed Matching Predicate
When the orchestrator is running round 2+ on the same document (see Unit 7 multi-round memory), synthesis verifies that prior-round Applied findings actually landed. For each prior-round Applied finding:
- **Matching predicate:** `normalize(section) + normalize(title)` (same fingerprint as 3.3 dedup) augmented with an evidence-substring overlap check. If any current-round persona raises a finding whose fingerprint matches a prior-round Applied finding AND shares >50% of its evidence substring with the prior-round evidence, treat it as a fix-landed regression.
- **Section renames count as different locations.** If the section name has changed between rounds (edit introduced a heading rename), treat the new section as a different location and the current-round finding as new.
- **On match:** flag the finding as "fix did not land" in the report rather than surfacing it as a new finding. Include the prior-round finding's title and the current-round persona's evidence so the user can see why the verification flagged it.
### Protected Artifacts ### Protected Artifacts
During synthesis, discard any finding that recommends deleting or removing files in: During synthesis, discard any finding that recommends deleting or removing files in:
- `docs/brainstorms/` - `docs/brainstorms/`
- `docs/plans/` - `docs/plans/`
- `docs/solutions/` - `docs/solutions/`
These are pipeline artifacts and must not be flagged for removal. These are pipeline artifacts and must not be flagged for removal.
## Phase 5: Next Action ## Phase 5: Next Action — Terminal Question
**Headless mode:** Return "Review complete" immediately. Do not ask questions. The caller receives the text summary from Phase 4 and handles any remaining findings. **Headless mode:** Return "Review complete" immediately. Do not ask questions. The caller receives the text envelope from Phase 4 and handles any remaining findings.
**Interactive mode:** **Interactive mode:** fire the terminal question using the platform's blocking question tool (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini). This question is distinct from the mid-flow routing question (`references/walkthrough.md`) — the routing question chooses *how* to engage with findings, this one chooses *what to do next* once engagement is complete. Do not merge them.
**Ask using the platform's interactive question tool** -- do not print the question as plain text output: **Stem:** `Apply decisions and what next?`
- Claude Code: `AskUserQuestion`
- Codex: `request_user_input`
- Gemini: `ask_user`
- Fallback (no question tool available): present numbered options and stop; wait for the user's next message
Offer these two options. Use the document type from Phase 1 to set the "Review complete" description: **Options (three by default; two in the zero-actionable case):**
1. **Refine again** -- Address the findings above, then re-review When `fixes_applied_count > 0` (at least one safe_auto or Apply decision has landed this session):
2. **Review complete** -- description based on document type:
- requirements document: "Create technical plan with ce-plan"
- plan document: "Implement with ce-work"
After 2 refinement passes, recommend completion -- diminishing returns are likely. But if the user wants to continue, allow it. ```
A. Apply decisions and proceed to <next stage>
B. Apply decisions and re-review
C. Exit without further action
```
Return "Review complete" as the terminal signal for callers. When `fixes_applied_count == 0` (zero-actionable case, or the user took routing option D / every walk-through decision was Skip):
```
A. Proceed to <next stage>
B. Exit without further action
```
The `<next stage>` substitution uses the document type from Phase 1:
- Requirements document → `ce-plan`
- Plan document → `ce-work`
**Label adaptation:** when no decisions are queued to apply, the primary option drops the `Apply decisions and` prefix — the label should match what the system is doing. `Apply decisions and proceed` when fixes are queued; `Proceed` when nothing is queued.
**Caller-context handling (implicit):** the terminal question's "Proceed to <next stage>" option is interpreted contextually by the agent from the visible conversation state. When ce-doc-review is invoked from inside another skill's flow (e.g., ce-brainstorm Phase 4 re-review, ce-plan phase 5.3.8), the agent does not fire a nested `/ce-plan` or `/ce-work` dispatch — it returns control to the caller's flow which continues its own logic. When invoked standalone, "Proceed" dispatches the appropriate next skill. No explicit caller-hint argument is required; if this implicit handling proves unreliable in practice, an explicit `nested:true` flag can be added as a follow-up.
### Iteration limit
After 2 refinement passes, recommend completion — diminishing returns are likely. But if the user wants to continue, allow it; the primer carries all prior-round decisions so later rounds suppress repeat findings cleanly.
Return "Review complete" as the terminal signal for callers, regardless of which option the user picked.
## What NOT to Do ## What NOT to Do
@@ -170,4 +346,4 @@ Return "Review complete" as the terminal signal for callers.
## Iteration Guidance ## Iteration Guidance
On subsequent passes, re-dispatch personas and re-synthesize. The auto-fix mechanism and confidence gating prevent the same findings from recurring once fixed. If findings are repetitive across passes, recommend completion. On subsequent passes, re-dispatch personas with the multi-round decision primer (see Unit 7) and re-synthesize. Fixed findings self-suppress because their evidence is gone from the current doc; rejected findings are handled by the R29 pattern-match suppression rule; applied-fix verification uses the R30 matching predicate above. If findings are repetitive across passes after these mechanisms run, recommend completion.

View File

@@ -0,0 +1,282 @@
# Per-finding Walk-through
This reference defines Interactive mode's per-finding walk-through — the path the user enters by picking option A (`Review each finding one by one — accept the recommendation or choose another action`) from the routing question, plus the unified completion report that every terminal path (walk-through, LFG, Append-to-Open-Questions, zero findings) emits.
Interactive mode only.
---
## Routing question (the entry point)
After `safe_auto` fixes apply and synthesis produces the remaining finding set, the orchestrator asks a four-option routing question before any walk-through or bulk action runs.
Use the platform's blocking question tool (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini). In Claude Code, the tool should already be loaded from the Interactive-mode pre-load step in `SKILL.md` — if it isn't, call `ToolSearch` with query `select:AskUserQuestion` now. Rendering the routing question as narrative text is a bug, not a valid fallback.
**Stem:** `What should the agent do with the remaining N findings?`
**Options (fixed order; no option is labeled `(recommended)` — the routing choice is user-intent):**
```
A. Review each finding one by one — accept the recommendation or choose another action
B. LFG. Apply the agent's best-judgment action per finding
C. Append findings to the doc's Open Questions section and proceed
D. Report only — take no further action
```
The per-finding `(recommended)` labeling lives inside the walk-through (option A) and the bulk preview (options B/C), where it's applied per-finding from synthesis step 3.5b's `recommended_action`. The routing question itself does not recommend one of A/B/C/D because the right route depends on user intent (engage / trust / triage / skim), not on the finding-set shape — a rule that mapped finding-set shape to routing recommendation (e.g., "most findings are Apply-shaped → recommend LFG") would pressure users toward automated paths in ways that conflict with the user-intent framing.
If all remaining findings are FYI-subsection-only (no `gated_auto` or above-gate `manual` findings), skip the routing question entirely and flow to the Phase 5 terminal question.
**Append-availability adaptation.** When `references/open-questions-defer.md` has cached `append_available: false` at Phase 4 start (e.g., read-only document, unwritable filesystem), option C is suppressed from the routing question because every per-finding Defer would fail into the open-questions failure path. The menu shows three options (A / B / D) and the stem appends one line explaining why (e.g., `Append to Open Questions unavailable — document is read-only in this environment.`). This mirrors the per-finding option B suppression described under "Adaptations" below — both routing-level and per-finding Defer paths share the same availability signal so the user never sees Defer surfaced at one level and omitted at the other.
**Dispatch by selection:**
- **A** — load this walk-through (per-finding loop). Apply decisions accumulate in memory; Open-Questions defers execute inline via `references/open-questions-defer.md`; Skip decisions are recorded as no-action; `LFG the rest` routes through `references/bulk-preview.md`.
- **B** — load `references/bulk-preview.md` scoped to every pending `gated_auto` / `manual` finding. On Proceed, execute the plan: Apply → end-of-batch document edit; Open-Questions defers → `references/open-questions-defer.md`; Skip → no-op. On Cancel, return to the routing question.
- **C** — load `references/bulk-preview.md` with every pending finding in the Open-Questions bucket (regardless of the agent's natural recommendation). On Proceed, route every finding through `references/open-questions-defer.md`; no document edits apply. On Cancel, return to the routing question.
- **D** — do not enter any dispatch phase. Emit the completion report and flow to Phase 5 terminal question.
---
## Entry (walk-through mode)
The walk-through receives, from the orchestrator:
- The merged findings list in severity order (P0 → P1 → P2 → P3), filtered to `gated_auto` and `manual` findings that survived the per-severity confidence gate. FYI-subsection findings are not included — they surface in the final report only and have no walk-through entry.
- The run id for artifact lookups (when applicable).
- Premise-dependency chain annotations from synthesis step 3.5c: each finding may carry `depends_on: <root_id>` or `dependents: [<ids>]`.
Each finding's recommended action has already been normalized by synthesis step 3.5b (Deterministic Recommended-Action Tie-Break, `Skip > Defer > Apply`) — the walk-through surfaces that recommendation via the merged finding's `recommended_action` field and does not recompute it.
**Root-first iteration order.** When a finding has `dependents`, iterate it before any of its dependents regardless of severity order within the chain. The root always comes first so the user's root decision can cascade.
**Cascading root decisions.** When the user picks Skip or Defer on a finding with `dependents`:
1. Announce the cascade in the terminal before firing the next question: "Skipping/Deferring this root will auto-resolve N dependent finding(s): {titles}. Continue?"
2. Use the platform's blocking question tool with two options: `Cascade — apply same action to all dependents` (recommended) and `Decide each dependent individually`. Labels must be self-contained per the AskUserQuestion rules.
3. On Cascade: apply the root's action to every dependent and skip those findings' walk-through entries. Persistence follows the per-action routing rules from "Per-finding routing" below — the canonical home for every cascaded decision is the in-memory decision list (annotated with `cascaded from {root_title}` and the cascaded action), plus any action-specific side effect:
- Cascaded `Apply` — add the dependent id to the Apply set and record in the decision list.
- Cascaded `Defer` — invoke the open-questions append flow for the dependent and record the append outcome in the decision list. If the append fails, fall back to the per-finding failure path (Retry / Record only / Convert to Skip) for that dependent before advancing the cascade.
- Cascaded `Skip` — record in the decision list only; no Apply-set entry, no open-questions append.
On Individual: proceed normally — the root's dependents each get their own walk-through entry.
When the user picks Apply on a root, do NOT cascade — the premise held, so dependents each need their own decision. Proceed through the walk-through normally.
**Orphaned dependents.** If a dependent's root was rejected in a prior round and the root is suppressed this round (per R29), treat the dependent as a standalone finding with no chain context. Do not reference the missing root.
---
## Per-finding presentation
Each finding is presented in two parts: a terminal output block carrying the explanation, and a question via the platform's blocking question tool carrying the decision. Never merge the two — the terminal block uses markdown; the question uses plain text.
### Terminal output block (print before firing the question)
Render as markdown. Labels on their own line, blank lines between sections:
```
## Finding {N} of {M} — {severity} {plain-English title}
Section: {section}
**What's wrong**
{plain-English problem statement from why_it_matters}
**Proposed fix**
{suggested_fix — rendered per the substitution rules below: prose-first, intent-language}
**Why it works**
{short reasoning, grounded in a pattern cited in the document or codebase when available}
{Conflict-context line, when applicable — see below}
```
Substitutions:
- **`{plain-English title}`** — a 38 word summary suitable as a heading. Derived from the merged finding's `title` field but rephrased so it reads as observable consequence (e.g., "Implementers will pick different tiers" rather than "Section X-Y lists four tiers"). For document-review findings, observable consequence is the *effect on a reader, implementer, or downstream decision*, not runtime behavior.
- **`{section}`** — from the finding's `section` field.
- **`why_it_matters`** — from the merged finding's `why_it_matters` field. Rendered as-is; the subagent template's framing guidance ensures it's already observable-consequence-first.
- **`suggested_fix`** — from the merged finding's `suggested_fix` field. Render as prose describing intent, not as raw markup. The user's job is to trust or reject the action — they don't need to review exact text. Rules:
- **Default — one sentence describing the effect.** What does the fix achieve, and where does it live? Prefer intent language over quoted text.
- Good: `Drop the Advisory tier from the enum; advisory-style findings surface in an FYI subsection at the presentation layer.`
- Good: `Add a deployment-ordering constraint requiring Units 3 and 4 in a single commit.`
- Bad: `Change "autofix_class: [auto, gated_auto, advisory, present]" to "autofix_class: [safe_auto, gated_auto, manual]" in findings-schema.json on line 48.` — too syntax-focused for a decision loop
- **Code-span budget** — at most 2 inline backtick spans per sentence, each a single identifier, flag, or short phrase (e.g., `` `safe_auto` ``, `` `<work-context>` ``). Always leave a space before and after each backtick span.
- **Raw code blocks** — only for short (≤5-line) genuinely additive content where no before-state exists. Above 5 lines, switch to a summary.
- **No diff blocks.** Document mutations render as prose.
- **`Why it works`** — grounded reasoning that, where possible, references a similar pattern already used in the document or codebase. One to three sentences.
- **Conflict-context line (when applicable)** — when contributing personas implied different actions for this finding and synthesis step 3.6 broke the tie, surface that briefly. Example: `Coherence recommends Apply; scope-guardian recommends Skip. Agent's recommendation: Skip.` The orchestrator's recommendation — the post-tie-break value — is what the menu labels "recommended."
### Question stem (short, decision-focused)
After the terminal block renders, fire the platform's blocking question tool with a compact two-line stem:
```
Finding {N} of {M} — {severity} {short handle}.
{Action framing in a phrase}?
```
Where:
- **Short handle** matches the `{plain-English title}` from the terminal block heading.
- **Action framing** — one phrase describing what the single recommended action does, as a yes/no question. Examples: `Apply the rename?`, `Defer to Open Questions since the tradeoff is genuine?`, `Skip since the document already resolves this elsewhere?`.
Never enumerate alternatives in the stem. One recommendation as a yes/no — the option list carries the alternatives. When the recommendation is close, surface the disagreement in the conflict-context line, not as a multi-option stem.
### Confirmation between findings
After the user answers and before printing the next finding's terminal block, emit a one-line confirmation of the action taken. Examples: `→ Applied. Edit staged at "Scope Boundaries" section.`, `→ Deferred. Entry appended to "## Deferred / Open Questions".`, `→ Skipped.`
### Options (four; adapted as noted)
Fixed order. Never reorder:
```
A. Apply the proposed fix
B. Defer — append to the doc's Open Questions section
C. Skip — don't apply, don't append
D. LFG the rest — apply the agent's best judgment to this and remaining findings
```
**Mark the post-tie-break recommendation with `(recommended)` on its option label.** Required, not optional. Only A, B, or C can carry it — synthesis emits `recommended_action` as Apply/Defer/Skip, which maps to A/B/C. D (`LFG the rest`) is a workflow shortcut for bulk execution across remaining findings, not a finding-level resolution action, so it is never marked `(recommended)`.
```
A. Apply the proposed fix (recommended)
B. Defer — append to the doc's Open Questions section
C. Skip — don't apply, don't append
D. LFG the rest
```
When reviewers disagreed or evidence cuts against the default, still mark one option — whichever synthesis produced — and surface the disagreement in the conflict-context line.
### Adaptations
- **N=1 (exactly one pending finding):** the terminal block's heading omits `Finding N of M` and renders as `## {severity} {plain-English title}`. The stem's first line drops the position counter, becoming `{severity} {short handle}.` Option D (`LFG the rest`) is suppressed because no subsequent findings exist — the menu shows three options: Apply / Defer / Skip.
- **Open-Questions append unavailable** (read-only document, write-failed): when `references/open-questions-defer.md` reports the in-doc append mechanic cannot run, option B is omitted. The stem appends one line explaining why (e.g., `Defer unavailable — document is read-only in this environment.`). The menu shows three options: Apply / Skip / LFG the rest. Before rendering options, remap any per-finding `Defer` recommendation from synthesis to `Skip` so the `(recommended)` marker lands on an option that's actually in the menu. Surface the remap on the conflict-context line (e.g., `Synthesis recommended Defer; downgraded to Skip — document is read-only.`).
- **Combined N=1 + no-append:** the menu shows two options: Apply / Skip.
Only when `ToolSearch` explicitly returns no match or the tool call errors — or on a platform with no blocking question tool — fall back to presenting the options as a numbered list and waiting for the user's next reply.
---
## Per-finding routing
For each finding's answer:
- **Apply the proposed fix** — add the finding's id to an in-memory Apply set. Advance to the next finding. Do not edit the document inline — Apply accumulates for end-of-walk-through batch execution. **No-fix guard:** if the merged finding has no `suggested_fix` (possible on `manual` findings where the persona flagged the issue as observation without a concrete resolution), Apply is not executable. Do not add the finding to the Apply set. Instead, surface the no-fix sub-question described below before advancing.
- **Defer — append to Open Questions section** — invoke the append flow from `references/open-questions-defer.md`. The walk-through's position indicator stays on the current finding during any failure-path sub-question (Retry / Fall back / Convert to Skip). On success, record the append location and reference in the in-memory decision list and advance. On conversion-to-Skip from the failure path, advance with the failure noted in the completion report.
- **Skip — don't apply, don't append** — record Skip in the in-memory decision list. Advance. No side effects.
- **LFG the rest — apply the agent's best judgment** — exit the walk-through loop. Dispatch the bulk preview from `references/bulk-preview.md`, scoped to the current finding and everything not yet decided. The preview header reports the count of already-decided findings ("K already decided"). If the user picks Cancel from the preview, return to the current finding's per-finding question (not to the routing question). If the user picks Proceed, execute the plan per `references/bulk-preview.md` — Apply findings join the in-memory Apply set with the ones the user already picked, Defer findings route through `references/open-questions-defer.md`, Skip is no-op — then proceed to end-of-walk-through execution.
### No-fix sub-question (Apply picked on a finding with no `suggested_fix`)
Synthesis step 3.5b demotes the default recommendation from Apply to Defer for any merged finding without a `suggested_fix`, so `(recommended)` never lands on Apply for these findings. But the menu still lets the user pick Apply manually. When that happens, do not add the finding to the Apply set — the execution pass has no edit payload to apply, which would either fail the batch or record a misleading "applied" outcome.
Fire a blocking sub-question using the platform's question tool. The stem explains why Apply is not executable in one line, then offers three self-contained options. Position indicator stays on the current finding while the sub-question is open.
**Stem:** `Apply isn't executable for this finding — the review surfaced the issue without a concrete fix. How should the agent proceed?`
**Options (fixed order):**
```
A. Defer to Open Questions (recommended)
B. Skip — don't apply, don't append
C. Acknowledge without applying — record the decision, no document edit
```
**Routing:**
- **A. Defer to Open Questions** — invoke the append flow from `references/open-questions-defer.md` as though the user had originally picked Defer. Failure-path handling is identical (Retry / Fall back / Convert to Skip). On success, record the append location in the decision list (annotated `redirected from Apply — no suggested_fix`) and advance.
- **B. Skip** — record Skip in the decision list (annotated `redirected from Apply — no suggested_fix`). Advance. No side effects.
- **C. Acknowledge without applying** — record the finding in the decision list as `acknowledged` (annotated `Apply picked but no suggested_fix — no edit dispatched`). Do not add to the Apply set. Advance. The completion report surfaces Acknowledged as its own dedicated bucket with its own count, its own per-finding action label, and its own position in the report ordering (`Applied / Deferred / Skipped / Acknowledged`) — see "Minimum required fields" and "Report ordering" in the unified completion report section below for the full contract. The acknowledgement reason is surfaced on each per-finding line. For round-to-round suppression (distinct from report display), Acknowledged decisions carry forward in the multi-round decision primer as a rejected-class decision alongside Skip and Defer so round-N+1 synthesis suppresses re-raises via R29 — semantically the user saw the finding, chose not to act, and wants it recorded, which is equivalent to Skip for suppression purposes but remains its own bucket in the report.
**Availability adaptation.** When `references/open-questions-defer.md` has cached `append_available: false` for the session, omit option A and surface one line in the stem explaining why (e.g., `Defer unavailable — document is read-only in this environment.`). The menu becomes Skip / Acknowledge without applying, with Skip labeled `(recommended)`.
**Cascading roots.** When the finding is a root with dependents and the user picks A (Defer) or B (Skip) from this sub-question, run the cascade announcement in "Cascading root decisions" above — treat the sub-question's choice as the root's effective action. Option C (Acknowledge) does not cascade; the root is recorded as acknowledged and dependents each get their own walk-through entry.
---
## Override rule
"Override" means the user picks a different preset action (Defer or Skip in place of Apply, or Apply in place of the agent's recommendation). No inline freeform custom-fix authoring — the walk-through is a decision loop, not a pair-editing surface. A user who wants a variant of the proposed fix picks Skip and hand-edits outside the flow; if they also want the finding tracked, they can Defer first and edit afterward.
---
## State
Walk-through state is **in-memory only**. The orchestrator maintains:
- An Apply set (finding ids the user picked Apply on)
- A decision list (every answered finding with its action and any metadata like `append_location` for Deferred or `reason` for Skipped)
- The current position in the findings list
Nothing is written to disk per-decision except the in-doc Open Questions appends (which are external side effects — those cannot be rolled back). An interrupted walk-through (user cancels the prompt, session compacts, network dies) discards all in-memory state. Apply decisions have not been dispatched yet (they batch at end-of-walk-through), so they are cleanly lost with no document changes.
Cross-session persistence is out of scope. Mirrors `ce-code-review`'s walk-through state rules.
---
## End-of-walk-through execution
After the loop terminates — either every finding has been answered, or the user took `LFG the rest → Proceed` — the walk-through hands off to the execution phase:
1. **Apply set:** in a single pass, the orchestrator applies every accumulated Apply-set finding's `suggested_fix` to the document. Document edits happen inline via the platform's edit tool — ce-doc-review has no batch-fixer subagent (per scope boundary); the orchestrator performs the edits directly, since `gated_auto` and `manual` fixes for documents are single-file markdown changes with no cross-file dependencies. **Defensive no-fix check:** before dispatching the edit for each Apply-set entry, verify the merged finding carries a `suggested_fix`. If it does not (the decision-time no-fix guard in "Per-finding routing" should prevent this, but treat it as a defensive fallback), skip the edit, record the finding in the completion report's failure section with reason `Apply skipped — no suggested_fix available`, and continue the batch. Do not fail the entire pass because one Apply-set entry lacks a fix.
2. **Defer set:** already executed inline during the walk-through via `references/open-questions-defer.md`. Nothing to dispatch here.
3. **Skip:** no-op.
After execution completes (or after `LFG the rest → Cancel` followed by the user working through remaining findings one at a time, or after the loop runs to completion), emit the unified completion report described below.
---
## Unified completion report
Every terminal path of Interactive mode emits the same completion report structure. This covers:
- Walk-through completed (all findings answered)
- Walk-through bailed via `LFG the rest → Proceed`
- Top-level LFG (routing option B) completed
- Top-level Append-to-Open-Questions (routing option C) completed
- Zero findings after `safe_auto` (routing question was skipped — the completion summary is a one-line degenerate case of this structure)
### Minimum required fields
- **Per-finding entries:** for every finding the flow touched, a line with — at minimum — title, severity, the action taken (Applied / Deferred / Skipped / Acknowledged), the append location for Deferred entries, a one-line reason for Skipped entries (grounded in the finding's confidence or the one-line `why_it_matters` snippet), and the acknowledgement reason for Acknowledged entries (e.g., `Apply picked but no suggested_fix available`).
- **Summary counts by action:** totals per bucket (e.g., `4 applied, 2 deferred, 2 skipped`). Include an `acknowledged` count when any entries land in that bucket; omit the label when the count is zero.
- **Failures called out explicitly:** any Apply that failed (e.g., document write error, or the defensive no-fix fallback skipping an Apply-set entry), any Open-Questions append that failed. Failures surface above the per-finding list so they are not missed.
- **End-of-review verdict:** carried over from Phase 4's Coverage section.
### Report ordering
Failures first (above the per-finding list), then per-finding entries grouped by action bucket in the order `Applied / Deferred / Skipped / Acknowledged`, then summary counts, then Coverage (FYI observations, residual concerns), then the verdict. Omit any bucket whose count is zero.
### Zero-findings degenerate case
When the routing question was skipped because no `gated_auto` / above-gate `manual` findings remained after `safe_auto`, the completion report collapses to its summary-counts + verdict form with one added line — the count of `safe_auto` fixes applied. The summary wording:
No FYI or residual concerns:
```
All findings resolved — 3 fixes applied.
Verdict: Ready.
```
FYI or residual concerns remain:
```
All actionable findings resolved — 3 fixes applied. (2 FYI observations, 1 residual concern remain in the report.)
Verdict: Ready.
```
---
## Execution posture
The walk-through is operationally read-only with respect to the project except for three permitted writes: the in-memory Apply set / decision list (managed by the orchestrator), the in-doc Open Questions appends (external side effects managed by `references/open-questions-defer.md`), and the end-of-walk-through batch document edits (the orchestrator's final Apply pass). Persona agents remain strictly read-only. Unlike `ce-code-review`, there is no fixer subagent — the orchestrator owns the document edit directly.

View File

@@ -0,0 +1,217 @@
---
title: Seeded Test Fixture — Auth Gateway Migration Plan
type: feat
status: active
date: 2026-04-19
---
<!--
This is a SEEDED TEST FIXTURE for ce-doc-review pipeline validation.
Second fixture alongside seeded-plan.md. Different domain (auth migration)
and different premise shape (security/reliability rather than naming)
so the pipeline can be measured against a second set of known
classifications.
Seed map (run this plan through ce-doc-review to verify):
- safe_auto candidates (3):
- wrong count (Requirements Trace says 7 requirements, list has 6)
- terminology drift (uses "token", "credential", "secret"
interchangeably for the same API-key concept)
- stale cross-reference (see-Unit-9 but only Units 1-6 exist)
- gated_auto candidates (3):
- missing CSRF protection on the new session endpoints — the
framework (OAuth2-Proxy) has a built-in option; the plan rolls
its own partial check
- deployment-ordering guarantee missing between gateway rollout
and downstream service updates
- framework-native-API substitution (plan describes hand-rolled
token-refresh loop; the library ships refresh middleware)
- manual candidates with TWO valid premise roots (intentional — tests
multi-root behavior):
- ROOT A: "Is migration to managed auth justified?" (top-level
premise) with dependents:
- Service-mesh integration layer complexity
- New secrets-rotation workflow scope
- Rollout of token-refresh middleware
- ROOT B: "Is the custom policy-enforcement layer warranted?"
(narrower premise about a specific sub-component) with
dependents:
- Policy-DSL parser abstraction
- Per-route policy cache design
Expected synthesis: elevates BOTH roots; dependents assign to
whichever root's fix most directly moots them. If the synthesizer
picks only one root, the other's dependents strand as top-level
manual findings — that's the regression we're watching for.
- manual candidates independent of either root (3, should NOT link):
- SLO / error-budget commitment missing — operational gap that
exists regardless of which auth path is chosen
- Session-timeout cross-tab coordination not specified —
behavior concern that applies under both migration and
status-quo paths
- PII handling during migration window unstated — compliance
gap independent of premise
- FYI candidates (4, confidence 0.40-0.65 at P3):
- naming preference ("AuthContext" vs "SessionContext" — both
legible in the code)
- speculative future-work concern (could reuse this for a
hypothetical mobile SDK that isn't on the roadmap)
- subjective readability note about the config schema shape
- unit-organization preference (could group by route rather
than by endpoint class — current split also reads fine)
- drop-worthy P3s (3, confidence 0.55-0.74):
- vague performance concern without baseline ("could be slow
under load")
- theoretical multi-region concern not relevant to single-region
deployment
- nitpick about commit-message style in the rollout plan
The fixture includes multiple premise challenges at DIFFERENT scopes
to exercise the multi-root synthesis path. Unlike the rename-shape
fixture, the root candidates here are genuinely distinct (managed-auth
migration vs. custom policy layer) — neither subsumes the other.
-->
# Seeded Auth Gateway Migration Plan
## Problem Frame
Our internal API gateway currently implements authentication via a hand-rolled JWT layer and a custom policy-enforcement module. This plan migrates the gateway to a managed auth service (via service-mesh integration) and introduces a new DSL-based policy layer.
The migration affects 6 downstream services. No user-reported authentication failures motivated this work — the driver is infrastructure consolidation across teams.
## Requirements Trace
7 requirements planned:
- R1. Integrate with the managed auth service via service-mesh adapter
- R2. Retire the hand-rolled JWT signing / verification layer
- R3. Implement the new policy DSL parser and per-route policy cache
- R4. Migrate credential storage from app-local config to managed secrets
- R5. Add token-refresh middleware for downstream services
- R6. Coordinate cutover with downstream services' deploy cycles
(Only 6 items listed despite "7 requirements" — seeded wrong-count
safe_auto candidate.)
## Scope Boundaries
- Not changing the user-facing auth UX (login flows, error messages)
- Not migrating non-gateway services' internal auth (out of scope for this phase)
## Key Technical Decisions
- Use the managed auth service's service-mesh adapter rather than direct SDK integration
- Introduce a custom policy-DSL parser with a per-route policy cache layer (see Unit 9 for cache invalidation — seeded stale cross-reference; Unit 9 does not exist in this plan)
- Store API keys in the managed secrets store; remove app-local config entries
- Hand-roll the token-refresh loop (check expiry every 30s, renew if within 60s of expiry)
(Uses "API key", "token", "credential", and "secret" interchangeably
throughout — seeded terminology drift safe_auto candidate.)
## Implementation Units
- [ ] Unit 1: Service-mesh adapter integration
**Goal:** Wire the gateway to the managed auth service via the mesh sidecar.
**Files:** `internal/gateway/auth/mesh_adapter.go`
**Approach:** Implement adapter interface against mesh sidecar. Fall back to legacy JWT layer during cutover window if adapter fails. (Seeded manual dependent of ROOT A: this complexity exists only because the migration is happening; if the migration premise is rejected, the adapter layer is unnecessary.)
- [ ] Unit 2: Policy DSL parser
**Goal:** Parse the new policy DSL and compile to a per-route evaluator.
**Files:** `internal/gateway/policy/parser.go`, `internal/gateway/policy/evaluator.go`
**Approach:** Write a recursive-descent parser. Cache compiled evaluators in a concurrent map keyed by route. (Seeded manual dependent of ROOT B: the parser exists solely to support the custom policy layer; if the custom policy-layer premise is rejected in favor of the managed service's native policy language, the parser is dead code.)
- [ ] Unit 3: Per-route policy cache
**Goal:** Cache compiled policy evaluators with LRU eviction.
**Files:** `internal/gateway/policy/cache.go`
**Approach:** Concurrent LRU keyed by `(route_id, policy_version)`. Invalidate on config reload.
(Seeded manual dependent of ROOT B: cache design only matters if the custom policy layer exists.)
- [ ] Unit 4: CSRF protection on new session endpoints
**Goal:** Add CSRF checks on the three new session endpoints introduced by the migration.
**Files:** `internal/gateway/auth/session.go`
**Approach:** Check the `X-CSRF-Token` header against a session-scoped token stored server-side. Reject requests where the token is missing or mismatched. No double-submit cookie pattern because the gateway is same-origin.
(Seeded gated_auto: OAuth2-Proxy ships a built-in CSRF middleware that handles this uniformly — including rotation and HMAC signing — which the hand-rolled version lacks. The hand-rolled check also omits the Origin header check that OAuth2-Proxy's default includes.)
- [ ] Unit 5: Token-refresh middleware
**Goal:** Refresh short-lived tokens before they expire.
**Files:** `internal/gateway/auth/refresh.go`
**Approach:** Poll token expiry every 30 seconds. If within 60 seconds of expiry, call refresh endpoint and swap the token in-place. Log refresh failures but continue serving with the old token until it expires.
(Seeded gated_auto: the auth-service client library ships a refresh middleware that handles this uniformly — including backoff, concurrency guards against duplicate-refresh stampedes, and fail-closed semantics on refresh failure. The hand-rolled version is missing the concurrency guard and the fail-closed branch.)
- [ ] Unit 6: Coordinate cutover with downstream services
**Goal:** Coordinate the gateway's cutover with the 6 downstream services.
**Files:** `docs/rollout/auth-cutover-plan.md`
**Approach:** Stagger rollout over 3 business days. Gateway deploys first, then downstream services pick up the new auth contract over the following 48 hours.
(Seeded gated_auto: no explicit deployment-ordering guarantee between the gateway's secrets-migration step and the downstream services' config reload — if the secrets migration lands before downstream services reload, they fail auth against the new store; if after, the gateway has no credentials for the window between its deploy and the migration. A dual-read or versioned-secrets pattern would close this.)
## Risks
- The migration's premise is "infrastructure consolidation." We have no user-reported auth failures and no stated reliability or security gap in the current hand-rolled layer. The consolidation benefit is real but speculative — this is a large refactor on a working system. (Seeded manual — ROOT A premise challenge: "Is migration to managed auth justified given no user-facing problem motivates it?")
- The policy DSL is a new abstraction we build specifically for this gateway. The managed auth service ships its own policy language that covers 80% of our current rules natively. Hand-rolling the DSL means owning a parser, cache, and evaluator that the managed service would provide for free. (Seeded manual — ROOT B premise challenge: "Is the custom policy-enforcement layer warranted when the managed service ships one?")
- The hand-rolled token-refresh loop has no concurrency guard; multiple goroutines may attempt refresh simultaneously under burst traffic, producing refresh-endpoint load spikes. (Seeded manual, independent of roots: this is an operational concern that exists regardless of which auth path is chosen.)
## Miscellaneous Notes
The managed secrets store introduces a new rotation workflow we don't currently have. This is net-new operational surface: we'd need runbooks for manual rotation, automatic-rotation settings, and break-glass access. (Seeded manual dependent of ROOT A: this workflow only exists because of the migration; if the migration is rejected, the rotation surface stays as-is.)
Our error budget for the gateway is 0.1% monthly error rate. The plan does not state the expected error-rate impact of cutover, rollback criteria tied to the budget, or how the transition affects SLO burn. (Seeded manual independent of roots: operational obligation regardless of premise.)
We name the session context struct `AuthContext` in the new code but the existing code uses `SessionContext` for the same concept. (Seeded FYI: naming preference — both are legible, no wrong answer.)
The config-schema shape is fairly nested (4 levels deep) for a handful of flags. Could be flattened. (Seeded FYI: subjective readability note about schema shape.)
We could reuse this auth adapter pattern for a hypothetical future mobile SDK. That SDK isn't currently on the roadmap. (Seeded FYI: speculative future-work concern with no current signal.)
The gateway is single-region today. Multi-region is not on the near-term roadmap, but if it becomes relevant, the per-route policy cache would need cross-region invalidation. (Seeded drop: theoretical multi-region concern not relevant to current deployment, P3.)
## PII Handling
Migration touches user-identifier fields during the JWT layer retirement. (Seeded manual independent of roots: PII compliance gap that applies during the migration window regardless of which premise holds; even if both premises are accepted, the migration itself needs explicit PII-handling guidance.)
## Deferred to Implementation
- Exact SLO monitoring dashboards
- Per-service rollout timing
## Known Drift
- The existing hand-rolled JWT module is retained for one release after cutover as a fallback path (Unit 1). We may remove it later. (Seeded FYI: drift note without concrete action, low-stakes.)
- Unit-organization choice: units are grouped by component (adapter, parser, cache, CSRF, refresh, cutover) rather than by endpoint class. Reads fine either way. (Seeded FYI: unit-organization preference, no wrong answer.)
## Low-Signal Residuals (Seeded Drop-Worthy P3s)
- The new policy layer "could be slow under load" — no baseline or benchmark, speculative. (Seeded drop: vague performance concern without evidence, P3.)
- Commit-message style in the rollout plan uses short subjects; some may prefer longer. (Seeded drop: nitpick about commit-message convention, P3.)
- The migration window is described as "a few days" — could be tighter. (Seeded drop: vague-phrasing preference at P3 with no consequence.)

View File

@@ -0,0 +1,194 @@
---
title: Seeded Test Fixture — Notification Preferences Redesign
type: feat
status: active
date: 2026-04-19
---
<!--
This is a SEEDED TEST FIXTURE for ce-doc-review pipeline validation.
Third fixture alongside seeded-plan.md (rename/infra) and
seeded-auth-plan.md (auth migration). Designed to exercise three gaps
the other fixtures do not cover:
1. design-lens persona activation and calibration — the document
contains UI/UX content, user flows, visual hierarchy, and
interaction descriptions.
2. Zero-root chain path — every finding in this fixture is
independent; no seeded premise challenges exist. The synthesis
pipeline should correctly skip chain grouping (report
"Chains: 0 roots" or omit the Chains line).
3. Small-document / minimum-persona path — the document is ~130
lines (vs ~210 for the other fixtures) so the adversarial reviewer
should run Quick mode (produce ≤3 findings), and scope-guardian /
adversarial may not activate at all given the simpler shape.
Deliberate design constraint: NO premise-level challenges. Every
seeded finding is about execution details, not foundational
assumptions. There is no "is this feature justified?" or "does this
serve a real user problem?" shape. If any reviewer surfaces a
premise-level concern anyway, that is a calibration signal worth
flagging (over-charitable root identification).
Seed map (run this plan through ce-doc-review to verify):
- safe_auto candidates (2):
- wrong count (Requirements Trace says 5 requirements, list has 4)
- terminology drift ("preference" / "setting" / "config"
used interchangeably for the same concept)
- gated_auto candidates (3):
- missing accessibility labels on the toggle components —
framework has standard aria-label pattern
- missing loading/error state in the Save flow — standard
pattern exists in the codebase (cite existing component)
- missing confirmation dialog on "unsubscribe from all"
destructive action — codebase pattern exists
- manual candidates (4, all INDEPENDENT, no premise roots):
- Grouping strategy: by channel (email/push/SMS) vs by topic
(comments/mentions/updates) — real tradeoff, both legitimate
- Default state for new users: all-on, all-off, or curated subset
- Save pattern: explicit Save button vs auto-save on toggle
- Admin enforcement: can org admins enforce preferences, and
with what override UX
- FYI candidates (3):
- naming preference ("Notification Center" vs "Preferences" vs
"Settings" — any works)
- micro-interaction suggestion (animate toggle state changes,
low-stakes)
- speculative analytics-event addition (not required by any
stated goal)
- drop-worthy P3s (2):
- vague style nitpick on the mock layout
- theoretical i18n concern when no localization is in scope
Expected pipeline behavior:
- design-lens activates (UI/UX content triggers it) and produces
findings specific to its scope.
- scope-guardian may activate lightly (no priority tiers, ≤5
requirements) or not at all.
- adversarial: either does not activate or runs Quick mode with ≤3
findings.
- Chains: 0 roots (no premise challenges exist; chain grouping
skipped). This is the key new-path test.
- Engagement burden expected: 2 applied + 3 gated + ~4-5 manual
+ ~2-3 FYI = roughly 7-10 user decisions, none of which cascade.
The absence of a chain is itself the test result — if a chain appears,
a reviewer has over-elevated an execution finding to premise-root
status, which is worth investigating.
-->
# Notification Preferences Redesign
## Problem Frame
Users currently manage notification preferences through a linear list of 18 toggle switches on a single screen. In-app analytics show a 6% engagement rate with the page and a support-ticket volume averaging 12/month for "I'm getting too many notifications" — both metrics documented in the Growth team's Q1 2026 review. This redesign restructures the page for faster comprehension and reduces support volume by giving users clearer control.
## Requirements Trace
5 requirements planned:
- R1. Group preferences by a meaningful dimension (channel, topic, or both)
- R2. Provide a bulk-action affordance for common preference sets
- R3. Add accessibility labels and keyboard navigation to the new controls
- R4. Preserve existing preference values during the migration
(Only 4 items listed despite "5 requirements" — seeded wrong-count safe_auto candidate.)
## User Flows
**Primary flow — change one setting:**
1. User opens Notification Preferences from the account menu
2. User sees the grouped layout with current values
3. User toggles one control
4. System persists the change (save pattern is an open question — see Miscellaneous Notes)
**Secondary flow — bulk unsubscribe:**
1. User clicks "Turn off all notifications" at the top of the page
2. System applies the change to every preference in the page
3. User sees a confirmation that changes were applied
(Seeded gated_auto: the destructive bulk-unsubscribe action has no confirmation dialog. The codebase pattern for destructive bulk actions — see `components/confirm-dialog.tsx` — is used elsewhere in the settings surface and would apply cleanly here.)
## Implementation Units
- [ ] Unit 1: Group preferences by a chosen dimension
**Goal:** Restructure the preference list into groups based on the chosen dimension.
**Files:** `src/routes/settings/notifications/page.tsx`, `src/routes/settings/notifications/group.tsx`
**Approach:** Render one `<PreferenceGroup>` component per group. Each group has a header and a body containing the toggles. Groups are expanded by default.
- [ ] Unit 2: Bulk-action affordances
**Goal:** Add a bulk-action row at the top of the page with an "Off" switch that turns off every preference at once.
**Files:** `src/routes/settings/notifications/bulk-actions.tsx`
**Approach:** One toggle at the page root that cascades to every child toggle when activated.
- [ ] Unit 3: Accessibility labels and keyboard navigation
**Goal:** Every new toggle has an aria-label, a visible focus ring, and is reachable via tab order.
**Files:** `src/routes/settings/notifications/group.tsx`, `src/routes/settings/notifications/toggle.tsx`
**Approach:** Pass `aria-label` through the `<Toggle>` prop interface. (Seeded gated_auto: the `<Toggle>` component in `src/components/toggle.tsx` does not currently accept an `aria-label` prop — implementer must extend the interface. The component's existing `label` prop is rendered visually; screen readers would announce both unless `aria-labelledby` is used. The codebase convention — see `src/components/toggle.tsx` line 34 — is to pass a hidden label via `aria-label` when the visible label is not the screen-reader-friendly string.)
- [ ] Unit 4: Persist preferences during migration
**Goal:** The redesign ships as a replacement; existing preference values must be preserved.
**Files:** `src/db/migrations/20260419_notification_preferences_shape.sql`
**Approach:** Data model is unchanged; only the rendering layer is updated. No migration required beyond the UI swap.
## Design Notes
**Visual hierarchy:** Each group has a bold header, a lighter description, and the toggles in a vertical stack. Spacing between groups uses the same token as other settings surfaces (`space-6`).
**Toggle states:** Default (off), On, Saving, Error. The current design mocks show the Default and On states; Saving and Error are not represented. (Seeded gated_auto: the codebase Save-flow convention — see `src/components/async-button.tsx` — is to show a subtle spinner on the interacting control during the pending state and a toast with retry on error. The plan's Save flow needs these states explicit.)
**Grouping dimension — open question.** The design mocks show grouping by channel (Email, Push, SMS). Product has also argued for grouping by topic (Comments, Mentions, Updates, Marketing). Both structures work; the tradeoff is:
- Channel-grouped: users who want to kill push but keep email scan faster
- Topic-grouped: users who want to turn off marketing but keep mentions scan faster
(Seeded manual: real tradeoff with no objectively correct answer. This is a product decision, not a design-correctness finding.)
## Scope Boundaries
- Not changing the underlying data model or preference-evaluation logic
- Not localizing the strings in this phase (all strings English-only)
- Not touching admin-side controls (org admin enforcement is covered in a separate initiative)
## Miscellaneous Notes
**Save pattern — open question.** The current page uses an explicit "Save" button at the bottom. The redesign mocks show auto-save on toggle. Tradeoff:
- Explicit save: users can experiment and discard
- Auto-save: one fewer interaction, matches platform conventions
(Seeded manual: save-pattern choice has real tradeoffs, neither is wrong.)
**Admin enforcement.** Org admins may want to enforce certain notification preferences (e.g., mandatory security-alert emails). This plan assumes admin enforcement is out of scope per Scope Boundaries, but the grouping and default-state decisions below should not foreclose that future. (Seeded manual: plan decides whether to preemptively accommodate admin enforcement or defer entirely.)
**Default state for new users.** All-on produces the current high-support-ticket problem; all-off silences potentially important notifications; curated subset requires us to pick which subset. (Seeded manual: real product decision, no objectively correct answer.)
**Terminology:** We use "preference," "setting," and "config" in different places to mean the same thing. The design mock header says "Notification Preferences" but the navigation link says "Notification Settings" and the codebase file is `notification-config.ts`. (Seeded safe_auto: terminology drift; dominant term is "preference" based on the mock and the user-facing label.)
**Naming the page.** The current nav link says "Notification Settings"; the design mock header says "Notification Preferences"; product marketing uses "Notification Center." Any of these is legible. (Seeded FYI: naming preference, low-stakes.)
**Cross-reference in Unit 3: see existing keyboard navigation guide in `docs/guides/keyboard-nav.md` (Section 4 — Form Controls) for the canonical tab-order pattern.** (Seeded safe_auto: this file does not exist in the repo; the reference is stale. Remove or point at a real target.)
**Animate toggle state changes.** A small state-change animation (150ms ease) would feel more polished. Not required by any stated goal. (Seeded FYI: micro-interaction, low-stakes.)
**Analytics event suggestion.** We could emit a `notification_preference_changed` event with the before/after value. Useful for future Growth analysis but not required by any requirement. (Seeded FYI: speculative analytics addition, not tied to stated goals.)
## Low-Signal Residuals (Seeded Drop-Worthy P3s)
- The mock layout "feels a little tight" — subjective style nitpick without evidence of impact. (Seeded drop: vague style preference at P3.)
- If we ever localize, the group headers will need translation. Localization is explicitly out of scope. (Seeded drop: theoretical i18n concern with no current relevance, P3.)

View File

@@ -0,0 +1,213 @@
---
title: Seeded Test Fixture for ce-doc-review Pipeline Validation
type: feat
status: active
date: 2026-04-18
---
<!--
This is a SEEDED TEST FIXTURE for ce-doc-review pipeline validation.
It contains deliberately-planted issues across each tier shape so the
new synthesis pipeline (safe_auto / gated_auto / manual / FYI / dropped)
can be measured against known expected classifications.
Seed map (run this plan through ce-doc-review to verify):
- safe_auto candidates (3): wrong count (Requirements Trace says 6, list
has 5), terminology drift (data store vs database used interchangeably),
stale cross-reference (see-Unit-7 but no Unit 7 exists)
- gated_auto candidates (3): missing fallback-with-deprecation-warning on
rename, deployment-ordering guarantee missing between skill+code commit,
framework-native-API substitution (hand-rolled deprecation vs using
cobra's Deprecated field)
- manual candidates (5): scope-guardian tension (Unit 2 could be merged
with Unit 3), product-lens premise question (is the refactor the right
solution), coherence design tension (two sections disagree on status),
scope-guardian complexity challenge (is this abstraction warranted),
product-lens trajectory concern (does this paint the system into a
corner)
- FYI candidates (5, confidence 0.40-0.65 at P3): filename-symmetry
observation, drift note, stylistic preference without evidence of
impact, speculative future-work concern, subjective readability note
- drop-worthy P3s (3, confidence 0.55-0.74): vague style nitpick, low-
signal "consider X" residual, theoretical scalability concern without
current evidence
The descriptions intentionally vary in evidence quality so the confidence
gate is exercised.
-->
# Seeded Test Fixture Plan
## Problem Frame
This fixture exercises the ce-doc-review pipeline against representative
issue shapes. The imagined feature is a refactor renaming the `crowd-sniff`
CLI command to `browser-sniff` across 6 implementation units, with
alias-compatibility, skill updates, and a schema migration.
## Requirements Trace
6 requirements planned:
- R1. Rename command and add deprecation alias
- R2. Update skills that invoke the command
- R3. Rename output files from `crowd-report` to `browser-report`
- R4. Migrate data store entries that reference the old name
- R5. Update CLI tests
(Only 5 items listed despite "6 requirements" — seeded wrong-count
safe_auto candidate.)
## Scope Boundaries
- Not changing the command's runtime behavior
- Not changing consumer-facing output formats beyond the rename
## Key Technical Decisions
- Keep a hidden alias `crowd-sniff` for backward compatibility (see Unit 7
below for alias deprecation plan — seeded stale cross-reference; Unit 7
does not exist in this plan)
- Store deprecation state in the data store
- Emit deprecation warning when alias is used
(Uses "data store" here and "database" elsewhere — seeded terminology
drift safe_auto candidate.)
## Implementation Units
- [ ] Unit 1: Rename the CLI command
**Goal:** Rename `crowd-sniff` to `browser-sniff` in the CLI framework.
**Files:** `internal/cli/crowd_sniff.go`
**Approach:** Move the command definition. Keep the old name as an alias.
Print a one-line deprecation warning to stdout when alias is used. (Seeded
gated_auto: cobra's native `Deprecated` field handles this uniformly;
hand-rolling the deprecation warning duplicates framework behavior.)
**Test scenarios:**
- Happy path: `browser-sniff` runs without warning
- Happy path: `crowd-sniff` runs and prints deprecation warning
- Edge case: `-h` on either variant shows the same help
- [ ] Unit 2: Update skills to invoke new command
**Goal:** Update every skill that shells out to `crowd-sniff` to call
`browser-sniff` instead.
**Files:** `plugins/*/skills/*/SKILL.md` (grep for "crowd-sniff")
**Approach:** sed rename across skill files. Keep alias working for
external consumers that may still invoke `crowd-sniff` directly.
(Seeded manual: this unit could be merged with Unit 3 since both update
consumer sites that will deploy together — scope-guardian candidate for
"Units 2 and 3 could be one unit.")
- [ ] Unit 3: Rename output files
**Goal:** Change output filename from `crowd-report.md` to
`browser-report.md`.
**Files:** `internal/cli/output.go`, `internal/pipeline/writer.go`
**Approach:** Write new name, read new name. No fallback — consumers that
read `crowd-report.md` will need to update. (Seeded gated_auto: missing
fallback-with-deprecation-warning on rename; mid-flight consumers and
published content will silently fail. Industry-standard pattern is read
new name first, fall back to old with warning for one release.)
**Test scenarios:**
- Happy path: new writes go to `browser-report.md`
(Seeded FYI: test coverage only covers the happy path and misses the
read-side failure modes entirely, but flagging this is low-signal since
the unit explicitly chose no-fallback.)
- [ ] Unit 4: Migrate data store entries
**Goal:** Update database entries that reference the old name.
**Files:** `db/migrate/20260418_rename_crowd_sniff.rb`
**Approach:** Single-transaction migration. No deployment-ordering
guarantee between this migration and the code changes in Units 1-3. If
the migration runs before Units 1-3 land, the code reads stale data.
If after, new code temporarily sees old entries until migration runs.
(Seeded gated_auto: deployment-ordering guarantee missing; concrete fix
is to require Units 1-4 land in a single commit/PR.)
- [ ] Unit 5: Update CLI tests
**Goal:** Update CLI tests to exercise both names.
**Files:** `internal/cli/cli_test.go`
**Approach:** Add test coverage for the new command name and the alias
behavior.
**Test scenarios:**
- Happy path: new name test
- Happy path: alias name test with deprecation warning assertion
## Risks
- The filename rename affects downstream consumers' readers. The chosen
approach (no-fallback) is subjective and could go either way — keeping
strict "move on" semantics vs. backward-compatible read fallback.
(Seeded manual: genuine design tension between "clean break" and
"compatibility period"; scope-guardian vs. product-lens judgment call.)
- The alias is compatibility theater if there are no external consumers.
We don't have evidence of external consumers. (Seeded manual:
product-lens premise challenge — "is the alias justified given no
external consumers are documented?")
## Miscellaneous Notes
The filename `browser-report.md` is asymmetric with the command name
`browser-sniff` — there's no `-sniff-report.md`. This could go either way
depending on whether command/output parity is valued. (Seeded FYI:
filename asymmetry observation, no wrong answer, low-stakes.)
Consider renaming the database column `crowd_data` to `browser_data` for
consistency. (Seeded FYI: stylistic preference without evidence of
impact.)
The refactor may paint the system into a corner if we later want to
support both crowd-based and browser-based sniffing. (Seeded manual:
product-lens trajectory concern about future path dependencies.)
## Deferred to Implementation
- Exact deprecation message wording
- Release notes phrasing
## Known Drift
`crowd_data` column name remains in the data store schema (legacy). We
may rename it later. (Seeded FYI: drift note without concrete fix.)
## Abstraction Commentary
The refactor introduces an `AliasedCommand` abstraction to bundle the
rename + deprecation-warning behavior. This might be overkill for a
one-command rename. (Seeded manual: scope-guardian complexity challenge
— is the abstraction warranted for one use case?)
## Low-Signal Residuals (Seeded Drop-Worthy P3s)
- The plan's section ordering could be improved; "Miscellaneous Notes"
feels like a catch-all. (Seeded drop: vague style nitpick at P3,
confidence should register below 0.75 gate.)
- Consider whether the schema migration strategy scales if the codebase
grows 10x. (Seeded drop: theoretical scalability concern without
current evidence, P3.)
- Some sentences could be tighter. (Seeded drop: low-signal "consider X"
at P3.)

View File

@@ -353,3 +353,278 @@ describe("ce-plan review contract", () => {
expect(content).not.toContain("**Options for Standard or Lightweight plans:**") expect(content).not.toContain("**Options for Standard or Lightweight plans:**")
}) })
}) })
describe("ce-doc-review contract", () => {
test("findings-schema autofix_class enum uses ce-code-review-aligned tier names", async () => {
const schema = JSON.parse(
await readRepoFile("plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json")
)
const enumValues = schema.properties.findings.items.properties.autofix_class.enum
// Three-tier system aligned with ce-code-review's first three tier names
expect(enumValues).toEqual(["safe_auto", "gated_auto", "manual"])
// No advisory tier — advisory-style findings surface as an FYI subsection at presentation layer
expect(enumValues).not.toContain("advisory")
// Old tier names must be gone after the rename
expect(enumValues).not.toContain("auto")
expect(enumValues).not.toContain("present")
})
test("subagent template carries framing guidance and strawman rule", async () => {
const template = await readRepoFile(
"plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md"
)
// Framing guidance block present
expect(template).toContain("observable consequence")
expect(template).toContain("2-4 sentences")
// Strawman-aware classification rule
expect(template).toContain("Strawman-aware classification rule")
expect(template).toContain("is NOT a real alternative")
// Strawman safeguard on safe_auto
expect(template).toContain("Strawman safeguard")
// Persona exclusion of Open Questions section (prevents round-2 feedback loop)
expect(template).toContain("Exclude prior-round deferred entries")
expect(template).toContain("Deferred / Open Questions")
// Decision primer slot and rules
expect(template).toContain("{decision_primer}")
expect(template).toContain("<decision-primer-rules>")
})
test("synthesis pipeline routes three tiers with per-severity gates and FYI subsection", async () => {
const synthesis = await readRepoFile(
"plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md"
)
// Per-severity confidence gate with the specific thresholds
expect(synthesis).toContain("Per-Severity")
expect(synthesis).toMatch(/P0\s*\|\s*0\.50/)
expect(synthesis).toMatch(/P1\s*\|\s*0\.60/)
expect(synthesis).toMatch(/P2\s*\|\s*0\.65/)
expect(synthesis).toMatch(/P3\s*\|\s*0\.75/)
// FYI floor at 0.40 for low-confidence manual findings
expect(synthesis).toContain("0.40")
expect(synthesis).toContain("FYI floor")
// Three-tier routing table present
expect(synthesis).toContain("`safe_auto`")
expect(synthesis).toContain("`gated_auto`")
expect(synthesis).toContain("`manual`")
// Cross-persona agreement boost (replaces residual-concern promotion)
expect(synthesis).toContain("Cross-Persona Agreement Boost")
expect(synthesis).toContain("+0.10")
// R29 and R30 round-2 rules
expect(synthesis).toContain("R29 Rejected-Finding Suppression")
expect(synthesis).toContain("R30 Fix-Landed Matching Predicate")
})
test("headless envelope surfaces new tiers distinctly", async () => {
const synthesis = await readRepoFile(
"plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md"
)
// Bucket headers for the new tiers appear in the headless envelope template.
// User-facing vocabulary: fixes / Proposed fixes / Decisions / FYI observations
// maps to the safe_auto / gated_auto / manual / FYI internal enum values.
expect(synthesis).toContain("Applied N fixes")
expect(synthesis).toContain("Proposed fixes")
expect(synthesis).toContain("Decisions")
expect(synthesis).toContain("FYI observations")
// Terminal signal preserved for programmatic callers
expect(synthesis).toContain("Review complete")
})
test("terminal question is three-option by default with label adaptation", async () => {
const synthesis = await readRepoFile(
"plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md"
)
// Three options when fixes are queued
expect(synthesis).toContain("Apply decisions and proceed to <next stage>")
expect(synthesis).toContain("Apply decisions and re-review")
expect(synthesis).toContain("Exit without further action")
// Two options in the zero-actionable case with the adapted label
expect(synthesis).toContain("fixes_applied_count == 0")
expect(synthesis).toContain("zero-actionable case")
// Next-stage substitution rules documented
expect(synthesis).toContain("Requirements document")
expect(synthesis).toContain("Plan document")
expect(synthesis).toContain("ce-plan")
expect(synthesis).toContain("ce-work")
})
test("SKILL.md has Interactive mode rules with AskUserQuestion pre-load", async () => {
const content = await readRepoFile(
"plugins/compound-engineering/skills/ce-doc-review/SKILL.md"
)
// Interactive mode rules section at top
expect(content).toContain("## Interactive mode rules")
expect(content).toContain("AskUserQuestion")
expect(content).toContain("ToolSearch")
expect(content).toContain("numbered-list fallback")
// Decision primer variable in the dispatch table
expect(content).toContain("{decision_primer}")
expect(content).toContain("<prior-decisions>")
// References loaded lazily via backtick paths for walk-through and bulk-preview
expect(content).toContain("`references/walkthrough.md`")
expect(content).toContain("`references/bulk-preview.md`")
})
test("walkthrough and bulk-preview reference files exist with required mechanics", async () => {
const walkthrough = await readRepoFile(
"plugins/compound-engineering/skills/ce-doc-review/references/walkthrough.md"
)
const bulkPreview = await readRepoFile(
"plugins/compound-engineering/skills/ce-doc-review/references/bulk-preview.md"
)
// Routing question distinguishing words present (front-loaded per AGENTS.md Interactive Question Tool Design)
expect(walkthrough).toContain("Review each finding one by one")
expect(walkthrough).toContain("LFG")
expect(walkthrough).toContain("Append findings to the doc's Open Questions section")
expect(walkthrough).toContain("Report only")
// Four per-finding options
expect(walkthrough).toContain("Apply the proposed fix")
expect(walkthrough).toContain("Defer — append to the doc's Open Questions section")
expect(walkthrough).toContain("Skip — don't apply, don't append")
expect(walkthrough).toContain("LFG the rest")
// Recommended marker mandatory
expect(walkthrough).toContain("(recommended)")
// No advisory variant (advisory is a presentation-layer concept, not a walkthrough option)
expect(walkthrough).not.toContain("Acknowledge — mark as reviewed")
// No tracker-detection machinery (ce-doc-review has no external tracker)
expect(walkthrough).not.toContain("named_sink_available")
expect(walkthrough).not.toContain("any_sink_available")
expect(walkthrough).not.toContain("[TRACKER]")
// Bulk preview has Proceed/Cancel options and the four bucket labels
expect(bulkPreview).toContain("Proceed")
expect(bulkPreview).toContain("Cancel")
expect(bulkPreview).toContain("Applying (N):")
expect(bulkPreview).toContain("Appending to Open Questions (N):")
expect(bulkPreview).toContain("Skipping (N):")
// No Acknowledge bucket in bulk preview either
expect(bulkPreview).not.toContain("Acknowledging (N):")
})
test("open-questions-defer reference implements append mechanic with failure path", async () => {
const defer = await readRepoFile(
"plugins/compound-engineering/skills/ce-doc-review/references/open-questions-defer.md"
)
// Append mechanic steps
expect(defer).toContain("## Deferred / Open Questions")
expect(defer).toContain("### From YYYY-MM-DD review")
// Entry format includes required fields but excludes suggested_fix and evidence
expect(defer).toContain("{title}")
expect(defer).toContain("{severity}")
expect(defer).toContain("{reviewer}")
expect(defer).toContain("{confidence}")
expect(defer).toContain("{why_it_matters}")
// Failure-path sub-question with three options
expect(defer).toContain("Retry")
expect(defer).toContain("Record the deferral in the completion report only")
expect(defer).toContain("Convert this finding to Skip")
// No tracker-detection logic (this is the in-doc defer path, not tracker-defer)
expect(defer).not.toContain("named_sink_available")
expect(defer).not.toContain("[TRACKER]")
})
})
describe("ce-compound frontmatter schema expansion contract", () => {
test("problem_type enum includes the four new knowledge-track values", async () => {
const schema = await readRepoFile(
"plugins/compound-engineering/skills/ce-compound/references/schema.yaml"
)
// Four new knowledge-track values present in the enum
expect(schema).toContain("architecture_pattern")
expect(schema).toContain("design_pattern")
expect(schema).toContain("tooling_decision")
expect(schema).toContain("convention")
// best_practice remains valid as fallback
expect(schema).toContain("best_practice")
})
test("ce-compound-refresh schema stays in sync with canonical ce-compound schema", async () => {
const canonical = await readRepoFile(
"plugins/compound-engineering/skills/ce-compound/references/schema.yaml"
)
const refresh = await readRepoFile(
"plugins/compound-engineering/skills/ce-compound-refresh/references/schema.yaml"
)
// Duplicate schemas must be identical (kept in sync intentionally per AGENTS.md)
expect(refresh).toEqual(canonical)
})
test("yaml-schema.md documents category mappings for the four new values", async () => {
const mapping = await readRepoFile(
"plugins/compound-engineering/skills/ce-compound/references/yaml-schema.md"
)
expect(mapping).toContain("architecture_pattern` -> `docs/solutions/architecture-patterns/")
expect(mapping).toContain("design_pattern` -> `docs/solutions/design-patterns/")
expect(mapping).toContain("tooling_decision` -> `docs/solutions/tooling-decisions/")
expect(mapping).toContain("convention` -> `docs/solutions/conventions/")
})
})
describe("ce-learnings-researcher domain-agnostic contract", () => {
test("agent prompt frames as domain-agnostic not bug-focused", async () => {
const agent = await readRepoFile(
"plugins/compound-engineering/agents/research/ce-learnings-researcher.agent.md"
)
// Domain-agnostic identity framing
expect(agent).toContain("domain-agnostic institutional knowledge researcher")
// Multiple learning shapes named as first-class
expect(agent).toContain("Architecture patterns")
expect(agent).toContain("Design patterns")
expect(agent).toContain("Tooling decisions")
expect(agent).toContain("Conventions")
// Structured <work-context> input accepted
expect(agent).toContain("<work-context>")
expect(agent).toContain("Activity:")
expect(agent).toContain("Concepts:")
expect(agent).toContain("Decisions:")
expect(agent).toContain("Domains:")
// Dynamic subdirectory probe replaces hardcoded category table
expect(agent).toContain("Probe")
expect(agent).toContain("discover which subdirectories actually exist")
// Critical-patterns.md read is conditional, not assumed
expect(agent).toMatch(/critical-patterns.md.*exists/i)
// Integration Points list no longer includes ce-doc-review (agent is ce-plan-owned)
const integration = agent.substring(agent.indexOf("Integration Points"))
expect(integration).not.toContain("ce-doc-review")
})
})