diff --git a/docs/brainstorms/2026-04-17-ce-review-interactive-judgment-requirements.md b/docs/brainstorms/2026-04-17-ce-review-interactive-judgment-requirements.md new file mode 100644 index 0000000..eccd3b0 --- /dev/null +++ b/docs/brainstorms/2026-04-17-ce-review-interactive-judgment-requirements.md @@ -0,0 +1,155 @@ +--- +date: 2026-04-17 +topic: ce-review-interactive-judgment +--- + +# ce:review Interactive Judgment Loop + +## Problem Frame + +`ce:review`'s Interactive mode produces a report, auto-applies `safe_auto` fixes, and then asks a single bucket-level policy question covering every remaining `gated_auto` and `manual` finding as a group. The findings themselves are presented as a pipe-delimited table grouped by severity. + +Two problems surface repeatedly: + +1. **Judgment calls are hard to make.** When a finding needs human judgment, the table row rarely gives enough context to decide confidently. The user is asked to approve or defer a bucket of findings they haven't individually understood. +2. **High-volume feedback is unreason-able.** A review producing 8-12 findings turns into a scrolling table the user can't engage with. There's no way to respond to individual items meaningfully — the only choice is "approve the whole bucket" or "defer the whole bucket." + +The result is that Interactive mode mostly degrades into rubber-stamping or wholesale deferral. The "judgment" in `gated_auto` / `manual` routing is never actually exercised per-finding. + +## Requirements + +**Routing after `safe_auto` fixes** + +- R1. After `safe_auto` fixes are applied, if any `gated_auto` or `manual` findings remain, Interactive mode presents a four-option routing question that replaces today's bucket-level policy question. +- R2. When zero `gated_auto` / `manual` findings remain after `safe_auto`, the routing question is skipped. Interactive mode shows a brief completion summary (e.g., "All findings resolved — N `safe_auto` fixes applied.") before handing off to the final-next-steps flow. +- R3. The routing question names the detected tracker inline (e.g., "File a Linear ticket per finding") only when detection is high-confidence — the tracker is explicitly named in `CLAUDE.md` / `AGENTS.md` or equivalent project documentation. When detection is lower-confidence, the label uses a generic form (e.g., "File an issue per finding") and the agent confirms the tracker with the user before executing any ticket creation. +- R4. The four routing options are: + - (A) `Review each finding one by one — accept the recommendation or choose another action` + - (B) `LFG. Apply the agent's best-judgment action per finding` + - (C) `File a [TRACKER] ticket per finding without applying fixes` + - (D) `Report only — take no further action` +- R5. Routing option C is a batch-defer shortcut: it files tickets for every pending `gated_auto` / `manual` finding without per-finding confirmation. The walk-through's own Defer option is per-finding; C skips that interactivity. + +**Per-finding walk-through (routing option: Review)** + +- R6. When the user picks the walk-through, findings are presented one at a time in severity order (P0 first). Each per-finding question opens with a position indicator (e.g., "Finding 3 of 8 (P1):") so the user can judge how many decisions remain. +- R7. Each per-finding question includes: plain-English statement of what the bug does, severity, confidence, the proposed fix (diff or concrete action), and a short reasoning for why the fix is right (grounded in codebase patterns when available). +- R8. Per-finding options: + - `Apply the proposed fix` + - `Defer — file a [TRACKER] ticket` + - `Skip — don't apply, don't track` + - `LFG the rest — apply the agent's best judgment to this and remaining findings` +- R9. For findings with no concrete fix to apply (advisory-only), option A becomes `Acknowledge — mark as reviewed`. Defer, Skip, and LFG the rest remain unchanged. +- R10. "Override" on a per-finding question means picking a different preset action (Defer or Skip in place of Apply); no inline freeform custom fix authoring. A user who wants a custom fix picks Skip and hand-edits outside the flow. + +When exactly one `gated_auto` / `manual` finding remains, the walk-through's wording adapts for N=1 (e.g., "Review the finding" rather than "Review each finding one by one"), and `LFG the rest` is suppressed because no subsequent findings exist to bulk-handle. + +**LFG path (routing option: LFG)** + +- R11. LFG applies the per-finding action the agent would have recommended in the walk-through — Apply, Defer, or Skip. There is no separate confidence threshold; confidence already shapes what the agent recommends. The top-level LFG option scopes to every `gated_auto` / `manual` finding; the walk-through's `LFG the rest` (R8) scopes to the current finding and everything not yet decided. Both share the same per-finding mechanic and the same bulk preview (R13-R14). +- R12. LFG (and `LFG the rest`) produces a single completion report after execution that must include, at minimum: + - per-finding entries with: title, severity, action taken (Applied / Deferred / Skipped / Acknowledged), the tracker URL or in-session task reference for Deferred entries, and a one-line reason for Skipped entries grounded in the finding's confidence or content + - summary counts by action + - any failures called out explicitly (fix application failed, ticket creation failed) + - the existing end-of-review verdict + +**Bulk action preview** + +- R13. Before executing any bulk action — top-level LFG (routing option B), top-level File tickets (routing option C), or walk-through `LFG the rest` (R8) — Interactive mode presents a compact plan preview and asks the user to confirm with `Proceed` or back out with `Cancel`. Two options. No per-item decisions in the preview; per-item decisioning is the walk-through's role. +- R14. The preview content groups findings by the action the agent intends to take (e.g., `Applying (N):`, `Filing [TRACKER] tickets (N):`, `Skipping (N):`, `Acknowledging (N):`). Each finding gets one line under its bucket, written as a compressed form of the framing-quality bar (R22-R25) — observable behavior over code structure, no function or variable names unless needed to locate the issue. For walk-through `LFG the rest`, the preview scopes to remaining findings only and notes how many are already decided (e.g., "LFG plan — 5 remaining findings (3 already decided)"). + +**Recommendation tie-breaking** + +- R15. When merged findings carry conflicting recommendations across contributing reviewers (e.g., one reviewer says Apply, another says Defer), synthesis picks the most conservative action using the order `Skip > Defer > Apply` so that LFG and walk-through behavior are deterministic and auditable post-hoc. + +**Defer behavior and tracker detection** + +- R16. Defer actions (from the walk-through, from the LFG path, or from routing option C) file a ticket in the project's tracker. +- R17. The SKILL.md instruction for tracker detection is minimal: the agent determines the project's tracker from whatever documentation is obvious (primarily `CLAUDE.md` / `AGENTS.md`), without an enumerated checklist of files to read. +- R18. When tracker detection is uncertain, the agent prefers durable external trackers over in-session-only primitives and communicates both the fallback behavior and the durability trade-off to the user before executing any Defer action. +- R19. If a Defer action fails at ticket-creation time (API error, auth expiry, rate limit, malformed body), the agent surfaces the failure inline and offers: retry, fall back to the next available sink, or convert the finding to Skip with the error recorded in the completion report. Silent failure is not acceptable. +- R20. When no external tracker is detectable and no harness task-tracking primitive is available on the current platform (e.g., CI contexts, converted targets without task binding), Defer is not offered as a menu option. The routing question and walk-through omit Defer paths and the agent tells the user why. +- R21. The internal `.context/compound-engineering/todos/` system is **not** part of the fallback chain. It is on a deprecation path and must not be extended by this work. + +**Framing quality (cross-cutting)** + +- R22. Every user-facing surface that describes a finding — per-finding walk-through questions, LFG completion reports, and ticket bodies filed by Defer actions — explains the problem and fix in plain English that a reader can understand without opening the file. +- R23. The framing leads with the *observable behavior* of the bug (what a user, attacker, or operator sees), not the code structure. Function and variable names appear only when the reader needs them to locate the issue. +- R24. The framing explains *why the fix works*, not just what it changes. When a similar pattern exists elsewhere in the codebase, reference it so the recommendation is grounded. +- R25. The framing is tight: approximately two to four sentences plus the minimum code needed to ground it. Longer framings are a regression. + +*Illustrative pair — weak vs. strong framing for the same finding:* + +> **Weak (code-citation style):** +> *orders_controller.rb:42 — missing authorization check. Add `current_user.owns?(account)` guard before query.* +> +> **Strong (framed for a human):** +> *Any signed-in user can read another user's orders by pasting the target account ID into the URL. The controller looks up the account and returns its orders without verifying the current user owns it. Adding a one-line ownership guard before the lookup matches the pattern already used in the shipments controller for the same attack.* + +- R26. R22-R25 depend on reviewer personas producing framing-suitable `why_it_matters` and `evidence` fields. If the planning-phase sample shows existing persona outputs do not meet this bar, persona prompt upgrades (or a synthesis-time rewrite pass) land with or before this work. + +**Mode boundaries** + +- R27. Only Interactive mode changes behavior. Autofix, Report-only, and Headless modes are unchanged. +- R28. The existing post-review "final next steps" flow (push fixes / create PR / exit) runs only when one or more fixes were applied to the working tree. It is skipped after routing option C (File tickets per finding) and option D (Report only), and skipped when LFG or the walk-through completes without any Apply action. + +## Success Criteria + +- A user facing a review with one high-stakes finding can decide confidently about the fix without rereading the file. +- A user facing a review with 8+ findings has a clear path to either engage per-item or trust the agent's judgment in one keystroke. +- A user who starts the walk-through but runs out of attention can bail mid-flow into a bulk action without losing the findings still ahead of them. +- Deferred findings land in the team's actual tracker (not a `.context/` file that gets forgotten). +- LFG runs feel honest: the completion report makes clear what was applied and why, so a user can audit the agent's judgment post-hoc. +- For reviews with three or more `gated_auto` / `manual` findings, Review is picked at a meaningful share of the time — LFG alone is not disproportionately the default, so the intervention actually shifts engagement upward rather than renaming the rubber-stamp. +- A first-time user of Interactive mode understands which routing options cause external side effects (fixes applied to the working tree, tickets filed in an external tracker) before choosing, without needing external docs. + +## Scope Boundaries + +- No new `ce:fix` skill. All changes live inside `ce:review`. +- No changes to the findings schema, persona agents, merge/dedup pipeline, or autofix-mode residual-todo creation in this work. +- No inline freeform fix authoring in the walk-through. The walk-through is a decision loop, not a pair-programming surface. +- The "approve the fix's intent but write a variant" case is explicitly unsupported in v1. Users in that situation pick Skip and hand-edit outside the flow; if they want the variant tracked, they file a ticket manually. +- No changes to Autofix, Report-only, or Headless mode behavior. +- The pre-menu findings table format (pipe-delimited, severity-grouped) is intentionally unchanged. The walk-through is the engagement surface for high-volume feedback; the table only needs to be scannable enough to reach the routing menu. Restructuring the table format is a separate follow-up if it proves necessary. +- Phasing out the internal `.context/compound-engineering/todos/` system and the `/todo-create`, `/todo-triage`, `/todo-resolve` skills is acknowledged as the long-term direction but is not scoped into this redesign. A separate follow-up covers that cleanup. +- The current bucket-level policy question wording (`Review and approve specific gated fixes` / `Leave as residual work` / `Report only`) is removed and replaced by the four-option routing question. No backward-compatibility shim. + +## Key Decisions + +- **Expand Interactive mode, no new skill.** Review and fix stay colocated; the review artifact, routing metadata, and fixer subagent are already wired up. A separate `ce:fix` skill would split state and add reintegration cost without clear benefit. +- **Four-option routing upfront, not an escape hatch buried inside the walk-through.** LFG and tracker-deferral are legitimate primary intents for many reviews, not fallbacks. Offering them as peers to the walk-through is honest about how users actually want to engage. +- **LFG = auto-accept recommendations, not a separate confidence policy.** Keeps the mental model simple. Confidence is already baked into whether the agent recommends Apply, Defer, or Skip for a given finding. +- **Tracker detection is reasoning-based, not rote.** Agents are smart enough to read the obvious documentation. An enumerated checklist of files in SKILL.md is pure rationale-discipline tax and caps the agent at the sources we happened to list. +- **Harness task tracking is the last-resort fallback, not internal todos.** Aligns with the deprecation direction for the internal todo system. Honest about the fact that in-session tasks don't survive past the session. +- **Override in the walk-through = pick a different preset action.** No freeform custom fixes. Keeps the interaction a decision loop and avoids turning it into a pair-programming transcript. Users who want custom fixes Skip and hand-edit. +- **Internal-todos deprecation ships a durability regression for some users.** A subset of users today treat `.context/compound-engineering/todos/` as persistent defer storage; removing it from the fallback chain means those users lose cross-session durability for Defer actions until they either document a tracker in `CLAUDE.md` / `AGENTS.md` or the broader phase-out lands. The trade is acknowledged and deliberate, not a silent regression; the mitigation is the separate phase-out cleanup referenced in Scope Boundaries. + +## Dependencies / Assumptions + +- The cross-platform blocking question tool (`AskUserQuestion` / `request_user_input` / `ask_user`) caps at 4 options. All menu designs respect this. +- Option labels across every menu in the flow (routing question, per-finding question, Stop-asking follow-up) must be self-contained, use third-person voice for the agent, and front-load a distinguishing word so they survive truncation in harnesses that hide description text. +- The walk-through writes per-finding decisions to the run artifact (e.g., `.context/compound-engineering/ce-review//walkthrough-state.json`) after each decision, so partial progress is inspectable post-hoc. Formal cross-session resumption is out of scope. +- Findings already carry enough detail (title, severity, confidence, file, line, autofix_class, suggested_fix, why_it_matters, evidence) to support the framing requirements. If some reviewers don't reliably produce plain-English `why_it_matters`, the framing quality bar may require prompt upgrades to those personas — flagged below as a question for planning. +- The existing per-run artifact directory (`.context/compound-engineering/ce-review//`) and the fixer subagent flow remain the underlying mechanics for applying fixes. +- The merged finding set produced by the existing Stage 5 merge pipeline carries only merge-tier fields; detail-tier fields (`why_it_matters`, `evidence`) live in the per-agent artifact files on disk. The per-finding walk-through enriches each merged finding by reading the contributing reviewer's artifact file at `.context/compound-engineering/ce-review//{reviewer}.json`, using the same `file + line_bucket(line, +/-3) + normalize(title)` matching that headless mode already uses. When no artifact match exists (merge-synthesized finding, or failed artifact write), the walk-through degrades to title plus `suggested_fix` and notes the gap. +- The four-option routing design is built to the cross-platform question tool's 4-option cap. A future fifth primary routing intent would require replacing an existing option, chaining a follow-up question, or pressuring the platform cap — the design does not provide pressure relief for this case. +- Autofix mode continues to write residual actionable work to `.context/compound-engineering/todos/` in this redesign, while Interactive-LFG and Defer actions route to external trackers per R16-R21. This temporary divergence is acknowledged — aligning autofix mode's residual sink with the new tracker routing is separate cleanup work tracked in the follow-up referenced in Scope Boundaries. + +## Outstanding Questions + +### Resolve Before Planning + +None. All product decisions are made. + +### Deferred to Planning + +- [Affects Problem Frame][Needs research] Sample recent `.context/compound-engineering/ce-review//` run artifacts to confirm the rubber-stamping / wholesale-deferral failure mode the Problem Frame asserts. If the dominant failure is something else (users disengage before the bucket question, report itself is unreadable), the four-option routing may not be the right intervention. +- [Affects R22-R26][Technical] Do reviewer personas reliably produce plain-English `why_it_matters` today, or does the framing bar require prompt upgrades and/or a synthesis-time rewrite pass? Planning should inspect a sample of recent review artifacts to decide before committing to R22-R25 as achievable without persona changes. +- [Affects R18][Technical] The concrete sequencing of the fallback chain on each target platform (e.g., GitHub Issues via `gh` vs harness task tracking, how to detect `gh` availability cheaply) is intentionally left out of the requirements so detection stays principle-based. Planning resolves the specific sequencing and detection heuristics per target environment. +- [Affects R18][Technical] If no documented tracker is found and `gh` is unavailable on the current platform, should the fallback to harness task tracking happen silently or should the agent confirm once per session? Default expectation: confirm once so users are not surprised by in-session-only behavior. +- [Affects R6][Technical] Whether the walk-through presents findings strictly in severity order (current default) or groups them by file first and then severity within each file. File-grouping may feel more coherent when many findings touch the same file, but it interacts with `Stop asking` semantics (a file-grouped bulk-accept applies to different findings than a severity-first bulk-accept). +- [Affects R7][Needs validation] Whether surfacing reviewer persona names in each per-finding question (e.g., `julik-frontend-races-reviewer`) helps user judgment or is noise. If validation shows noise, omit reviewer attribution from R7's required content or replace with a short category label. + +## Next Steps + +`-> /ce:plan` for structured implementation planning diff --git a/docs/plans/2026-04-17-002-feat-ce-review-interactive-judgment-plan.md b/docs/plans/2026-04-17-002-feat-ce-review-interactive-judgment-plan.md new file mode 100644 index 0000000..adeec7a --- /dev/null +++ b/docs/plans/2026-04-17-002-feat-ce-review-interactive-judgment-plan.md @@ -0,0 +1,638 @@ +--- +title: "feat: Add interactive judgment loop to ce:review" +type: feat +status: completed +date: 2026-04-17 +origin: docs/brainstorms/2026-04-17-ce-review-interactive-judgment-requirements.md +--- + +# feat: Add interactive judgment loop to ce:review + +## Overview + +Redesign `ce:review`'s Interactive mode post-review flow. The current single bucket-level policy question (Review and approve specific gated fixes / Leave as residual work / Report only) gets replaced with a four-option routing question (**Review** walk-through / **LFG** / **File** tickets / **Report** only). The Review path walks findings one at a time with plain-English framing and per-finding actions (Apply / Defer / Skip / LFG the rest). The LFG, File-tickets, and LFG-the-rest paths show a compact plan preview (Proceed / Cancel) before executing. Defer actions file tickets in the project's tracker (reasoning-based detection with GitHub Issues or harness task primitive as fallback). + +A small framing-guidance upgrade to the shared reviewer subagent template ensures every user-facing surface — the walk-through, bulk preview, and ticket bodies — explains findings in plain English, observable behavior first, not code structure. The upgrade applies universally across all 16+ persona agents via a single file change, fixing both the null-`why_it_matters` schema violations observed in adversarial and api-contract and the code-structure-first framing observed in correctness and maintainability. + +All other `ce:review` modes (Autofix, Report-only, Headless) and the existing merge/dedup pipeline, persona dispatch, and safe_auto fixer flow remain unchanged. + +## Problem Frame + +Today's Interactive mode mostly degrades into rubber-stamping or wholesale deferral: + +1. **Judgment calls are hard to make.** When a finding needs human judgment, today's pipe-delimited table row rarely gives enough context to decide confidently. The user is asked to approve or defer a bucket of findings they haven't individually understood. +2. **High-volume feedback is unreason-able.** A review with 8-12 findings turns into a scrolling table. There's no way to respond to individual items meaningfully — only to "approve the whole bucket" or "defer the whole bucket." + +The result: the `gated_auto` / `manual` routing tiers exist in the schema but are never actually exercised per-finding in practice. See origin document for the full problem frame. + +## Requirements Trace + +### Routing after `safe_auto` fixes + +- R1. Four-option routing question replaces today's bucket-level policy question *(see origin)* +- R2. Zero-findings path skips the routing question and shows a completion summary +- R3. Routing question names the detected tracker inline only when detection is high-confidence +- R4. Four options: `Review each finding one by one...`, `LFG. Apply the agent's best-judgment action per finding`, `File a [TRACKER] ticket per finding...`, `Report only...` +- R5. Routing option C is a batch-defer shortcut — distinct from the walk-through's per-finding Defer + +### Per-finding walk-through + +- R6. Walk-through presents findings one at a time in severity order with a position indicator +- R7. Per-finding question content: plain-English problem, severity, confidence, proposed fix, reasoning +- R8. Per-finding options: Apply / Defer / Skip / LFG the rest +- R9. Advisory-only findings substitute `Acknowledge — mark as reviewed` for option A +- R10. Override = pick a different preset action; no inline freeform custom fixes +- N=1 adaptation: walk-through wording adapts and `LFG the rest` is suppressed + +### LFG path + +- R11. LFG applies the per-finding action the agent would recommend; top-level scope vs. walk-through D scope distinction +- R12. Single completion report with required fields after any LFG execution + +### Bulk action preview + +- R13. Compact preview with `Proceed` / `Cancel` before every bulk action (LFG, File tickets, LFG the rest) +- R14. Preview content grouped by intended action; one line per finding in compressed framing-quality form + +### Recommendation tie-breaking + +- R15. When reviewers disagree on per-finding action, synthesis picks the most conservative using `Skip > Defer > Apply` + +### Defer behavior and tracker detection + +- R16-R21. Defer files tickets in project's tracker; minimal reasoning-based detection; fallback to GitHub Issues then harness task primitive; failure surfaces inline; no-sink omits Defer entirely; internal `.context/` todo system explicitly out of fallback chain + +### Framing quality (cross-cutting) + +- R22-R26. All user-facing finding surfaces (walk-through questions, LFG completion reports, ticket bodies, bulk-preview lines) explain in plain English, observable-behavior-first, tight 2-4 sentences. Planning resolves: delivered by a small framing-guidance upgrade to the shared reviewer subagent template (Unit 2), applied once at the source rather than rewritten downstream. Per-persona file edits beyond the shared template are deferred as follow-up. + +### Mode boundaries + +- R27. Only Interactive mode changes behavior. Autofix / Report-only / Headless unchanged +- R28. Final-next-steps flow (push / PR / exit) runs only when one or more fixes landed in the working tree + +## Scope Boundaries + +- No new `ce:fix` skill. All changes live inside `ce:review`. +- No changes to the findings schema, merge/dedup routing beyond the recommended-action tie-breaking in R15, or autofix-mode residual-todo creation. +- Small framing-guidance updates to the shared reviewer subagent template are in scope (see Unit 2). Per-persona file edits are out of scope for v1 — the shared-template change affects all personas at once, which is deliberately the "small upgrade" chosen over a synthesis-time rewrite pass. +- No inline freeform fix authoring in the walk-through — the walk-through is a decision loop, not a pair-programming surface. +- The "approve intent, write a variant" case is unsupported in v1; user picks Skip and hand-edits. +- No changes to Autofix, Report-only, or Headless mode behavior. +- The pre-menu findings table format (pipe-delimited, severity-grouped) stays unchanged. +- The current bucket-level policy question wording is removed entirely — no backward-compatibility shim. + +### Deferred to Separate Tasks + +- **Per-persona file edits beyond the shared template:** deferred. Unit 2 updates the shared subagent template to add R22-R25 framing guidance, which applies universally to all personas. If post-ship sampling shows specific personas still produce weak framing, targeted per-persona file upgrades land as follow-up. +- **Phasing out the internal `.context/compound-engineering/todos/` todo system and the `/todo-create`, `/todo-triage`, `/todo-resolve` skills:** long-term direction acknowledged in origin. Separate cleanup. +- **Script-first architecture for the tracker defer dispatch and bulk preview rendering:** considered during planning. Deferred to v2 — current ce:review is entirely prose-based orchestration; adding new scripts expands redesign footprint and cross-language test surface. Re-evaluate after usage data. + +## Context & Research + +### Relevant Code and Patterns + +**Current `ce:review` structure to modify:** +- `plugins/compound-engineering/skills/ce-review/SKILL.md` — single orchestrator, 744 lines. After Review section at lines 603-715 is the primary edit target. +- Current bucket policy question at `plugins/compound-engineering/skills/ce-review/SKILL.md:615-640`. The stem violates AGENTS.md third-person rule ("What should I do...") — the redesign fixes this. +- Stage 5 merge pipeline at `plugins/compound-engineering/skills/ce-review/SKILL.md:451-479`. Existing "most conservative route" rule at line 471 is extended for R15. +- Headless detail-tier enrichment at `plugins/compound-engineering/skills/ce-review/SKILL.md:568-572`. The walk-through reuses this exact matching rule verbatim. +- Safe_auto fixer dispatch at `plugins/compound-engineering/skills/ce-review/SKILL.md:664-671` ("Spawn exactly one fixer subagent..."). The walk-through's Apply actions accumulate and dispatch at the end of the walk-through to preserve this "one fixer, consistent tree" guarantee. +- Findings schema at `plugins/compound-engineering/skills/ce-review/references/findings-schema.json`. No schema changes; R15 tie-breaking operates on existing fields. + +**Patterns to mirror:** +- Four-option menu format: `plugins/compound-engineering/skills/ce-ideate/references/post-ideation-workflow.md:137-150`. Front-loaded distinguishing words, self-contained labels, third-person agent voice. +- Per-item walk-through with progress header: `plugins/compound-engineering/skills/todo-triage/SKILL.md:20-29`. Uses numbered chat prompts; the ce:review walk-through must upgrade to `AskUserQuestion`. +- Per-agent review loop with Accept / Reject / Discuss: `plugins/compound-engineering/skills/ce-plan/references/deepening-workflow.md:195-216`. +- Pipe-delimited findings table rhythm for the pre-menu: `plugins/compound-engineering/skills/ce-review/references/review-output-template.md`. + +**AGENTS.md rules that materially shape this plan:** +- `plugins/compound-engineering/AGENTS.md:122-134` — Interactive Question Tool Design (4-option cap; self-contained labels; third-person agent voice; front-loaded distinguishing words; target-named when ambiguous) +- `plugins/compound-engineering/AGENTS.md:117-119` — Cross-platform question tool phrasing. Every new question uses "the platform's blocking question tool (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini)" plus a fallback path. +- `plugins/compound-engineering/AGENTS.md:109-114` — Rationale discipline. Extract the walk-through, bulk preview, and tracker defer flows to `references/` because they are conditional (Interactive mode only) and would otherwise add ~200 lines to every invocation. +- `plugins/compound-engineering/AGENTS.md:155-165` — Platform-specific variables in skills. The walk-through state file path is pre-resolved from the existing run-id pattern. + +### Institutional Learnings + +- `docs/solutions/skill-design/compound-refresh-skill-improvements.md` — Phrase interactive-question-tool references as platform-agnostic ("`AskUserQuestion` in Claude Code, `request_user_input` in Codex") with explicit "stop to wait for the answer" language. Gate new interactive surfaces on explicit `mode:interactive` (the existing default), never on "no question tool = headless" auto-detection. +- `docs/solutions/skill-design/beta-promotion-orchestration-contract.md` — Mode contracts are load-bearing. `tests/review-skill-contract.test.ts` asserts the ce:review mode surface; any behavior change must ship the contract test update in the same PR. +- `docs/solutions/workflow/todo-status-lifecycle.md` — Apply outcomes in Interactive mode must continue routing through the existing `ready` todo pipeline (preserving the `downstream-resolver` contract). Defer routes to the new tracker path. Skip produces no downstream artifact. Do not invent a new `pending`-producing path. +- `docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md` — Stateful per-item walkthroughs need explicit transitions. The walk-through's "no more findings" and "LFG the rest" are distinct terminal transitions; encode each explicitly rather than collapsing. +- `docs/solutions/best-practices/codex-delegation-best-practices-2026-04-01.md` — Skill body size is a multiplicative cost driver. Move Interactive-mode detail to `references/` because it runs on a minority of invocations. +- `docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md` — If Defer invokes a sub-agent for ticket composition, pass paths (to merged findings artifact) rather than content. Also: "per-item walk" phrasing can cause 7x tool-call amplification in Claude Code vs. "bulk find, then filter" phrasing — the walk-through spec iterates over merged findings in memory, not by re-scanning per finding. + +### External References + +None used. Local patterns are strong; no framework/security/compliance unknowns. + +## Key Technical Decisions + +- **Extract walk-through, bulk preview, and tracker defer to `references/` files.** SKILL.md is already 744 lines; these three surfaces are conditional (Interactive mode, when gated/manual findings remain) and would inflate the body by ~200 lines paid on every invocation. Respects `plugins/compound-engineering/AGENTS.md:109-114`. + +- **R15 tie-breaking extends the existing Stage 5 "most conservative route" rule.** The rule at `SKILL.md:471` already does this for `autofix_class` / `owner`. R15 adds the same discipline for the recommended *action* (Apply / Defer / Skip), using order `Skip > Defer > Apply`. Same Stage 5 sub-step, same philosophy — no new architectural seam. + +- **R22-R25 framing quality is delivered by a small framing-guidance upgrade in the shared reviewer subagent template, not a synthesis-time rewrite pass.** Planning-phase sampling of 15+ recent review artifacts across 5 personas showed two distinct gaps: + 1. *Consistency gap:* `adversarial-reviewer` and `api-contract-reviewer` produced `why_it_matters: null` on every finding in at least one recent run (schema violation — field is required). + 2. *Quality gap:* `correctness-reviewer` and `maintainability-reviewer` populate `why_it_matters` but lead with code-structure-first framing; observable-behavior-first (R23) failed in roughly 5 of 7 sampled findings. + + Considered options: (a) synthesis-time rewrite pass (new Stage 5b with per-finding model dispatch) — rejected as over-engineered for the gap, adds recurring per-review cost, and papers over a schema violation rather than fixing it; (b) per-persona file upgrades across 5 personas — rejected as scope inflation for v1; (c) shared-template upgrade — chosen. One file change (the persona subagent template) adds framing guidance that every dispatched persona receives, fixing both gaps at the source with bounded scope. If post-ship sampling shows specific personas still fail, targeted per-persona edits land as follow-up. + +- **Apply actions in the walk-through accumulate and dispatch at the end.** The walk-through collects Apply decisions in memory, and after the loop exits, dispatches one fixer subagent for the full accumulated set. Trade-off the user experiences: a fix failure surfaces at the end of the walk-through, not at the decision moment. The alternative — per-finding fixer dispatch — costs per-finding fixer overhead, spawns racey mid-walk-through processes, and complicates the user model (when is the Apply "real"?). The unified end-of-walk-through dispatch also means the fixer sees the whole set at once and can handle inter-fix dependencies (two Applies touching overlapping regions) in one pass rather than sequentially. The existing Step 3 fixer prompt needs a small update to acknowledge the heterogeneous queue (gated_auto + manual mix, not just safe_auto); tracked in Unit 3. + +- **Tracker detection stays reasoning-based per R14 / R17.** No enumerated checklist of files. Agent reads `CLAUDE.md` / `AGENTS.md` and whatever else it judges relevant. When evidence is ambiguous, the label is generic ("File an issue per finding") and the agent confirms the tracker with the user before executing any Defer. GitHub Issues is the only concrete fallback named by the spec; the harness task primitive is a last-resort with a clear durability warning. + +- **Prose-based v1, not script-first.** Deterministic logic (preview rendering, tracker dispatch) is a script-first candidate per `docs/solutions/skill-design/script-first-skill-architecture.md`. Deferred to v2 — current ce:review is entirely prose-based orchestration; adding two new scripts expands the redesign footprint and introduces cross-language test surface. Revisit after usage data. + +- **Walk-through state is in-memory only, not persisted per-decision.** The walk-through accumulates Apply / Defer / Skip / Acknowledge decisions in orchestrator memory. Formal cross-session resumption is out of scope; an interrupted walk-through simply loses its in-flight state (prior Applies have not been dispatched yet since they batch at the end). Avoids the complexity of state-file schema design, external-edit staleness detection, and `.context/` lifecycle management — all for a feature (inspectable partial state) that has no consumer. + +- **Tracker-availability probes run at most once per session, cached for the rest of the run.** When the routing question needs to decide whether to offer option C with a tracker name, a single probe sequence runs (e.g., read `CLAUDE.md` / `AGENTS.md`, then `gh auth status` if relevant, then any MCP-tracker availability checks). The `{ tracker_name, confidence, sink_available }` tuple is cached; subsequent Defer actions in the same session reuse it without re-probing. Probes fire only when the routing question is about to be asked — never speculatively at the start of a review. + +- **Third-person voice in all new question stems and labels.** The current bucket question's stem ("What should I do...") violates `plugins/compound-engineering/AGENTS.md:127`. The redesign fixes this for the new surfaces — "What should the agent do next?" style. + +## Open Questions + +### Resolved During Planning + +- **Do reviewer personas reliably produce framing-quality `why_it_matters` today?** No, with two distinct failure modes: (a) `adversarial` and `api-contract` produced `why_it_matters: null` on every finding in one recent run (schema violation); (b) `correctness` and `maintainability` populate the field but 5 of 7 sampled findings lead with code structure instead of observable behavior. Resolution: a small framing-guidance upgrade to the shared reviewer subagent template (Unit 2) addresses both gaps at the source — single file change, universal effect across all personas. Fixes the schema-violation bug inline; no separate deferred item needed. +- **Apply in walk-through: per-finding or batched?** Batched at end of walk-through. User experience: fix results surface at the end. Also gives the fixer the whole Apply set at once for dependency-aware application. The existing Step 3 fixer prompt needs a small update to acknowledge the heterogeneous queue (tracked in Unit 3). +- **Script-first for tracker dispatch and preview?** Deferred to v2. Prose-based for this work to match existing ce:review shape. +- **Where does R15 tie-breaking land in the pipeline?** In Stage 5 merge as an extension of the existing conservative-route rule, immediately after the current step 7 ("Normalize routing"). +- **Extract new logic to `references/`?** Yes — three new reference files (walk-through, bulk preview, tracker defer). + +### Deferred to Implementation + +- **Exact `AskUserQuestion` label wording for `LFG the rest` and related bail-out moments.** Requirements pin semantics ("LFG the rest — apply the agent's best judgment to this and remaining findings"), but harness-specific label truncation behavior may require minor phrasing tweaks during authoring. Validate against each target platform during implementation. +- **Exact framing-guidance prose for the subagent template (Unit 2).** The block must be tight (add a paragraph or two, not pages), include a positive/negative example pair, and reinforce the required-field constraint. Word during implementation against recent artifacts. +- **GitHub Issues availability check command.** Left to the agent's reasoning at runtime per R14 / R17 (e.g., `gh auth status` + `gh repo view --json hasIssuesEnabled`, or cheaper signal). Not pre-specified. +- **Fixer subagent prompt updates for heterogeneous Apply queue.** Today's Step 3 fixer prompt was scoped to the safe_auto queue. The walk-through's Apply set may contain gated_auto or manual findings whose suggested_fix needs the same execution care. Prompt iteration during Unit 3 authoring; may become its own small prompt edit inside ce-review SKILL.md. +- **Whether reviewer-name attribution survives in per-finding questions.** Origin document defers this as a validation question. Keep in for v1 implementation and validate via usage after shipping. + +## High-Level Technical Design + +> *This illustrates the intended flow and is directional guidance for review, not implementation specification.* + +```mermaid +flowchart TD + A[Stage 5: Merge & dedup] --> A1[R15 tie-breaking
Skip > Defer > Apply] + A1 --> C[Stage 6: Synthesize & present table
framing reads persona output directly] + C --> D{Any gated/manual
findings remain?} + D -->|No| Z[Completion summary -> final-next-steps] + D -->|Yes| E[Step 2: Four-option routing] + + E -->|A: Review| F[Walk-through loop] + E -->|B: LFG| P[Bulk preview] + E -->|C: File tickets| P + E -->|D: Report only| Z2[Stop; no action] + + F --> G{Per-finding decision} + G -->|Apply| G1[Accumulate Apply set] + G -->|Defer| G2[Tracker-defer dispatch] + G -->|Skip| G3[No action] + G -->|LFG the rest| P2[Bulk preview
scoped to remaining] + + G1 --> G + G2 --> G + G3 --> G + G -->|End of list| H[Step 3: Dispatch fixer
for accumulated Apply set] + + P -->|Proceed| Q[Execute: apply/defer/skip per agent recommendation] + P -->|Cancel| E + P2 -->|Proceed| Q + P2 -->|Cancel| F + + Q --> H + H --> I{Any fixes
applied?} + Z2 --> Z + I -->|Yes| Z + I -->|No| Z3[Skip final-next-steps;
exit after report] +``` + +The diagram shows the conceptual flow; exact prose sub-steps and `references/` delegation land in the implementation units below. + +## Implementation Units + +- [ ] **Unit 1: Add recommended-action tie-breaking to Stage 5 merge** + +**Goal:** Extend the existing Stage 5 "most conservative route" rule to resolve conflicting per-finding recommendations (Apply / Defer / Skip) into a single deterministic value per merged finding, so LFG and walk-through Apply/Defer/Skip decisions are auditable. + +**Requirements:** R15 + +**Dependencies:** None + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-review/SKILL.md` (Stage 5, after existing step 7) +- Test: `tests/review-skill-contract.test.ts` — add assertion that the Stage 5 prose mentions the tie-breaking rule and the order `Skip > Defer > Apply` + +**Approach:** +- Add a new sub-step (e.g., "7b. Recommended-action tie-breaking") immediately after the existing "Normalize routing" step at `SKILL.md:471` +- State the rule verbatim: when merged findings carry conflicting recommendations, pick the most conservative using `Skip > Defer > Apply` +- Reference the existing same-philosophy rule for `autofix_class` so the extension reads as continuation, not novelty + +**Patterns to follow:** +- Existing conservative-route prose at `plugins/compound-engineering/skills/ce-review/SKILL.md:98` and `:471` +- The schema's `_meta.return_tiers` structure for what the merged finding carries + +**Test scenarios:** +- *Happy path:* reviewer A recommends Apply and reviewer B recommends Defer on a merged finding -> merged recommendation is Defer +- *Happy path:* reviewer A Defer and reviewer B Skip -> merged recommendation is Skip +- *Happy path:* all contributing reviewers recommend Apply -> merged recommendation is Apply +- *Edge case:* single reviewer (no merge happened) -> that reviewer's recommendation passes through unchanged +- *Edge case:* a finding with only `autofix_class: advisory` carries no apply/defer/skip recommendation — the tie-breaking rule is a no-op (not an error) + +**Verification:** +- The SKILL.md Stage 5 section names the rule and the order. +- `bun test tests/review-skill-contract.test.ts` passes. + +--- + +- [ ] **Unit 2: Upgrade shared reviewer subagent template with R22-R25 framing guidance** + +**Goal:** Add framing guidance for the `why_it_matters` field to the shared reviewer subagent template so all persona agents produce observable-behavior-first framing (fixing the R23 gap observed in correctness and maintainability) and never emit null `why_it_matters` (fixing the schema violation observed in adversarial and api-contract). One file change, universal effect across all 16+ persona agents. + +**Requirements:** R22, R23, R24, R25, R26 + +**Dependencies:** None (can author in parallel with Unit 1) + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-review/references/subagent-template.md` — add a dedicated framing-guidance block for the `why_it_matters` field +- Test: `tests/review-skill-contract.test.ts` — add assertions on the presence of the framing-guidance block and its key constraints + +**Approach:** +- Current subagent template already instructs personas to return JSON per schema, but offers no guidance on *how* to write `why_it_matters` beyond the schema's one-line description ("Impact and failure mode -- not 'what is wrong' but 'what breaks'"). +- Add a new `why_it_matters` guidance block to the template that the orchestrator dispatches verbatim to every persona. Content: + - Lead with the observable behavior (what a user, attacker, or operator sees) — not the code structure. Function and variable names appear only when the reader needs them to locate the issue. + - Explain *why* the recommended fix works, not just what it changes. When a similar pattern exists elsewhere in the codebase, reference it so the recommendation is grounded. + - Tight: approximately 2-4 sentences plus the minimum code needed to ground it. Longer is a regression. + - `why_it_matters` is required by the schema. Empty, null, or single-phrase entries are validation failures — always produce substantive content grounded in the evidence the reviewer collected. +- Include a positive/negative example pair so personas have a concrete calibration anchor. +- Because the shared template is loaded verbatim by every dispatched persona, this change fixes both gaps at the source for every reviewer in one edit — no per-persona file editing. + +**Patterns to follow:** +- The existing structure of `plugins/compound-engineering/skills/ce-review/references/subagent-template.md` (the canonical template all personas receive via the dispatch path at `plugins/compound-engineering/skills/ce-review/SKILL.md:405-445`). +- The illustrative framing pair from `docs/brainstorms/2026-04-17-ce-review-interactive-judgment-requirements.md` (R22-R25 section). Reuse verbatim or paraphrase tightly. + +**Test scenarios:** +- *Template structure:* the subagent template contains a dedicated section instructing personas on `why_it_matters` framing (observable-behavior-first, 2-4 sentences, grounded in evidence, required field). +- *Template example:* the template includes a positive/negative framing example pair. +- *Integration (post-merge sampling):* after the template change lands, sample one fresh review artifact from each of correctness, maintainability, adversarial, api-contract, security, reliability. Verify `why_it_matters` is populated (never null) and leads with observable behavior in the majority of cases. +- *Edge case:* a persona still produces weak framing on some subset of findings — not a regression of this unit; tracked as a per-persona follow-up. + +**Verification:** +- The subagent template contains the framing-guidance block, the required-field reminder, and an example pair. +- A fresh review run's artifact files show populated `why_it_matters` for every finding (no null values). +- Spot-check the first sentence of `why_it_matters` across 5+ fresh findings: each leads with observable behavior, not code structure. + +--- + +- [ ] **Unit 3: Author per-finding walk-through** + +**Goal:** The `Review each finding one by one` path — present findings one at a time with the required per-finding content (R7), options (R8-R10), advisory variant (R9), mode+position indicator (R6), N=1 adaptation, R15 conflict surfacing, and no-sink handling. Hand off Apply decisions as a batch to the existing fixer subagent at end of loop. Implements R6-R12 (walk-through scope). + +**Requirements:** R6, R7, R8, R9, R10, R11 (walk-through scope of LFG), R12 (completion report for the walk-through's Apply / Defer / Skip decisions) + +**Dependencies:** Unit 2 (walk-through display reads persona-produced `why_it_matters` directly; the upgraded template ensures that content is R22-R25-quality) + +**Files:** +- Create: `plugins/compound-engineering/skills/ce-review/references/walkthrough.md` +- Modify: `plugins/compound-engineering/skills/ce-review/SKILL.md` — add a sub-step under After Review Step 2 (e.g., Step 2c) that delegates to the reference +- Test: `tests/review-skill-contract.test.ts` — assertions on the existence of `references/walkthrough.md` and on the four-option label set for per-finding questions + +**Approach:** +- Walk-through iterates merged findings in severity order (P0 → P3), reading each finding's `why_it_matters` and evidence directly from the persona artifact (same lookup rule headless mode uses at `SKILL.md:568-572`). Unit 2's template upgrade ensures persona output meets the framing bar; no synthesis-time rewrite happens here. +- Each question uses the platform's blocking question tool (`AskUserQuestion` / `request_user_input` / `ask_user`) with: + - Stem: opens with a mode+position indicator ("Review mode — Finding 3 of 8 (P1):"), then the persona-supplied plain-English problem and the proposed fix + - When R15 tie-breaking narrowed a conflict across reviewers, the stem surfaces that context briefly (e.g., "Correctness recommends Apply; Testing recommends Skip. Agent's recommendation: Skip.") so the user sees the orchestrator's final call and the disagreement context at once. The orchestrator's recommendation is what's labeled "recommended" on the option set. + - Four options (R8): `Apply the proposed fix` / `Defer — file a [TRACKER] ticket` / `Skip — don't apply, don't track` / `LFG the rest — apply the agent's best judgment to this and remaining findings` + - For advisory-only findings: option A becomes `Acknowledge — mark as reviewed` (R9). Remaining options unchanged. +- Per-finding routing: + - Apply -> accumulate the finding id into an in-memory Apply set; advance + - Defer -> invoke the tracker-defer flow (see Unit 5); on success record the tracker URL; on failure present Retry / Fall back / Convert-to-Skip. The walk-through position indicator stays on the current finding during this sub-flow. + - Skip -> record Skip; advance + - Acknowledge -> record Acknowledge; advance (advisory-only path) + - LFG the rest -> exit the walk-through loop; dispatch the bulk preview (Unit 4) scoped to remaining findings, with already-decided count inline. If the preview's Cancel is picked, return the user to the current finding's per-finding question (not to the routing question). +- Walk-through state is in-memory only (not written to disk). An interrupted walk-through discards in-flight decisions; prior Applies have not been dispatched yet because Apply accumulates for end-of-walk-through batch dispatch. +- After the walk-through loop terminates (all findings decided, or user took LFG-the-rest Proceed, or all decisions were non-Apply), the unit hands off to the end-of-walk-through dispatch: one fixer subagent receives the accumulated Apply set; Defer set has already executed inline; Skip / Acknowledge no-op. The existing Step 3 fixer subagent prompt needs a small update acknowledging the queue is heterogeneous (gated_auto + manual mix, not just safe_auto) — tracked in this unit's approach even though the prompt lives outside this plan's edit surface today. +- N=1 adaptation: when exactly one gated/manual finding remains, the header wording is "Review the finding" rather than "Review each finding one by one"; `LFG the rest` is omitted from the option set (three options). +- No-sink adaptation: when Unit 5's detection returns `sink_available: false`, option B ("Defer — file a ticket") is omitted from the per-finding question. The stem tells the user why ("Defer unavailable on this platform — no tracker or task-tracking primitive detected."). +- Override clarification (R10): picking Defer or Skip instead of Apply is "override"; no inline freeform fix authoring; users who want a variant Skip and hand-edit. + +**Completion report (shared with Unit 4 per T5):** when the walk-through terminates — or any bulk action (LFG / File tickets / LFG the rest) finishes executing, or the zero-findings path runs — emit one unified completion report per R12's minimum fields: per-finding entries (title, severity, action taken, tracker URL for Deferred, one-line reason for Skipped), summary counts by action, explicit failure callouts, and the existing end-of-review verdict. The report structure is identical across paths; only the data differs. + +**Execution note:** The walk-through is operationally read-only except for two permitted writes — the in-memory Apply-set accumulator, and the tracker-defer dispatch (Unit 5). Persona agents remain strictly read-only. + +**Patterns to follow:** +- `plugins/compound-engineering/skills/todo-triage/SKILL.md:20-29` — per-item prompt and progress header (model upgrade: use `AskUserQuestion` instead of numbered chat options) +- `plugins/compound-engineering/skills/ce-review/SKILL.md:568-572` — artifact lookup for persona-produced `why_it_matters` and evidence +- `plugins/compound-engineering/skills/ce-plan/references/deepening-workflow.md:195-216` — per-agent loop with third-person agent voice +- `plugins/compound-engineering/skills/ce-review/references/review-output-template.md` — severity-grouped rhythm (for consistency with the table preceding the menu) + +**Test scenarios:** +- *Happy path:* 3-finding review, user picks Apply / Defer / Skip one per finding -> walk-through completes; end-of-walk-through fixer dispatch receives a 1-element Apply set; one Linear ticket was filed; completion report shows 1 applied / 1 deferred with URL / 1 skipped +- *Happy path N=1:* 1-finding review, question wording adapts and `LFG the rest` is suppressed (three options) +- *Advisory variant:* advisory-only finding -> option A reads `Acknowledge — mark as reviewed` +- *LFG the rest:* at finding 2 of 5, user picks LFG the rest -> walk-through exits, bulk preview is invoked scoped to findings 2-5 with "1 already decided" note; Cancel from the preview returns the user to finding 2, not to the routing question +- *Override:* user picks Skip on a finding with a concrete proposed fix -> walk-through records Skip (not Apply) +- *R15 conflict surface:* a finding where reviewers recommended different actions -> walk-through stem surfaces the conflict and the orchestrator's final recommendation; user picks the orchestrator's recommendation and moves on +- *Defer failure mid-walk-through:* user picks Defer on finding 3 of 5; `gh issue create` returns 403; Retry / Fall back / Convert-to-Skip sub-question appears; user picks Convert-to-Skip; position indicator stays at 3 of 5; completion report's failure callout names the finding and reason +- *Edge case (interruption):* user cancels the AskUserQuestion mid-walk-through -> prior in-memory Apply/Defer/Skip decisions are lost; any Defers that already executed remain in the tracker (they were external side effects); Skip/Acknowledge/Apply-pending states are discarded; no end-of-walk-through fixer dispatch runs +- *No-sink:* detection returns `sink_available: false` -> per-finding question shows three options (no Defer); stem explains why +- *Integration:* a walk-through Apply action adds the finding to the Apply set; after walk-through completes, Step 3's fixer subagent receives the accumulated set with a prompt update noting the heterogeneous queue + +**Verification:** +- Running `ce:review` interactive on a 3+-finding fixture yields a walk-through where each question shows mode+position + framing + options correctly. +- The end-of-walk-through fixer dispatch runs once with all Apply decisions; no per-finding fixer calls during the loop. +- The unified completion report is emitted on every terminal path (walk-through complete, LFG-rest Proceed, LFG-rest Cancel followed by user picking Stop). + +--- + +- [ ] **Unit 4: Author bulk action preview** + +**Goal:** The compact plan preview shown before every bulk action (top-level LFG, top-level File tickets, and walk-through `LFG the rest`). Implements R13-R14 and the LFG half of R12 (the post-execution completion report is shared). + +**Requirements:** R13, R14 (R12 completion report is shared with Unit 3 per T5) + +**Dependencies:** Unit 2 (preview lines read persona-produced `why_it_matters` directly in compressed form; the upgraded subagent template ensures that content meets the framing bar) + +**Files:** +- Create: `plugins/compound-engineering/skills/ce-review/references/bulk-preview.md` +- Modify: `plugins/compound-engineering/skills/ce-review/SKILL.md` — After Review Step 2 dispatches to this reference for options B and C; Unit 3's walk-through dispatches for `LFG the rest` +- Test: `tests/review-skill-contract.test.ts` — assert existence of the reference and that the preview contract uses exactly `Proceed` / `Cancel` + +**Approach:** +- Preview renders findings grouped by the action the agent intends to take: `Applying (N):`, `Filing [TRACKER] tickets (N):`, `Skipping (N):`, `Acknowledging (N):` +- Each finding line: severity tag + file:line + compressed plain-English summary + action phrase. One line per finding, max ~80 columns +- Compressed framing follows R22-R25 spirit: observable behavior over code structure, no function/variable names unless needed to locate. Draw from the persona-produced `why_it_matters` (post-Unit 2 template upgrade) in condensed form; the preview line is essentially the first sentence of the finding's framing +- For `LFG the rest`: preview header reads "LFG plan — N remaining findings (K already decided)"; already-decided findings are not included in the preview +- Question: `AskUserQuestion` / `request_user_input` / `ask_user` with exactly two options: + - `Proceed` + - `Cancel — back to routing` (for top-level) or `Cancel — back to walk-through` (for LFG the rest) +- Cancel returns to the originating question without changing state +- Proceed dispatches the plan: Apply set -> Step 3 fixer; Defer set -> tracker-defer flow (Unit 5); Skip/Acknowledge -> no action; then flows to completion report + +**Technical design:** *(directional)* + +Preview layout: + +``` +LFG plan — 8 findings (tracker: Linear): + +Applying (4): + [P0] orders_controller.rb:42 — Add ownership guard before order lookup + [P1] webhook_handler.rb:120 — Raise on unhandled error instead of swallowing + [P2] user_serializer.rb:14 — Drop internal_id from serialized response + [P3] string_utils.rb:8 — Rename ambiguous helper for clarity + +Filing Linear tickets (2): + [P2] billing_service.rb:230 — N+1 on refund batch (no concrete fix) + [P2] session_helper.rb:12 — Session reset behavior needs discussion + +Skipping (2): + [P2] report_worker.rb:55 — Recommendation is speculative; low confidence + [P3] readme.md:14 — Style preference, subjective + +A) Proceed +B) Cancel +``` + +**Patterns to follow:** +- Compact tabular rhythm from `plugins/compound-engineering/skills/ce-review/references/review-output-template.md` +- Third-person labels and front-loaded distinguishing words per `plugins/compound-engineering/AGENTS.md:122-134` +- Conditional visual aid guidance from `docs/solutions/best-practices/conditional-visual-aids-in-generated-documents-2026-03-29.md` + +**Test scenarios:** +- *Happy path (LFG, top-level):* 8 findings mixed across actions -> preview shows grouped buckets with correct counts; Proceed advances to dispatch; Cancel returns to routing +- *Happy path (File tickets, top-level):* every finding appears under `Filing [TRACKER] tickets (N):` regardless of the agent's natural recommendation, because option C is batch-defer +- *Happy path (LFG the rest):* walk-through has decided 3 findings; preview scopes to 5 remaining with "3 already decided" in header +- *Edge case:* zero findings in a bucket -> that bucket header is omitted from the preview (no empty `Skipping (0):` line) +- *Edge case:* all findings map to a single bucket -> preview still shows the bucket header; Proceed/Cancel still offered +- *Advisory preview:* for advisory-only findings appearing under `Acknowledging (N):`, the action phrase is "Mark as reviewed" +- *Cross-platform:* when the platform has no blocking question tool, preview falls back to numbered options and waits for user input + +**Verification:** +- Three call sites (Step 2 option B, Step 2 option C, walk-through `LFG the rest`) render the preview correctly. +- Cancel returns to the originating question; Proceed executes the plan. +- Preview lines all meet the compressed framing bar. + +--- + +- [ ] **Unit 5: Author tracker detection and defer execution** + +**Goal:** Tracker detection, fallback chain, ticket body composition, failure path, and the no-sink case. Implements R16-R21 and R3's tracker-name-inline-when-confident rule. + +**Requirements:** R3 (partial — tracker naming), R13 (partial — tracker name in preview), R16, R17, R18, R19, R20, R21 + +**Dependencies:** None (can be authored in parallel with Units 3 and 4) + +**Files:** +- Create: `plugins/compound-engineering/skills/ce-review/references/tracker-defer.md` +- Modify: `plugins/compound-engineering/skills/ce-review/SKILL.md` — After Review Step 2 references this file for tracker-name-in-label logic and for Defer execution +- Test: `tests/review-skill-contract.test.ts` — assertions on reference existence and on R21's "internal `.context/` todos out of fallback chain" being explicit in the prose + +**Approach:** +- **Detection (reasoning-based per R14 / R17):** Agent reads project documentation — primarily `CLAUDE.md` / `AGENTS.md` — and determines the tracker from whatever evidence is obvious. No enumerated checklist. A tracker can be surfaced via MCP tool (e.g., Linear MCP), CLI (e.g., `gh`), or direct API — all are acceptable. When the tracker is named explicitly (e.g., "issues go in Linear", a Linear URL, a project board link), confidence is high. When the signal is conflicting or absent, confidence is low. +- **Probe timing and caching (T3):** Availability probes (e.g., `gh auth status`, MCP-tracker reachability) run at most once per session and only when the routing question is about to be asked — not speculatively at review start, not per-Defer, not per-walk-through-finding. The resulting `{ tracker_name, confidence, sink_available }` tuple is held in orchestrator memory for the rest of the run. If a named tracker's availability is uncertain from documentation alone (tracker mentioned but no MCP/CLI invocation visible to the agent), the probe resolves the uncertainty once. +- **Label logic (R3):** If confidence is high AND the tracker's sink is available, the routing question and walk-through Defer label include the tracker name verbatim (e.g., `File a Linear ticket per finding`). If confidence is low or sink is uncertain, labels read generically (`File an issue per finding`) and the agent confirms the tracker with the user before executing any Defer. +- **Fallback chain (R18 principle-based):** Prefer durable external trackers over in-session-only primitives. Concrete fallbacks in order of preference: named tracker (MCP / CLI / API the agent can invoke) -> GitHub Issues via `gh` if authenticated and the repo has issues enabled -> the harness's task-tracking primitive (`TaskCreate` in Claude Code, `update_plan` in Codex) with an explicit durability notice to the user. Never fall back to `.context/compound-engineering/todos/` (R21 — explicit out-of-scope). +- **No-sink case (R20):** When no external tracker is detectable and no harness primitive is available (e.g., CI, converted targets without task binding), Defer is not offered as a menu option. Routing option C is omitted; walk-through option B is omitted; the agent tells the user why. +- **Ticket composition:** Title = merged finding's title. Body uses the persona-produced `why_it_matters` and evidence (read from the per-agent artifact via the same rule as headless enrichment at `SKILL.md:568-572`), plus severity, confidence, reviewer attribution, and finding_id. Labels include severity tag when the tracker supports labels. +- **Failure path (R19):** On ticket-creation failure, surface the error inline via a blocking question: `Retry` / `Fall back to next available sink` / `Convert to Skip (record the failure)`. The completion report captures the failure. When a high-confidence named tracker fails at execution, the session's cached `sink_available` for that tracker is invalidated so subsequent Defers in the same session fall through to the next tier rather than retrying a confirmed-broken sink. +- **Once-per-session confirmation:** When the fallback to harness task primitive is in effect, confirm once per session before the first Defer action: "No documented tracker and `gh` unavailable — will create in-session tasks that won't survive this session. Proceed for this and subsequent Defer actions?" + +**Patterns to follow:** +- `plugins/compound-engineering/skills/report-bug-ce/SKILL.md:104-122` — only existing `gh issue create` usage; pattern for optional labels and fallback body +- `plugins/compound-engineering/skills/ce-debug/SKILL.md:40-42` — consuming tracker URLs (Linear / Jira) via MCP tools or URL fetching; the principle-based "try, fall back, ask" style transposed to write-path +- `plugins/compound-engineering/AGENTS.md:117-119` — cross-platform question phrasing for the failure-path follow-up and the harness-fallback confirmation +- `docs/solutions/integrations/cross-platform-model-field-normalization-2026-03-29.md` — per-tracker behavior matrix as a model for stating Linear / GitHub Issues / harness primitive / no-tracker behavior explicitly + +**Test scenarios:** +- *Happy path, named tracker:* `CLAUDE.md` mentions "file bugs in Linear" -> routing label reads "File a Linear ticket per finding"; Defer dispatch creates a Linear ticket +- *Happy path, GitHub Issues fallback:* no tracker documented, `gh` authenticated and issues enabled -> Defer creates a GitHub issue; label reads "File an issue per finding"; agent confirms the tracker choice before executing +- *Happy path, harness fallback:* no tracker documented, `gh` unavailable -> once-per-session confirmation with durability warning; Defer calls `TaskCreate` / `update_plan` per platform +- *No-sink:* no tracker, no `gh`, no harness primitive -> routing option C is omitted; walk-through option B is omitted; the user is told why in the routing question's stem +- *Failure path:* `gh issue create` returns 403 -> inline `Retry / Fall back / Convert to Skip` question; completion report captures the failure +- *Label confidence:* `CLAUDE.md` says "bugs in Linear, features in GitHub Issues" -> ambiguous. Label is generic; agent confirms before dispatch +- *Integration:* persona-produced `why_it_matters` (post-Unit 2 template upgrade) is used in the ticket body; reviewer attribution and finding_id are included +- *Probe timing:* tracker probes do not fire for a review whose routing question is skipped (R2 zero-findings case) — the probe only runs when option C is a candidate to present +- *Edge case:* ticket body exceeds a tracker's max length -> truncate with "…(continued in ce-review run artifact: )" and include the finding_id for reference + +**Verification:** +- The reference file covers detection, label logic, fallback chain, failure path, no-sink, and harness-fallback confirmation in that order. +- Running Interactive mode against a repo with Linear documented produces a routing label naming Linear and creates a Linear-shaped ticket on Defer. + +--- + +- [ ] **Unit 6: Restructure After Review Step 2 as four-option routing** + +**Goal:** Replace the current bucket-level policy question with the four-option routing question that dispatches to the walk-through (Unit 3), bulk preview (Unit 4), or tracker-defer (Unit 5). Implements R1-R5 and R27 (mode boundary — only Interactive changes). + +**Requirements:** R1, R2, R3, R4, R5, R27 + +**Dependencies:** Units 3, 4, 5 (routing dispatches to all three) + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-review/SKILL.md` — After Review section (lines ~603-715); replace current Step 2 entirely +- Test: `tests/review-skill-contract.test.ts` — add assertions on the four-option set, stem voice, and tracker-name-conditional behavior; preserve existing assertions on Autofix / Report-only / Headless behavior + +**Approach:** +- Rewrite the "Choose policy by mode" subsection for Interactive mode only. Autofix / Report-only / Headless prose is unchanged +- New Interactive mode flow: + 1. Apply `safe_auto -> review-fixer` findings automatically without asking (unchanged) + 2. **R2 zero-check:** If no `gated_auto` / `manual` findings remain after safe_auto, show a one-line completion summary ("All findings resolved — N safe_auto fixes applied.") and proceed to Step 5 (final-next-steps) + 3. **R3 tracker pre-detection:** Dispatch to the tracker detection logic from `references/tracker-defer.md`; receive a `{ tracker_name, confidence, sink_available }` tuple + 4. **R1 routing question** via the platform's blocking question tool with: + - Stem (third-person, per AGENTS.md:127): "What should the agent do with the remaining N findings?" + - Four options (R4) — only options with sinks are shown (R20): + - (A) `Review each finding one by one — accept the recommendation or choose another action` + - (B) `LFG. Apply the agent's best-judgment action per finding` + - (C) `File a [TRACKER] ticket per finding without applying fixes` (label uses the concrete tracker name only when confidence is high; otherwise reads "File an issue per finding"; omitted entirely when `sink_available == false`) + - (D) `Report only — take no further action` + 5. Dispatch on selection: + - A -> `references/walkthrough.md` + - B -> `references/bulk-preview.md` (LFG plan scoped to all gated/manual findings) -> on Proceed, execute Apply set via Step 3, Defer set via Unit 5, Skip/Acknowledge no-op + - C -> `references/bulk-preview.md` (all findings under `Filing [TRACKER] tickets`) -> on Proceed, execute Defer set via Unit 5 for every finding; no fixes applied + - D -> skip to Step 5 (final-next-steps) with no action +- Remove the current bucket policy question and its routing blocks entirely (no shim — origin document Scope Boundary "no backward-compatibility shim") + +**Patterns to follow:** +- Four-option routing label patterns from `plugins/compound-engineering/skills/ce-ideate/references/post-ideation-workflow.md:137-150` +- Existing After Review mode-routing structure at `plugins/compound-engineering/skills/ce-review/SKILL.md:615-662` (replace the Interactive branch; leave Autofix / Report-only / Headless branches untouched) +- Cross-platform question phrasing at `plugins/compound-engineering/AGENTS.md:117-119` + +**Test scenarios:** +- *Happy path:* a review with 5 gated/manual findings and Linear tracker detected -> routing question shows all four options, option C reads "File a Linear ticket per finding", stem is third-person +- *R2 zero-case:* all findings resolved by safe_auto -> routing question is skipped; completion summary is shown; Step 5 runs +- *R3 low-confidence tracker:* ambiguous documentation -> option C label is generic ("File an issue per finding"); agent confirms the tracker before Defer on option C selection +- *R20 no-sink:* no tracker, no gh, no harness primitive -> option C is omitted; three options presented instead of four +- *Option A:* walk-through is dispatched with all findings +- *Option B:* bulk preview is dispatched scoped to all findings; Proceed executes +- *Option C:* bulk preview is dispatched with all findings under the Filing bucket +- *Option D:* Step 5 runs with no action taken +- *Third-person voice:* stem uses "the agent" not "I" / "me" +- *Mode isolation (R27):* same fixture under `mode:autofix` / `mode:report-only` / `mode:headless` shows unchanged behavior + +**Verification:** +- `bun test tests/review-skill-contract.test.ts` passes with new assertions. +- The After Review section no longer contains the old bucket policy question wording. +- Dispatch to `references/walkthrough.md`, `references/bulk-preview.md`, and `references/tracker-defer.md` is explicit. + +--- + +- [ ] **Unit 7: Condition Step 5 final-next-steps on applied fixes** + +**Goal:** The existing "final next steps" flow (push fixes / create PR / exit) only runs when at least one fix landed in the working tree. Skips for options C, D, and for LFG / walk-through completions with no Apply action. Implements R28. + +**Requirements:** R28 + +**Dependencies:** Unit 6 (the routing flow must track whether any fix was applied) + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-review/SKILL.md` — Step 5 (lines ~697-715) +- Test: `tests/review-skill-contract.test.ts` — assertions on the Step 5 gating prose + +**Approach:** +- After Unit 6's routing resolves and Unit 3 / Unit 4 / Unit 5 execute, the flow tracks a `fixes_applied_count` (incremented when Step 3 fixer succeeds on any Apply decision) +- Step 5's existing prompt is gated: if `fixes_applied_count == 0`, skip Step 5 entirely and exit the skill after the completion report +- Explicit skip conditions: + - Option C ran (File tickets per finding): no fixes landed; skip Step 5 + - Option D ran (Report only): no fixes landed; skip Step 5 + - LFG ran but the agent's recommendations contained no Apply: no fixes landed; skip Step 5 + - Walk-through completed with all Skip / Defer / Acknowledge: no fixes landed; skip Step 5 +- When fixes did land, Step 5 runs exactly as today — PR mode / branch mode / on-main mode + +**Patterns to follow:** +- Existing Step 5 mode-aware phrasing at `plugins/compound-engineering/skills/ce-review/SKILL.md:697-715` + +**Test scenarios:** +- *Happy path:* walk-through with 2 Apply decisions -> fixer runs -> Step 5 runs (offers push/PR/exit) +- *Option D:* Report only -> Step 5 is skipped; skill exits after report +- *Option C:* File tickets -> tickets filed, no fixes applied -> Step 5 is skipped +- *LFG with zero Applies:* all recommendations were Defer or Skip -> Step 5 is skipped +- *Walk-through all Skip:* no Apply decisions -> Step 5 is skipped +- *Mixed walk-through:* 1 Apply + 2 Defer + 1 Skip -> Step 5 runs + +**Verification:** +- The SKILL.md Step 5 section names the gating condition. +- `bun test tests/review-skill-contract.test.ts` passes with the new gating assertions. +- Running Interactive mode with option D or C exits after the report; running with Apply decisions offers Step 5 as today. + +--- + +- [ ] **Unit 8: Update orchestration contract test** + +**Goal:** `tests/review-skill-contract.test.ts` encodes the updated ce:review contract for all modes, so callers (`lfg`, `slfg`, any future orchestrator) stay validated. + +**Requirements:** R27 (mode boundary assertions), plus contract assertions from Units 1, 2, 3, 4, 5, 6, 7 + +**Dependencies:** Units 1-7 + +**Files:** +- Modify: `tests/review-skill-contract.test.ts` +- Verify (no change): `plugins/compound-engineering/skills/ce-review/SKILL.md` (already updated by Units 1-7) + +**Approach:** +- Add **structural assertions** (check for presence of landmarks and files, not exact copy): + - Stage 5 prose mentions a tie-breaking rule for conflicting recommendations (Unit 1). Assert presence of the three action tokens (`Skip`, `Defer`, `Apply`) and the word `conservative` in Stage 5; do not lock to a specific punctuation between them so prose can be edited for clarity. + - `references/walkthrough.md` exists (Unit 3). + - `references/bulk-preview.md` exists (Unit 4). + - `references/tracker-defer.md` exists and states `.context/compound-engineering/todos/` is not in the fallback chain (Unit 5). + - `references/subagent-template.md` contains a framing-guidance block for `why_it_matters` (Unit 2). Assert presence of "observable behavior" and the required-field reminder; do not lock to exact copy of the example pair. + - After Review Step 2 (Interactive branch) presents four options (Unit 6). Assert the four distinguishing words appear (`Review`, `LFG`, `File`, `Report`) as standalone tokens; do not lock the full option label copy. + - After Review Step 2's stem does not contain first-person "I" / "me" (Unit 6, AGENTS.md:127). + - Step 5 prose gates on fixes-applied (Unit 7). Assert presence of a conditional landmark; do not lock to exact phrasing. +- Preserve existing assertions for Autofix / Report-only / Headless mode prose (R27). These branches are unchanged by this work; the test locks that in. +- Confirm no reference to legacy `todos/` in the fallback chain. +- **Philosophy:** the contract test is a regression guard, not authoring ossification. Assert presence of stable landmarks (file paths, required tokens, mode branches) rather than exact prose. Wording improvements in future PRs should not break the test. + +**Patterns to follow:** +- Existing assertion style in `tests/review-skill-contract.test.ts:1-257` +- `bun:test` conventions and the existing `parseFrontmatter` helper + +**Test scenarios:** +- *Happy path:* `bun test tests/review-skill-contract.test.ts` passes after Units 1-7 land +- *Regression guard:* removing a routing option entirely (dropping one of the four distinguishing words) fails the test; re-wording a label for clarity does NOT fail the test +- *Regression guard:* re-introducing first-person "I" / "me" in the Step 2 stem fails the test +- *Mode isolation:* removing or modifying Autofix / Report-only / Headless prose fails the test (ensures R27 is enforced in the contract) + +**Verification:** +- The test suite passes after all units land. +- The test file is the single source of truth for the Interactive-mode contract shape. + +## System-Wide Impact + +- **Interaction graph:** The new After Review Step 2 dispatches to three new reference files (`walkthrough.md`, `bulk-preview.md`, `tracker-defer.md`). Framing quality is delivered upstream via the shared subagent template (Unit 2) — no new orchestrator-owned inline stage. The existing Step 3 fixer subagent is called once at the end of Apply accumulation (walk-through path) or once after Proceed (LFG path). Step 5 becomes conditional on `fixes_applied_count > 0`. +- **Error propagation:** Tracker failures surface inline via a Retry / Fallback / Convert-to-Skip follow-up question. When a high-confidence named tracker fails at execution, its cached sink-available state is invalidated for the rest of the session. Fixer failures continue to use today's bounded-rounds retry. +- **State lifecycle risks:** Walk-through state is in-memory only; an interrupted walk-through discards in-flight decisions and no fixer dispatch runs. Defer actions that already executed during the walk-through remain in the tracker (external side effects cannot be rolled back). The tracker-detection tuple is cached in orchestrator memory for the run. +- **API surface parity:** All new questions use `AskUserQuestion` / `request_user_input` / `ask_user` with fallback prose for platforms that lack the tool. Third-person agent voice applies uniformly. +- **Integration coverage:** The `lfg`, `slfg`, and other ce:review callers operate in `mode:autofix`, `mode:report-only`, or `mode:headless` — all three are unchanged. Unit 8's contract test asserts this explicitly. No behavior change for those callers. +- **Unchanged invariants:** Findings schema, persona dispatch (Stage 3-4), merge pipeline routing logic beyond R15, safe_auto fixer flow, run-id generation, headless output envelope, headless detail-tier artifact enrichment rule, the existing bucket policy question behavior under modes other than Interactive (it is removed, but since it only existed in the Interactive branch this is an in-mode change), and the pre-menu findings table format. + +## Risks & Dependencies + +| Risk | Mitigation | +|------|------------| +| Unit 2 template upgrade doesn't land the framing quality we want (personas still produce code-structure-first `why_it_matters`) | The change is a single file edit, so iterating the prose is cheap. Post-merge sampling verifies uptake; if specific personas still fail, targeted per-persona edits land as follow-up (deferred-tasks list) | +| Unit 2 template change causes unintended behavior changes in other review fields | The framing guidance is scoped to `why_it_matters` only. Other schema fields (title, severity, evidence, etc.) are untouched in the template edit. Contract test asserts the other fields' existing instructions are preserved | +| Tracker detection confidently names the wrong tracker at runtime | R3 label-confidence qualifier: only name the tracker inline when detection is high-confidence AND sink-available. On execution failure, cached sink-available state is invalidated so fallback fires on the next Defer rather than retrying a confirmed-broken sink. Failure path always offers the user a path out (Retry / Fall back / Skip) | +| Tracker probes add latency before the routing question appears | Probes run at most once per session and only when option C is a candidate (skipped on zero-findings path). Acceptable added latency: single `gh auth status` call plus MCP dispatch checks | +| Apply set from the walk-through is heterogeneous (gated_auto + manual), differing from the safe_auto queue the fixer was designed for | Unit 3 calls out the small Step 3 fixer prompt update needed to acknowledge the heterogeneous queue. Prompt iteration lands alongside Unit 3 | +| Scope spans 8 units across SKILL.md, shared subagent template, and 3 new reference files | Unit boundaries keep individual changes focused. Units 1, 2, 3, 4, 5 can author in parallel; Unit 6 is the integration point that depends on 3/4/5; Units 7/8 follow. Single-PR shipping acceptable given the reduced scope (no Stage 5b) | +| Cross-platform test regression in `tests/review-skill-contract.test.ts` from prose-wording improvements | Unit 8 uses structural assertions (landmarks, file paths, required tokens, mode branches) rather than exact prose. Wording improvements in future PRs should not break the test (philosophy documented in the unit approach) | +| The "approve intent, write a variant" edge case surfaces user friction in v1 | Documented in Scope Boundaries and in the walk-through's override rule (R10). Track as candidate for v2 | +| Four-option routing menu has no headroom for a future fifth intent | Documented in Dependencies / Assumptions. A future fifth intent would require promoting a follow-up sub-question or demoting one of the four options — both are acceptable follow-up costs | + +## Documentation / Operational Notes + +- Update `plugins/compound-engineering/README.md` if the redesign changes the skill's externally visible capabilities (the routing question stem and options will appear in user-facing help). Defer the README change to an end-of-PR unit; the skill-level docs are the source of truth. +- No rollout, feature flag, or monitoring changes needed — this is a prose-level skill authoring change behind `mode:interactive` (the default). Callers using other modes are unaffected. +- Run `bun run release:validate` as part of verification; the plugin.json descriptions/counts are not changed by this work, but the validator catches regressions if they appear. + +## Sources & References + +- **Origin document:** [docs/brainstorms/2026-04-17-ce-review-interactive-judgment-requirements.md](../brainstorms/2026-04-17-ce-review-interactive-judgment-requirements.md) +- Primary edit targets: `plugins/compound-engineering/skills/ce-review/SKILL.md` (After Review section, Stage 5) and `plugins/compound-engineering/skills/ce-review/references/subagent-template.md` (framing guidance for `why_it_matters`) +- New reference files: `plugins/compound-engineering/skills/ce-review/references/{walkthrough.md,bulk-preview.md,tracker-defer.md}` +- Findings schema: `plugins/compound-engineering/skills/ce-review/references/findings-schema.json` (no changes) +- Contract test: `tests/review-skill-contract.test.ts` +- Project standards: `plugins/compound-engineering/AGENTS.md` (§Interactive Question Tool Design, §Cross-Platform User Interaction, §Rationale Discipline) +- Institutional learnings: `docs/solutions/skill-design/compound-refresh-skill-improvements.md`, `beta-promotion-orchestration-contract.md`, `workflow/todo-status-lifecycle.md`, `skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md`, `best-practices/codex-delegation-best-practices-2026-04-01.md`, `skill-design/pass-paths-not-content-to-subagents-2026-03-26.md` +- Related prior work: `plugins/compound-engineering/skills/todo-triage/SKILL.md` (per-item walk-through precedent), `plugins/compound-engineering/skills/ce-ideate/references/post-ideation-workflow.md` (four-option menu precedent), `plugins/compound-engineering/skills/ce-plan/references/deepening-workflow.md` (per-agent loop precedent) diff --git a/docs/solutions/best-practices/ce-pipeline-end-to-end-learnings-2026-04-17.md b/docs/solutions/best-practices/ce-pipeline-end-to-end-learnings-2026-04-17.md new file mode 100644 index 0000000..d79b5ca --- /dev/null +++ b/docs/solutions/best-practices/ce-pipeline-end-to-end-learnings-2026-04-17.md @@ -0,0 +1,300 @@ +--- +title: "End-to-end learnings from running the full CE pipeline on a substantial feature" +date: 2026-04-17 +category: best-practices +module: plugins/compound-engineering +problem_type: best_practice +component: development_workflow +severity: medium +applies_when: + - Running ce:brainstorm → ce:plan → ce:work → ce:review on any non-trivial feature (more than ~1 unit of implementation work) + - Orchestrating the full compound-engineering pipeline end-to-end in a single session + - Deciding when to insert document-review passes between pipeline stages + - Any feature that introduces a new user-facing flow, especially bulk actions or single-keystroke commitments + - Any time a research agent returns a confident architectural recommendation that would add a stage, schema field, or module +tags: [compound-engineering, ce-pipeline, ce-brainstorm, ce-plan, ce-work, ce-review, document-review, workflow, hitl, pipeline-discipline] +--- + +# End-to-end learnings from running the full CE pipeline on a substantial feature + +## Context + +The compound-engineering pipeline is designed as a sequence of progressively more expensive stages: `ce:brainstorm` → `document-review` → `ce:plan` → `document-review` → `ce:work` → `ce:review` → `resolve-pr-feedback`. Each stage operates on a different artifact (requirements doc, plan doc, diff, PR) and applies a different lens (exploration, critique, execution, synthesis, defense). + +It is tempting, on a substantial feature, to collapse this sequence — jump from a rough idea to implementation, or skip document-review because the plan "looks right." A recent session ran the full pipeline end-to-end on a non-trivial feature: redesigning the Interactive mode of `ce:review` with a per-finding walk-through, a compact bulk-action preview, a four-option routing model, and defer-to-tracker integration. + +The cross-cutting insight from that run is that **the pipeline itself compounds**. Issues that would have been cheap to fix at brainstorm time became expensive in PR review; issues document-review caught at plan time would have corrupted implementation if they had slipped through. Each stage catches a different class of problem, and each cheaper stage eliminates issues before they become expensive ones downstream. The value of running the pipeline in full isn't process-for-its-own-sake — it is that the stages are not redundant. They find different things. + +This document codifies the concrete patterns that surfaced repeatedly so future runs — by humans or agents — inherit the lessons instead of rediscovering them. + +--- + +## Guidance + +### 1. Sample actual evidence before accepting research-agent claims + +Research agents and sub-agents return confident conclusions. Treat those conclusions as hypotheses, not facts, whenever an architectural decision rides on them. "Did you check?" is the correct response to any recommendation framed as "our analysis shows..." when the downstream cost of being wrong is a new stage, a new schema, or a new module. + +The concrete practice: + +- When a research agent recommends a structural intervention (new stage, new field, new module), name the specific artifacts the claim is derived from. +- Sample 10-20 real artifacts across the relevant axes. +- Compare what the sampled evidence actually shows to what the research claim asserts. +- Update the intervention to match the evidence, not the claim. + +Sampled evidence is often directionally correct but mechanistically wrong — and the mechanism is what determines the fix. + +### 2. Run document-review after brainstorm AND after plan + +Document-review is not a single gate. It operates differently on requirements (is this the right problem, framed coherently?) than on plans (does this design hold together, and does it contradict its own scope?). Skipping either application is a different failure mode: + +- Skipping post-brainstorm doc-review: you plan the wrong thing. +- Skipping post-plan doc-review: you implement a plan with internal contradictions. + +Multiple doc-review personas routinely catch architectural contradictions — a unit that adds a schema field the plan's own scope boundary forbade, a feature whose framing undermines its stated goal. These are cheap catches at plan time, expensive in implementation, and nearly unfixable in PR review. + +### 3. Treat "trust the agent" UX options as rubber-stamp vectors + +Any feature that offers a single-keystroke commit-a-lot action is a rubber-stamping risk, regardless of how well it is labeled. If the redesign's goal is *reducing* rubber-stamping, any such action needs a visible plan the user can inspect before executing. + +The pattern: + +- Compact preview grouped by action class (Applying / Filing / Skipping). +- Proceed / Cancel gate before execution. +- Preview is cheap to render and hard to misuse. + +This is the right surface for *reviewing a pre-computed plan*. It is explicitly the wrong surface for *per-item decisions* — a numbered list with per-item options looks efficient at low volume and collapses working memory at high volume. + +### 4. Distinguish bulk-preview ergonomics from per-item walk-through ergonomics + +Two different review modalities with different affordances: + +| Modality | Good for | Bad for | +|---|---|---| +| Bulk preview grouped by action | Reviewing a pre-computed plan | Making per-item decisions | +| Per-item walk-through | Making per-item decisions | Reviewing dozens of items at once | + +Mixing the two — a numbered list with per-row options — feels dense and efficient until volume hits. Then it breaks. Decide which modality each surface is, and commit. + +### 5. Treat tool/platform caps as structural constraints + +Cross-platform tool limits (e.g., the 4-option cap on `AskUserQuestion`) are not annoyances to route around. They force design decisions. Collapsing a 5-option set into 4 + a follow-up question is architecturally different from a 5-option set. Accept the cap early and design for it; do not fight it in implementation and pay for it later. + +### 6. Never conflate two semantic meanings in one flag + +Flag names that read sensibly in one callsite can be silently wrong in another. The symptom is a flag whose definition ("is X available?") is consistent, but whose *use* answers two different questions ("can we invoke X?" vs. "should we offer X as an option?"). One flag cannot answer both correctly. + +When a flag's meaning depends on the caller, split it (see Example 2 below). + +This pattern recurs in the codebase. Prior instances surfaced during the `batch_confirm` collapse in document-review (session history) — a three-tier routing was collapsed to two because the middle tier conflated "high confidence in the fix" with "needs user judgment." And in the signal-word tightening for plan deepening, where "strengthen" / "confidence gaps" as standalone trigger words conflated targeted-edit intent with holistic-deepening intent, producing false positives until tightened to require "deepen" explicitly. + +### 7. Contract tests assert structure, not prose + +A contract test that pins exact wording becomes a tax on future copy improvements. Every wording refinement breaks the test even though the contract is intact. The philosophy is "regression guard, not authoring ossification." + +Assert: file existence, required section headings, required tokens, regex on distinguishing words. Do not assert: sentence-level wording, punctuation, or phrasing that copy editors will legitimately touch. This parallels the structural-evaluation practice used in skill-creator evals, where assertion names map to concrete fields in the output JSON (`overlap_detected`, `update_not_create`) rather than subjective prose judgments. + +### 8. Don't cite external plugins or tools in durable artifacts + +External references may be useful *in dialogue* during brainstorming — "plugin X's review flow does Y, what if we did Z?" — but should not appear in requirements docs, plan docs, PR descriptions, or commit messages. Artifacts need to stand on their own. + +- Dialogue: "X's design is interesting because..." +- Artifact: re-frame the same insight in self-contained terms that do not depend on the reader knowing X. + +The cost of violating this is low-visibility: the artifact reads fine today, but a future reader (or re-user of the pattern) hits an unexplained proper noun with no resolution path. + +### 9. Skill bodies are product code — author them accordingly + +Skills are the instruction substrate for future dispatch. Violations in a skill being shipped propagate into every future invocation. The authoring rules that apply to agent definitions apply equally to skill bodies: + +- Third-person agent voice ("What should the agent do?", not "What should I do?"). +- Front-load distinguishing words so truncated labels remain differentiable. +- Rationale discipline: conditional and late-sequence blocks must explain *why*, not just *what*, because agents landing mid-skill need the reasoning to route correctly. + +### 10. Each pipeline stage catches a different class of issue + +Don't skip stages because "the previous one looked fine." The value distribution across stages: + +| Stage | Catches | Relative cost to fix | +|---|---|---| +| Brainstorm | Wrong problem, wrong framing | Cheapest | +| Doc-review (requirements) | Incoherent requirements, missing constraints | Cheap | +| Plan | Wrong design | Medium | +| Doc-review (plan) | Self-contradicting plan, scope violations | Medium | +| Work | Execution bugs | Expensive | +| ce:review | Scope drift in implementation | Expensive | +| PR review | Subtle semantic conflations (flags, schema, contracts) | Most expensive | + +The stages are not redundant. Each catches things the others structurally cannot. + +--- + +## Why This Matters + +- **Cheaper stages eliminate expensive bugs.** The `sink_available` conflation (Example 2) was caught in PR review; had it shipped, it would have been a user-visible bug in an interactive flow. A hypothetical new "Stage 5b synthesis-time rewrite pass" would have added a persistent stage and per-finding model dispatch to the pipeline had it not been caught at plan time by sampling real artifacts instead of accepting a research claim. +- **Document-review finds contradictions authors miss.** The plan draft contained a unit that added a new field to merged findings — a schema change that contradicted the plan's own "no changes to the findings schema" scope boundary. The authors did not see this; multiple doc-review personas did. (session history: this same pattern appears across testing-addressed-gate, universal-planning, and the deepen-plan work — adversarial and scope-guardian reviewers consistently catch scope contradictions.) +- **Rubber-stamping risk is invisible without a preview gate.** A compact preview is cheap to implement and hard to misuse. Its absence is invisible until an interactive flow has been rubber-stamped in production. This was the exact failure mode in an earlier LFG-autopilot session where 6 of 7 reviewers scored just below the 80 threshold on legitimately fixable issues and were auto-suppressed. +- **Contract tests that ossify prose become a hidden tax on iteration.** Every future wording improvement triggers a false-positive test break, which trains contributors to either skip wording improvements or mechanically update tests without thinking. Neither is the intended outcome. +- **Pipelines compound only if run in full.** Running brainstorm-then-work is not compound engineering. It is ad-hoc engineering with extra syntax. The compounding effect comes from stages catching each other's misses. + +--- + +## When to Apply + +- Running `ce:brainstorm` → `ce:plan` → `ce:work` → `ce:review` on any non-trivial feature (more than ~1 unit of implementation work). +- Any feature that introduces a new user-facing flow, especially one with bulk actions, routing decisions, or single-keystroke commitments. +- Any time a research agent or sub-agent returns a confident architectural recommendation that would add a stage, a schema field, or a module. +- Any PR whose scope boundary is explicitly stated ("no changes to X schema", "no new stages") — doc-review both the requirements and the plan before implementation starts. +- Any contract test or snapshot test being written against generated documentation. +- Any flag whose name could plausibly answer more than one question. +- Any skill body being authored or revised. + +--- + +## Examples + +### Example 1: Sampling-over-assumption (Stage 5b → shared-template upgrade) + +**Before** — a research agent asserted "personas will not reliably produce R22-R25 framing." The plan drafted a new Stage 5b synthesis-time rewrite pass to enforce framing post-hoc via a new per-finding model dispatch. + +**Intervention** — user pushback: "are you sure?" Sampled 15+ real review artifacts across 5 personas. + +**Sampled finding** — the research was directionally correct but mechanistically wrong. The actual issues were: + +- Null `why_it_matters` fields in `adversarial` and `api-contract` personas. +- Code-structure-first framing (vs. impact-first) in `correctness` and `maintainability` personas. + +**After** — intervention changed from "new per-finding model-dispatch stage" to "one-file shared-template upgrade" (`references/subagent-template.md`). Smaller surface area, cheaper to implement, targets the actual failure modes. No new stage, no recurring per-review model cost. + +This mirrors a prior pattern (session history): in the `feat/plan-review-personas` work, a model-tiering assumption ("Codex probably ignores the `sonnet` param") was challenged with "are you sure other platforms ignore it?" Checking the converter code revealed `model: sonnet` was already propagated to all targets, flipping the design from Claude-Code-only to universal. + +### Example 2: The `sink_available` split + +**Before** — one flag, used in two places with two different meanings: + +``` +# Detection output +{ tracker_name, confidence, sink_available } + +# sink_available definition: "the detected tracker can be invoked" + +# Callsite A — label logic +if confidence == "high" and sink_available: + label = f"File a {tracker_name} ticket..." +else: + label = "File a ticket..." # generic + +# Callsite B — no-sink suppression (subtly wrong) +if not sink_available: + omit_option_C() + # Question really being answered: "should we offer Defer at all?" + # which is NOT the same as "can we invoke the named tracker?" +``` + +The bug: when `sink_available = false` for the named tracker but GitHub Issues via `gh` or the harness task primitive *would* work, Callsite B silently drops Defer even though a fallback sink is available. + +**After** — two flags, one meaning each: + +``` +# Detection output +{ tracker_name, confidence, named_sink_available, any_sink_available } + +# named_sink_available — the specifically-named tracker is invokable +# any_sink_available — any tier in the fallback chain works + +# Callsite A — label logic uses the narrow flag +if confidence == "high" and named_sink_available: + label = f"File a {tracker_name} ticket..." +elif any_sink_available: + label = "File a ticket..." # generic, fallback works +# else: option omitted + +# Callsite B — suppression uses the broad flag +if not any_sink_available: + omit_option_C() +``` + +The two callsites now answer their respective questions correctly. A repo with no documented tracker but working `gh` correctly offers Defer with a generic label instead of silently suppressing. + +### Example 3: Structural-vs-prose contract test assertion + +**Before:** + +``` +def test_release_notes_contract(): + doc = (root / "RELEASE_NOTES.md").read_text() + assert "only when one or more fixes landed" in doc + assert "applied during the review" in doc +``` + +Every rephrase of either sentence breaks the test, even when the contract is intact. + +**After:** + +``` +def test_release_notes_contract(): + doc_path = root / "RELEASE_NOTES.md" + assert doc_path.exists(), "release notes file must be generated" + + doc = doc_path.read_text() + + # Required sections (structural landmarks) + assert "## Fixes applied" in doc + assert "## Findings deferred" in doc + + # Required distinguishing tokens + assert re.search(r"\bfix(es)?\b.*\bland", doc, re.I), \ + "must describe fixes landing" + assert re.search(r"\bdefer(red)?\b", doc, re.I), \ + "must describe deferrals" +``` + +Structural landmarks (file exists, section exists, token present) are the contract. Sentence-level wording is not. This matches the structural-evaluation style used in skill-creator evals, where assertion names map to concrete fields in output JSON (`overlap_detected`, `update_not_create`). + +### Example 4: Preview gate for bulk "trust the agent" action + +**Before** — an LFG-style routing option executes the full bulk plan on one keystroke. Looks efficient; is a rubber-stamp vector. + +**After** — LFG presents a compact preview grouped by action class, then gates execution behind explicit Proceed/Cancel: + +``` +Review plan: + +Applying (3): + - src/auth.ts:44 fix stale session on logout + - src/auth.ts:112 null-check refresh token + - src/api.ts:87 handle 429 retry-after + +Filing (2): + - src/ui/modal.tsx:23 a11y focus trap (defer) + - src/db/migrate.ts:9 idempotency audit (defer) + +Skipping (1): + - docs/README.md:4 prose nit + +[Proceed] [Cancel] +``` + +The plan is visible. Rubber-stamping is now an explicit, informed act rather than a side effect of UI design. + +### Example 5: External plugin references stay in dialogue + +**Dialogue (acceptable):** "Plugin X's review flow groups findings by file, which works well for their navigation-driven use case. What if we grouped by action class instead, since our Interactive mode is decision-driven?" + +**Artifact (acceptable):** "Findings are grouped by action class (Applying / Filing / Skipping) because Interactive mode is decision-driven: the user's question at this surface is 'what is about to happen?', not 'where in the tree am I?'." + +**Artifact (not acceptable):** "Findings are grouped by action class, similar to plugin X's review flow but adapted for our decision-driven Interactive mode." + +The artifact version stands on its own without the external reference. A future reader does not need to know X to understand the design. *(auto memory [claude]: this rule was applied throughout the ce:review redesign session — the requirements doc, plan, and PR description all re-framed externally-inspired patterns in self-contained terms.)* + +--- + +## Related + +- [research-agent-pipeline-separation-2026-04-05.md](../skill-design/research-agent-pipeline-separation-2026-04-05.md) — Establishes the brainstorm / plan / work stage separation. This learning extends downstream to doc-review, ce:review, and resolve-pr-feedback, and focuses on what issues surface at each stage rather than what research dispatches. +- [compound-refresh-skill-improvements.md](../skill-design/compound-refresh-skill-improvements.md) — The 6-item skill review checklist is a natural companion for review-time prevention rules, particularly around cross-phase consistency and blind-user-question avoidance. +- [beta-promotion-orchestration-contract.md](../skill-design/beta-promotion-orchestration-contract.md) — Contract-tests-enforce-orchestration-assumptions pattern for the ce:review surface; direct prior art for structural assertion philosophy. +- [todo-status-lifecycle.md](../workflow/todo-status-lifecycle.md) — Closely related on the downstream half of the CE pipeline; "make state transitions load-bearing, not advisory" parallels per-finding judgment discipline. +- [git-workflow-skills-need-explicit-state-machines-2026-03-27.md](../skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md) — Methodologically aligned ("state machine over prose" ≈ "structural assertions over prose"), different domain. +- [pass-paths-not-content-to-subagents-2026-03-26.md](../skill-design/pass-paths-not-content-to-subagents-2026-03-26.md) — Companion for any subagent-template changes, particularly around instruction phrasing. +- [codex-delegation-best-practices-2026-04-01.md](codex-delegation-best-practices-2026-04-01.md) — Canonical example of sampling-evidence-over-assumption at depth (6 evaluation iterations, empirical token measurement). diff --git a/docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md b/docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md index b839a84..cf47920 100644 --- a/docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md +++ b/docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md @@ -69,6 +69,7 @@ A "dependency-analyzer" agent that identifies library versions and compatibility ## Related - `docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md` -- related agent dispatch optimization pattern (token efficiency, not deduplication) -- `docs/solutions/skill-design/beta-skills-framework.md` -- documents the pipeline chain (note: pipeline description is stale, references `deepen-plan` which has been merged into `ce:plan`) +- `docs/solutions/skill-design/beta-skills-framework.md` -- documents the pipeline chain and the beta-skills rollout pattern that plugs into it +- `docs/solutions/best-practices/ce-pipeline-end-to-end-learnings-2026-04-17.md` -- extends this framing downstream (document-review, ce:review, resolve-pr-feedback) with meta-observations from running the full pipeline end-to-end on a feature - Commit f7a14b76 on `tmchow/slack-analyst-agent` -- the Slack researcher pass-through optimization that prompted this analysis - GitHub issue #492 -- `repo-research-analyst` self-recursion bug (fixed, separate concern) diff --git a/plugins/compound-engineering/agents/review/agent-native-reviewer.md b/plugins/compound-engineering/agents/review/agent-native-reviewer.md index e9c71d6..8520817 100644 --- a/plugins/compound-engineering/agents/review/agent-native-reviewer.md +++ b/plugins/compound-engineering/agents/review/agent-native-reviewer.md @@ -2,7 +2,7 @@ name: agent-native-reviewer description: "Reviews code to ensure agent-native parity -- any action a user can take, an agent can also take. Use after adding UI features, agent tools, or system prompts." model: inherit -color: cyan +color: blue tools: Read, Grep, Glob, Bash --- diff --git a/plugins/compound-engineering/skills/ce-review/SKILL.md b/plugins/compound-engineering/skills/ce-review/SKILL.md index 807b88a..ebb6037 100644 --- a/plugins/compound-engineering/skills/ce-review/SKILL.md +++ b/plugins/compound-engineering/skills/ce-review/SKILL.md @@ -70,6 +70,11 @@ All tokens are optional. Each one present means one less thing to infer. When ab - **Never commit, push, or create a PR** from headless mode. The caller owns those decisions. - **End with "Review complete" as the terminal signal** so callers can detect completion. If all reviewers fail or time out, emit `Code review degraded (headless mode). Reason: 0 of N reviewers returned results.` followed by "Review complete". +### Interactive mode rules + +- **Pre-load the platform question tool before any question fires.** In Claude Code, `AskUserQuestion` is a deferred tool — its schema is not available at session start. At the start of Interactive-mode work (before Stage 2 intent-ambiguity questions, the After-Review routing question, walk-through per-finding questions, bulk-preview Proceed/Cancel, and tracker-defer failure sub-questions), call `ToolSearch` with query `select:AskUserQuestion` to load the schema. Load it **once, eagerly, at the top of the Interactive flow** — do not wait for the first question site and do not decide it on a per-site basis. On Codex (`request_user_input`) and Gemini (`ask_user`) this step is not required; the tools are loaded by default. +- **The numbered-list fallback only applies on confirmed load failure.** The skill's fallback pattern — "present the options as a numbered list and wait for the user's reply" — is valid **only** when `ToolSearch` returns no match or the tool call explicitly fails. Rendering a question as narrative text because the tool feels inconvenient, because the model is in report-formatting mode, or because the instruction was buried in a long skill is a bug. A question that calls for a user decision must either fire the tool or fail loudly. + ## Severity Scale All reviewers use P0-P3: @@ -316,7 +321,7 @@ Pass this to every reviewer in their spawn prompt. Intent shapes *how hard each **When intent is ambiguous:** -- **Interactive mode:** Ask one question using the platform's interactive question tool (AskUserQuestion in Claude Code, request_user_input in Codex): "What is the primary goal of these changes?" Do not spawn reviewers until intent is established. +- **Interactive mode:** Ask one question using the platform's interactive question tool (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini): "What is the primary goal of these changes?" Do not spawn reviewers until intent is established. **Claude Code only:** if `AskUserQuestion` has not yet been loaded this session (per the Interactive mode rules pre-load), call `ToolSearch` with query `select:AskUserQuestion` first before asking. On Codex (`request_user_input`) and Gemini (`ask_user`) this preload step does not apply — the platform-native question tool is loaded by default. - **Autofix/report-only/headless modes:** Infer intent conservatively from the branch name, diff, PR metadata, and caller context. Note the uncertainty in Coverage or Verdict reasoning instead of blocking. ### Stage 2b: Plan discovery (requirements verification) @@ -469,6 +474,7 @@ Convert multiple reviewer compact JSON returns into one deduplicated, confidence 5. **Separate pre-existing.** Pull out findings with `pre_existing: true` into a separate list. 6. **Resolve disagreements.** When reviewers flag the same code region but disagree on severity, autofix_class, or owner, annotate the Reviewer column with the disagreement (e.g., "security (P0), correctness (P1) -- kept P0"). This transparency helps the user understand why a finding was routed the way it was. 7. **Normalize routing.** For each merged finding, set the final `autofix_class`, `owner`, and `requires_verification`. If reviewers disagree, keep the most conservative route. Synthesis may narrow a finding from `safe_auto` to `gated_auto` or `manual`, but must not widen it without new evidence. +7b. **Tie-break the recommended action.** Interactive mode's walk-through and LFG paths present a per-finding recommended action (Apply / Defer / Skip / Acknowledge) derived from the normalized `autofix_class` and `suggested_fix`. When contributing reviewers implied different actions for the same merged finding, synthesis picks the most conservative using the order `Skip > Defer > Apply > Acknowledge`. This guarantees that identical review artifacts produce the same recommendation deterministically, so LFG results are auditable after the fact and the walk-through's recommendation is stable across re-runs. The user may still override per finding via the walk-through's options; this rule only determines what gets labeled "recommended." 8. **Partition the work.** Build three sets: - in-skill fixer queue: only `safe_auto -> review-fixer` - residual actionable queue: unresolved `gated_auto` or `manual` findings whose owner is `downstream-resolver` @@ -617,26 +623,29 @@ After presenting findings and verdict (Stage 6), route the next steps by mode. R **Interactive mode** - Apply `safe_auto -> review-fixer` findings automatically without asking. These are safe by definition. -- Ask a policy question **using the platform's blocking question tool** (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini) only when `gated_auto` or `manual` findings remain after safe fixes. Do not replace with a conversational open-ended question. Adapt the options to match what actually remains: +- **Zero-remaining case:** if no `gated_auto` or `manual` findings remain after the `safe_auto` pass, skip the routing question entirely. Emit a one-line completion summary phrased so advisory and pre-existing findings (which are not handled by this flow) are not implied to be cleared. When no advisory or pre-existing findings remain in the report, `All findings resolved — N safe_auto fixes applied.` is accurate. When advisory and/or pre-existing findings do remain, use the qualified form `All actionable findings resolved — N safe_auto fixes applied. (K advisory, J pre-existing findings remain in the report.)`, omitting any zero-count clause. Follow the summary with the existing end-of-review verdict, then proceed to Step 5 per the gating rule there. +- **Tracker pre-detection:** before rendering the routing question, consult `references/tracker-defer.md` for the session's tracker tuple `{ tracker_name, confidence, named_sink_available, any_sink_available }`. The probe runs at most once per session and is cached for the rest of the run. `named_sink_available` drives the option C label (inline tracker name only when the named sink can actually be invoked). `any_sink_available` drives whether option C is offered at all (it can still be offered when the named tracker is unreachable but `gh` or the harness task primitive works). +- **Verify question-tool pre-load (checklist, Claude Code only).** Before firing the routing question in Claude Code, confirm `AskUserQuestion` is loaded (per Interactive mode rules at the top of this skill). If not yet loaded this session, call `ToolSearch` with query `select:AskUserQuestion` now. Do not proceed to the routing question without this verification. Rendering the question as narrative text is a bug, not a valid fallback. On Codex (`request_user_input`) and Gemini (`ask_user`) this checklist does not apply — the platform-native question tool is loaded by default and there is no `ToolSearch` preload step to perform. +- **Routing question.** Ask using the platform's blocking question tool (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini). Stem: `What should the agent do with the remaining N findings?` — use third-person voice referring to "the agent", not first-person "me" / "I". Options: - **When `gated_auto` findings are present** (with or without `manual`): ``` - Safe fixes have been applied. What should I do with the remaining findings? - 1. Review and approve specific gated fixes (Recommended) - 2. Leave as residual work - 3. Report only -- no further action + (A) Review each finding one by one — accept the recommendation or choose another action + (B) LFG. Apply the agent's best-judgment action per finding + (C) File a [TRACKER] ticket per finding without applying fixes + (D) Report only — take no further action ``` - **When only `manual` findings remain** (no `gated_auto`): - ``` - Safe fixes have been applied. The remaining findings need manual resolution. What should I do? - 1. Leave as residual work (Recommended) - 2. Report only -- no further action - ``` + Render option C per `references/tracker-defer.md`: when `confidence = high` AND `named_sink_available = true`, replace `[TRACKER]` with the concrete name and keep the full label (e.g., `File a Linear ticket per finding without applying fixes`). When `any_sink_available = true` but either `confidence = low` or `named_sink_available = false` (a fallback tier like GitHub Issues or the harness task primitive is working instead), use the generic label `File an issue per finding without applying fixes` — this is a whole-label substitution, not a `[TRACKER]` token swap. When `any_sink_available = false`, **omit option C entirely** and add one line to the stem explaining why (e.g., `Defer unavailable — no tracker or task-tracking primitive detected on this platform.`). The three remaining options (A, B, D) survive. - If no blocking question tool is available, present the applicable numbered options as text and wait for the user's selection before proceeding. -- If no `gated_auto` or `manual` findings remain after safe fixes, skip the policy question entirely — report what was fixed and proceed to next steps. -- Only include `gated_auto` findings in the fixer queue after the user explicitly approves the specific items. Do not widen the queue based on severity alone. + The numbered-list text fallback applies only when `ToolSearch` explicitly returns no match for the platform's question tool or the tool call errors. It does not apply when the agent simply hasn't loaded the tool yet — in that case, load it now (see the verification checklist above). On platforms genuinely without a blocking question tool, present the applicable options as a numbered list and wait for the user's reply. + +- **Dispatch on selection.** Route by the option letter (A / B / C / D), not by the rendered label string. The option-C label varies by tracker-detection confidence (`File a [TRACKER] ticket per finding without applying fixes` for a named tracker, `File an issue per finding without applying fixes` as the generic fallback, or omitted entirely when no sink is available — see `references/tracker-defer.md`), and options A / B / D have a single canonical label each. The letter is the stable dispatch signal; the canonical labels below are shown for documentation only. A low-confidence run that rendered option C as the generic label routes to the same branch as a high-confidence run that rendered it with the named tracker. + - (A) `Review each finding one by one` — load `references/walkthrough.md` and enter the per-finding walk-through loop. The walk-through accumulates Apply decisions in memory; Defer decisions execute inline via `references/tracker-defer.md`; Skip / Acknowledge decisions are recorded as no-action; `LFG the rest` routes through `references/bulk-preview.md`. At end of the loop, dispatch one fixer subagent for the accumulated Apply set (Step 3). Emit the unified completion report. + - (B) `LFG. Apply the agent's best-judgment action per finding` — load `references/bulk-preview.md` scoped to every pending `gated_auto` / `manual` finding. On `Proceed`, execute the plan: Apply set → Step 3 fixer dispatch; Defer set → `references/tracker-defer.md`; Skip / Acknowledge → no-op. On `Cancel`, return to this routing question. Emit the unified completion report after execution. + - (C) `File a [TRACKER] ticket per finding without applying fixes` (or the generic `File an issue per finding without applying fixes` when the named-tracker label is not used) — load `references/bulk-preview.md` with every pending finding in the file-tickets bucket (regardless of the agent's natural recommendation). On `Proceed`, route every finding through `references/tracker-defer.md`; no fixes are applied. On `Cancel`, return to this routing question. Emit the unified completion report. + - (D) `Report only — take no further action` — do not enter any dispatch phase. Emit the completion report, then proceed to Step 5 per its gating rule (`fixes_applied_count > 0` from earlier `safe_auto` passes). If no fixes were applied this run, stop after the report. + +- The walk-through's completion report, the LFG / File-tickets completion report, and the zero-remaining completion summary all follow the unified completion-report structure documented in `references/walkthrough.md`. Use the same structure across every terminal path. **Autofix mode** @@ -696,7 +705,18 @@ After presenting findings and verdict (Stage 6), route the next steps by mode. R #### Step 5: Final next steps -**Interactive mode only:** after the fix-review cycle completes (clean verdict or the user chose to stop), offer next steps based on the entry mode. Reuse the resolved review base/default branch from Stage 1 when known; do not hard-code only `main`/`master`. +**Interactive mode only.** After the fix-review cycle completes (clean verdict or the user chose to stop), offer next steps based on the entry mode. Reuse the resolved review base/default branch from Stage 1 when known; do not hard-code only `main`/`master`. + +**The gate is total fixes applied this run, not routing option.** Track `fixes_applied_count` across the whole Interactive invocation. This counter includes both the `safe_auto` fixes applied automatically before the routing question (see Step 2 Interactive mode) AND any Apply decisions executed by routing option A (walk-through) or option B (LFG). Routing options C (File tickets) and D (Report only) add zero to this counter; neither does a walk-through that ends with only Skip / Defer / Acknowledge, and neither does an LFG whose recommendations were all Defer / Skip / Acknowledge. + +Step 5 runs only when `fixes_applied_count > 0`. If the counter is zero — no `safe_auto` fixes were applied AND the routing path produced no additional Apply — skip Step 5 entirely and exit after the completion report. Asking "push fixes?" when nothing changed in the working tree is incoherent. + +Common outcomes: + +- `safe_auto` produced fixes AND the user picked any routing option → Step 5 runs (counter > 0 from the safe_auto pass alone). +- No `safe_auto` fixes AND the user picked option C or D → Step 5 skipped. +- No `safe_auto` fixes AND walk-through / LFG finished with zero Applies → Step 5 skipped. +- Zero-remaining case (no `gated_auto` / `manual` after `safe_auto`) with at least one `safe_auto` fix → Step 5 runs; the routing question was never asked but the counter is > 0. - **PR mode (entered via PR number/URL):** - **Push fixes** -- push commits to the existing PR branch diff --git a/plugins/compound-engineering/skills/ce-review/references/bulk-preview.md b/plugins/compound-engineering/skills/ce-review/references/bulk-preview.md new file mode 100644 index 0000000..1f865c2 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-review/references/bulk-preview.md @@ -0,0 +1,132 @@ +# Bulk Action Preview + +This reference defines the compact plan preview that Interactive mode shows before every bulk action — LFG (routing option B), File tickets per finding (routing option C), and the walk-through's `LFG the rest` (option D of the per-finding question). The preview gives the user a single-screen view of what the agent is about to do, with exactly two options to Proceed or Cancel. + +Interactive mode only. + +--- + +## When the preview fires + +Three call sites: + +1. **Routing option B (top-level LFG)** — after the user picks `LFG. Apply the agent's best-judgment action per finding` from the routing question, but before any action executes. Scope: every pending `gated_auto` / `manual` finding. +2. **Routing option C (top-level File tickets)** — after the user picks `File a [TRACKER] ticket per finding without applying fixes` but before any ticket is filed. Scope: every pending `gated_auto` / `manual` finding. Every finding appears under `Filing [TRACKER] tickets (N):` regardless of the agent's natural recommendation, because option C is batch-defer. +3. **Walk-through `LFG the rest`** — after the user picks `LFG the rest — apply the agent's best judgment to this and remaining findings` from a per-finding question, but before the remaining findings are resolved. Scope: the current finding and everything not yet decided. Already-decided findings from the walk-through are not included in the preview. + +In all three cases the user confirms with `Proceed` or backs out with `Cancel`. No per-item decisions inside the preview — per-item decisioning is the walk-through's role. + +--- + +## Preview structure + +The preview is grouped by the action the agent intends to take. Bucket headers appear only when their bucket is non-empty. + +``` +[ (tracker: )]: + +Applying (N): + [P0] : + [P1] : + +Filing [TRACKER] tickets (N): + [P2] : + +Skipping (N): + [P2] : + +Acknowledging (N): + [P3] : +``` + +Worked example, for routing option B (top-level LFG): + +``` +LFG plan — 8 findings (tracker: Linear): + +Applying (4): + [P0] orders_controller.rb:42 — Add ownership guard before order lookup + [P1] webhook_handler.rb:120 — Raise on unhandled error instead of swallowing + [P2] user_serializer.rb:14 — Drop internal_id from serialized response + [P3] string_utils.rb:8 — Rename ambiguous helper for clarity + +Filing Linear tickets (2): + [P2] billing_service.rb:230 — N+1 on refund batch (no concrete fix) + [P2] session_helper.rb:12 — Session reset behavior needs discussion + +Skipping (2): + [P2] report_worker.rb:55 — Recommendation is speculative; low confidence + [P3] readme.md:14 — Style preference, subjective +``` + +--- + +## Scope summary wording by path + +- **Routing option B (top-level LFG):** header reads `LFG plan — N findings[ (tracker: )]:`. +- **Routing option C (top-level File tickets):** header reads `File plan — N findings as [TRACKER] tickets:`. Every finding lands in the `Filing [TRACKER] tickets (N):` bucket. +- **Walk-through LFG the rest:** header reads `LFG plan — N remaining findings (K already decided)[ (tracker: )]:`. Already-decided findings from the walk-through are **not** included in the preview or in the bucket counts. The `K already decided` counter communicates that the walk-through was partially completed. + +When the detected tracker is low-confidence or generic (see `tracker-defer.md`), the `(tracker: )` annotation is omitted from the header and the `Filing [TRACKER] tickets` bucket header uses the generic form (`Filing tickets (N):`). + +--- + +## Per-finding line format + +Each line uses the compressed form of the framing-quality bar from the plan (R22-R25 — observable-behavior-first, no function / variable names unless needed to locate). The one-line summary is drawn from the persona-produced `why_it_matters` by taking the first sentence (and, when the first sentence is too long for the preview width, paraphrasing it tightly to fit). + +- **Shape:** `[] :` +- **Width target:** keep lines near 80 columns so the preview renders cleanly in narrow terminals. Truncate with ellipsis when necessary. +- **No function / variable names inline** unless the reader needs them to locate the issue. +- **Advisory bucket phrasing:** the `Acknowledging (N):` bucket describes the advisory content in one line. No "fix" phrase — advisory findings have no concrete fix. + +When no `why_it_matters` is available for a finding (e.g., Unit 2's template upgrade hasn't fully propagated through the persona run, or the artifact file was unreadable), fall back to the finding's title directly. Note the gap in the completion report's Coverage section if it affects more than a few findings in the same run. + +--- + +## Question and options + +After the preview body is rendered, ask the user using the platform's blocking question tool (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini). In Claude Code, the tool should already be loaded from the Interactive-mode pre-load step — if it isn't, call `ToolSearch` with query `select:AskUserQuestion` now. The text fallback below applies only when that load explicitly fails. + +Stem (adapted to the path): +- For routing B: `The agent is about to apply the plan above. Proceed?` +- For routing C: `The agent is about to file the tickets above. Proceed?` +- For walk-through `LFG the rest`: `The agent is about to resolve the remaining findings above. Proceed?` + +Options (exactly two, in all three cases): +- `Proceed` — execute the plan as shown +- `Cancel` — do nothing, return to the originating question + +Only when `ToolSearch` explicitly returns no match or the tool call errors — or on a platform with no blocking question tool — fall back to presenting numbered options and waiting for the user's next reply. + +--- + +## Cancel semantics + +- **From routing option B Cancel:** return the user to the routing question (the four-option menu). Do not apply any fixes, file any tickets, or record any state. The session's cached tracker-detection tuple is preserved. +- **From routing option C Cancel:** same — return to the routing question, no side effects. +- **From walk-through `LFG the rest` Cancel:** return the user to the current finding's per-finding question (not to the routing question). The walk-through continues from where it was, with prior decisions intact. + +In every case, `Cancel` changes no on-disk or external state. + +--- + +## Proceed semantics + +When the user picks `Proceed`: + +- **Routing option B (top-level LFG):** for each finding in the plan, execute the recommended action. Apply findings go into the Apply set for a single end-of-batch fixer dispatch (see `walkthrough.md` for the Apply batching rules). Defer findings route through `tracker-defer.md`. Skip / Acknowledge findings are recorded as no-action. After all actions complete, emit the unified completion report (see `walkthrough.md`). +- **Routing option C (top-level File tickets):** every finding routes through `tracker-defer.md` for ticket creation. No fixes are applied. After all tickets have been filed (or failed), emit the unified completion report. +- **Walk-through `LFG the rest`:** same as routing option B, but scoped to the findings the user hadn't decided on. Apply findings join the in-memory Apply set with the ones the user already picked during the walk-through; all dispatch together in the single end-of-walk-through fixer call. + +Failure during `Proceed` (e.g., ticket creation fails for one finding during a batch Defer) follows the failure path defined in `tracker-defer.md` — surface the failure inline with Retry / Fallback / Skip, continue with the rest of the plan, and capture the failure in the completion report's failure section. + +--- + +## Edge cases + +- **Zero findings in a bucket:** omit the bucket header. A preview with only Apply and Skip does not show an empty `Filing tickets (0):` or `Acknowledging (0):` line. +- **All findings in one bucket:** preview still shows the bucket header; Proceed / Cancel still offered. This is the common case for routing option C (every finding under `Filing tickets`). +- **N=1 preview (only one finding in scope):** the preview still uses the grouped format, just with a single-line bucket. `Proceed` / `Cancel` still apply. +- **No tracker available:** option C is not offered upstream (see `tracker-defer.md` no-sink handling). LFG (option B) and walk-through `LFG the rest` can still run — they may contain per-finding Defer recommendations from Stage 5. Before rendering any LFG-shaped preview, downgrade every Defer recommendation to Skip when the session's cached `any_sink_available` is false, and surface the downgrade on the preview itself (e.g., a `Skipping — defer sink unavailable (N):` bucket, or a note in the header: `N Defer recommendations downgraded to Skip — no tracker sink`). This is a preview-time runtime step, not Stage 5 tie-breaking — step 7b only orders conflicting reviewer recommendations (`Skip > Defer > Apply > Acknowledge`, as defined in `SKILL.md` Stage 5 step 7b) and has no knowledge of sink availability. +- **Walk-through `LFG the rest` with zero remaining findings:** the walk-through's own logic suppresses `LFG the rest` as an option when N=1 and otherwise, so the preview should never be invoked with zero remaining findings. If it is, render `LFG plan — 0 remaining findings` and fall through to Proceed with no-op. diff --git a/plugins/compound-engineering/skills/ce-review/references/subagent-template.md b/plugins/compound-engineering/skills/ce-review/references/subagent-template.md index ad3f39d..7d2e8a1 100644 --- a/plugins/compound-engineering/skills/ce-review/references/subagent-template.md +++ b/plugins/compound-engineering/skills/ce-review/references/subagent-template.md @@ -48,6 +48,30 @@ Confidence rubric (0.0-1.0 scale): Suppress threshold: 0.60. Do not emit findings below 0.60 confidence (except P0 at 0.50+). +Writing `why_it_matters` (required field, every finding): + +The `why_it_matters` field is how the reader — a developer triaging findings, a ticket-body reader months later, or a downstream automated surface — understands the problem without re-reading the file. Treat it as the most important prose field in your output; every downstream surface (walk-through questions, bulk-action previews, ticket bodies, headless output) depends on it being good. + +- **Lead with observable behavior.** Describe what the bug does from the outside — what a user, attacker, operator, or downstream caller experiences. Do not lead with code structure ("The function X does Y..."). Start with the effect ("Any signed-in user can read another user's orders..."). Function and variable names appear later, only when the reader needs them to locate the issue. +- **Explain why the fix resolves the problem.** If you include a `suggested_fix`, the `why_it_matters` should make clear why that specific fix addresses the root cause. When a similar pattern exists elsewhere in the codebase (an existing guard, an established convention, a parallel handler), reference it so the recommendation is grounded in the project's own conventions rather than theoretical best practice. +- **Keep it tight.** Approximately 2-4 sentences plus the minimum code quoted inline to ground the point. Longer framings are a regression — downstream surfaces have narrow display budgets, and verbose `why_it_matters` content gets truncated or skimmed. +- **Always produce substantive content.** `why_it_matters` is required by the schema. Empty strings, nulls, and single-phrase entries are validation failures. If you found something worth flagging (confidence >= 0.60), you can explain it — the field exists because every finding needs a reason. + +Illustrative pair — same finding, weak vs. strong framing: + +``` +WEAK (code-citation first; fails the observable-behavior rule): + orders_controller.rb:42 has a missing authorization check. + Add current_user.owns?(account) guard before the query. + +STRONG (observable behavior first, grounded fix reasoning): + Any signed-in user can read another user's orders by pasting the + target account ID into the URL. The controller looks up the account + and returns its orders without verifying the current user owns it. + Adding a one-line ownership guard before the lookup matches the + pattern already used in the shipments controller for the same attack. +``` + False-positive categories to actively suppress: - Pre-existing issues unrelated to this diff (mark pre_existing: true for unchanged code the diff does not interact with; if the diff makes it newly relevant, it is secondary, not pre-existing) - Pedantic style nitpicks that a linter/formatter would catch diff --git a/plugins/compound-engineering/skills/ce-review/references/tracker-defer.md b/plugins/compound-engineering/skills/ce-review/references/tracker-defer.md new file mode 100644 index 0000000..984884e --- /dev/null +++ b/plugins/compound-engineering/skills/ce-review/references/tracker-defer.md @@ -0,0 +1,138 @@ +# Tracker Detection and Defer Execution + +This reference covers how Interactive mode's Defer actions file tickets in the project's tracker. It is loaded by `SKILL.md` when the routing question needs to decide whether to offer option C (File tickets), when the walk-through's Defer option executes, and when the bulk-preview of option C (File tickets per finding) is shown. + +Interactive mode only. Autofix, Report-only, and Headless modes do not use this reference. + +--- + +## Detection + +The agent determines the project's tracker from whatever documentation is obvious. Primary sources: `CLAUDE.md` and `AGENTS.md` at the repo root and in relevant subdirectories. Supplementary signals (when primary documentation is ambiguous): `CONTRIBUTING.md`, `README.md`, PR templates under `.github/`, visible tracker URLs in the repo. + +A tracker can be surfaced via MCP tool (e.g., a Linear MCP server), CLI (e.g., `gh`), or direct API. All are acceptable. The detection output is a tuple with two availability flags — one for the named tracker specifically (drives label confidence) and one for the full fallback chain (drives whether Defer is offered at all): + +``` +{ tracker_name, confidence, named_sink_available, any_sink_available } +``` + +Where: +- `tracker_name` — human-readable name ("Linear", "GitHub Issues", "Jira"), or `null` when detection cannot identify a specific tracker +- `confidence` — `high` when the tracker is named explicitly in documentation (or via a linked URL to a specific project/workspace) and is unambiguously the project's canonical tracker; `low` when the signal is thin, conflicting, or implied only +- `named_sink_available` — `true` only when the agent can actually invoke the detected tracker (MCP tool is loaded, CLI is authenticated, or API credentials are in environment); `false` when the tracker is documented but no tool reaches it, or when no tracker is found at all. Drives label confidence: inline tracker naming requires this to be `true`. +- `any_sink_available` — `true` when any tier in the fallback chain (named tracker, GitHub Issues via `gh`, or harness task-tracking primitive) can be invoked this session. Drives whether Defer is offered: no-sink behavior fires only when this is `false`. + +Detection is reasoning-based. Do not maintain an enumerated checklist of files to read. Read the obvious sources and form a confident conclusion; when the obvious sources don't resolve, the label falls back to generic wording and the agent confirms with the user before executing. + +--- + +## Probe timing and caching + +Availability probes run **at most once per session** and **only when the routing question is about to be asked**. Never speculatively at review start, never per-Defer, never per-walk-through-finding. The cached tuple is reused for every Defer action in the same run. + +Typical probe sequence: + +1. Read `CLAUDE.md` / `AGENTS.md` for tracker references. If nothing found, set `tracker_name = null`, `confidence = low`. +2. **Probe the named tracker when one was found.** For GitHub Issues, run `gh auth status` and `gh repo view --json hasIssuesEnabled`. For Linear or other MCP-backed trackers, verify the relevant MCP tool is loaded and responsive. For API-backed trackers, verify credentials in environment. Set `named_sink_available` from the probe result. +3. **Probe the fallback tiers to compute `any_sink_available`.** Even when the named tracker was found and probed, the fallback tiers matter for the "no-sink" decision so that a run with no documented tracker but working `gh` still offers Defer. Stop at the first working tier: + - If `named_sink_available = true`: `any_sink_available = true` (no further probes needed). + - Otherwise, probe GitHub Issues via `gh auth status` + `gh repo view --json hasIssuesEnabled` (skip if already probed in step 2). If it works, `any_sink_available = true`. + - Otherwise, check the harness task-tracking primitive. `TaskCreate` / `update_plan` are typically always present when the skill runs inside their harness — treat as available unless the session is in a context that explicitly forbids it (e.g., converted targets without task binding). + - If every tier fails, `any_sink_available = false`. + +When the routing question is skipped entirely (R2 zero-findings case), no probes run. When the cached tuple is reused across a session, any `named_sink_available = true` from the session's first probe stays cached — do not re-probe per Defer. + +--- + +## Label logic + +- When `confidence = high` AND `named_sink_available = true`: the routing question's option C and the walk-through's per-finding Defer option both include the tracker name verbatim. Example: `File a Linear ticket per finding`, `Defer — file a Linear ticket`. +- When `any_sink_available = true` but either `confidence = low` or `named_sink_available = false` (a fallback tier is working instead): the labels read generically — `File an issue per finding`, `Defer — file a ticket`. Before executing the first Defer of the session, the agent confirms the effective tracker choice with the user using the platform's blocking question tool. +- When `any_sink_available = false`: option C is omitted from the routing question, option B (Defer) is omitted from the walk-through per-finding options, and the agent tells the user why in the routing question's stem. + +--- + +## Fallback chain + +When the named tracker is unavailable or no tracker is named, fall back in this order. Prefer durable external trackers over in-session-only primitives. + +1. **Named tracker** (MCP tool, CLI, or API the agent can invoke directly) +2. **GitHub Issues via `gh`** — when `gh auth status` succeeds and the current repo has issues enabled (`gh repo view --json hasIssuesEnabled` returns `true`) +3. **Harness task-tracking primitive** — `TaskCreate` in Claude Code, `update_plan` in Codex, or the equivalent on other target platforms — used as a last resort and only after a once-per-session durability confirmation (below) + +Never fall back to `.context/compound-engineering/todos/`. The internal-todos system is on a deprecation path (see plan scope boundaries) and must not be extended by this Defer path. + +--- + +## Once-per-session harness-fallback confirmation + +When the fallback to harness task-tracking primitive is in effect, and before the first Defer action of the session executes, the agent asks the user once using the platform's blocking question tool (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini). In Claude Code, `AskUserQuestion` is a deferred tool — before the first call this session, load its schema via `ToolSearch` with query `select:AskUserQuestion`. + +> No documented tracker was found and `gh` is not available. Defer actions will create in-session tasks that do not survive past this session. Proceed for this and subsequent Defer actions? + +Options: +- `Proceed with in-session tasks` — the agent continues with harness task creation for every Defer in this run +- `Cancel — leave findings as residual in the report` — the agent converts all pending Defers to Skip with a note, and surfaces the findings in the completion report's residual-work section + +The confirmation is cached for the session. Subsequent Defer actions do not re-prompt. + +Only when `ToolSearch` explicitly returns no match or the tool call errors — or on a platform with no blocking question tool — fall back to numbered options and waiting for the user's reply. + +--- + +## Ticket composition + +Every Defer action creates a ticket with the following content, adapted to the tracker's capabilities: + +- **Title:** the merged finding's `title` (schema-capped at 10 words). +- **Body:** + - Plain-English problem statement — reads the persona-produced `why_it_matters` from the contributing reviewer's artifact file at `.context/compound-engineering/ce-review//{reviewer}.json`, using the same `file + line_bucket(line, +/-3) + normalize(title)` matching headless mode uses (see SKILL.md Stage 6 detail enrichment). Falls back to the merged finding's `title`, `severity`, `file`, and `suggested_fix` (when present) when no artifact match is available — these fields are guaranteed in the merge-tier compact return. + - Suggested fix (when present in the finding's `suggested_fix`). + - Evidence (direct quotes from the reviewer's artifact). + - Metadata block: `Severity: `, `Confidence: `, `Reviewer(s): `, `Finding ID: `. +- **Labels** (when the tracker supports labels): severity tag (`P0`, `P1`, `P2`, `P3`) and, when the tracker convention supports it, a category label sourced from the reviewer name. +- **Length cap:** when the composed body would exceed a tracker's body length limit, truncate with `... (continued in ce-review run artifact: .context/compound-engineering/ce-review//)` and include the finding_id in both the truncated body and the metadata block so the artifact is discoverable. + +The finding_id is a stable fingerprint composed as `normalize(file) + line_bucket(line, +/-3) + normalize(title)` — the same fingerprint used by the merge pipeline. + +--- + +## Failure path + +When ticket creation fails at execution (API error, auth expiry mid-session, rate limit, malformed body rejected, 4xx/5xx response), the agent surfaces the failure inline and asks the user using the platform's blocking question tool: + +Stem: +> Defer failed: returned . How should the agent handle this finding? + +Options: +- `Retry on ` — re-attempt the same tracker once more (useful for transient errors) +- `Fall back to next sink` — move this finding's Defer to the next tier in the fallback chain (e.g., from Linear to GitHub Issues, or from GitHub Issues to harness task primitive) +- `Convert to Skip — record the failure` — abandon this Defer, note the failure in the completion report's failure section, and continue the walk-through or bulk flow + +When a high-confidence named tracker fails at execution, the cached `named_sink_available` is set to `false` for the rest of the session. Subsequent Defer actions fall straight through to the next tier without retrying a confirmed-broken sink. `any_sink_available` is only downgraded to `false` when every tier has been confirmed broken — a failed Linear call that succeeds via `gh` keeps `any_sink_available = true`. + +Only when `ToolSearch` explicitly returns no match or the tool call errors — or on a platform with no blocking question tool — fall back to numbered options and waiting for the user's reply. + +--- + +## Per-tracker behavior + +Concrete behavior per tracker at execution time. The agent may invoke any of these through the appropriate interface (MCP, CLI, or API) — the choice depends on what is available in the current environment. + +| Tracker | Interface | Invocation sketch | Body format | Labels | +|---------|-----------|-------------------|-------------|--------| +| Linear | MCP (preferred) or API | Create issue in the project/workspace identified by documentation; assign to the reporter if the MCP tool exposes user context | Markdown | Severity priority field if the MCP exposes it; otherwise include severity in body | +| GitHub Issues | `gh issue create` | Repo defaults to the current repo. Use `--label` for severity tag when labels exist; omit `--label` if the repo has no label fixture. Fall back to a label-less issue on first failure. | Markdown | `--label P0` / `--label P1` / etc. when labels exist | +| Jira | MCP or API | Create issue in the project identified by documentation; Jira's markdown dialect differs from GitHub's — use plain text in the body when MCP does not handle conversion | Plain text when MCP does not handle markdown | Severity priority field | +| Harness task primitive (last resort) | `TaskCreate` / `update_plan` / platform equivalent | Create one task per finding with subject = title and description = compact version of the body. No labels. Warn the user that tasks will not survive past the session (see once-per-session confirmation above). | Plain text, compact | None | +| No sink available | — | Defer option is omitted; findings remain in the report's residual-work section | — | — | + +When uncertain, prefer "drop with explicit user-facing notice" over "pass through silently and hope." A Defer that produces no durable artifact and no user message is data loss. + +--- + +## Cross-platform notes + +The question-tool name varies by platform. Use the platform's blocking question tool (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini). In Claude Code the tool should already be loaded from the Interactive-mode pre-load step — if it isn't, call `ToolSearch` with query `select:AskUserQuestion` now. Only when that load explicitly fails, or on a platform with no blocking tool, fall back to numbered options and waiting for the user's next reply before proceeding. + +The fallback chain's final tier (harness task-tracking primitive) does not exist on every target platform. When converted for a platform that has no equivalent of `TaskCreate` / `update_plan`, the agent should treat that platform as "no harness sink" and move directly to the no-sink behavior (omit Defer from menus and tell the user why). diff --git a/plugins/compound-engineering/skills/ce-review/references/walkthrough.md b/plugins/compound-engineering/skills/ce-review/references/walkthrough.md new file mode 100644 index 0000000..3872119 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-review/references/walkthrough.md @@ -0,0 +1,246 @@ +# Per-finding Walk-through + +This reference defines Interactive mode's per-finding walk-through — the path the user enters by picking option A (`Review each finding one by one — accept the recommendation or choose another action`) from the routing question. It also covers the unified completion report that every terminal path (walk-through, LFG, File tickets, zero findings) emits. + +Interactive mode only. + +--- + +## Entry + +The walk-through receives, from the orchestrator: + +- The merged findings list in severity order (P0 → P1 → P2 → P3), filtered to `gated_auto` and `manual` findings that survived the Stage 5 confidence gate. Advisory findings are included when they were surfaced to this phase (advisory findings normally live in the report-only queue, but when the review flow routes them here for acknowledgment they take the advisory variant below). +- The cached tracker-detection tuple from `tracker-defer.md` (`{ tracker_name, confidence, named_sink_available, any_sink_available }`). `any_sink_available` determines whether the Defer option is offered; `named_sink_available` + `confidence` determine whether the label names the tracker inline. +- The run id for artifact lookups. + +Each finding's recommended action has already been normalized by Stage 5 (step 7b — tie-break on action). The walk-through surfaces that recommendation to the user but does not recompute it. + +--- + +## Per-finding presentation + +Each finding is presented in two parts: a **terminal output block** carrying the explanation, and a **question** via the platform's blocking question tool (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini) carrying the decision. Never merge the two — the terminal block uses markdown; the question uses plain text. + +In Claude Code the tool should already be loaded from the Interactive-mode pre-load step in `SKILL.md` — if it isn't, call `ToolSearch` with query `select:AskUserQuestion` now. Rendering the per-finding question as narrative text is a bug, not a valid fallback. + +### Terminal output block (print before firing the question) + +Render as markdown. Labels on their own line, blank lines between sections: + +``` +## Finding {N} of {M} — {severity} {plain-English title} + +{file}:{line} + +**What's wrong** + +{plain-English problem statement from why_it_matters} + +**Proposed fix** + +{suggested_fix — rendered per the substitution rules below: prose-first, intent-language} + +**Why it works** + +{short reasoning, grounded in a codebase pattern when available} + +{R15 conflict context line, when applicable} +``` + +Substitutions: + +- **`{plain-English title}`:** a 3-8 word summary suitable as a heading. Derived from the merged finding's `title` field but rephrased so it reads as observable behavior (e.g., "Path traversal in loadUserFromCache" rather than "Missing userId validation on line 36"). +- **`why_it_matters`:** read the contributing reviewer's artifact file at `.context/compound-engineering/ce-review/{run_id}/{reviewer_name}.json` using the same `file + line_bucket(line, +/-3) + normalize(title)` matching that headless mode uses (see `SKILL.md` Stage 6 detail enrichment). When multiple reviewers flagged the merged finding, try them in the order they appear in the merged finding's reviewer list. Use the first match. +- **`suggested_fix`:** from the merged finding's `suggested_fix` field. Render as prose describing **intent**, not as syntax. The fixer subagent owns the exact code — the walk-through just needs enough for the user to trust or reject the action. Rules: + - **Default — one sentence describing the effect.** What does the fix achieve, and where does it live? Prefer intent language over quoted code. + - ✅ `Throw on non-2xx response before parsing JSON.` + - ✅ `` Replace `==` with `===` on line 42. `` + - ✅ `` Add a `response.ok` check after the fetch and throw on non-2xx. `` + - ✅ `Extract the request-building logic into a helper and call it from both sites.` + - ❌ `` Add `if (!response.ok) throw new Error(`HTTP ${response.status}`);` after the `await fetch(...)` call, before `response.json()`. `` — nested backticks, multiple code spans, full statement quoted; renders broken in terminal. + - **Code-span budget: at most 2 inline backtick spans per sentence, each a single identifier, operator, or short phrase** (e.g., `` `response.ok` ``, `` `===` ``, `` `fetchUserById` ``). Never embed full statements, template literals, or code requiring nested backticks. If the intent can't be stated within that budget, the prose is too close to syntax — restate at a higher level, or switch to summary + artifact pointer. + - **Always leave a space before and after every backtick span.** Without it, the terminal's markdown renderer eats the delimiters and runs the words together. + - **Raw code block — only for short (≤5 line) genuinely additive new code** where no before-state exists (new file, new function, new guard at the top of an empty body). Above 5 lines, switch to summary + pointer. + - **Summary + artifact pointer** — when prose can't capture the fix: one-sentence transformation + key symbol/location + `Full fix: .context/compound-engineering/ce-review/{run_id}/{reviewer_name}.json → findings[].suggested_fix`. + - **No diff blocks.** Modifications to existing code render as prose. +- **`Why it works`:** grounded reasoning that, where possible, references a similar pattern already used elsewhere in the codebase (e.g., "matches the format-validation pattern already used at src/cli/io.ts:41"). One to three sentences. +- **R15 conflict context line (when applicable):** when contributing reviewers implied different actions for this finding and Stage 5 step 7b broke the tie, surface that briefly. Example: `Correctness recommends Apply; Testing recommends Skip (low confidence). Agent's recommendation: Skip.` The orchestrator's recommendation — the post-tie-break value — is what the menu labels "recommended." + +When no artifact match exists for the finding (merge-synthesized finding, or the persona's artifact write failed), the terminal block degrades to the heading + `suggested_fix` only (omit the `What's wrong` and `Why it works` sections) and records the gap for the Coverage section of the completion report. + +### Question stem (short, decision-focused) + +After the terminal block renders, fire the platform's blocking question tool with a compact two-line stem: + +``` +Finding {N} of {M} — {severity} {short handle}. +{Action framing in a phrase}? +``` + +Where: + +- **Short handle:** matches the `{plain-English title}` from the terminal block heading. +- **Action framing:** one phrase describing what the *single recommended action* does, as a yes/no question. Examples: `Apply the format-validation + path.resolve guard?`, `Skip the fix since the fixture is being deleted?`, `Defer and file a rotation ticket?`. + +Never enumerate alternatives in the stem. One recommendation as a yes/no — the option list carries the alternatives. When the recommendation is close, surface the disagreement in the R15 conflict context line, not as a multi-option stem. + +Example (recommendation = Apply): + +``` +Finding 3 of 8 — P1 path traversal in loadUserFromCache. +Apply the format-validation + path.resolve guard? +``` + +Example (recommendation = Skip because content context overrides default): + +``` +Finding 1 of 9 — P0 hardcoded admin token. +Skip the fix since the fixture is being deleted? +(Security recommends Apply; file context recommends Skip. Agent's recommendation: Skip.) +``` + +Never embed code blocks, diff syntax, or the full fix/reasoning in the stem. + +### Confirmation between findings + +After the user answers and before printing the next finding's terminal block, emit a one-line confirmation of the action taken. Examples: `→ Applied. Fix staged at src/utils/api-client.ts:36-37.`, `→ Deferred. Ticket filed: .`, `→ Skipped.`, `→ Acknowledged.` + +### Options (four, or adapted as noted) + +Fixed order. Never reorder: + +``` +1. Apply the proposed fix +2. Defer — file a [TRACKER] ticket +3. Skip — don't apply, don't track +4. LFG the rest — apply the agent's best judgment to this and remaining findings +``` + +Render the `[TRACKER]` label per `tracker-defer.md`: when `confidence = high` AND `named_sink_available = true`, replace `[TRACKER]` with the concrete tracker name (e.g., `Defer — file a Linear ticket`). When `any_sink_available = true` but either `confidence = low` or `named_sink_available = false`, use the generic whole label `Defer — file a ticket` — whole-label substitution, not a `[TRACKER]` token swap. + +**Mark the post-tie-break recommendation with `(recommended)` on its option label.** Required, not optional. Any of the four options can carry it: + +``` +1. Apply the proposed fix (recommended) +2. Defer — file a ticket +3. Skip — don't apply, don't track +4. LFG the rest +``` + +``` +1. Apply the proposed fix +2. Defer — file a ticket +3. Skip — don't apply, don't track (recommended) +4. LFG the rest +``` + +When reviewers disagreed or content context cuts against the default, still mark one option — whichever Stage 5 step 7b produced — and surface the disagreement in the R15 conflict context line. + +### Adaptations + +- **Advisory-only finding:** when the finding's `autofix_class` is `advisory` (no actionable fix), option A is replaced with `Acknowledge — mark as reviewed`. The other three options remain. The advisory variant is the only case where `Acknowledge` appears in the menu. +- **N=1 (exactly one pending finding):** the terminal block's heading omits `Finding N of M` and renders as `## {severity} {plain-English title}`. The stem's first line drops the position counter, becoming `{severity} {short handle}.` Option D (`LFG the rest`) is suppressed because no subsequent findings exist — the menu shows three options: Apply / Defer / Skip (or Acknowledge, for advisory). +- **No-sink (Defer option unavailable):** when the tracker-detection tuple reports `any_sink_available: false` (every tier in the fallback chain — named tracker, GitHub Issues, harness primitive — is unreachable), option B (`Defer`) is omitted. The stem appends one line explaining why (e.g., `Defer unavailable on this platform — no tracker or task-tracking primitive detected.`). The menu shows three options: Apply / Skip / LFG the rest (and Acknowledge in place of Apply for advisory-only findings). **Before rendering the options, remap any per-finding `Defer` recommendation produced by Stage 5 step 7b to `Skip`** so the `(recommended)` marker always lands on an option that is actually in the menu. When the remap fires, surface it on the R15 conflict context line (e.g., `Stage 5 recommended Defer; downgraded to Skip — no tracker sink available.`). This is a render-time runtime step mirroring the Defer→Skip downgrade that `bulk-preview.md` performs for LFG previews; Stage 5 step 7b has no knowledge of sink availability and only orders conflicting reviewer recommendations. +- **Combined N=1 + no-sink:** the menu shows two options: Apply / Skip (or Acknowledge / Skip). + +Only when `ToolSearch` explicitly returns no match or the tool call errors — or on a platform with no blocking question tool — fall back to presenting the options as a numbered list and waiting for the user's next reply. + +--- + +## Per-finding routing + +For each finding's answer: + +- **Apply the proposed fix** — add the finding's id to an in-memory Apply set. Advance to the next finding. Do not dispatch the fixer inline — Apply accumulates for end-of-walk-through batch dispatch. +- **Acknowledge — mark as reviewed** (advisory variant) — record Acknowledge in the in-memory decision list. Advance to the next finding. No side effects. +- **Defer — file a [TRACKER] ticket** — invoke the tracker-defer flow from `tracker-defer.md`. The walk-through's position indicator stays on the current finding during any failure-path sub-question (Retry / Fall back / Convert to Skip). On success, record the tracker URL / reference in the in-memory decision list and advance. On conversion-to-Skip from the failure path, advance with the failure noted in the completion report. +- **Skip — don't apply, don't track** — record Skip in the in-memory decision list. Advance. No side effects. +- **LFG the rest — apply the agent's best judgment to this and remaining findings** — exit the walk-through loop. Dispatch the bulk preview from `bulk-preview.md`, scoped to the current finding and everything not yet decided. The preview header reports the count of already-decided findings ("K already decided"). If the user picks `Cancel` from the preview, return to the current finding's per-finding question (not to the routing question). If the user picks `Proceed`, execute the plan per `bulk-preview.md` — Apply findings join the in-memory Apply set with the ones the user already picked, Defer findings route through `tracker-defer.md`, Skip / Acknowledge no-op — then proceed to end-of-walk-through dispatch. + +--- + +## Override rule + +"Override" means the user picks a different preset action (Defer or Skip in place of Apply, or Apply in place of the agent's recommendation). No inline freeform custom-fix authoring — the walk-through is a decision loop, not a pair-programming surface. A user who wants a variant of the proposed fix picks Skip and hand-edits outside the flow; if they also want the finding tracked, they file a ticket manually. This trade is explicit in v1's scope boundaries. + +--- + +## State + +Walk-through state is **in-memory only**. The orchestrator maintains: + +- An Apply set (finding ids the user picked Apply on) +- A decision list (every answered finding with its action and any metadata like `tracker_url` for Deferred or `reason` for Skipped) +- The current position in the findings list + +Nothing is written to disk per-decision. An interrupted walk-through (user cancels the prompt, session compacts, network dies) discards all in-memory state. Defer actions that already executed remain in the tracker — those are external side effects and cannot be rolled back. Apply decisions have not been dispatched yet (they batch at end-of-walk-through), so they are cleanly lost with no code changes. + +Formal cross-session resumption is out of scope for v1. + +--- + +## End-of-walk-through dispatch + +After the loop terminates — either every finding has been answered, or the user took `LFG the rest → Proceed` — the walk-through hands off to the dispatch phase: + +1. **Apply set:** spawn one fixer subagent for the full accumulated Apply set. The fixer receives the set as its input queue and applies all changes in one pass against the current working tree. This preserves the existing "one fixer, consistent tree" mechanic and gives the fixer the full set at once to handle inter-fix dependencies (two Applies touching overlapping regions). The existing Step 3 fixer prompt needs a small update to acknowledge this queue may be heterogeneous (`gated_auto` and `manual` mix, not just `safe_auto`) — authored alongside this reference. +2. **Defer set:** already executed inline during the walk-through. Nothing to dispatch here. +3. **Skip / Acknowledge:** no-op. + +After dispatch completes (or after `LFG the rest → Cancel` followed by the user working through remaining findings one at a time, or after the loop runs to completion), emit the unified completion report described below. + +--- + +## Unified completion report + +Every terminal path of Interactive mode emits the same completion report structure. This covers: + +- Walk-through completed (all findings answered) +- Walk-through bailed via `LFG the rest → Proceed` +- Top-level LFG (routing option B) completed +- Top-level File tickets (routing option C) completed +- Zero findings after `safe_auto` (routing question was skipped — the completion summary is a one-line degenerate case of this structure) + +### Minimum required fields (per R12) + +- **Per-finding entries:** for every finding the flow touched, a line with — at minimum — title, severity, the action taken (Applied / Deferred / Skipped / Acknowledged), the tracker URL or in-session task reference for Deferred entries, and a one-line reason for Skipped entries (grounded in the finding's confidence or the one-line `why_it_matters` snippet). +- **Summary counts by action:** totals per bucket (e.g., `4 applied, 2 deferred, 2 skipped`). +- **Failures called out explicitly:** any fix application that failed, any ticket creation that failed (with the reason returned by the tracker). Failures are surfaced above the per-finding list so they are not missed. +- **End-of-review verdict:** the existing Stage 6 verdict (Ready to merge / Ready with fixes / Not ready), computed from the residual state after all actions complete. + +### Coverage section + +Carry forward the existing Coverage data (suppressed-finding count, residual risks, testing gaps, failed reviewers) and add one new element: + +- **Framing-enrichment gaps:** count of findings where artifact lookup returned no match (merge-synthesized findings, or failed persona artifact writes). Name the personas contributing those gaps so the data feeds any future persona-upgrade decision. A trail of gaps per run tells the team which persona agents still need attention. + +### Report ordering + +The report appears after all execution completes. Ordering inside the report: failures first (above the per-finding list), then per-finding entries grouped by action bucket in the order `Applied / Deferred / Skipped / Acknowledged`, then summary counts, then Coverage, then the verdict. + +### Zero-findings degenerate case + +When the routing question was skipped because no `gated_auto` / `manual` findings remained after `safe_auto`, the completion report collapses to its summary-counts + verdict form with one added line — the count of `safe_auto` fixes applied. The summary wording mirrors `SKILL.md` Step 2 Interactive mode's zero-remaining case: the unqualified `All findings resolved` form is only accurate when no advisory or pre-existing findings remain. When advisory and/or pre-existing findings remain in the report, use the qualified form that names what was cleared and names what still remains. Examples: + +No remaining advisory or pre-existing findings: + +``` +All findings resolved — 3 safe_auto fixes applied. + +Verdict: Ready with fixes. +``` + +Advisory and/or pre-existing findings remain in the report: + +``` +All actionable findings resolved — 3 safe_auto fixes applied. (2 advisory, 1 pre-existing findings remain in the report.) + +Verdict: Ready with fixes. +``` + +--- + +## Execution posture + +The walk-through is operationally read-only except for two permitted writes: the in-memory Apply set / decision list (managed by the orchestrator) and the tracker-defer dispatch (external ticket creation, described in `tracker-defer.md`). Persona agents remain strictly read-only. The end-of-walk-through fixer dispatch is the single point where file modifications happen — governed by the existing Step 3 fixer contract in `SKILL.md`. diff --git a/tests/review-skill-contract.test.ts b/tests/review-skill-contract.test.ts index dd88e17..69cd928 100644 --- a/tests/review-skill-contract.test.ts +++ b/tests/review-skill-contract.test.ts @@ -72,18 +72,80 @@ describe("ce-review contract", () => { test("documents policy-driven routing and residual handoff", async () => { const content = await readRepoFile("plugins/compound-engineering/skills/ce-review/SKILL.md") + // Routing taxonomy and fixer queue semantics expect(content).toContain("## Action Routing") expect(content).toContain("Only `safe_auto -> review-fixer` enters the in-skill fixer queue automatically.") - expect(content).toContain( - "Only include `gated_auto` findings in the fixer queue after the user explicitly approves the specific items.", - ) - expect(content).toContain( - "If no `gated_auto` or `manual` findings remain after safe fixes, skip the policy question entirely", - ) + + // Interactive mode four-option routing structure: each distinguishing word must appear + // as a routing-option label so truncation-safe menus stay intact. + // Assert presence rather than exact copy — wording can be improved without breaking the test. + expect(content).toMatch(/\(A\)\s*`Review each finding one by one/) + expect(content).toMatch(/\(B\)\s*`LFG\./) + expect(content).toMatch(/\(C\)\s*`File a \[TRACKER\] ticket/) + expect(content).toMatch(/\(D\)\s*`Report only/) + + // The new routing question dispatches to focused reference files, not inline prose. + expect(content).toContain("references/walkthrough.md") + expect(content).toContain("references/bulk-preview.md") + expect(content).toContain("references/tracker-defer.md") + + // Stem is third-person (AGENTS.md:127 — no first-person "I" / "me" in the new routing question). + // The Interactive branch of After Review Step 2 must not reintroduce the removed bucket-policy wording. + expect(content).not.toContain("What should I do with the remaining findings?") + expect(content).not.toContain("What should I do?") + + // Zero-remaining case: routing question is skipped with a completion summary. + expect(content).toMatch(/skip the routing question entirely/i) + + // Stage 5 tie-breaking rule — the walk-through's recommendation is deterministic. + expect(content).toMatch(/Skip\s*>\s*Defer\s*>\s*Apply/) + + // Autofix-mode residual todo handoff is preserved (mode isolation). expect(content).toContain( "In autofix mode, create durable todo files only for unresolved actionable findings whose final owner is `downstream-resolver`.", ) expect(content).toContain("If only advisory outputs remain, create no todos.") + + // Tracker fallback chain explicitly forbids extending the internal todos system. + const trackerDefer = await readRepoFile( + "plugins/compound-engineering/skills/ce-review/references/tracker-defer.md", + ) + expect(trackerDefer).toContain(".context/compound-engineering/todos/") + expect(trackerDefer).toMatch(/Never fall back to `\.context\/compound-engineering\/todos\//) + + // Subagent template carries the why_it_matters framing guidance that replaces the + // rejected synthesis-time rewrite pass. Assert presence of the observable-behavior + // rule and the required-field reminder without pinning exact prose. + const subagentTemplate = await readRepoFile( + "plugins/compound-engineering/skills/ce-review/references/subagent-template.md", + ) + expect(subagentTemplate).toMatch(/observable behavior/i) + expect(subagentTemplate).toMatch(/required/i) + + // walkthrough.md carries the four per-finding option labels (Apply / Defer / Skip / + // LFG the rest). Assert presence of each distinguishing word so renaming an option + // breaks the test. Exact label wording may be refined for clarity — these assertions + // check the structural contract, not the prose. + const walkthrough = await readRepoFile( + "plugins/compound-engineering/skills/ce-review/references/walkthrough.md", + ) + expect(walkthrough).toContain("Apply the proposed fix") + expect(walkthrough).toContain("Defer — file a [TRACKER] ticket") + expect(walkthrough).toContain("Skip — don't apply, don't track") + expect(walkthrough).toMatch(/LFG the rest/) + + // bulk-preview.md contract: exactly Proceed / Cancel, no third option. + const bulkPreview = await readRepoFile( + "plugins/compound-engineering/skills/ce-review/references/bulk-preview.md", + ) + expect(bulkPreview).toContain("Proceed") + expect(bulkPreview).toContain("Cancel") + + // Step 5 final-next-steps flow is gated on fixes-applied count, not routing option. + expect(content).toContain("fixes_applied_count") + expect(content).toMatch(/Step 5 runs only when `fixes_applied_count > 0`/i) + + // Final-next-steps wording preserved. expect(content).toContain("**On the resolved review base/default branch:**") expect(content).toContain("git push --set-upstream origin HEAD") expect(content).not.toContain("**On main/master:**")