feat(ce-plan): reduce token usage by extracting conditional references (#489)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 01:55:53 -07:00
parent bbd4f6de56
commit fd562a0d02
6 changed files with 378 additions and 348 deletions
--- a/plugins/compound-engineering/AGENTS.md
+++ b/plugins/compound-engineering/AGENTS.md
@@ -93,6 +93,10 @@ When adding or modifying skills, verify compliance with the skill spec:
  This resolves relative to the SKILL.md and substitutes content before the model sees it. If a file is over ~150 lines, prefer a backtick path even if it is always needed
 - [ ] For files the agent needs to *execute* (scripts, shell templates), always use backtick paths -- `@` would inline the script as text content instead of keeping it as an executable file

+### Conditional and Late-Sequence Extraction
+
+Skill content loaded at trigger time is carried in every subsequent message — every tool call, agent dispatch, and response. This carrying cost compounds across the session. For skills that orchestrate many tool or agent calls, extract blocks to `references/` when they are conditional (only execute under specific conditions) or late-sequence (only needed after many prior calls) and represent a meaningful share of the skill (~20%+). The more tool/agent calls a skill makes, the more aggressively to extract. Replace extracted blocks with a 1-3 line stub stating the condition and a backtick path reference (e.g., "Read `references/deepening-workflow.md`"). Never use `@` for extracted blocks — it inlines content at load time, defeating the extraction.
+
 ### Writing Style

 - [ ] Use imperative/infinitive form (verb-first instructions)
--- a/plugins/compound-engineering/skills/ce-plan/SKILL.md
+++ b/plugins/compound-engineering/skills/ce-plan/SKILL.md
@@ -587,35 +587,7 @@ For larger `Deep` plans, extend the core template only when useful with sections

 #### 4.4 Visual Communication in Plan Documents

-Section 3.4 covers diagrams about the *solution being planned* (pseudo-code, mermaid sequences, state diagrams). The existing Section 4.3 mermaid rule encourages those solution-design diagrams within Technical Design and per-unit fields. This guidance covers a different concern: visual aids that help readers *navigate and comprehend the plan document itself* -- dependency graphs, interaction diagrams, and comparison tables that make plan structure scannable.
-
-Visual aids are conditional on content patterns, not on plan depth classification -- a Lightweight plan about a complex multi-unit workflow may warrant a dependency graph; a Deep plan about a straightforward feature may not.
-
-**When to include:**
-
-| Plan describes... | Visual aid | Placement |
-|---|---|---|
-| 4+ implementation units with non-linear dependencies (parallelism, diamonds, fan-in/fan-out) | Mermaid dependency graph | Before or after the Implementation Units heading |
-| System-Wide Impact naming 3+ interacting surfaces or cross-layer effects | Mermaid interaction or component diagram | Within the System-Wide Impact section |
-| Problem/Overview involving 3+ behavioral modes, states, or variants | Markdown comparison table | Within Overview or Problem Frame |
-| Key Technical Decisions with 3+ interacting decisions, or Alternative Approaches with 3+ alternatives | Markdown comparison table | Within the relevant section |
-
-**When to skip:**
- The plan has 3 or fewer units in a straight dependency chain -- the Dependencies field on each unit is sufficient
- Prose already communicates the relationships clearly
- The visual would duplicate what the High-Level Technical Design section already shows
- The visual describes code-level detail (specific method names, SQL columns, API field lists)
-
-**Format selection:**
- **Mermaid** (default) for dependency graphs and interaction diagrams -- 5-15 nodes, no in-box annotations, standard flowchart shapes. Use `TB` (top-to-bottom) direction so diagrams stay narrow in both rendered and source form. Source should be readable as fallback in diff views and terminals.
- **ASCII/box-drawing diagrams** for annotated flows that need rich in-box content -- file path layouts, decision logic branches, multi-column spatial arrangements. More expressive than mermaid when the diagram's value comes from annotations within nodes. Follow 80-column max for code blocks, use vertical stacking.
- **Markdown tables** for mode/variant comparisons and decision/approach comparisons.
- Keep diagrams proportionate to the plan. A 6-unit linear chain gets a simple 6-node graph. A complex dependency graph with fan-out and fan-in may need 10-15 nodes -- that is fine if every node earns its place.
- Place inline at the point of relevance, not in a separate section.
- Plan-structure level only -- unit dependencies, component interactions, mode comparisons, impact surfaces. Not implementation architecture, data schemas, or code structure (those belong in Section 3.4).
- Prose is authoritative: when a visual aid and its surrounding prose disagree, the prose governs.
-
-After generating a visual aid, verify it accurately represents the plan sections it illustrates -- correct dependency edges, no missing surfaces, no merged units.
+When the plan contains 4+ implementation units with non-linear dependencies, 3+ interacting surfaces in System-Wide Impact, 3+ behavioral modes/variants in Overview or Problem Frame, or 3+ interacting decisions in Key Technical Decisions or alternatives in Alternative Approaches, read `references/visual-communication.md` for diagram and table guidance. This covers plan-structure visuals (dependency graphs, interaction diagrams, comparison tables) — not solution-design diagrams, which are covered in Section 3.4.

 ### Phase 5: Final Review, Write File, and Handoff

@@ -701,323 +673,12 @@ Build a risk profile. Treat these as high-risk signals:

 If the plan already appears sufficiently grounded and the thin-grounding override does not apply, report "Confidence check passed — no sections need strengthening" and skip to Phase 5.3.8 (Document Review). Document-review always runs regardless of whether deepening was needed — the two tools catch different classes of issues.

-##### 5.3.3 Score Confidence Gaps
+##### 5.3.3–5.3.7 Deepening Execution

-Use a checklist-first, risk-weighted scoring pass.
+When deepening is warranted, read `references/deepening-workflow.md` for confidence scoring checklists, section-to-agent dispatch mapping, execution mode selection, research execution, interactive finding review, and plan synthesis instructions. Execute steps 5.3.3 through 5.3.7 from that file, then return here for 5.3.8.

-For each section, compute:
- **Trigger count** - number of checklist problems that apply
- **Risk bonus** - add 1 if the topic is high-risk and this section is materially relevant to that risk
- **Critical-section bonus** - add 1 for `Key Technical Decisions`, `Implementation Units`, `System-Wide Impact`, `Risks & Dependencies`, or `Open Questions` in `Standard` or `Deep` plans
+##### 5.3.8–5.4 Document Review, Final Checks, and Post-Generation Options

-Treat a section as a candidate if:
- it hits **2+ total points**, or
- it hits **1+ point** in a high-risk domain and the section is materially important
-
-Choose only the top **2-5** sections by score. If deepening a lightweight plan (high-risk exception), cap at **1-2** sections.
-
-If the plan already has a `deepened:` date:
- Prefer sections that have not yet been substantially strengthened, if their scores are comparable
- Revisit an already-deepened section only when it still scores clearly higher than alternatives
-
-**Section Checklists:**
-
-**Requirements Trace**
- Requirements are vague or disconnected from implementation units
- Success criteria are missing or not reflected downstream
- Units do not clearly advance the traced requirements
- Origin requirements are not clearly carried forward
-
-**Context & Research / Sources & References**
- Relevant repo patterns are named but never used in decisions or implementation units
- Cited learnings or references do not materially shape the plan
- High-risk work lacks appropriate external or internal grounding
- Research is generic instead of tied to this repo or this plan
-
-**Key Technical Decisions**
- A decision is stated without rationale
- Rationale does not explain tradeoffs or rejected alternatives
- The decision does not connect back to scope, requirements, or origin context
- An obvious design fork exists but the plan never addresses why one path won
-
-**Open Questions**
- Product blockers are hidden as assumptions
- Planning-owned questions are incorrectly deferred to implementation
- Resolved questions have no clear basis in repo context, research, or origin decisions
- Deferred items are too vague to be useful later
-
-**High-Level Technical Design (when present)**
- The sketch uses the wrong medium for the work
- The sketch contains implementation code rather than pseudo-code
- The non-prescriptive framing is missing or weak
- The sketch does not connect to the key technical decisions or implementation units
-
-**High-Level Technical Design (when absent)** *(Standard or Deep plans only)*
- The work involves DSL design, API surface design, multi-component integration, complex data flow, or state-heavy lifecycle
- Key technical decisions would be easier to validate with a visual or pseudo-code representation
- The approach section of implementation units is thin and a higher-level technical design would provide context
-
-**Implementation Units**
- Dependency order is unclear or likely wrong
- File paths or test file paths are missing where they should be explicit
- Units are too large, too vague, or broken into micro-steps
- Approach notes are thin or do not name the pattern to follow
- Test scenarios are vague (don't name inputs and expected outcomes), skip applicable categories (e.g., no error paths for a unit with failure modes, no integration scenarios for a unit crossing layers), or are disproportionate to the unit's complexity
- Feature-bearing units have blank or missing test scenarios (feature-bearing units require actual test scenarios; the `Test expectation: none` annotation is only valid for non-feature-bearing units)
- Verification outcomes are vague or not expressed as observable results
-
-**System-Wide Impact**
- Affected interfaces, callbacks, middleware, entry points, or parity surfaces are missing
- Failure propagation is underexplored
- State lifecycle, caching, or data integrity risks are absent where relevant
- Integration coverage is weak for cross-layer work
-
-**Risks & Dependencies / Documentation / Operational Notes**
- Risks are listed without mitigation
- Rollout, monitoring, migration, or support implications are missing when warranted
- External dependency assumptions are weak or unstated
- Security, privacy, performance, or data risks are absent where they obviously apply
-
-Use the plan's own `Context & Research` and `Sources & References` as evidence. If those sections cite a pattern, learning, or risk that never affects decisions, implementation units, or verification, treat that as a confidence gap.
-
-##### 5.3.4 Report and Dispatch Targeted Research
-
-Before dispatching agents, report what sections are being strengthened and why:
-
-```text
-Strengthening [section names] — [brief reason for each, e.g., "decision rationale is thin", "cross-boundary effects aren't mapped"]
-```
-
-For each selected section, choose the smallest useful agent set. Do **not** run every agent. Use at most **1-3 agents per section** and usually no more than **8 agents total**.
-
-Use fully-qualified agent names inside Task calls.
-
-**Deterministic Section-to-Agent Mapping:**
-
-**Requirements Trace / Open Questions classification**
- `compound-engineering:workflow:spec-flow-analyzer` for missing user flows, edge cases, and handoff gaps
- `compound-engineering:research:repo-research-analyst` (Scope: `architecture, patterns`) for repo-grounded patterns, conventions, and implementation reality checks
-
-**Context & Research / Sources & References gaps**
- `compound-engineering:research:learnings-researcher` for institutional knowledge and past solved problems
- `compound-engineering:research:framework-docs-researcher` for official framework or library behavior
- `compound-engineering:research:best-practices-researcher` for current external patterns and industry guidance
- Add `compound-engineering:research:git-history-analyzer` only when historical rationale or prior art is materially missing
-
-**Key Technical Decisions**
- `compound-engineering:review:architecture-strategist` for design integrity, boundaries, and architectural tradeoffs
- Add `compound-engineering:research:framework-docs-researcher` or `compound-engineering:research:best-practices-researcher` when the decision needs external grounding beyond repo evidence
-
-**High-Level Technical Design**
- `compound-engineering:review:architecture-strategist` for validating that the technical design accurately represents the intended approach and identifying gaps
- `compound-engineering:research:repo-research-analyst` (Scope: `architecture, patterns`) for grounding the technical design in existing repo patterns and conventions
- Add `compound-engineering:research:best-practices-researcher` when the technical design involves a DSL, API surface, or pattern that benefits from external validation
-
-**Implementation Units / Verification**
- `compound-engineering:research:repo-research-analyst` (Scope: `patterns`) for concrete file targets, patterns to follow, and repo-specific sequencing clues
- `compound-engineering:review:pattern-recognition-specialist` for consistency, duplication risks, and alignment with existing patterns
- Add `compound-engineering:workflow:spec-flow-analyzer` when sequencing depends on user flow or handoff completeness
-
-**System-Wide Impact**
- `compound-engineering:review:architecture-strategist` for cross-boundary effects, interface surfaces, and architectural knock-on impact
- Add the specific specialist that matches the risk:
-  - `compound-engineering:review:performance-oracle` for scalability, latency, throughput, and resource-risk analysis
-  - `compound-engineering:review:security-sentinel` for auth, validation, exploit surfaces, and security boundary review
-  - `compound-engineering:review:data-integrity-guardian` for migrations, persistent state safety, consistency, and data lifecycle risks
-
-**Risks & Dependencies / Operational Notes**
- Use the specialist that matches the actual risk:
-  - `compound-engineering:review:security-sentinel` for security, auth, privacy, and exploit risk
-  - `compound-engineering:review:data-integrity-guardian` for persistent data safety, constraints, and transaction boundaries
-  - `compound-engineering:review:data-migration-expert` for migration realism, backfills, and production data transformation risk
-  - `compound-engineering:review:deployment-verification-agent` for rollout checklists, rollback planning, and launch verification
-  - `compound-engineering:review:performance-oracle` for capacity, latency, and scaling concerns
-
-**Agent Prompt Shape:**
-
-For each selected section, pass:
- The scope prefix from the mapping above when the agent supports scoped invocation
- A short plan summary
- The exact section text
- Why the section was selected, including which checklist triggers fired
- The plan depth and risk profile
- A specific question to answer
-
-Instruct the agent to return:
- findings that change planning quality
- stronger rationale, sequencing, verification, risk treatment, or references
- no implementation code
- no shell commands
-
-##### 5.3.5 Choose Research Execution Mode
-
-Use the lightest mode that will work:
-
- **Direct mode** - Default. Use when the selected section set is small and the parent can safely read the agent outputs inline.
- **Artifact-backed mode** - Use only when the selected research scope is large enough that inline returns would create unnecessary context pressure.
-
-Signals that justify artifact-backed mode:
- More than 5 agents are likely to return meaningful findings
- The selected section excerpts are long enough that repeating them in multiple agent outputs would be wasteful
- The topic is high-risk and likely to attract bulky source-backed analysis
-
-If artifact-backed mode is not clearly warranted, stay in direct mode.
-
-Artifact-backed mode uses a per-run scratch directory under `.context/compound-engineering/ce-plan/deepen/`.
-
-##### 5.3.6 Run Targeted Research
-
-Launch the selected agents in parallel using the execution mode chosen above. If the current platform does not support parallel dispatch, run them sequentially instead.
-
-Prefer local repo and institutional evidence first. Use external research only when the gap cannot be closed responsibly from repo context or already-cited sources.
-
-If a selected section can be improved by reading the origin document more carefully, do that before dispatching external agents.
-
-**Direct mode:** Have each selected agent return its findings directly to the parent. Keep the return payload focused: strongest findings only, the evidence or sources that matter, the concrete planning improvement implied by the finding.
-
-**Artifact-backed mode:** For each selected agent, instruct it to write one compact artifact file in the scratch directory and return only a short completion summary. Each artifact should contain: target section, why selected, 3-7 findings, source-backed rationale, the specific plan change implied by each finding. No implementation code, no shell commands.
-
-If an artifact is missing or clearly malformed, re-run that agent or fall back to direct-mode reasoning for that section.
-
-If agent outputs conflict:
- Prefer repo-grounded and origin-grounded evidence over generic advice
- Prefer official framework documentation over secondary best-practice summaries when the conflict is about library behavior
- If a real tradeoff remains, record it explicitly in the plan
-
-##### 5.3.6b Interactive Finding Review (Interactive Mode Only)
-
-Skip this step in auto mode — proceed directly to 5.3.7.
-
-In interactive mode, present each agent's findings to the user before integration. For each agent that returned findings:
-
-1. **Summarize the agent and its target section** — e.g., "The architecture-strategist reviewed Key Technical Decisions and found:"
-2. **Present the findings concisely** — bullet the key points, not the raw agent output. Include enough context for the user to evaluate: what the agent found, what evidence supports it, and what plan change it implies.
-3. **Ask the user** using the platform's blocking question tool when available (see Interaction Method):
-   - **Accept** — integrate these findings into the plan
-   - **Reject** — discard these findings entirely
-   - **Discuss** — the user wants to talk through the findings before deciding
-
-If the user chooses "Discuss", engage in brief dialogue about the findings and then re-ask with only accept/reject (no discuss option on the second ask). The user makes a deliberate choice either way.
-
-When presenting findings from multiple agents targeting the same section, present them one agent at a time so the user can make independent decisions. Do not merge findings from different agents before showing them.
-
-After all agents have been reviewed, carry only the accepted findings forward to 5.3.7.
-
-If the user accepted no findings, report "No findings accepted — plan unchanged." If artifact-backed mode was used, clean up the scratch directory before continuing. Then proceed directly to Phase 5.4 (skip document-review and synthesis — the plan was not modified). This interactive-mode-only skip does not apply in auto mode; auto mode always proceeds through 5.3.7 and 5.3.8.
-
-If findings were accepted and the plan was modified, proceed through 5.3.7 and 5.3.8 as normal — document-review acts as a quality gate on the changes.
-
-##### 5.3.7 Synthesize and Update the Plan
-
-Strengthen only the selected sections. Keep the plan coherent and preserve its overall structure.
-
-**In interactive mode:** Only integrate findings the user accepted in 5.3.6b. If some findings from different agents touch the same section, reconcile them coherently but do not reintroduce rejected findings.
-
-Allowed changes:
- Clarify or strengthen decision rationale
- Tighten requirements trace or origin fidelity
- Reorder or split implementation units when sequencing is weak
- Add missing pattern references, file/test paths, or verification outcomes
- Expand system-wide impact, risks, or rollout treatment where justified
- Reclassify open questions between `Resolved During Planning` and `Deferred to Implementation` when evidence supports the change
- Strengthen, replace, or add a High-Level Technical Design section when the work warrants it and the current representation is weak
- Strengthen or add per-unit technical design fields where the unit's approach is non-obvious
- Add or update `deepened: YYYY-MM-DD` in frontmatter when the plan was substantively improved
-
-Do **not**:
- Add implementation code — no imports, exact method signatures, or framework-specific syntax. Pseudo-code sketches and DSL grammars are allowed
- Add git commands, commit choreography, or exact test command recipes
- Add generic `Research Insights` subsections everywhere
- Rewrite the entire plan from scratch
- Invent new product requirements, scope changes, or success criteria without surfacing them explicitly
-
-If research reveals a product-level ambiguity that should change behavior or scope:
- Do not silently decide it here
- Record it under `Open Questions`
- Recommend `ce:brainstorm` if the gap is truly product-defining
-
-##### 5.3.8 Document Review
-
-After the confidence check (and any deepening), run the `document-review` skill on the plan file. Pass the plan path as the argument. When this step is reached, it is mandatory — do not skip it because the confidence check already ran. The two tools catch different classes of issues.
-
-The confidence check and document-review are complementary:
- The confidence check strengthens rationale, sequencing, risk treatment, and grounding
- Document-review checks coherence, feasibility, scope alignment, and surfaces role-specific issues
-
-If document-review returns findings that were auto-applied, note them briefly when presenting handoff options. If residual P0/P1 findings were surfaced, mention them so the user can decide whether to address them before proceeding.
-
-When document-review returns "Review complete", proceed to Final Checks.
-
-**Pipeline mode:** If invoked from an automated workflow such as LFG, SLFG, or any `disable-model-invocation` context, run `document-review` with `mode:headless` and the plan path. Headless mode applies auto-fixes silently and returns structured findings without interactive prompts. Address any P0/P1 findings before returning control to the caller.
-
-##### 5.3.9 Final Checks and Cleanup
-
-Before proceeding to post-generation options:
- Confirm the plan is stronger in specific ways, not merely longer
- Confirm the planning boundary is intact
- Confirm origin decisions were preserved when an origin document exists
-
-If artifact-backed mode was used:
- Clean up the temporary scratch directory after the plan is safely updated
- If cleanup is not practical on the current platform, note where the artifacts were left
-
-#### 5.4 Post-Generation Options
-
-**Pipeline mode:** If invoked from an automated workflow such as LFG, SLFG, or any `disable-model-invocation` context, skip the interactive menu below and return control to the caller immediately. The plan file has already been written, the confidence check has already run, and document-review has already run — the caller (e.g., lfg, slfg) determines the next step.
-
-After document-review completes, present the options using the platform's blocking question tool when available (see Interaction Method). Otherwise present numbered options in chat and wait for the user's reply before proceeding.
-
-**Question:** "Plan ready at `docs/plans/YYYY-MM-DD-NNN-<type>-<name>-plan.md`. What would you like to do next?"
-
-**Options:**
-1. **Start `/ce:work`** - Begin implementing this plan in the current environment (recommended)
-2. **Open plan in editor** - Open the plan file for review
-3. **Run additional document review** - Another pass for further refinement
-4. **Share to Proof** - Upload the plan for collaborative review and sharing
-5. **Start `/ce:work` in another session** - Begin implementing in a separate agent session when the current platform supports it
-6. **Create Issue** - Create an issue in the configured tracker
-
-Based on selection:
- **Open plan in editor** → Open `docs/plans/<plan_filename>.md` using the current platform's file-open or editor mechanism (e.g., `open` on macOS, `xdg-open` on Linux, or the IDE's file-open API)
- **Run additional document review** → Load the `document-review` skill with the plan path for another pass
- **Share to Proof** → Upload the plan:
-  ```bash
-  CONTENT=$(cat docs/plans/<plan_filename>.md)
-  TITLE="Plan: <plan title from frontmatter>"
-  RESPONSE=$(curl -s -X POST https://www.proofeditor.ai/share/markdown \
-    -H "Content-Type: application/json" \
-    -d "$(jq -n --arg title "$TITLE" --arg markdown "$CONTENT" --arg by "ai:compound" '{title: $title, markdown: $markdown, by: $by}')")
-  PROOF_URL=$(echo "$RESPONSE" | jq -r '.tokenUrl')
-  ```
-  Display `View & collaborate in Proof: <PROOF_URL>` if successful, then return to the options
- **`/ce:work`** → Call `/ce:work` with the plan path
- **`/ce:work` in another session** → If the current platform supports launching a separate agent session, start `/ce:work` with the plan path there. Otherwise, explain the limitation briefly and offer to run `/ce:work` in the current session instead.
- **Create Issue** → Follow the Issue Creation section below
- **Other** → Accept free text for revisions and loop back to options
-
-## Issue Creation
-
-When the user selects "Create Issue", detect their project tracker from `AGENTS.md` or, if needed for compatibility, `CLAUDE.md`:
-
-1. Look for `project_tracker: github` or `project_tracker: linear`
-2. If GitHub:
-
-   ```bash
-   gh issue create --title "<type>: <title>" --body-file <plan_path>
-   ```
-
-3. If Linear:
-
-   ```bash
-   linear issue create --title "<title>" --description "$(cat <plan_path>)"
-   ```
-
-4. If no tracker is configured:
-   - Ask which tracker they use using the platform's blocking question tool when available (see Interaction Method)
-   - Suggest adding the tracker to `AGENTS.md` for future runs
-
-After issue creation:
- Display the issue URL
- Ask whether to proceed to `/ce:work`
+When reaching this phase, read `references/plan-handoff.md` for document review instructions (5.3.8), final checks and cleanup (5.3.9), post-generation options menu (5.4), and issue creation. Do not load this file earlier. Document review is mandatory — do not skip it even if the confidence check already ran.

 NEVER CODE! Research, decide, and write the plan.
--- a/plugins/compound-engineering/skills/ce-plan/references/deepening-workflow.md
+++ b/plugins/compound-engineering/skills/ce-plan/references/deepening-workflow.md
@@ -0,0 +1,238 @@
+# Deepening Workflow
+
+This file contains the confidence-check execution path (5.3.3-5.3.7). Load it only when the deepening gate at 5.3.2 determines that deepening is warranted.
+
+## 5.3.3 Score Confidence Gaps
+
+Use a checklist-first, risk-weighted scoring pass.
+
+For each section, compute:
+- **Trigger count** - number of checklist problems that apply
+- **Risk bonus** - add 1 if the topic is high-risk and this section is materially relevant to that risk
+- **Critical-section bonus** - add 1 for `Key Technical Decisions`, `Implementation Units`, `System-Wide Impact`, `Risks & Dependencies`, or `Open Questions` in `Standard` or `Deep` plans
+
+Treat a section as a candidate if:
+- it hits **2+ total points**, or
+- it hits **1+ point** in a high-risk domain and the section is materially important
+
+Choose only the top **2-5** sections by score. If deepening a lightweight plan (high-risk exception), cap at **1-2** sections.
+
+If the plan already has a `deepened:` date:
+- Prefer sections that have not yet been substantially strengthened, if their scores are comparable
+- Revisit an already-deepened section only when it still scores clearly higher than alternatives
+
+**Section Checklists:**
+
+**Requirements Trace**
+- Requirements are vague or disconnected from implementation units
+- Success criteria are missing or not reflected downstream
+- Units do not clearly advance the traced requirements
+- Origin requirements are not clearly carried forward
+
+**Context & Research / Sources & References**
+- Relevant repo patterns are named but never used in decisions or implementation units
+- Cited learnings or references do not materially shape the plan
+- High-risk work lacks appropriate external or internal grounding
+- Research is generic instead of tied to this repo or this plan
+
+**Key Technical Decisions**
+- A decision is stated without rationale
+- Rationale does not explain tradeoffs or rejected alternatives
+- The decision does not connect back to scope, requirements, or origin context
+- An obvious design fork exists but the plan never addresses why one path won
+
+**Open Questions**
+- Product blockers are hidden as assumptions
+- Planning-owned questions are incorrectly deferred to implementation
+- Resolved questions have no clear basis in repo context, research, or origin decisions
+- Deferred items are too vague to be useful later
+
+**High-Level Technical Design (when present)**
+- The sketch uses the wrong medium for the work
+- The sketch contains implementation code rather than pseudo-code
+- The non-prescriptive framing is missing or weak
+- The sketch does not connect to the key technical decisions or implementation units
+
+**High-Level Technical Design (when absent)** *(Standard or Deep plans only)*
+- The work involves DSL design, API surface design, multi-component integration, complex data flow, or state-heavy lifecycle
+- Key technical decisions would be easier to validate with a visual or pseudo-code representation
+- The approach section of implementation units is thin and a higher-level technical design would provide context
+
+**Implementation Units**
+- Dependency order is unclear or likely wrong
+- File paths or test file paths are missing where they should be explicit
+- Units are too large, too vague, or broken into micro-steps
+- Approach notes are thin or do not name the pattern to follow
+- Test scenarios are vague (don't name inputs and expected outcomes), skip applicable categories (e.g., no error paths for a unit with failure modes, no integration scenarios for a unit crossing layers), or are disproportionate to the unit's complexity
+- Feature-bearing units have blank or missing test scenarios (feature-bearing units require actual test scenarios; the `Test expectation: none` annotation is only valid for non-feature-bearing units)
+- Verification outcomes are vague or not expressed as observable results
+
+**System-Wide Impact**
+- Affected interfaces, callbacks, middleware, entry points, or parity surfaces are missing
+- Failure propagation is underexplored
+- State lifecycle, caching, or data integrity risks are absent where relevant
+- Integration coverage is weak for cross-layer work
+
+**Risks & Dependencies / Documentation / Operational Notes**
+- Risks are listed without mitigation
+- Rollout, monitoring, migration, or support implications are missing when warranted
+- External dependency assumptions are weak or unstated
+- Security, privacy, performance, or data risks are absent where they obviously apply
+
+Use the plan's own `Context & Research` and `Sources & References` as evidence. If those sections cite a pattern, learning, or risk that never affects decisions, implementation units, or verification, treat that as a confidence gap.
+
+## 5.3.4 Report and Dispatch Targeted Research
+
+Before dispatching agents, report what sections are being strengthened and why:
+
+```text
+Strengthening [section names] — [brief reason for each, e.g., "decision rationale is thin", "cross-boundary effects aren't mapped"]
+```
+
+For each selected section, choose the smallest useful agent set. Do **not** run every agent. Use at most **1-3 agents per section** and usually no more than **8 agents total**.
+
+Use fully-qualified agent names inside Task calls.
+
+**Deterministic Section-to-Agent Mapping:**
+
+**Requirements Trace / Open Questions classification**
+- `compound-engineering:workflow:spec-flow-analyzer` for missing user flows, edge cases, and handoff gaps
+- `compound-engineering:research:repo-research-analyst` (Scope: `architecture, patterns`) for repo-grounded patterns, conventions, and implementation reality checks
+
+**Context & Research / Sources & References gaps**
+- `compound-engineering:research:learnings-researcher` for institutional knowledge and past solved problems
+- `compound-engineering:research:framework-docs-researcher` for official framework or library behavior
+- `compound-engineering:research:best-practices-researcher` for current external patterns and industry guidance
+- Add `compound-engineering:research:git-history-analyzer` only when historical rationale or prior art is materially missing
+
+**Key Technical Decisions**
+- `compound-engineering:review:architecture-strategist` for design integrity, boundaries, and architectural tradeoffs
+- Add `compound-engineering:research:framework-docs-researcher` or `compound-engineering:research:best-practices-researcher` when the decision needs external grounding beyond repo evidence
+
+**High-Level Technical Design**
+- `compound-engineering:review:architecture-strategist` for validating that the technical design accurately represents the intended approach and identifying gaps
+- `compound-engineering:research:repo-research-analyst` (Scope: `architecture, patterns`) for grounding the technical design in existing repo patterns and conventions
+- Add `compound-engineering:research:best-practices-researcher` when the technical design involves a DSL, API surface, or pattern that benefits from external validation
+
+**Implementation Units / Verification**
+- `compound-engineering:research:repo-research-analyst` (Scope: `patterns`) for concrete file targets, patterns to follow, and repo-specific sequencing clues
+- `compound-engineering:review:pattern-recognition-specialist` for consistency, duplication risks, and alignment with existing patterns
+- Add `compound-engineering:workflow:spec-flow-analyzer` when sequencing depends on user flow or handoff completeness
+
+**System-Wide Impact**
+- `compound-engineering:review:architecture-strategist` for cross-boundary effects, interface surfaces, and architectural knock-on impact
+- Add the specific specialist that matches the risk:
+  - `compound-engineering:review:performance-oracle` for scalability, latency, throughput, and resource-risk analysis
+  - `compound-engineering:review:security-sentinel` for auth, validation, exploit surfaces, and security boundary review
+  - `compound-engineering:review:data-integrity-guardian` for migrations, persistent state safety, consistency, and data lifecycle risks
+
+**Risks & Dependencies / Operational Notes**
+- Use the specialist that matches the actual risk:
+  - `compound-engineering:review:security-sentinel` for security, auth, privacy, and exploit risk
+  - `compound-engineering:review:data-integrity-guardian` for persistent data safety, constraints, and transaction boundaries
+  - `compound-engineering:review:data-migration-expert` for migration realism, backfills, and production data transformation risk
+  - `compound-engineering:review:deployment-verification-agent` for rollout checklists, rollback planning, and launch verification
+  - `compound-engineering:review:performance-oracle` for capacity, latency, and scaling concerns
+
+**Agent Prompt Shape:**
+
+For each selected section, pass:
+- The scope prefix from the mapping above when the agent supports scoped invocation
+- A short plan summary
+- The exact section text
+- Why the section was selected, including which checklist triggers fired
+- The plan depth and risk profile
+- A specific question to answer
+
+Instruct the agent to return:
+- findings that change planning quality
+- stronger rationale, sequencing, verification, risk treatment, or references
+- no implementation code
+- no shell commands
+
+## 5.3.5 Choose Research Execution Mode
+
+Use the lightest mode that will work:
+
+- **Direct mode** - Default. Use when the selected section set is small and the parent can safely read the agent outputs inline.
+- **Artifact-backed mode** - Use only when the selected research scope is large enough that inline returns would create unnecessary context pressure.
+
+Signals that justify artifact-backed mode:
+- More than 5 agents are likely to return meaningful findings
+- The selected section excerpts are long enough that repeating them in multiple agent outputs would be wasteful
+- The topic is high-risk and likely to attract bulky source-backed analysis
+
+If artifact-backed mode is not clearly warranted, stay in direct mode.
+
+Artifact-backed mode uses a per-run scratch directory under `.context/compound-engineering/ce-plan/deepen/`.
+
+## 5.3.6 Run Targeted Research
+
+Launch the selected agents in parallel using the execution mode chosen above. If the current platform does not support parallel dispatch, run them sequentially instead.
+
+Prefer local repo and institutional evidence first. Use external research only when the gap cannot be closed responsibly from repo context or already-cited sources.
+
+If a selected section can be improved by reading the origin document more carefully, do that before dispatching external agents.
+
+**Direct mode:** Have each selected agent return its findings directly to the parent. Keep the return payload focused: strongest findings only, the evidence or sources that matter, the concrete planning improvement implied by the finding.
+
+**Artifact-backed mode:** For each selected agent, instruct it to write one compact artifact file in the scratch directory and return only a short completion summary. Each artifact should contain: target section, why selected, 3-7 findings, source-backed rationale, the specific plan change implied by each finding. No implementation code, no shell commands.
+
+If an artifact is missing or clearly malformed, re-run that agent or fall back to direct-mode reasoning for that section.
+
+If agent outputs conflict:
+- Prefer repo-grounded and origin-grounded evidence over generic advice
+- Prefer official framework documentation over secondary best-practice summaries when the conflict is about library behavior
+- If a real tradeoff remains, record it explicitly in the plan
+
+## 5.3.6b Interactive Finding Review (Interactive Mode Only)
+
+Skip this step in auto mode — proceed directly to 5.3.7.
+
+In interactive mode, present each agent's findings to the user before integration. For each agent that returned findings:
+
+1. **Summarize the agent and its target section** — e.g., "The architecture-strategist reviewed Key Technical Decisions and found:"
+2. **Present the findings concisely** — bullet the key points, not the raw agent output. Include enough context for the user to evaluate: what the agent found, what evidence supports it, and what plan change it implies.
+3. **Ask the user** using the platform's blocking question tool when available (see Interaction Method):
+   - **Accept** — integrate these findings into the plan
+   - **Reject** — discard these findings entirely
+   - **Discuss** — the user wants to talk through the findings before deciding
+
+If the user chooses "Discuss", engage in brief dialogue about the findings and then re-ask with only accept/reject (no discuss option on the second ask). The user makes a deliberate choice either way.
+
+When presenting findings from multiple agents targeting the same section, present them one agent at a time so the user can make independent decisions. Do not merge findings from different agents before showing them.
+
+After all agents have been reviewed, carry only the accepted findings forward to 5.3.7.
+
+If the user accepted no findings, report "No findings accepted — plan unchanged." If artifact-backed mode was used, clean up the scratch directory before continuing. Then proceed directly to Phase 5.4 (skip document-review and synthesis — the plan was not modified). This interactive-mode-only skip does not apply in auto mode; auto mode always proceeds through 5.3.7 and 5.3.8.
+
+If findings were accepted and the plan was modified, proceed through 5.3.7 and 5.3.8 as normal — document-review acts as a quality gate on the changes.
+
+## 5.3.7 Synthesize and Update the Plan
+
+Strengthen only the selected sections. Keep the plan coherent and preserve its overall structure.
+
+**In interactive mode:** Only integrate findings the user accepted in 5.3.6b. If some findings from different agents touch the same section, reconcile them coherently but do not reintroduce rejected findings.
+
+Allowed changes:
+- Clarify or strengthen decision rationale
+- Tighten requirements trace or origin fidelity
+- Reorder or split implementation units when sequencing is weak
+- Add missing pattern references, file/test paths, or verification outcomes
+- Expand system-wide impact, risks, or rollout treatment where justified
+- Reclassify open questions between `Resolved During Planning` and `Deferred to Implementation` when evidence supports the change
+- Strengthen, replace, or add a High-Level Technical Design section when the work warrants it and the current representation is weak
+- Strengthen or add per-unit technical design fields where the unit's approach is non-obvious
+- Add or update `deepened: YYYY-MM-DD` in frontmatter when the plan was substantively improved
+
+Do **not**:
+- Add implementation code — no imports, exact method signatures, or framework-specific syntax. Pseudo-code sketches and DSL grammars are allowed
+- Add git commands, commit choreography, or exact test command recipes
+- Add generic `Research Insights` subsections everywhere
+- Rewrite the entire plan from scratch
+- Invent new product requirements, scope changes, or success criteria without surfacing them explicitly
+
+If research reveals a product-level ambiguity that should change behavior or scope:
+- Do not silently decide it here
+- Record it under `Open Questions`
+- Recommend `ce:brainstorm` if the gap is truly product-defining
--- a/plugins/compound-engineering/skills/ce-plan/references/plan-handoff.md
+++ b/plugins/compound-engineering/skills/ce-plan/references/plan-handoff.md
@@ -0,0 +1,87 @@
+# Plan Handoff
+
+This file contains post-plan-writing instructions: document review, post-generation options, and issue creation. Load it after the plan file has been written and the confidence check (5.3.1-5.3.7) is complete.
+
+## 5.3.8 Document Review
+
+After the confidence check (and any deepening), run the `document-review` skill on the plan file. Pass the plan path as the argument. When this step is reached, it is mandatory — do not skip it because the confidence check already ran. The two tools catch different classes of issues.
+
+The confidence check and document-review are complementary:
+- The confidence check strengthens rationale, sequencing, risk treatment, and grounding
+- Document-review checks coherence, feasibility, scope alignment, and surfaces role-specific issues
+
+If document-review returns findings that were auto-applied, note them briefly when presenting handoff options. If residual P0/P1 findings were surfaced, mention them so the user can decide whether to address them before proceeding.
+
+When document-review returns "Review complete", proceed to Final Checks.
+
+**Pipeline mode:** If invoked from an automated workflow such as LFG, SLFG, or any `disable-model-invocation` context, run `document-review` with `mode:headless` and the plan path. Headless mode applies auto-fixes silently and returns structured findings without interactive prompts. Address any P0/P1 findings before returning control to the caller.
+
+## 5.3.9 Final Checks and Cleanup
+
+Before proceeding to post-generation options:
+- Confirm the plan is stronger in specific ways, not merely longer
+- Confirm the planning boundary is intact
+- Confirm origin decisions were preserved when an origin document exists
+
+If artifact-backed mode was used:
+- Clean up the temporary scratch directory after the plan is safely updated
+- If cleanup is not practical on the current platform, note where the artifacts were left
+
+## 5.4 Post-Generation Options
+
+**Pipeline mode:** If invoked from an automated workflow such as LFG, SLFG, or any `disable-model-invocation` context, skip the interactive menu below and return control to the caller immediately. The plan file has already been written, the confidence check has already run, and document-review has already run — the caller (e.g., lfg, slfg) determines the next step.
+
+After document-review completes, present the options using the platform's blocking question tool when available (see Interaction Method). Otherwise present numbered options in chat and wait for the user's reply before proceeding.
+
+**Question:** "Plan ready at `docs/plans/YYYY-MM-DD-NNN-<type>-<name>-plan.md`. What would you like to do next?"
+
+**Options:**
+1. **Start `/ce:work`** - Begin implementing this plan in the current environment (recommended)
+2. **Open plan in editor** - Open the plan file for review
+3. **Run additional document review** - Another pass for further refinement
+4. **Share to Proof** - Upload the plan for collaborative review and sharing
+5. **Start `/ce:work` in another session** - Begin implementing in a separate agent session when the current platform supports it
+6. **Create Issue** - Create an issue in the configured tracker
+
+Based on selection:
+- **Open plan in editor** -> Open `docs/plans/<plan_filename>.md` using the current platform's file-open or editor mechanism (e.g., `open` on macOS, `xdg-open` on Linux, or the IDE's file-open API)
+- **Run additional document review** -> Load the `document-review` skill with the plan path for another pass
+- **Share to Proof** -> Upload the plan:
+  ```bash
+  CONTENT=$(cat docs/plans/<plan_filename>.md)
+  TITLE="Plan: <plan title from frontmatter>"
+  RESPONSE=$(curl -s -X POST https://www.proofeditor.ai/share/markdown \
+    -H "Content-Type: application/json" \
+    -d "$(jq -n --arg title "$TITLE" --arg markdown "$CONTENT" --arg by "ai:compound" '{title: $title, markdown: $markdown, by: $by}')")
+  PROOF_URL=$(echo "$RESPONSE" | jq -r '.tokenUrl')
+  ```
+  Display `View & collaborate in Proof: <PROOF_URL>` if successful, then return to the options
+- **`/ce:work`** -> Call `/ce:work` with the plan path
+- **`/ce:work` in another session** -> If the current platform supports launching a separate agent session, start `/ce:work` with the plan path there. Otherwise, explain the limitation briefly and offer to run `/ce:work` in the current session instead.
+- **Create Issue** -> Follow the Issue Creation section below
+- **Other** -> Accept free text for revisions and loop back to options
+
+## Issue Creation
+
+When the user selects "Create Issue", detect their project tracker from `AGENTS.md` or, if needed for compatibility, `CLAUDE.md`:
+
+1. Look for `project_tracker: github` or `project_tracker: linear`
+2. If GitHub:
+
+   ```bash
+   gh issue create --title "<type>: <title>" --body-file <plan_path>
+   ```
+
+3. If Linear:
+
+   ```bash
+   linear issue create --title "<title>" --description "$(cat <plan_path>)"
+   ```
+
+4. If no tracker is configured:
+   - Ask which tracker they use using the platform's blocking question tool when available (see Interaction Method)
+   - Suggest adding the tracker to `AGENTS.md` for future runs
+
+After issue creation:
+- Display the issue URL
+- Ask whether to proceed to `/ce:work`
--- a/plugins/compound-engineering/skills/ce-plan/references/visual-communication.md
+++ b/plugins/compound-engineering/skills/ce-plan/references/visual-communication.md
@@ -0,0 +1,31 @@
+# Visual Communication in Plan Documents
+
+Section 3.4 covers diagrams about the *solution being planned* (pseudo-code, mermaid sequences, state diagrams). The existing Section 4.3 mermaid rule encourages those solution-design diagrams within Technical Design and per-unit fields. This guidance covers a different concern: visual aids that help readers *navigate and comprehend the plan document itself* -- dependency graphs, interaction diagrams, and comparison tables that make plan structure scannable.
+
+Visual aids are conditional on content patterns, not on plan depth classification -- a Lightweight plan about a complex multi-unit workflow may warrant a dependency graph; a Deep plan about a straightforward feature may not.
+
+**When to include:**
+
+| Plan describes... | Visual aid | Placement |
+|---|---|---|
+| 4+ implementation units with non-linear dependencies (parallelism, diamonds, fan-in/fan-out) | Mermaid dependency graph | Before or after the Implementation Units heading |
+| System-Wide Impact naming 3+ interacting surfaces or cross-layer effects | Mermaid interaction or component diagram | Within the System-Wide Impact section |
+| Problem/Overview involving 3+ behavioral modes, states, or variants | Markdown comparison table | Within Overview or Problem Frame |
+| Key Technical Decisions with 3+ interacting decisions, or Alternative Approaches with 3+ alternatives | Markdown comparison table | Within the relevant section |
+
+**When to skip:**
+- The plan has 3 or fewer units in a straight dependency chain -- the Dependencies field on each unit is sufficient
+- Prose already communicates the relationships clearly
+- The visual would duplicate what the High-Level Technical Design section already shows
+- The visual describes code-level detail (specific method names, SQL columns, API field lists)
+
+**Format selection:**
+- **Mermaid** (default) for dependency graphs and interaction diagrams -- 5-15 nodes, no in-box annotations, standard flowchart shapes. Use `TB` (top-to-bottom) direction so diagrams stay narrow in both rendered and source form. Source should be readable as fallback in diff views and terminals.
+- **ASCII/box-drawing diagrams** for annotated flows that need rich in-box content -- file path layouts, decision logic branches, multi-column spatial arrangements. More expressive than mermaid when the diagram's value comes from annotations within nodes. Follow 80-column max for code blocks, use vertical stacking.
+- **Markdown tables** for mode/variant comparisons and decision/approach comparisons.
+- Keep diagrams proportionate to the plan. A 6-unit linear chain gets a simple 6-node graph. A complex dependency graph with fan-out and fan-in may need 10-15 nodes -- that is fine if every node earns its place.
+- Place inline at the point of relevance, not in a separate section.
+- Plan-structure level only -- unit dependencies, component interactions, mode comparisons, impact surfaces. Not implementation architecture, data schemas, or code structure (those belong in Section 3.4).
+- Prose is authoritative: when a visual aid and its surrounding prose disagree, the prose governs.
+
+After generating a visual aid, verify it accurately represents the plan sections it illustrates -- correct dependency edges, no missing surfaces, no merged units.
--- a/tests/pipeline-review-contract.test.ts
+++ b/tests/pipeline-review-contract.test.ts
@@ -118,10 +118,11 @@ describe("ce:plan testing contract", () => {

 describe("ce:plan review contract", () => {
  test("requires document review after confidence check", async () => {
-    const content = await readRepoFile("plugins/compound-engineering/skills/ce-plan/SKILL.md")
+    // Document review instructions extracted to references/plan-handoff.md
+    const content = await readRepoFile("plugins/compound-engineering/skills/ce-plan/references/plan-handoff.md")

    // Phase 5.3.8 runs document-review before final checks (5.3.9)
-    expect(content).toContain("##### 5.3.8 Document Review")
+    expect(content).toContain("## 5.3.8 Document Review")
    expect(content).toContain("`document-review` skill")

    // Document review must come before final checks so auto-applied edits are validated
@@ -130,16 +131,24 @@ describe("ce:plan review contract", () => {
    expect(docReviewIdx).toBeLessThan(finalChecksIdx)
  })

-  test("uses headless mode in pipeline context", async () => {
+  test("SKILL.md stub points to plan-handoff reference", async () => {
    const content = await readRepoFile("plugins/compound-engineering/skills/ce-plan/SKILL.md")

+    // Stub references the handoff file and marks document review as mandatory
+    expect(content).toContain("`references/plan-handoff.md`")
+    expect(content).toContain("Document review is mandatory")
+  })
+
+  test("uses headless mode in pipeline context", async () => {
+    const content = await readRepoFile("plugins/compound-engineering/skills/ce-plan/references/plan-handoff.md")
+
    // Pipeline mode runs document-review headlessly, not skipping it
    expect(content).toContain("document-review` with `mode:headless`")
    expect(content).not.toContain("skip document-review and return control")
  })

  test("handoff options recommend ce:work after review", async () => {
-    const content = await readRepoFile("plugins/compound-engineering/skills/ce-plan/SKILL.md")
+    const content = await readRepoFile("plugins/compound-engineering/skills/ce-plan/references/plan-handoff.md")

    // ce:work is recommended (review already happened)
    expect(content).toContain("**Start `/ce:work`** - Begin implementing this plan in the current environment (recommended)")