feat(ce-work): reduce token usage by extracting late-sequence references (#540)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 12:48:21 -07:00
parent 31b0686c2e
commit bb59547a2e
6 changed files with 543 additions and 285 deletions
--- a/docs/plans/2026-04-09-001-feat-ce-work-token-extraction-plan.md
+++ b/docs/plans/2026-04-09-001-feat-ce-work-token-extraction-plan.md
@@ -0,0 +1,205 @@
+---
+title: "feat(ce-work): reduce token usage by extracting late-sequence references"
+type: feat
+status: completed
+date: 2026-04-09
+---
+
+# feat(ce-work): reduce token usage by extracting late-sequence references
+
+## Overview
+
+Apply the "conditional and late-sequence extraction" pattern (established in PR #489 for ce:plan) to ce:work and ce:work-beta. Both skills carry Phase 3/4 shipping content through the entire Phase 2 execution loop without using it. Extracting this late-sequence content into on-demand reference files eliminates that compounding context cost.
+
+## Problem Frame
+
+ce:work sessions are the longest-running skill in the plugin — a typical execution session involves 20-60+ tool calls across Phase 0-4. Phase 3 (quality check) and Phase 4 (ship it) content, plus the duplicative Quality Checklist and Code Review Tiers summary sections, ride in context for the entire Phase 2 execution loop without being used until the very end. This compounds token costs proportional to message count.
+
+ce:work-beta already extracted its Codex delegation workflow into `references/codex-delegation-workflow.md` (315 lines), but its Phase 3/4 content has the same late-sequence problem as stable. Both variants benefit from the same extraction.
+
+## Requirements Trace
+
+- R1. Extract late-sequence blocks (Phase 3 + Phase 4 + Quality Checklist + Code Review Tiers) into an on-demand reference file for ce:work
+- R2. Extract the same late-sequence blocks for ce:work-beta
+- R3. Replace extracted blocks with 1-3 line stubs per the AGENTS.md "Conditional and Late-Sequence Extraction" rule
+- R4. Update contract tests to read from reference files where assertions moved
+
+## Scope Boundaries
+
+- Not changing any behavioral content — purely restructuring for token efficiency
+- Not extracting Phase 0, Phase 1, or Phase 2 content (needed during the core execution loop)
+- Not extracting Key Principles or Common Pitfalls (small, general-purpose guidance used throughout)
+- Not extracting ce:work-beta's Argument Parsing or Codex Delegation Mode sections (already handled or needed early)
+- Beta is on a separate evolutionary track from stable — extraction follows the same pattern but the files are independent, not shared
+
+## Context & Research
+
+### Relevant Code and Patterns
+
+- `plugins/compound-engineering/skills/ce-plan/SKILL.md` — established extraction pattern with stub syntax
+- `plugins/compound-engineering/skills/ce-plan/references/plan-handoff.md` — example of late-sequence extraction
+- `plugins/compound-engineering/skills/ce-brainstorm/references/handoff.md` — another late-sequence extraction (ce:brainstorm already did this)
+- `plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md` — beta already uses extraction for its conditional delegation workflow
+- `tests/pipeline-review-contract.test.ts` — existing contract tests for ce:work (lines 9-98) and ce:work-beta (lines 100-219)
+- `plugins/compound-engineering/AGENTS.md` — "Conditional and Late-Sequence Extraction" rule
+
+### Institutional Learnings
+
+- PR #489 validated that extracting ~36% of ce:plan saved ~130,000-167,000 context tokens per session with zero premature reference file reads
+- ce:brainstorm has already applied the same pattern (Phase 3/4 extracted to `references/requirements-capture.md` and `references/handoff.md`)
+
+## Key Technical Decisions
+
+- **Bundle Phase 3 + Phase 4 + Quality Checklist + Code Review Tiers into one reference file**: These are all used at the same point in the workflow (after all Phase 2 tasks complete). The Quality Checklist is "Before creating PR" and Code Review Tiers duplicates Phase 3 Step 2 — they're the same workflow stage. One file is simpler than four. This matches the bundling strategy ce:brainstorm used for its late-sequence content.
+- **Keep Key Principles, Common Pitfalls in SKILL.md**: They're small (~40 lines combined) and provide behavioral guardrails throughout execution. Extracting them saves little and risks execution quality.
+- **Independent reference files for stable and beta**: Per AGENTS.md skill self-containment rules, each skill's references directory is its own unit. Beta already has a `references/` directory with `codex-delegation-workflow.md`; the shipping workflow file goes alongside it. Stable creates its `references/` directory fresh.
+
+## Implementation Units
+
+- [x] **Unit 1: Create `references/shipping-workflow.md` for ce:work**
+
+**Goal:** Extract Phase 3 (Quality Check), Phase 4 (Ship It), Quality Checklist, and Code Review Tiers into a single reference file for the stable skill.
+
+**Requirements:** R1, R3
+
+**Dependencies:** None
+
+**Files:**
+- Create: `plugins/compound-engineering/skills/ce-work/references/shipping-workflow.md`
+- Modify: `plugins/compound-engineering/skills/ce-work/SKILL.md`
+
+**Approach:**
+- Move Phase 3 (lines 271-315), Phase 4 (lines 317-374), Quality Checklist (lines 408-423), and Code Review Tiers (lines 425-435) into the new reference file
+- Add a header comment: "This file contains the shipping workflow (Phase 3-4). Load it only when all Phase 2 tasks are complete and execution transitions to quality check."
+- Replace Phase 3 + Phase 4 in SKILL.md with a 2-line stub stating the condition and backtick path reference
+- Remove the standalone Quality Checklist and Code Review Tiers sections at the bottom of SKILL.md (they're consolidated into the reference file)
+
+**Patterns to follow:**
+- `plugins/compound-engineering/skills/ce-plan/references/plan-handoff.md` — late-sequence extraction with header comment and stub pattern
+- `plugins/compound-engineering/skills/ce-brainstorm/references/handoff.md` — same pattern for brainstorm's shipping phase
+
+**Test scenarios:**
+- Happy path: SKILL.md stub contains backtick path to `references/shipping-workflow.md` and states the loading condition
+- Happy path: reference file contains Phase 3 (quality checks, code review, final validation, operational validation plan) and Phase 4 (screenshots, commit/PR, plan status update, notify user) and the quality checklist and code review tiers
+- Edge case: SKILL.md does not contain `gh pr create` — the existing contract test at line 35 continues to pass since this string was never in ce:work SKILL.md
+
+**Verification:**
+- SKILL.md line count decreases by ~130 lines (445 -> ~315)
+- Reference file contains all Phase 3, Phase 4, Quality Checklist, and Code Review Tiers content
+- SKILL.md stub clearly states when to load the reference
+
+---
+
+- [x] **Unit 2: Create `references/shipping-workflow.md` for ce:work-beta**
+
+**Goal:** Extract the same late-sequence shipping content from ce:work-beta into its already-existing references directory, alongside the existing `codex-delegation-workflow.md`.
+
+**Requirements:** R2, R3
+
+**Dependencies:** None (can run in parallel with Unit 1)
+
+**Files:**
+- Create: `plugins/compound-engineering/skills/ce-work-beta/references/shipping-workflow.md`
+- Modify: `plugins/compound-engineering/skills/ce-work-beta/SKILL.md`
+
+**Approach:**
+- Move Phase 3 (lines 336-381), Phase 4 (lines 382-438), Quality Checklist (lines 481-496), and Code Review Tiers (lines 498-508) into the new reference file
+- Same header comment pattern as Unit 1
+- Replace with the same 2-line stub pattern
+- Remove standalone Quality Checklist and Code Review Tiers sections
+- Beta has an additional Phase 2 subsection ("Frontend Design Guidance" at lines 322-328) that stays in SKILL.md since it's used during execution
+- The Codex Delegation Mode stub (lines 442-444) stays untouched — it's a separate extraction
+
+**Sync decision:** Propagating extraction to beta — this is a structural optimization that applies equally to both variants. The shipping workflow content is identical between stable and beta.
+
+**Patterns to follow:**
+- Unit 1 output for stable variant
+- Beta's existing `codex-delegation-workflow.md` extraction as precedent
+
+**Test scenarios:**
+- Happy path: beta SKILL.md stub contains backtick path to `references/shipping-workflow.md`
+- Happy path: beta reference file contains the same Phase 3/4 content as stable's reference
+- Edge case: existing `codex-delegation-workflow.md` reference is untouched
+
+**Verification:**
+- Beta SKILL.md line count decreases by ~130 lines (518 -> ~388)
+- Beta `references/` directory now contains both `codex-delegation-workflow.md` and `shipping-workflow.md`
+
+---
+
+- [x] **Unit 3: Update contract tests**
+
+**Goal:** Update existing contract tests to read assertions from reference files where content moved, and add stub pointer tests.
+
+**Requirements:** R4
+
+**Dependencies:** Unit 1, Unit 2
+
+**Files:**
+- Modify: `tests/pipeline-review-contract.test.ts`
+
+**Approach:**
+
+Tests that need restructuring (some assertions move to reference file, negative assertions may stay on SKILL.md):
+- "requires code review before shipping" (line 10) — positive assertions (`"2. **Code Review**"`, tier names, `ce:review`, `mode:autofix`, quality checklist review line) read from `references/shipping-workflow.md`; negative assertions (`not.toContain("Consider Code Review")`, `not.toContain("Code Review** (Optional)")`) stay reading SKILL.md to confirm extraction completeness
+- "delegates commit and PR to dedicated skills" (line 28) — positive assertions (`git-commit-push-pr`, `git-commit`) read from `references/shipping-workflow.md`; negative assertions (`not.toContain("gh pr create")`) stay reading SKILL.md
+- "ce:work-beta mirrors review and commit delegation" (line 39) — same dual-read pattern from beta's reference and beta's SKILL.md
+- "quality checklist says Testing addressed" (line 66) — positive assertion (`"Testing addressed"`) reads from `references/shipping-workflow.md`; negative assertions (`not.toContain("Tests pass...")`) stay reading SKILL.md
+- "ce:work-beta mirrors testing deliberation and checklist changes" (line 77) — testing deliberation stays reading beta SKILL.md; checklist assertions read from beta reference
+
+Tests that stay unchanged (content not extracted):
+- "includes per-task testing deliberation in execution loop" (line 52) — Phase 2 content, stays in SKILL.md
+- "ce:work remains the stable non-delegating surface" (line 91) — checks SKILL.md absence of delegation content
+- All ce:work-beta delegation contract tests (lines 100-219) — check SKILL.md stubs and delegation reference
+
+New tests to add:
+- Stub pointer test: SKILL.md contains backtick path `references/shipping-workflow.md` (for both stable and beta)
+- Negative test: SKILL.md does not contain `"2. **Code Review**"` directly (confirms extraction, not duplication)
+
+**Patterns to follow:**
+- Lines 283-289 in `tests/pipeline-review-contract.test.ts` — PR #489's stub pointer test pattern (`"SKILL.md stub points to plan-handoff reference"`)
+
+**Test scenarios:**
+- Happy path: all existing ce:work and ce:work-beta contract tests pass after updating file paths
+- Happy path: new stub pointer tests verify both SKILL.md files reference `shipping-workflow.md`
+- Edge case: tests checking Phase 2 content (testing deliberation, delegation routing) still read from SKILL.md unchanged
+
+**Verification:**
+- `bun test tests/pipeline-review-contract.test.ts` passes
+- No contract test reads from SKILL.md for content that moved to a reference file
+
+## System-Wide Impact
+
+- **Interaction graph:** No behavioral change — content is restructured, not modified. The agent reads the same instructions, just from a reference file instead of inline.
+- **Error propagation:** If reference file read fails at runtime, the agent would lack shipping instructions. Low risk since file reads are reliable and the files are co-located in the skill directory.
+- **API surface parity:** Both stable and beta get the same extraction. Beta's existing Codex delegation reference is untouched.
+- **Integration coverage:** Contract tests in `tests/pipeline-review-contract.test.ts` are the primary integration surface.
+- **Unchanged invariants:** Phase 0-2 execution behavior, subagent dispatch, test discovery, and all other execution-time content remains inline and unchanged.
+
+## Risks & Dependencies
+
+| Risk | Mitigation |
+|------|------------|
+| Contract tests break if file paths change | Unit 3 explicitly updates all affected tests |
+| Agent fails to load reference file at the right time | Stub wording follows the validated pattern from PR #489 and ce:brainstorm |
+| Beta-specific content accidentally dropped | Unit 2 only extracts Phase 3/4 content identical to stable; delegation stubs/references are untouched |
+
+## Token Savings Estimate
+
+| Skill | Extraction | Lines | Est. tokens | Loaded when |
+|---|---|---|---|---|
+| ce:work | `references/shipping-workflow.md` | ~130 | ~2,200 | All Phase 2 tasks complete |
+| ce:work-beta | `references/shipping-workflow.md` | ~130 | ~2,200 | All Phase 2 tasks complete |
+
+**ce:work reduction:** 445 lines (~6,500 tokens) -> ~315 lines (~4,600 tokens) — **~29% reduction**
+
+**ce:work-beta reduction:** 518 lines (~7,600 tokens) -> ~388 lines (~5,700 tokens) — **~25% reduction**
+
+**Per-session savings (each skill):** For a typical 40-message execution session:
+- Shipping workflow: ~2,200 tokens x ~32 messages before it's needed = **~70,400 context tokens per session**
+
+## Sources & References
+
+- Related PRs: #489 (ce:plan extraction — established the pattern)
+- Related code: `plugins/compound-engineering/AGENTS.md` (extraction rule)
+- Precedent: ce:brainstorm already applied this pattern to its Phase 3/4 content
--- a/plugins/compound-engineering/skills/ce-work-beta/SKILL.md
+++ b/plugins/compound-engineering/skills/ce-work-beta/SKILL.md
@@ -333,109 +333,9 @@ Determine how to proceed based on what was provided in `<input_document>`.
   - Create new tasks if scope expands
   - Keep user informed of major milestones

-### Phase 3: Quality Check
+### Phase 3-4: Quality Check and Ship It

-1. **Run Core Quality Checks**
-
-   Always run before submitting:
-
-   ```bash
-   # Run full test suite (use project's test command)
-   # Examples: bin/rails test, npm test, pytest, go test, etc.
-
-   # Run linting (per AGENTS.md)
-   # Use linting-agent before pushing to origin
-   ```
-
-2. **Code Review** (REQUIRED)
-
-   Every change gets reviewed before shipping. The depth scales with the change's risk profile, but review itself is never skipped.
-
-   **Tier 2: Full review (default)** — REQUIRED unless Tier 1 criteria are explicitly met. Invoke the `ce:review` skill with `mode:autofix` to run specialized reviewer agents, auto-apply safe fixes, and surface residual work as todos. When the plan file path is known, pass it as `plan:<path>`. This is the mandatory default — proceed to Tier 1 only after confirming every criterion below.
-
-   **Tier 1: Inline self-review** — A lighter alternative permitted only when **all four** criteria are true. Before choosing Tier 1, explicitly state which criteria apply and why. If any criterion is uncertain, use Tier 2.
-   - Purely additive (new files only, no existing behavior modified)
-   - Single concern (one skill, one component — not cross-cutting)
-   - Pattern-following (implementation mirrors an existing example with no novel logic)
-   - Plan-faithful (no scope growth, no deferred questions resolved with surprising answers)
-
-3. **Final Validation**
-   - All tasks marked completed
-   - Testing addressed -- tests pass and new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)
-   - Linting passes
-   - Code follows existing patterns
-   - Figma designs match (if applicable)
-   - No console errors or warnings
-   - If the plan has a `Requirements Trace`, verify each requirement is satisfied by the completed work
-   - If any `Deferred to Implementation` questions were noted, confirm they were resolved during execution
-
-4. **Prepare Operational Validation Plan** (REQUIRED)
-   - Add a `## Post-Deploy Monitoring & Validation` section to the PR description for every change.
-   - Include concrete:
-     - Log queries/search terms
-     - Metrics or dashboards to watch
-     - Expected healthy signals
-     - Failure signals and rollback/mitigation trigger
-     - Validation window and owner
-   - If there is truly no production/runtime impact, still include the section with: `No additional operational monitoring required` and a one-line reason.
-
-### Phase 4: Ship It
-
-1. **Capture and Upload Screenshots for UI Changes** (REQUIRED for any UI work)
-
-   For **any** design changes, new views, or UI modifications, capture and upload screenshots before creating the PR:
-
-   **Step 1: Start dev server** (if not running)
-   ```bash
-   bin/dev  # Run in background
-   ```
-
-   **Step 2: Capture screenshots with agent-browser CLI**
-   ```bash
-   agent-browser open http://localhost:3000/[route]
-   agent-browser snapshot -i
-   agent-browser screenshot output.png
-   ```
-   See the `agent-browser` skill for detailed usage.
-
-   **Step 3: Upload using imgup skill**
-   ```bash
-   skill: imgup
-   # Then upload each screenshot:
-   imgup -h pixhost screenshot.png  # pixhost works without API key
-   # Alternative hosts: catbox, imagebin, beeimg
-   ```
-
-   **What to capture:**
-   - **New screens**: Screenshot of the new UI
-   - **Modified screens**: Before AND after screenshots
-   - **Design implementation**: Screenshot showing Figma design match
-
-2. **Commit and Create Pull Request**
-
-   Load the `git-commit-push-pr` skill to handle committing, pushing, and PR creation. The skill handles convention detection, branch safety, logical commit splitting, adaptive PR descriptions, and attribution badges.
-
-   When providing context for the PR description, include:
-   - The plan's summary and key decisions
-   - Testing notes (tests added/modified, manual testing performed)
-   - Screenshot URLs from step 1 (if applicable)
-   - Figma design link (if applicable)
-   - The Post-Deploy Monitoring & Validation section (see Phase 3 Step 4)
-
-   If the user prefers to commit without creating a PR, load the `git-commit` skill instead.
-
-3. **Update Plan Status**
-
-   If the input document has YAML frontmatter with a `status` field, update it to `completed`:
-   ```
-   status: active  →  status: completed
-   ```
-
-4. **Notify User**
-   - Summarize what was completed
-   - Link to PR (if one was created)
-   - Note any follow-up work needed
-   - Suggest next steps if applicable
+When all Phase 2 tasks are complete and execution transitions to quality check, read `references/shipping-workflow.md` for the full shipping workflow: quality checks, code review, final validation, PR creation, and notification.

 ---

@@ -478,35 +378,6 @@ When `delegation_active` is true after argument parsing, read `references/codex-
 - Don't leave features 80% done
 - A finished feature that ships beats a perfect feature that doesn't

-## Quality Checklist
-
-Before creating PR, verify:
-
- [ ] All clarifying questions asked and answered
- [ ] All tasks marked completed
- [ ] Testing addressed -- tests pass AND new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)
- [ ] Linting passes (use linting-agent)
- [ ] Code follows existing patterns
- [ ] Figma designs match implementation (if applicable)
- [ ] Before/after screenshots captured and uploaded (for UI changes)
- [ ] Commit messages follow conventional format
- [ ] PR description includes Post-Deploy Monitoring & Validation section (or explicit no-impact rationale)
- [ ] Code review completed (inline self-review or full `ce:review`)
- [ ] PR description includes summary, testing notes, and screenshots
- [ ] PR description includes Compound Engineered badge with accurate model and harness
-
-## Code Review Tiers
-
-Every change gets reviewed. The tier determines depth, not whether review happens.
-
-**Tier 2 (full review)** — REQUIRED default. Invoke `ce:review mode:autofix` with `plan:<path>` when available. Safe fixes are applied automatically; residual work surfaces as todos. Always use this tier unless all four Tier 1 criteria are explicitly confirmed.
-
-**Tier 1 (inline self-review)** — permitted only when all four are true (state each explicitly before choosing):
- Purely additive (new files only, no existing behavior modified)
- Single concern (one skill, one component — not cross-cutting)
- Pattern-following (mirrors an existing example, no novel logic)
- Plan-faithful (no scope growth, no surprising deferred-question resolutions)
-
 ## Common Pitfalls to Avoid

 - **Analysis paralysis** - Don't overthink, read the plan and execute
--- a/plugins/compound-engineering/skills/ce-work-beta/references/shipping-workflow.md
+++ b/plugins/compound-engineering/skills/ce-work-beta/references/shipping-workflow.md
@@ -0,0 +1,136 @@
+# Shipping Workflow
+
+This file contains the shipping workflow (Phase 3-4). Load it only when all Phase 2 tasks are complete and execution transitions to quality check.
+
+## Phase 3: Quality Check
+
+1. **Run Core Quality Checks**
+
+   Always run before submitting:
+
+   ```bash
+   # Run full test suite (use project's test command)
+   # Examples: bin/rails test, npm test, pytest, go test, etc.
+
+   # Run linting (per AGENTS.md)
+   # Use linting-agent before pushing to origin
+   ```
+
+2. **Code Review** (REQUIRED)
+
+   Every change gets reviewed before shipping. The depth scales with the change's risk profile, but review itself is never skipped.
+
+   **Tier 2: Full review (default)** -- REQUIRED unless Tier 1 criteria are explicitly met. Invoke the `ce:review` skill with `mode:autofix` to run specialized reviewer agents, auto-apply safe fixes, and surface residual work as todos. When the plan file path is known, pass it as `plan:<path>`. This is the mandatory default -- proceed to Tier 1 only after confirming every criterion below.
+
+   **Tier 1: Inline self-review** -- A lighter alternative permitted only when **all four** criteria are true. Before choosing Tier 1, explicitly state which criteria apply and why. If any criterion is uncertain, use Tier 2.
+   - Purely additive (new files only, no existing behavior modified)
+   - Single concern (one skill, one component -- not cross-cutting)
+   - Pattern-following (implementation mirrors an existing example with no novel logic)
+   - Plan-faithful (no scope growth, no deferred questions resolved with surprising answers)
+
+3. **Final Validation**
+   - All tasks marked completed
+   - Testing addressed -- tests pass and new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)
+   - Linting passes
+   - Code follows existing patterns
+   - Figma designs match (if applicable)
+   - No console errors or warnings
+   - If the plan has a `Requirements Trace`, verify each requirement is satisfied by the completed work
+   - If any `Deferred to Implementation` questions were noted, confirm they were resolved during execution
+
+4. **Prepare Operational Validation Plan** (REQUIRED)
+   - Add a `## Post-Deploy Monitoring & Validation` section to the PR description for every change.
+   - Include concrete:
+     - Log queries/search terms
+     - Metrics or dashboards to watch
+     - Expected healthy signals
+     - Failure signals and rollback/mitigation trigger
+     - Validation window and owner
+   - If there is truly no production/runtime impact, still include the section with: `No additional operational monitoring required` and a one-line reason.
+
+## Phase 4: Ship It
+
+1. **Capture and Upload Screenshots for UI Changes** (REQUIRED for any UI work)
+
+   For **any** design changes, new views, or UI modifications, capture and upload screenshots before creating the PR:
+
+   **Step 1: Start dev server** (if not running)
+   ```bash
+   bin/dev  # Run in background
+   ```
+
+   **Step 2: Capture screenshots with agent-browser CLI**
+   ```bash
+   agent-browser open http://localhost:3000/[route]
+   agent-browser snapshot -i
+   agent-browser screenshot output.png
+   ```
+   See the `agent-browser` skill for detailed usage.
+
+   **Step 3: Upload using imgup skill**
+   ```bash
+   skill: imgup
+   # Then upload each screenshot:
+   imgup -h pixhost screenshot.png  # pixhost works without API key
+   # Alternative hosts: catbox, imagebin, beeimg
+   ```
+
+   **What to capture:**
+   - **New screens**: Screenshot of the new UI
+   - **Modified screens**: Before AND after screenshots
+   - **Design implementation**: Screenshot showing Figma design match
+
+2. **Update Plan Status**
+
+   If the input document has YAML frontmatter with a `status` field, update it to `completed`:
+   ```
+   status: active  ->  status: completed
+   ```
+
+3. **Commit and Create Pull Request**
+
+   Load the `git-commit-push-pr` skill to handle committing, pushing, and PR creation. The skill handles convention detection, branch safety, logical commit splitting, adaptive PR descriptions, and attribution badges.
+
+   When providing context for the PR description, include:
+   - The plan's summary and key decisions
+   - Testing notes (tests added/modified, manual testing performed)
+   - Screenshot URLs from step 1 (if applicable)
+   - Figma design link (if applicable)
+   - The Post-Deploy Monitoring & Validation section (see Phase 3 Step 4)
+
+   If the user prefers to commit without creating a PR, load the `git-commit` skill instead.
+
+4. **Notify User**
+   - Summarize what was completed
+   - Link to PR (if one was created)
+   - Note any follow-up work needed
+   - Suggest next steps if applicable
+
+## Quality Checklist
+
+Before creating PR, verify:
+
+- [ ] All clarifying questions asked and answered
+- [ ] All tasks marked completed
+- [ ] Testing addressed -- tests pass AND new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)
+- [ ] Linting passes (use linting-agent)
+- [ ] Code follows existing patterns
+- [ ] Figma designs match implementation (if applicable)
+- [ ] Before/after screenshots captured and uploaded (for UI changes)
+- [ ] Commit messages follow conventional format
+- [ ] PR description includes Post-Deploy Monitoring & Validation section (or explicit no-impact rationale)
+- [ ] Code review completed (inline self-review or full `ce:review`)
+- [ ] PR description includes summary, testing notes, and screenshots
+- [ ] PR description includes Compound Engineered badge with accurate model and harness
+
+## Code Review Tiers
+
+Every change gets reviewed. The tier determines depth, not whether review happens.
+
+**Tier 2 (full review)** -- REQUIRED default. Invoke `ce:review mode:autofix` with `plan:<path>` when available. Safe fixes are applied automatically; residual work surfaces as todos. Always use this tier unless all four Tier 1 criteria are explicitly confirmed.
+
+**Tier 1 (inline self-review)** -- permitted only when all four are true (state each explicitly before choosing):
+- Purely additive (new files only, no existing behavior modified)
+- Single concern (one skill, one component -- not cross-cutting)
+- Pattern-following (mirrors an existing example, no novel logic)
+- Plan-faithful (no scope growth, no surprising deferred-question resolutions)
--- a/plugins/compound-engineering/skills/ce-work/SKILL.md
+++ b/plugins/compound-engineering/skills/ce-work/SKILL.md
@@ -268,109 +268,9 @@ Determine how to proceed based on what was provided in `<input_document>`.
   - Create new tasks if scope expands
   - Keep user informed of major milestones

-### Phase 3: Quality Check
+### Phase 3-4: Quality Check and Ship It

-1. **Run Core Quality Checks**
-
-   Always run before submitting:
-
-   ```bash
-   # Run full test suite (use project's test command)
-   # Examples: bin/rails test, npm test, pytest, go test, etc.
-
-   # Run linting (per AGENTS.md)
-   # Use linting-agent before pushing to origin
-   ```
-
-2. **Code Review** (REQUIRED)
-
-   Every change gets reviewed before shipping. The depth scales with the change's risk profile, but review itself is never skipped.
-
-   **Tier 2: Full review (default)** — REQUIRED unless Tier 1 criteria are explicitly met. Invoke the `ce:review` skill with `mode:autofix` to run specialized reviewer agents, auto-apply safe fixes, and surface residual work as todos. When the plan file path is known, pass it as `plan:<path>`. This is the mandatory default — proceed to Tier 1 only after confirming every criterion below.
-
-   **Tier 1: Inline self-review** — A lighter alternative permitted only when **all four** criteria are true. Before choosing Tier 1, explicitly state which criteria apply and why. If any criterion is uncertain, use Tier 2.
-   - Purely additive (new files only, no existing behavior modified)
-   - Single concern (one skill, one component — not cross-cutting)
-   - Pattern-following (implementation mirrors an existing example with no novel logic)
-   - Plan-faithful (no scope growth, no deferred questions resolved with surprising answers)
-
-3. **Final Validation**
-   - All tasks marked completed
-   - Testing addressed -- tests pass and new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)
-   - Linting passes
-   - Code follows existing patterns
-   - Figma designs match (if applicable)
-   - No console errors or warnings
-   - If the plan has a `Requirements Trace`, verify each requirement is satisfied by the completed work
-   - If any `Deferred to Implementation` questions were noted, confirm they were resolved during execution
-
-4. **Prepare Operational Validation Plan** (REQUIRED)
-   - Add a `## Post-Deploy Monitoring & Validation` section to the PR description for every change.
-   - Include concrete:
-     - Log queries/search terms
-     - Metrics or dashboards to watch
-     - Expected healthy signals
-     - Failure signals and rollback/mitigation trigger
-     - Validation window and owner
-   - If there is truly no production/runtime impact, still include the section with: `No additional operational monitoring required` and a one-line reason.
-
-### Phase 4: Ship It
-
-1. **Capture and Upload Screenshots for UI Changes** (REQUIRED for any UI work)
-
-   For **any** design changes, new views, or UI modifications, capture and upload screenshots before creating the PR:
-
-   **Step 1: Start dev server** (if not running)
-   ```bash
-   bin/dev  # Run in background
-   ```
-
-   **Step 2: Capture screenshots with agent-browser CLI**
-   ```bash
-   agent-browser open http://localhost:3000/[route]
-   agent-browser snapshot -i
-   agent-browser screenshot output.png
-   ```
-   See the `agent-browser` skill for detailed usage.
-
-   **Step 3: Upload using imgup skill**
-   ```bash
-   skill: imgup
-   # Then upload each screenshot:
-   imgup -h pixhost screenshot.png  # pixhost works without API key
-   # Alternative hosts: catbox, imagebin, beeimg
-   ```
-
-   **What to capture:**
-   - **New screens**: Screenshot of the new UI
-   - **Modified screens**: Before AND after screenshots
-   - **Design implementation**: Screenshot showing Figma design match
-
-2. **Commit and Create Pull Request**
-
-   Load the `git-commit-push-pr` skill to handle committing, pushing, and PR creation. The skill handles convention detection, branch safety, logical commit splitting, adaptive PR descriptions, and attribution badges.
-
-   When providing context for the PR description, include:
-   - The plan's summary and key decisions
-   - Testing notes (tests added/modified, manual testing performed)
-   - Screenshot URLs from step 1 (if applicable)
-   - Figma design link (if applicable)
-   - The Post-Deploy Monitoring & Validation section (see Phase 3 Step 4)
-
-   If the user prefers to commit without creating a PR, load the `git-commit` skill instead.
-
-3. **Update Plan Status**
-
-   If the input document has YAML frontmatter with a `status` field, update it to `completed`:
-   ```
-   status: active  →  status: completed
-   ```
-
-4. **Notify User**
-   - Summarize what was completed
-   - Link to PR (if one was created)
-   - Note any follow-up work needed
-   - Suggest next steps if applicable
+When all Phase 2 tasks are complete and execution transitions to quality check, read `references/shipping-workflow.md` for the full shipping workflow: quality checks, code review, final validation, PR creation, and notification.

 ## Key Principles

@@ -405,35 +305,6 @@ Determine how to proceed based on what was provided in `<input_document>`.
 - Don't leave features 80% done
 - A finished feature that ships beats a perfect feature that doesn't

-## Quality Checklist
-
-Before creating PR, verify:
-
- [ ] All clarifying questions asked and answered
- [ ] All tasks marked completed
- [ ] Testing addressed -- tests pass AND new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)
- [ ] Linting passes (use linting-agent)
- [ ] Code follows existing patterns
- [ ] Figma designs match implementation (if applicable)
- [ ] Before/after screenshots captured and uploaded (for UI changes)
- [ ] Commit messages follow conventional format
- [ ] PR description includes Post-Deploy Monitoring & Validation section (or explicit no-impact rationale)
- [ ] Code review completed (inline self-review or full `ce:review`)
- [ ] PR description includes summary, testing notes, and screenshots
- [ ] PR description includes Compound Engineered badge with accurate model and harness
-
-## Code Review Tiers
-
-Every change gets reviewed. The tier determines depth, not whether review happens.
-
-**Tier 2 (full review)** — REQUIRED default. Invoke `ce:review mode:autofix` with `plan:<path>` when available. Safe fixes are applied automatically; residual work surfaces as todos. Always use this tier unless all four Tier 1 criteria are explicitly confirmed.
-
-**Tier 1 (inline self-review)** — permitted only when all four are true (state each explicitly before choosing):
- Purely additive (new files only, no existing behavior modified)
- Single concern (one skill, one component — not cross-cutting)
- Pattern-following (mirrors an existing example, no novel logic)
- Plan-faithful (no scope growth, no surprising deferred-question resolutions)
-
 ## Common Pitfalls to Avoid

 - **Analysis paralysis** - Don't overthink, read the plan and execute
--- a/plugins/compound-engineering/skills/ce-work/references/shipping-workflow.md
+++ b/plugins/compound-engineering/skills/ce-work/references/shipping-workflow.md
@@ -0,0 +1,136 @@
+# Shipping Workflow
+
+This file contains the shipping workflow (Phase 3-4). Load it only when all Phase 2 tasks are complete and execution transitions to quality check.
+
+## Phase 3: Quality Check
+
+1. **Run Core Quality Checks**
+
+   Always run before submitting:
+
+   ```bash
+   # Run full test suite (use project's test command)
+   # Examples: bin/rails test, npm test, pytest, go test, etc.
+
+   # Run linting (per AGENTS.md)
+   # Use linting-agent before pushing to origin
+   ```
+
+2. **Code Review** (REQUIRED)
+
+   Every change gets reviewed before shipping. The depth scales with the change's risk profile, but review itself is never skipped.
+
+   **Tier 2: Full review (default)** -- REQUIRED unless Tier 1 criteria are explicitly met. Invoke the `ce:review` skill with `mode:autofix` to run specialized reviewer agents, auto-apply safe fixes, and surface residual work as todos. When the plan file path is known, pass it as `plan:<path>`. This is the mandatory default -- proceed to Tier 1 only after confirming every criterion below.
+
+   **Tier 1: Inline self-review** -- A lighter alternative permitted only when **all four** criteria are true. Before choosing Tier 1, explicitly state which criteria apply and why. If any criterion is uncertain, use Tier 2.
+   - Purely additive (new files only, no existing behavior modified)
+   - Single concern (one skill, one component -- not cross-cutting)
+   - Pattern-following (implementation mirrors an existing example with no novel logic)
+   - Plan-faithful (no scope growth, no deferred questions resolved with surprising answers)
+
+3. **Final Validation**
+   - All tasks marked completed
+   - Testing addressed -- tests pass and new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)
+   - Linting passes
+   - Code follows existing patterns
+   - Figma designs match (if applicable)
+   - No console errors or warnings
+   - If the plan has a `Requirements Trace`, verify each requirement is satisfied by the completed work
+   - If any `Deferred to Implementation` questions were noted, confirm they were resolved during execution
+
+4. **Prepare Operational Validation Plan** (REQUIRED)
+   - Add a `## Post-Deploy Monitoring & Validation` section to the PR description for every change.
+   - Include concrete:
+     - Log queries/search terms
+     - Metrics or dashboards to watch
+     - Expected healthy signals
+     - Failure signals and rollback/mitigation trigger
+     - Validation window and owner
+   - If there is truly no production/runtime impact, still include the section with: `No additional operational monitoring required` and a one-line reason.
+
+## Phase 4: Ship It
+
+1. **Capture and Upload Screenshots for UI Changes** (REQUIRED for any UI work)
+
+   For **any** design changes, new views, or UI modifications, capture and upload screenshots before creating the PR:
+
+   **Step 1: Start dev server** (if not running)
+   ```bash
+   bin/dev  # Run in background
+   ```
+
+   **Step 2: Capture screenshots with agent-browser CLI**
+   ```bash
+   agent-browser open http://localhost:3000/[route]
+   agent-browser snapshot -i
+   agent-browser screenshot output.png
+   ```
+   See the `agent-browser` skill for detailed usage.
+
+   **Step 3: Upload using imgup skill**
+   ```bash
+   skill: imgup
+   # Then upload each screenshot:
+   imgup -h pixhost screenshot.png  # pixhost works without API key
+   # Alternative hosts: catbox, imagebin, beeimg
+   ```
+
+   **What to capture:**
+   - **New screens**: Screenshot of the new UI
+   - **Modified screens**: Before AND after screenshots
+   - **Design implementation**: Screenshot showing Figma design match
+
+2. **Update Plan Status**
+
+   If the input document has YAML frontmatter with a `status` field, update it to `completed`:
+   ```
+   status: active  ->  status: completed
+   ```
+
+3. **Commit and Create Pull Request**
+
+   Load the `git-commit-push-pr` skill to handle committing, pushing, and PR creation. The skill handles convention detection, branch safety, logical commit splitting, adaptive PR descriptions, and attribution badges.
+
+   When providing context for the PR description, include:
+   - The plan's summary and key decisions
+   - Testing notes (tests added/modified, manual testing performed)
+   - Screenshot URLs from step 1 (if applicable)
+   - Figma design link (if applicable)
+   - The Post-Deploy Monitoring & Validation section (see Phase 3 Step 4)
+
+   If the user prefers to commit without creating a PR, load the `git-commit` skill instead.
+
+4. **Notify User**
+   - Summarize what was completed
+   - Link to PR (if one was created)
+   - Note any follow-up work needed
+   - Suggest next steps if applicable
+
+## Quality Checklist
+
+Before creating PR, verify:
+
+- [ ] All clarifying questions asked and answered
+- [ ] All tasks marked completed
+- [ ] Testing addressed -- tests pass AND new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)
+- [ ] Linting passes (use linting-agent)
+- [ ] Code follows existing patterns
+- [ ] Figma designs match implementation (if applicable)
+- [ ] Before/after screenshots captured and uploaded (for UI changes)
+- [ ] Commit messages follow conventional format
+- [ ] PR description includes Post-Deploy Monitoring & Validation section (or explicit no-impact rationale)
+- [ ] Code review completed (inline self-review or full `ce:review`)
+- [ ] PR description includes summary, testing notes, and screenshots
+- [ ] PR description includes Compound Engineered badge with accurate model and harness
+
+## Code Review Tiers
+
+Every change gets reviewed. The tier determines depth, not whether review happens.
+
+**Tier 2 (full review)** -- REQUIRED default. Invoke `ce:review mode:autofix` with `plan:<path>` when available. Safe fixes are applied automatically; residual work surfaces as todos. Always use this tier unless all four Tier 1 criteria are explicitly confirmed.
+
+**Tier 1 (inline self-review)** -- permitted only when all four are true (state each explicitly before choosing):
+- Purely additive (new files only, no existing behavior modified)
+- Single concern (one skill, one component -- not cross-cutting)
+- Pattern-following (mirrors an existing example, no novel logic)
+- Plan-faithful (no scope growth, no surprising deferred-question resolutions)
--- a/tests/pipeline-review-contract.test.ts
+++ b/tests/pipeline-review-contract.test.ts
@@ -9,27 +9,34 @@ async function readRepoFile(relativePath: string): Promise<string> {
 describe("ce:work review contract", () => {
  test("requires code review before shipping", async () => {
    const content = await readRepoFile("plugins/compound-engineering/skills/ce-work/SKILL.md")
+    // Review content extracted to references/shipping-workflow.md
+    const shipping = await readRepoFile("plugins/compound-engineering/skills/ce-work/references/shipping-workflow.md")

-    // Phase 3 has a mandatory code review step (not optional)
-    expect(content).toContain("2. **Code Review**")
+    // SKILL.md should not contain extracted content
+    expect(content).not.toContain("2. **Code Review**")
    expect(content).not.toContain("Consider Code Review")
    expect(content).not.toContain("Code Review** (Optional)")

-    // Two-tier rubric
-    expect(content).toContain("**Tier 1: Inline self-review**")
-    expect(content).toContain("**Tier 2: Full review (default)**")
-    expect(content).toContain("ce:review")
-    expect(content).toContain("mode:autofix")
+    // Phase 3 has a mandatory code review step in the reference file
+    expect(shipping).toContain("2. **Code Review**")
+
+    // Two-tier rubric in reference file
+    expect(shipping).toContain("**Tier 1: Inline self-review**")
+    expect(shipping).toContain("**Tier 2: Full review (default)**")
+    expect(shipping).toContain("ce:review")
+    expect(shipping).toContain("mode:autofix")

    // Quality checklist includes review
-    expect(content).toContain("Code review completed (inline self-review or full `ce:review`)")
+    expect(shipping).toContain("Code review completed (inline self-review or full `ce:review`)")
  })

  test("delegates commit and PR to dedicated skills", async () => {
    const content = await readRepoFile("plugins/compound-engineering/skills/ce-work/SKILL.md")
+    // Commit/PR delegation content extracted to references/shipping-workflow.md
+    const shipping = await readRepoFile("plugins/compound-engineering/skills/ce-work/references/shipping-workflow.md")

-    expect(content).toContain("`git-commit-push-pr` skill")
-    expect(content).toContain("`git-commit` skill")
+    expect(shipping).toContain("`git-commit-push-pr` skill")
+    expect(shipping).toContain("`git-commit` skill")

    // Should not contain inline PR templates or attribution placeholders
    expect(content).not.toContain("gh pr create")
@@ -38,14 +45,16 @@ describe("ce:work review contract", () => {

  test("ce:work-beta mirrors review and commit delegation", async () => {
    const beta = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/SKILL.md")
+    // Review/commit content extracted to references/shipping-workflow.md
+    const shipping = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/references/shipping-workflow.md")

-    // Both have mandatory review
-    expect(beta).toContain("2. **Code Review**")
+    // Extracted content in reference file
+    expect(shipping).toContain("2. **Code Review**")
+    expect(shipping).toContain("`git-commit-push-pr` skill")
+    expect(shipping).toContain("`git-commit` skill")
+
+    // Negative assertions stay on SKILL.md
    expect(beta).not.toContain("Consider Code Review")
-
-    // Both delegate to git skills
-    expect(beta).toContain("`git-commit-push-pr` skill")
-    expect(beta).toContain("`git-commit` skill")
    expect(beta).not.toContain("gh pr create")
  })

@@ -65,27 +74,57 @@ describe("ce:work review contract", () => {

  test("quality checklist says 'Testing addressed' not 'Tests pass'", async () => {
    const content = await readRepoFile("plugins/compound-engineering/skills/ce-work/SKILL.md")
+    // Quality checklist extracted to references/shipping-workflow.md
+    const shipping = await readRepoFile("plugins/compound-engineering/skills/ce-work/references/shipping-workflow.md")

-    // New language present
-    expect(content).toContain("Testing addressed")
+    // New language present in reference file
+    expect(shipping).toContain("Testing addressed")

-    // Old language fully removed
+    // Old language fully removed from both
    expect(content).not.toContain("Tests pass (run project's test command)")
    expect(content).not.toContain("- All tests pass")
+    expect(shipping).not.toContain("Tests pass (run project's test command)")
  })

  test("ce:work-beta mirrors testing deliberation and checklist changes", async () => {
    const beta = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/SKILL.md")
+    // Checklist extracted to references/shipping-workflow.md
+    const shipping = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/references/shipping-workflow.md")

-    // Testing deliberation in loop
+    // Testing deliberation stays in SKILL.md (Phase 2 content)
    expect(beta).toContain("Assess testing coverage")

-    // New checklist language
-    expect(beta).toContain("Testing addressed")
+    // New checklist language in reference file
+    expect(shipping).toContain("Testing addressed")

-    // Old language removed
+    // Old language removed from both
    expect(beta).not.toContain("Tests pass (run project's test command)")
    expect(beta).not.toContain("- All tests pass")
+    expect(shipping).not.toContain("Tests pass (run project's test command)")
+  })
+
+  test("SKILL.md stub points to shipping-workflow reference", async () => {
+    const content = await readRepoFile("plugins/compound-engineering/skills/ce-work/SKILL.md")
+
+    // Stub references the shipping-workflow file
+    expect(content).toContain("`references/shipping-workflow.md`")
+
+    // Extracted content is not in SKILL.md
+    expect(content).not.toContain("2. **Code Review**")
+    expect(content).not.toContain("## Quality Checklist")
+    expect(content).not.toContain("## Code Review Tiers")
+  })
+
+  test("ce:work-beta SKILL.md stub points to shipping-workflow reference", async () => {
+    const content = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/SKILL.md")
+
+    // Stub references the shipping-workflow file
+    expect(content).toContain("`references/shipping-workflow.md`")
+
+    // Extracted content is not in SKILL.md
+    expect(content).not.toContain("2. **Code Review**")
+    expect(content).not.toContain("## Quality Checklist")
+    expect(content).not.toContain("## Code Review Tiers")
  })

  test("ce:work remains the stable non-delegating surface", async () => {