feat(ce-work-beta): add beta Codex delegation mode (#476)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 00:29:12 -07:00
parent 044a035e77
commit 31b0686c2e
15 changed files with 1549 additions and 1889 deletions
--- a/plugins/compound-engineering/skills/ce-plan/SKILL.md
+++ b/plugins/compound-engineering/skills/ce-plan/SKILL.md
@@ -191,7 +191,6 @@ Look for signals such as:
 - The user explicitly asks for TDD, test-first, or characterization-first work
 - The origin document calls for test-first implementation or exploratory hardening of legacy code
 - Local research shows the target area is legacy, weakly tested, or historically fragile, suggesting characterization coverage before changing behavior
- The user asks for external delegation, says "use codex", "delegate mode", or mentions token conservation -- add `Execution target: external-delegate` to implementation units that are pure code writing

 When the signal is clear, carry it forward silently in the relevant implementation units.

@@ -357,7 +356,7 @@ For each unit, include:
 - **Dependencies** - what must exist first
 - **Files** - repo-relative file paths to create, modify, or test (never absolute paths)
 - **Approach** - key decisions, data flow, component boundaries, or integration notes
- **Execution note** - optional, only when the unit benefits from a non-default execution posture such as test-first, characterization-first, or external delegation
+- **Execution note** - optional, only when the unit benefits from a non-default execution posture such as test-first or characterization-first
 - **Technical design** - optional pseudo-code or diagram when the unit's approach is non-obvious and prose alone would leave it ambiguous. Frame explicitly as directional guidance, not implementation specification
 - **Patterns to follow** - existing code or conventions to mirror
 - **Test scenarios** - enumerate the specific test cases the implementer should write, right-sized to the unit's complexity and risk. Consider each category below and include scenarios from every category that applies to this unit. A simple config change may need one scenario; a payment flow may need a dozen. The quality signal is specificity — each scenario should name the input, action, and expected outcome so the implementer doesn't have to invent coverage. For units with no behavioral change (pure config, scaffolding, styling), use `Test expectation: none -- [reason]` instead of leaving the field blank.
@@ -373,7 +372,6 @@ Use `Execution note` sparingly. Good uses include:
 - `Execution note: Start with a failing integration test for the request/response contract.`
 - `Execution note: Add characterization coverage before modifying this legacy parser.`
 - `Execution note: Implement new domain behavior test-first.`
- `Execution note: Execution target: external-delegate`

 Do not expand units into literal `RED/GREEN/REFACTOR` substeps.

@@ -512,7 +510,7 @@ deepened: YYYY-MM-DD  # optional, set when the confidence check substantively st
 **Approach:**
 - [Key design or sequencing decision]

-**Execution note:** [Optional test-first, characterization-first, external-delegate, or other execution posture signal]
+**Execution note:** [Optional test-first, characterization-first, or other execution posture signal]

 **Technical design:** *(optional -- pseudo-code or diagram when the unit's approach is non-obvious. Directional guidance, not implementation specification.)*

--- a/plugins/compound-engineering/skills/ce-work-beta/SKILL.md
+++ b/plugins/compound-engineering/skills/ce-work-beta/SKILL.md
@@ -2,7 +2,7 @@
 name: ce:work-beta
 description: "[BETA] Execute work with external delegate support. Same as ce:work but includes experimental Codex delegation mode for token-conserving code implementation."
 disable-model-invocation: true
-argument-hint: "[Plan doc path or description of work. Blank to auto use latest plan doc]"
+argument-hint: "[Plan doc path or description of work. Blank to auto use latest plan doc] [delegate:codex]"
 ---

 # Work Execution Command
@@ -13,10 +13,62 @@ Execute work efficiently while maintaining quality and finishing features.

 This command takes a work document (plan, specification, or todo file) or a bare prompt describing the work, and executes it systematically. The focus is on **shipping complete features** by understanding requirements quickly, following existing patterns, and maintaining quality throughout.

+**Beta rollout note:** Invoke `ce:work-beta` manually when you want to trial Codex delegation. During the beta period, planning and workflow handoffs remain pointed at stable `ce:work` to avoid dual-path orchestration complexity.
+
 ## Input Document

 <input_document> #$ARGUMENTS </input_document>

+## Argument Parsing
+
+Parse `$ARGUMENTS` for the following optional tokens. Strip each recognized token before interpreting the remainder as the plan file path or bare prompt.
+
+| Token | Example | Effect |
+|-------|---------|--------|
+| `delegate:codex` | `delegate:codex` | Activate Codex delegation mode for plan execution |
+| `delegate:local` | `delegate:local` | Deactivate delegation even if enabled in config |
+
+All tokens are optional. When absent, fall back to the resolution chain below.
+
+**Fuzzy activation:** Also recognize imperative delegation-intent phrases such as "use codex", "delegate to codex", "codex mode", or "delegate mode" as equivalent to `delegate:codex`. A bare mention of "codex" in a prompt (e.g., "fix codex converter bugs") must NOT activate delegation -- only clear delegation intent triggers it.
+
+**Fuzzy deactivation:** Also recognize phrases such as "no codex", "local mode", "standard mode" as equivalent to `delegate:local`.
+
+### Settings Resolution Chain
+
+After extracting tokens from arguments, resolve the delegation state using this precedence chain:
+
+1. **Argument flag** -- `delegate:codex` or `delegate:local` from the current invocation (highest priority)
+2. **Config file** -- extract settings from the config block below. Value `codex` for `work_delegate` activates delegation; `false` deactivates.
+3. **Hard default** -- `false` (delegation off)
+
+**Config (pre-resolved):**
+!`cat "$(git rev-parse --show-toplevel 2>/dev/null)/.compound-engineering/config.local.yaml" 2>/dev/null || cat "$(dirname "$(git rev-parse --path-format=absolute --git-common-dir 2>/dev/null)")/.compound-engineering/config.local.yaml" 2>/dev/null || echo '__NO_CONFIG__'`
+
+If the block above contains YAML key-value pairs, extract values for the keys listed below.
+If it shows `__NO_CONFIG__`, the file does not exist — all settings fall through to defaults.
+If it shows an unresolved command string, read `.compound-engineering/config.local.yaml` from the repo root using the native file-read tool (e.g., Read in Claude Code, read_file in Codex). If the file does not exist, all settings fall through to defaults.
+
+If any setting has an unrecognized value, fall through to the hard default for that setting.
+
+Config keys:
+- `work_delegate` -- `codex` or default `false`
+- `work_delegate_consent` -- `true` or default `false`
+- `work_delegate_sandbox` -- `yolo` (default) or `full-auto`
+- `work_delegate_decision` -- `auto` (default) or `ask`
+- `work_delegate_model` -- Codex model to use (default `gpt-5.4`). Passthrough — any valid model name accepted.
+- `work_delegate_effort` -- `minimal`, `low`, `medium`, `high` (default), or `xhigh`
+
+Store the resolved state for downstream consumption:
+- `delegation_active` -- boolean, whether delegation mode is on
+- `delegation_source` -- `argument` or `config` or `default` -- how delegation was resolved (used by environment guard to decide notification verbosity)
+- `sandbox_mode` -- `yolo` or `full-auto` (from config or default `yolo`)
+- `consent_granted` -- boolean (from config `work_delegate_consent`)
+- `delegate_model` -- string (from config or default `gpt-5.4`)
+- `delegate_effort` -- string (from config or default `high`)
+
+---
+
 ## Execution Workflow

 ### Phase 0: Input Triage
@@ -126,6 +178,8 @@ Determine how to proceed based on what was provided in `<input_document>`.

 4. **Choose Execution Strategy**

+   **Delegation routing gate:** If `delegation_active` is true AND the input is a plan file (not a bare prompt), read `references/codex-delegation-workflow.md` and follow its Pre-Delegation Checks and Delegation Decision flow. If all checks pass and delegation proceeds, force **serial execution** and proceed directly to Phase 2 using the workflow's batched execution loop. If any check disables delegation, fall through to the standard strategy table below. If delegation is active but the input is a bare prompt (no plan file), set `delegation_active` to false with a brief note: "Codex delegation requires a plan file -- using standard mode." and continue with the standard strategy selection below.
+
   After creating the task list, decide how to execute based on the plan's size and dependency structure:

   | Strategy | When to use |
@@ -144,8 +198,6 @@ Determine how to proceed based on what was provided in `<input_document>`.

   After each subagent completes, update the plan checkboxes and task list before dispatching the next dependent unit.

-   For genuinely large plans needing persistent inter-agent communication (agents challenging each other's approaches, shared coordination across 10+ tasks), see Swarm Mode below which uses Agent Teams.
-
 ### Phase 2: Execute

 1. **Task Execution Loop**
@@ -158,7 +210,9 @@ Determine how to proceed based on what was provided in `<input_document>`.
     - Read any referenced files from the plan or discovered during Phase 0
     - Look for similar patterns in codebase
     - Find existing test files for implementation files being changed (Test Discovery — see below)
-     - Implement following existing conventions
+     - If delegation_active: branch to the Codex Delegation Execution Loop
+       (see `references/codex-delegation-workflow.md`)
+     - Otherwise: implement following existing conventions
     - Add, update, or remove tests to match implementation changes (see Test Discovery below)
     - Run System-Wide Test Check (see below)
     - Run tests after changes
@@ -385,94 +439,9 @@ Determine how to proceed based on what was provided in `<input_document>`.

 ---

-## Swarm Mode with Agent Teams (Optional)
+## Codex Delegation Mode

-For genuinely large plans where agents need to communicate with each other, challenge approaches, or coordinate across 10+ tasks with persistent specialized roles, use agent team capabilities if available (e.g., Agent Teams in Claude Code, multi-agent workflows in Codex).
-
-**Agent teams are typically experimental and require opt-in.** Do not attempt to use agent teams unless the user explicitly requests swarm mode or agent teams, and the platform supports it.
-
-### When to Use Agent Teams vs Subagents
-
-| Agent Teams | Subagents (standard mode) |
-|-------------|---------------------------|
-| Agents need to discuss and challenge each other's approaches | Each task is independent — only the result matters |
-| Persistent specialized roles (e.g., dedicated tester running continuously) | Workers report back and finish |
-| 10+ tasks with complex cross-cutting coordination | 3-8 tasks with clear dependency chains |
-| User explicitly requests "swarm mode" or "agent teams" | Default for most plans |
-
-Most plans should use subagent dispatch from standard mode. Agent teams add significant token cost and coordination overhead — use them when the inter-agent communication genuinely improves the outcome.
-
-### Agent Teams Workflow
-
-1. **Create team** — use your available team creation mechanism
-2. **Create task list** — parse Implementation Units into tasks with dependency relationships
-3. **Spawn teammates** — assign specialized roles (implementer, tester, reviewer) based on the plan's needs. Give each teammate the plan file path and their specific task assignments
-4. **Coordinate** — the lead monitors task completion, reassigns work if someone gets stuck, and spawns additional workers as phases unblock
-5. **Cleanup** — shut down all teammates, then clean up the team resources
-
---
-
-## External Delegate Mode (Optional)
-
-For plans where token conservation matters, delegate code implementation to an external delegate (currently Codex CLI) while keeping planning, review, and git operations in the current agent.
-
-This mode integrates with the existing Phase 1 Step 4 strategy selection as a **task-level modifier** - the strategy (inline/serial/parallel) still applies, but the implementation step within each tagged task delegates to the external tool instead of executing directly.
-
-### When to Use External Delegation
-
-| External Delegation | Standard Mode |
-|---------------------|---------------|
-| Task is pure code implementation | Task requires research or exploration |
-| Plan has clear acceptance criteria | Task is ambiguous or needs iteration |
-| Token conservation matters (e.g., Max20 plan) | Unlimited plan or small task |
-| Files to change are well-scoped | Changes span many interconnected files |
-
-### Enabling External Delegation
-
-External delegation activates when any of these conditions are met:
- The user says "use codex for this work", "delegate to codex", or "delegate mode"
- A plan implementation unit contains `Execution target: external-delegate` in its Execution note (set by ce:plan)
-
-The specific delegate tool is resolved at execution time. Currently the only supported delegate is Codex CLI. Future delegates can be added without changing plan files.
-
-### Environment Guard
-
-Before attempting delegation, check whether the current agent is already running inside a delegate's sandbox. Delegation from within a sandbox will fail silently or recurse.
-
-Check for known sandbox indicators:
- `CODEX_SANDBOX` environment variable is set
- `CODEX_SESSION_ID` environment variable is set
- The filesystem is read-only at `.git/` (Codex sandbox blocks git writes)
-
-If any indicator is detected, print "Already running inside a delegate sandbox - using standard mode." and proceed with standard execution for that task.
-
-### External Delegation Workflow
-
-When external delegation is active, follow this workflow for each tagged task. Do not skip delegation because a task seems "small", "simple", or "faster inline". The user or plan explicitly requested delegation.
-
-1. **Check availability**
-
-   Verify the delegate CLI is installed. If not found, print "Delegate CLI not installed - continuing with standard mode." and proceed normally.
-
-2. **Build prompt** — For each task, assemble a prompt from the plan's implementation unit (Goal, Files, Approach, Conventions from project CLAUDE.md/AGENTS.md). Include rules: no git commits, no PRs, run `git status` and `git diff --stat` when done. Never embed credentials or tokens in the prompt - pass auth through environment variables.
-
-3. **Write prompt to file** — Save the assembled prompt to a unique temporary file to avoid shell quoting issues and cross-task races. Use a unique filename per task.
-
-4. **Delegate** — Run the delegate CLI, piping the prompt file via stdin (not argv expansion, which hits `ARG_MAX` on large prompts). Omit the model flag to use the delegate's default model, which stays current without manual updates.
-
-5. **Review diff** — After the delegate finishes, verify the diff is non-empty and in-scope. Run the project's test/lint commands. If the diff is empty or out-of-scope, fall back to standard mode for that task.
-
-6. **Commit** — The current agent handles all git operations. The delegate's sandbox blocks `.git/index.lock` writes, so the delegate cannot commit. Stage changes and commit with a conventional message.
-
-7. **Error handling** — On any delegate failure (rate limit, error, empty diff), fall back to standard mode for that task. Track consecutive failures - after 3 consecutive failures, disable delegation for remaining tasks and print "Delegate disabled after 3 consecutive failures - completing remaining tasks in standard mode."
-
-### Mixed-Model Attribution
-
-When some tasks are executed by the delegate and others by the current agent, use the following attribution in Phase 4:
-
- If all tasks used the delegate: attribute to the delegate model
- If all tasks used standard mode: attribute to the current agent's model
- If mixed: use `Generated with [CURRENT_MODEL] + [DELEGATE_MODEL] via [HARNESS]` and note which tasks were delegated in the PR description
+When `delegation_active` is true after argument parsing, read `references/codex-delegation-workflow.md` for the complete delegation workflow: pre-checks, batching, prompt template, execution loop, and result classification.

 ---

--- a/plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md
+++ b/plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md
@@ -0,0 +1,315 @@
+# Codex Delegation Workflow
+
+When `delegation_active` is true, code implementation is delegated to the Codex CLI (`codex exec`) instead of being implemented directly. The orchestrating Claude Code agent retains control of planning, review, git operations, and orchestration.
+
+## Delegation Decision
+
+If `work_delegate_decision` is `ask`, present the recommendation and wait for the user's choice before proceeding.
+
+**When recommending Codex delegation:**
+
+> "Codex delegation active. [N] implementation units -- delegating in one batch."
+> 1. Delegate to Codex *(recommended)*
+> 2. Execute with Claude Code instead
+
+**When recommending Codex delegation, multiple batches:**
+
+> "Codex delegation active. [N] implementation units -- delegating in [X] batches."
+> 1. Delegate to Codex *(recommended)*
+> 2. Execute with Claude Code instead
+
+**When recommending Claude Code (all units are trivial):**
+
+> "Codex delegation active, but these are small changes where the cost of delegating outweighs having Claude Code do them."
+> 1. Execute with Claude Code *(recommended)*
+> 2. Delegate to Codex anyway
+
+If the user chooses the delegation option, proceed to Pre-Delegation Checks below. If the user chooses the Claude Code option, set `delegation_active` to false and return to standard execution in the parent skill.
+
+If `work_delegate_decision` is `auto` (the default), state the execution plan in one line and proceed without waiting: "Codex delegation active. Delegating [N] units in [X] batch(es)." If all units are trivial, set `delegation_active` to false and proceed: "Codex delegation active. All units are trivial -- executing with Claude Code."
+
+## Pre-Delegation Checks
+
+Run these checks **once before the first batch**. If any check fails, fall back to standard mode for the remainder of the plan execution. Do not re-run on subsequent batches.
+
+**0. Platform Gate**
+
+Codex delegation is only supported when the orchestrating agent is running in Claude Code. If the current session is Codex, Gemini CLI, OpenCode, or any other platform, set `delegation_active` to false and proceed in standard mode.
+
+**1. Environment Guard**
+
+Check whether the current agent is already running inside a Codex sandbox:
+
+```bash
+if [ -n "$CODEX_SANDBOX" ] || [ -n "$CODEX_SESSION_ID" ]; then
+  echo "inside_sandbox=true"
+else
+  echo "inside_sandbox=false"
+fi
+```
+
+If `inside_sandbox` is true, delegation would recurse or fail.
+
+- If `delegation_source` is `argument`: emit "Already inside Codex sandbox -- using standard mode." and set `delegation_active` to false.
+- If `delegation_source` is `config` or `default`: set `delegation_active` to false silently.
+
+**2. Availability Check**
+
+**Codex availability (pre-resolved):**
+!`command -v codex >/dev/null 2>&1 && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_FOUND"`
+
+If the line above shows `CODEX_AVAILABLE`, proceed to the next check.
+If it shows `CODEX_NOT_FOUND`, the Codex CLI is not installed. Emit "Codex CLI not found (install via `npm install -g @openai/codex` or `brew install codex`) -- using standard mode." and set `delegation_active` to false.
+If it shows an unresolved command string, run `command -v codex` using a shell tool. If the command prints a path, proceed. If it fails or prints nothing, emit the same message and set `delegation_active` to false.
+
+**3. Consent Flow**
+
+If `consent_granted` is not true (from config `work_delegate_consent`):
+
+Present a one-time consent warning using the platform's blocking question tool (AskUserQuestion in Claude Code). The consent warning explains:
+- Delegation sends implementation units to `codex exec` as a structured prompt
+- **yolo mode** (`--yolo`): Full system access including network. Required for verification steps that run tests or install dependencies. **Recommended.**
+- **full-auto mode** (`--full-auto`): Workspace-write sandbox, no network access.
+
+Present the sandbox mode choice: (1) yolo (recommended), (2) full-auto.
+
+On acceptance:
+- Resolve the repo root: `git rev-parse --show-toplevel`. Write `work_delegate_consent: true` and `work_delegate_sandbox: <chosen-mode>` to `<repo-root>/.compound-engineering/config.local.yaml`
+- To write: (1) if file or directory does not exist, create `<repo-root>/.compound-engineering/` and write the YAML file; (2) if file exists, merge new keys preserving existing keys
+- Update `consent_granted` and `sandbox_mode` in the resolved state
+
+On decline:
+- Ask whether to disable delegation entirely for this project
+- If yes: write `work_delegate: false` to `<repo-root>/.compound-engineering/config.local.yaml` (using the same repo root resolved above). To write: (1) if file or directory does not exist, create `<repo-root>/.compound-engineering/` and write the YAML file; (2) if file exists, merge new keys preserving existing keys. Set `delegation_active` to false, proceed in standard mode
+- If no: set `delegation_active` to false for this invocation only, proceed in standard mode
+
+**Headless consent:** If running in a headless or non-interactive context, delegation proceeds only if `work_delegate_consent` is already `true` in the config file. If consent is not recorded, set `delegation_active` to false silently.
+
+## Batching
+
+Delegate all units in one batch. If the plan exceeds 5 units, split into batches at the plan's own phase boundaries, or in groups of roughly 5 -- never splitting units that share files. Skip delegation entirely if every unit is trivial.
+
+## Prompt Template
+
+At the start of delegated execution, generate a short unique run ID (e.g., 8 hex chars from a timestamp or random source). All scratch files for this invocation go under `.context/compound-engineering/codex-delegation/<run-id>/`. Create the directory if it does not exist.
+
+Before each batch, write a prompt file to `.context/compound-engineering/codex-delegation/<run-id>/prompt-batch-<batch-num>.md`.
+
+Build the prompt from the batch's implementation units using these XML-tagged sections:
+
+```xml
+<task>
+[For a single-unit batch: Goal from the implementation unit.
+For a multi-unit batch: list each unit with its Goal, stating the concrete
+job, repository context, and expected end state for each.]
+</task>
+
+<files>
+[Combined file list from all units in the batch -- files to create, modify, or read.]
+</files>
+
+<patterns>
+[File paths from all units' "Patterns to follow" fields. If no patterns:
+"No explicit patterns referenced -- follow existing conventions in the
+modified files."]
+</patterns>
+
+<approach>
+[For a single-unit batch: Approach from the unit.
+For a multi-unit batch: list each unit's approach, noting dependencies
+and suggested ordering.]
+</approach>
+
+<constraints>
+- Do NOT run git commit, git push, or create PRs -- the orchestrating agent handles all git operations
+- Restrict all modifications to files within the repository root
+- Keep changes tightly scoped to the stated task -- avoid unrelated refactors, renames, or cleanup
+- Resolve the task fully before stopping -- do not stop at the first plausible answer
+- If you discover mid-execution that you need to modify files outside the repo root, complete what you can within the repo and report what you could not do via the result schema issues field
+</constraints>
+
+<testing>
+Before writing tests, check whether the plan's test scenarios cover all
+categories that apply to each unit. Supplement gaps before writing tests:
+- Happy path: core input/output pairs from each unit's goal
+- Edge cases: boundary values, empty/nil inputs, type mismatches
+- Error/failure paths: invalid inputs, permission denials, downstream failures
+- Integration: cross-layer scenarios that mocks alone won't prove
+
+Write tests that name specific inputs and expected outcomes. If your changes
+touch code with callbacks, middleware, or event handlers, verify the
+interaction chain works end-to-end.
+</testing>
+
+<verify>
+After implementing, run ALL test files together in a single command (not
+per-file). Cross-file contamination (e.g., mocked globals leaking between
+test files) only surfaces when tests run in the same process. If tests
+fail, fix the issues and re-run until they pass. Do not report status
+"completed" unless verification passes. This is your responsibility --
+the orchestrator will not re-run verification independently.
+
+[Test and lint commands from the project. Use the union of all units'
+verification commands as a single combined invocation.]
+</verify>
+
+<output_contract>
+Report your result via the --output-schema mechanism. Fill in every field:
+- status: "completed" ONLY if all changes were made AND verification passes,
+  "partial" if incomplete, "failed" if no meaningful progress
+- files_modified: array of file paths you changed
+- issues: array of strings describing any problems, gaps, or out-of-scope
+  work discovered
+- summary: one-paragraph description of what was done
+- verification_summary: what you ran to verify (command and outcome).
+  Example: "Ran `bun test` -- 14 tests passed, 0 failed."
+  If no verification was possible, say why.
+</output_contract>
+```
+
+## Result Schema
+
+Write the result schema to `.context/compound-engineering/codex-delegation/<run-id>/result-schema.json` once at the start of delegated execution:
+
+```json
+{
+  "type": "object",
+  "properties": {
+    "status": { "enum": ["completed", "partial", "failed"] },
+    "files_modified": { "type": "array", "items": { "type": "string" } },
+    "issues": { "type": "array", "items": { "type": "string" } },
+    "summary": { "type": "string" },
+    "verification_summary": { "type": "string" }
+  },
+  "required": ["status", "files_modified", "issues", "summary", "verification_summary"],
+  "additionalProperties": false
+}
+```
+
+Each batch's result is written to `.context/compound-engineering/codex-delegation/<run-id>/result-batch-<batch-num>.json` via the `-o` flag. On plan failure, files are left in place for debugging.
+
+If the result JSON is absent or malformed after a successful exit code, classify as task failure.
+
+## Execution Loop
+
+Initialize a `consecutive_failures` counter at 0 before the first batch.
+
+**Clean-baseline preflight:** Before the first batch, verify there are no uncommitted changes to tracked files:
+
+```bash
+git diff --quiet HEAD
+```
+
+This intentionally ignores untracked files. Only staged or unstaged modifications to tracked files make rollback unsafe. However, if untracked files exist at paths in the batch's planned Files list, rollback (`git clean -fd -- <paths>`) would delete them. If such overlaps are detected, warn the user and recommend committing or stashing those files before proceeding.
+
+If tracked files are dirty, stop and present options: (1) commit current changes, (2) stash explicitly (`git stash push -m "pre-delegation"`), (3) continue in standard mode (sets `delegation_active` to false). Do not auto-stash user changes.
+
+**Delegation invocation:** For each batch, execute these as **separate Bash tool calls** (not combined into one):
+
+**Step A — Launch (background, separate Bash call):**
+
+Write the prompt file, then make a single Bash tool call with `run_in_background: true` set on the tool parameter. This call returns immediately and has no timeout ceiling.
+
+```bash
+# Substitute the resolved sandbox_mode value (yolo or full-auto) from the skill state
+SANDBOX_MODE="<sandbox_mode>"
+
+# Resolve sandbox flag
+if [ "$SANDBOX_MODE" = "full-auto" ]; then
+  SANDBOX_FLAG="--full-auto"
+else
+  SANDBOX_FLAG="--dangerously-bypass-approvals-and-sandbox"
+fi
+
+codex exec \
+  -m "<delegate_model>" \
+  -c 'model_reasoning_effort="<delegate_effort>"' \
+  $SANDBOX_FLAG \
+  --output-schema .context/compound-engineering/codex-delegation/<run-id>/result-schema.json \
+  -o .context/compound-engineering/codex-delegation/<run-id>/result-batch-<batch-num>.json \
+  - < .context/compound-engineering/codex-delegation/<run-id>/prompt-batch-<batch-num>.md
+```
+
+Critical: `run_in_background: true` must be set as a **Bash tool parameter**, not as a shell `&` suffix. The tool parameter is what removes the timeout ceiling. A shell `&` inside a foreground Bash call still hits the 2-minute default timeout.
+
+Quoting is critical for the `-c` flag: use single quotes around the entire key=value and double quotes around the TOML string value inside. Example: `-c 'model_reasoning_effort="high"'`.
+
+Do not improvise CLI flags or modify this invocation template.
+
+**Step B — Poll (foreground, separate Bash calls):**
+
+After the launch call returns, make a **new, separate** foreground Bash tool call that polls for the result file. This keeps the agent's turn active so the user cannot interfere with the working tree.
+
+```bash
+RESULT_FILE=".context/compound-engineering/codex-delegation/<run-id>/result-batch-<batch-num>.json"
+for i in $(seq 1 6); do
+  test -s "$RESULT_FILE" && echo "DONE" && exit 0
+  sleep 10
+done
+echo "Waiting for Codex..."
+```
+
+If the output is "Waiting for Codex...", issue the same polling command again as another separate Bash call. Repeat until the output is "DONE", then read the result file and proceed to classification.
+
+**Polling termination conditions:** Stop polling when any of these conditions is met:
+
+- **Result file appears** (output is "DONE") -- proceed to result classification normally.
+- **Background process exits with non-zero code** -- classify as CLI failure (row 1). Rollback and fall back to standard mode.
+- **Background process exits with zero code but result file is absent** -- classify as task failure (row 2: exit 0, result JSON missing). Rollback and increment `consecutive_failures`.
+- **5 polling rounds** elapse (~5 minutes) without the result file appearing and without a background process notification -- treat as a hung process. Classify as CLI failure (row 1). Rollback and fall back to standard mode.
+
+**Result classification:** Codex is responsible for running verification internally and fixing failures before reporting -- the orchestrator does not re-run verification independently.
+
+| # | Signal | Classification | Action |
+|---|--------|---------------|--------|
+| 1 | Exit code != 0 | CLI failure | Rollback to HEAD. Fall back to standard mode for ALL remaining work. |
+| 2 | Exit code 0, result JSON missing or malformed | Task failure | Rollback to HEAD. Increment `consecutive_failures`. |
+| 3 | Exit code 0, `status: "failed"` | Task failure | Rollback to HEAD. Increment `consecutive_failures`. |
+| 4 | Exit code 0, `status: "partial"` | Partial success | Keep the diff. Complete remaining work locally, verify, and commit. Increment `consecutive_failures`. |
+| 5 | Exit code 0, `status: "completed"` | Success | Commit changes. Reset `consecutive_failures` to 0. |
+
+**Result handoff — surface to user:** After reading the result JSON and before committing or rolling back, display a summary so the user sees what happened. Format:
+
+> **Codex batch <batch-num> — <classification>**
+> <summary from result JSON>
+>
+> **Files:** <comma-separated list from files_modified>
+> **Verification:** <verification_summary from result JSON>
+> **Issues:** <issues list, or "None">
+
+On failure or partial results, include the classification reason (e.g., "status: failed", "result JSON missing") so the user understands why the orchestrator is rolling back or completing locally.
+
+Keep this brief — the goal is transparency, not a wall of text. One short block per batch.
+
+**Rollback procedure:**
+
+```bash
+git checkout -- .
+git clean -fd -- <paths from the batch's combined Files list>
+```
+
+Do NOT use bare `git clean -fd` without path arguments.
+
+**Commit on success:**
+
+```bash
+git add $(git diff --name-only HEAD; git ls-files --others --exclude-standard)
+git commit -m "feat(<scope>): <batch summary>"
+```
+
+**Between batches** (plans split into multiple batches): Report what completed, test results, and what's next. Continue immediately unless the user intervenes -- the checkpoint exists so the user *can* steer, not so they *must*.
+
+**Circuit breaker:** After 3 consecutive failures, set `delegation_active` to false and emit: "Codex delegation disabled after 3 consecutive failures -- completing remaining units in standard mode."
+
+**Scratch cleanup:** After the last batch completes:
+
+```bash
+rm -rf .context/compound-engineering/codex-delegation/<run-id>/
+```
+
+## Mixed-Model Attribution
+
+When some units are executed by Codex and others locally:
+- If all units used delegation: attribute to the Codex model
+- If all units used standard mode: attribute to the current agent's model
+- If mixed: note which units were delegated in the PR description and credit both models
--- a/plugins/compound-engineering/skills/ce-work/SKILL.md
+++ b/plugins/compound-engineering/skills/ce-work/SKILL.md
@@ -143,8 +143,6 @@ Determine how to proceed based on what was provided in `<input_document>`.

   After each subagent completes, update the plan checkboxes and task list before dispatching the next dependent unit.

-   For genuinely large plans needing persistent inter-agent communication (agents challenging each other's approaches, shared coordination across 10+ tasks), see Swarm Mode below which uses Agent Teams.
-
 ### Phase 2: Execute

 1. **Task Execution Loop**
@@ -374,35 +372,6 @@ Determine how to proceed based on what was provided in `<input_document>`.
   - Note any follow-up work needed
   - Suggest next steps if applicable

---
-
-## Swarm Mode with Agent Teams (Optional)
-
-For genuinely large plans where agents need to communicate with each other, challenge approaches, or coordinate across 10+ tasks with persistent specialized roles, use agent team capabilities if available (e.g., Agent Teams in Claude Code, multi-agent workflows in Codex).
-
-**Agent teams are typically experimental and require opt-in.** Do not attempt to use agent teams unless the user explicitly requests swarm mode or agent teams, and the platform supports it.
-
-### When to Use Agent Teams vs Subagents
-
-| Agent Teams | Subagents (standard mode) |
-|-------------|---------------------------|
-| Agents need to discuss and challenge each other's approaches | Each task is independent — only the result matters |
-| Persistent specialized roles (e.g., dedicated tester running continuously) | Workers report back and finish |
-| 10+ tasks with complex cross-cutting coordination | 3-8 tasks with clear dependency chains |
-| User explicitly requests "swarm mode" or "agent teams" | Default for most plans |
-
-Most plans should use subagent dispatch from standard mode. Agent teams add significant token cost and coordination overhead — use them when the inter-agent communication genuinely improves the outcome.
-
-### Agent Teams Workflow
-
-1. **Create team** — use your available team creation mechanism
-2. **Create task list** — parse Implementation Units into tasks with dependency relationships
-3. **Spawn teammates** — assign specialized roles (implementer, tester, reviewer) based on the plan's needs. Give each teammate the plan file path and their specific task assignments
-4. **Coordinate** — the lead monitors task completion, reassigns work if someone gets stuck, and spawns additional workers as phases unblock
-5. **Cleanup** — shut down all teammates, then clean up the team resources
-
---
-
 ## Key Principles

 ### Start Fast, Execute Faster
--- a/plugins/compound-engineering/skills/orchestrating-swarms/SKILL.md
+++ b/plugins/compound-engineering/skills/orchestrating-swarms/SKILL.md
--- a/plugins/compound-engineering/skills/slfg/SKILL.md
+++ b/plugins/compound-engineering/skills/slfg/SKILL.md
@@ -1,35 +0,0 @@
---
-name: slfg
-description: Full autonomous engineering workflow using swarm mode for parallel execution
-argument-hint: "[feature description]"
-disable-model-invocation: true
---
-
-Swarm-enabled LFG. Run these steps in order, parallelizing where indicated. Do not stop between steps — complete every step through to the end.
-
-## Sequential Phase
-
-1. **Optional:** If the `ralph-loop` skill is available, run `/ralph-loop:ralph-loop "finish all slash commands" --completion-promise "DONE"`. If not available or it fails, skip and continue to step 2 immediately.
-2. `/ce:plan $ARGUMENTS` — If ce:plan reported the task is non-software and cannot be processed in pipeline mode, stop the pipeline and inform the user that SLFG requires software tasks. Otherwise, **record the plan file path** from `docs/plans/` for steps 4 and 6.
-3. `/ce:work` — **Use swarm mode**: Make a Task list and launch an army of agent swarm subagents to build the plan
-
-## Parallel Phase
-
-After work completes, launch steps 4 and 5 as **parallel swarm agents** (both only need code to be written):
-
-4. `/ce:review mode:report-only plan:<plan-path-from-step-2>` — spawn as background Task agent
-5. `/compound-engineering:test-browser` — spawn as background Task agent
-
-Wait for both to complete before continuing.
-
-## Autofix Phase
-
-6. `/ce:review mode:autofix plan:<plan-path-from-step-2>` — run sequentially after the parallel phase so it can safely mutate the checkout, apply `safe_auto` fixes, and emit residual todos for step 7
-
-## Finalize Phase
-
-7. `/compound-engineering:todo-resolve` — resolve findings, compound on learnings, clean up completed todos
-8. `/compound-engineering:feature-video` — record the final walkthrough and add to PR
-9. Output `<promise>DONE</promise>` when video is in PR
-
-Start with step 1 now.