feat(ce-optimize): Auto-research loop for tuning system prompts / vector clustering / evaluating different code solution / etc (#446)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 23:16:09 -04:00
parent 4e0ed2cc8d
commit 8f20aa0406
15 changed files with 3970 additions and 1 deletions
--- a/plugins/compound-engineering/skills/ce-optimize/README.md
+++ b/plugins/compound-engineering/skills/ce-optimize/README.md
@@ -0,0 +1,38 @@
+# `ce-optimize`
+
+Run iterative optimization loops for problems where you can try multiple variants and score them with the same measurement setup.
+
+## When To Use It
+
+Use `/ce-optimize` when:
+
+- The right change is not obvious up front
+- You can generate several plausible variants
+- You have a repeatable measurement harness
+- "Better" can be expressed as a hard metric or an LLM-as-judge evaluation
+
+Good fits:
+
+- Tuning memory, timeout, concurrency, or batch-size settings where you can measure crashes, latency, throughput, or error rate
+- Improving clustering, ranking, search, or recommendation quality where hard metrics alone can be gamed
+- Optimizing prompts where both output quality and token cost matter
+
+Usually not a good fit:
+
+- One-shot bug fixes with an obvious root cause
+- Changes without a repeatable measurement harness
+- Problems where "better" cannot be measured or judged consistently
+
+## Quick Start
+
+- Start with [`references/example-hard-spec.yaml`](./references/example-hard-spec.yaml) for objective targets
+- Start with [`references/example-judge-spec.yaml`](./references/example-judge-spec.yaml) when semantics matter and you need LLM-as-judge
+- Keep the first run serial, small, and cheap until the harness is trustworthy
+- Avoid introducing new dependencies until the baseline and evaluation loop are stable
+
+## Docs
+
+- [`SKILL.md`](./SKILL.md): full orchestration workflow and runtime rules
+- [`references/usage-guide.md`](./references/usage-guide.md): example prompts and practical "when/how to use this skill" guidance
+- [`references/optimize-spec-schema.yaml`](./references/optimize-spec-schema.yaml): optimization spec schema
+- [`references/experiment-log-schema.yaml`](./references/experiment-log-schema.yaml): experiment log schema
--- a/plugins/compound-engineering/skills/ce-optimize/SKILL.md
+++ b/plugins/compound-engineering/skills/ce-optimize/SKILL.md
@@ -0,0 +1,659 @@
+---
+name: ce-optimize
+description: "Run metric-driven iterative optimization loops. Define a measurable goal, build measurement scaffolding, then run parallel experiments that try many approaches, measure each against hard gates and/or LLM-as-judge quality scores, keep improvements, and converge toward the best solution. Use when optimizing clustering quality, search relevance, build performance, prompt quality, or any measurable outcome that benefits from systematic experimentation. Inspired by Karpathy's autoresearch, generalized for multi-file code changes and non-ML domains."
+argument-hint: "[path to optimization spec YAML, or describe the optimization goal]"
+---
+
+# Iterative Optimization Loop
+
+Run metric-driven iterative optimization. Define a goal, build measurement scaffolding, then run parallel experiments that converge toward the best solution.
+
+## Interaction Method
+
+Use the platform's blocking question tool when available (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini). Otherwise, present numbered options in chat and wait for the user's reply before proceeding.
+
+## Input
+
+<optimization_input> #$ARGUMENTS </optimization_input>
+
+If the input above is empty, ask: "What would you like to optimize? Describe the goal, or provide a path to an optimization spec YAML file."
+
+## Optimization Spec Schema
+
+Reference the spec schema for validation:
+
+`references/optimize-spec-schema.yaml`
+
+## Experiment Log Schema
+
+Reference the experiment log schema for state management:
+
+`references/experiment-log-schema.yaml`
+
+## Quick Start
+
+For a first run, optimize for signal and safety, not maximum throughput:
+
+- Start from `references/example-hard-spec.yaml` when the metric is objective and cheap to measure
+- Use `references/example-judge-spec.yaml` only when actual quality requires semantic judgment
+- Prefer `execution.mode: serial` and `execution.max_concurrent: 1`
+- Cap the first run with `stopping.max_iterations: 4` and `stopping.max_hours: 1`
+- Avoid new dependencies until the baseline and measurement harness are trusted
+- For judge mode, start with `sample_size: 10`, `batch_size: 5`, and `max_total_cost_usd: 5`
+
+For a friendly overview of what this skill is for, when to use hard metrics vs LLM-as-judge, and example kickoff prompts, see:
+
+`references/usage-guide.md`
+
+---
+
+## Persistence Discipline
+
+**CRITICAL: The experiment log on disk is the single source of truth. The conversation context is NOT durable storage. Results that exist only in the conversation WILL be lost.**
+
+The files under `.context/compound-engineering/ce-optimize/<spec-name>/` are local scratch state. They are ignored by git, so they survive local resumes on the same machine but are not preserved by commits, branches, or pushes unless the user exports them separately.
+
+This skill runs for hours. Context windows compact, sessions crash, and agents restart. Every piece of state that matters MUST live on disk, not in the agent's memory.
+
+**If you produce a results table in the conversation without writing those results to disk first, you have a bug.** The conversation is for the user's benefit. The experiment log file is for durability.
+
+### Core Rules
+
+1. **Write each experiment result to disk IMMEDIATELY after measurement** — not after the batch, not after evaluation, IMMEDIATELY. Append the experiment entry to the experiment log file the moment its metrics are known, before evaluating the next experiment. This is the #1 crash-safety rule.
+
+2. **VERIFY every critical write** — after writing the experiment log, read the file back and confirm the entry is present. This catches silent write failures. Do not proceed to the next experiment until verification passes.
+
+3. **Re-read from disk at every phase boundary and before every decision** — never trust in-memory state across phase transitions, batch boundaries, or after any operation that might have taken significant time. Re-read the experiment log and strategy digest from disk.
+
+4. **The experiment log is append-only during Phase 3** — never rewrite the full file. Append new experiment entries. Update the `best` section in place only when a new best is found. This prevents data loss if a write is interrupted.
+
+5. **Per-experiment result markers for crash recovery** — each experiment writes a `result.yaml` marker in its worktree immediately after measurement. On resume, scan for these markers to recover experiments that were measured but not yet logged.
+
+6. **Strategy digest is written after every batch, before generating new hypotheses** — the agent reads the digest (not its memory) when deciding what to try next.
+
+7. **Never present results to the user without writing them to disk first** — the pattern is: measure -> write to disk -> verify -> THEN show the user. Not the reverse.
+
+### Mandatory Disk Checkpoints
+
+These are non-negotiable write-then-verify steps. At each checkpoint, the agent MUST write the specified file and then read it back to confirm the write succeeded.
+
+| Checkpoint | File Written | Phase |
+|---|---|---|
+| CP-0: Spec saved | `spec.yaml` | Phase 0, after user approval |
+| CP-1: Baseline recorded | `experiment-log.yaml` (initial with baseline) | Phase 1, after baseline measurement |
+| CP-2: Hypothesis backlog saved | `experiment-log.yaml` (hypothesis_backlog section) | Phase 2, after hypothesis generation |
+| CP-3: Each experiment result | `experiment-log.yaml` (append experiment entry) | Phase 3.3, immediately after each measurement |
+| CP-4: Batch summary | `experiment-log.yaml` (outcomes + best) + `strategy-digest.md` | Phase 3.5, after batch evaluation |
+| CP-5: Final summary | `experiment-log.yaml` (final state) | Phase 4, at wrap-up |
+
+**Format of a verification step:**
+1. Write the file using the native file-write tool
+2. Read the file back using the native file-read tool
+3. Confirm the expected content is present
+4. If verification fails, retry the write. If it fails twice, alert the user.
+
+### File Locations (all under `.context/compound-engineering/ce-optimize/<spec-name>/`)
+
+| File | Purpose | Written When |
+|------|---------|-------------|
+| `spec.yaml` | Optimization spec (immutable during run) | Phase 0 (CP-0) |
+| `experiment-log.yaml` | Full history of all experiments | Initialized at CP-1, appended at CP-3, updated at CP-4 |
+| `strategy-digest.md` | Compressed learnings for hypothesis generation | Written at CP-4 after each batch |
+| `<worktree>/result.yaml` | Per-experiment crash-recovery marker | Immediately after measurement, before CP-3 |
+
+### On Resume
+
+When Phase 0.4 detects an existing run:
+1. Read the experiment log from disk — this is the ground truth
+2. Scan worktree directories for `result.yaml` markers not yet in the log
+3. Recover any measured-but-unlogged experiments
+4. Continue from where the log left off
+
+---
+
+## Phase 0: Setup
+
+### 0.1 Determine Input Type
+
+Check whether the input is:
+- **A spec file path** (ends in `.yaml` or `.yml`): read and validate it
+- **A description of the optimization goal**: help the user create a spec interactively
+
+### 0.2 Load or Create Spec
+
+**If spec file provided:**
+1. Read the YAML spec file. The orchestrating agent parses YAML natively -- no shell script parsing.
+2. Validate against `references/optimize-spec-schema.yaml`:
+   - All required fields present
+   - `name` is lowercase kebab-case and safe to use in git refs / worktree paths
+   - `metric.primary.type` is `hard` or `judge`
+   - If type is `judge`, `metric.judge` section exists with `rubric` and `scoring`
+   - At least one degenerate gate defined
+   - `measurement.command` is non-empty
+   - `scope.mutable` and `scope.immutable` each have at least one entry
+   - Gate check operators are valid (`>=`, `<=`, `>`, `<`, `==`, `!=`)
+   - `execution.max_concurrent` is at least 1
+   - `execution.max_concurrent` does not exceed 6 when backend is `worktree`
+3. If validation fails, report errors and ask the user to fix them
+
+**If description provided:**
+1. Analyze the project to understand what can be measured
+2. **Detect whether the optimization target is qualitative or quantitative** — this determines `type: hard` vs `type: judge` and is the single most important spec decision:
+
+   **Use `type: hard`** when:
+   - The metric is a scalar number with a clear "better" direction
+   - The metric is objectively measurable (build time, test pass rate, latency, memory usage)
+   - No human judgment is needed to evaluate "is this result actually good?"
+   - Examples: reduce build time, increase test coverage, reduce API latency, decrease bundle size
+
+   **Use `type: judge`** when:
+   - The quality of the output requires semantic understanding to evaluate
+   - A human reviewer would need to look at the results to say "this is better"
+   - Proxy metrics exist but can mislead (e.g., "more clusters" does not mean "better clusters")
+   - The optimization could produce degenerate solutions that look good on paper
+   - Examples: clustering quality, search relevance, summarization quality, code readability, UX copy, recommendation relevance
+
+   **IMPORTANT**: If the target is qualitative, **strongly recommend `type: judge`**. Explain that hard metrics alone will optimize proxy numbers without checking actual quality. Show the user the three-tier approach:
+   - **Degenerate gates** (hard, cheap, fast): catch obviously broken solutions — e.g., "all items in 1 cluster" or "0% coverage". Run first. If gates fail, skip the expensive judge step.
+   - **LLM-as-judge** (the actual optimization target): sample outputs, score them against a rubric, aggregate. This is what the loop optimizes.
+   - **Diagnostics** (logged, not gated): distribution stats, counts, timing — useful for understanding WHY a judge score changed.
+
+   If the user insists on `type: hard` for a qualitative target, proceed but warn that the results may optimize a misleading proxy.
+
+3. **Design the sampling strategy** (for `type: judge`):
+
+   Guide the user through defining stratified sampling. The key question is: "What parts of the output space do you need to check quality on?"
+
+   Walk through these questions:
+   - **What does one "item" look like?** (a cluster, a search result page, a summary, etc.)
+   - **What are the natural size/quality strata?** (e.g., large clusters vs small clusters vs singletons)
+   - **Where are quality failures most likely?** (e.g., very large clusters may be degenerate merges; singletons may be missed groupings)
+   - **What total sample size balances cost vs signal?** (default: 30 items, adjust based on output volume)
+
+   Example stratified sampling for clustering:
+   ```yaml
+   stratification:
+     - bucket: "top_by_size"     # largest clusters — check for degenerate mega-clusters
+       count: 10
+     - bucket: "mid_range"       # middle of non-solo cluster size range — representative quality
+       count: 10
+     - bucket: "small_clusters"  # clusters with 2-3 items — check if connections are real
+       count: 10
+   singleton_sample: 15          # singletons — check for false negatives (items that should cluster)
+   ```
+
+   The sampling strategy is domain-specific. For search relevance, strata might be "top-3 results", "results 4-10", "tail results". For summarization, strata might be "short documents", "long documents", "multi-topic documents".
+
+   **Singleton evaluation is critical when the goal involves coverage** — sampling singletons with the singleton rubric checks whether the system is missing obvious groupings.
+
+4. **Design the rubric** (for `type: judge`):
+
+   Help the user define the scoring rubric. A good rubric:
+   - Has a 1-5 scale (or similar) with concrete descriptions for each level
+   - Includes supplementary fields that help diagnose issues (e.g., `distinct_topics`, `outlier_count`)
+   - Is specific enough that two judges would give similar scores
+   - Does NOT assume bigger/more is better — "3 items per cluster average" is not inherently good or bad
+
+   Example for clustering:
+   ```yaml
+   rubric: |
+     Rate this cluster 1-5:
+     - 5: All items clearly about the same issue/feature
+     - 4: Strong theme, minor outliers
+     - 3: Related but covers 2-3 sub-topics that could reasonably be split
+     - 2: Weak connection — items share superficial similarity only
+     - 1: Unrelated items grouped together
+     Also report: distinct_topics (integer), outlier_count (integer)
+   ```
+
+5. Guide the user through the remaining spec fields:
+   - What degenerate cases should be rejected? (gates — e.g., "solo_pct <= 0.95" catches all-singletons, "max_cluster_size <= 500" catches mega-clusters)
+   - What command runs the measurement?
+   - What files can be modified? What is immutable?
+   - Any constraints or dependencies?
+   - If this is the first run: recommend `execution.mode: serial`, `execution.max_concurrent: 1`, `stopping.max_iterations: 4`, and `stopping.max_hours: 1`
+   - If `type: judge`: recommend `sample_size: 10`, `batch_size: 5`, and `max_total_cost_usd: 5` until the rubric and harness are trusted
+6. Write the spec to `.context/compound-engineering/ce-optimize/<spec-name>/spec.yaml`
+7. Present the spec to the user for approval before proceeding
+
+### 0.3 Search Prior Learnings
+
+Dispatch `compound-engineering:research:learnings-researcher` to search for prior optimization work on similar topics. If relevant learnings exist, incorporate them into the approach.
+
+### 0.4 Run Identity Detection
+
+Check if `optimize/<spec-name>` branch already exists:
+
+```bash
+git rev-parse --verify "optimize/<spec-name>" 2>/dev/null
+```
+
+**If branch exists**, check for an existing experiment log at `.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml`.
+
+Present the user with a choice via the platform question tool:
+- **Resume**: read ALL state from the experiment log on disk (do not rely on any in-memory context from a prior session). Recover any measured-but-unlogged experiments by scanning worktree directories for `result.yaml` markers. Continue from the last iteration number in the log.
+- **Fresh start**: archive the old branch to `optimize-archive/<spec-name>/archived-<timestamp>`, clear the experiment log, start from scratch
+
+### 0.5 Create Optimization Branch and Scratch Space
+
+```bash
+git checkout -b "optimize/<spec-name>"  # or switch to existing if resuming
+```
+
+Create scratch directory:
+```bash
+mkdir -p .context/compound-engineering/ce-optimize/<spec-name>/
+```
+
+---
+
+## Phase 1: Measurement Scaffolding
+
+**This phase is a HARD GATE. The user must approve baseline and parallel readiness before Phase 2.**
+
+### 1.1 Clean-Tree Gate
+
+Verify no uncommitted changes to files within `scope.mutable` or `scope.immutable`:
+
+```bash
+git status --porcelain
+```
+
+Filter the output against the scope paths. If any in-scope files have uncommitted changes:
+- Report which files are dirty
+- Ask the user to commit or stash before proceeding
+- Do NOT continue until the working tree is clean for in-scope files
+
+### 1.2 Build or Validate Measurement Harness
+
+**If user provides a measurement harness** (the `measurement.command` already exists):
+1. Run it once via the measurement script:
+   ```bash
+   bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<measurement.working_directory or .>"
+   ```
+2. Validate the JSON output:
+   - Contains keys for all degenerate gate metric names
+   - Contains keys for all diagnostic metric names
+   - Values are numeric or boolean as expected
+3. If validation fails, report what is missing and ask the user to fix the harness
+
+**If agent must build the harness:**
+1. Analyze the codebase to understand the current approach and what should be measured
+2. Build an evaluation script (e.g., `evaluate.py`, `evaluate.sh`, or equivalent)
+3. Add the evaluation script path to `scope.immutable` -- the experiment agent must not modify it
+4. Run it once and validate the output
+5. Present the harness and its output to the user for review
+
+### 1.3 Establish Baseline
+
+Run the measurement harness on the current code.
+
+**If stability mode is `repeat`:**
+1. Run the harness `repeat_count` times
+2. Aggregate results using the configured aggregation method (median, mean, min, max)
+3. Calculate variance across runs
+4. If variance exceeds `noise_threshold`, warn the user and suggest increasing `repeat_count`
+
+Record the baseline in the experiment log:
+```yaml
+baseline:
+  timestamp: "<current ISO 8601 timestamp>"
+  gates:
+    <gate_name>: <value>
+    ...
+  diagnostics:
+    <diagnostic_name>: <value>
+    ...
+```
+
+If primary type is `judge`, also run the judge evaluation on baseline output to establish the starting judge score.
+
+### 1.4 Parallelism Readiness Probe
+
+Run the parallelism probe script:
+```bash
+bash scripts/parallel-probe.sh "<project_directory>" "<measurement.command>" "<measurement.working_directory>" <shared_files...>
+```
+
+Read the JSON output. Present any blockers to the user with suggested mitigations. Treat the probe as intentionally narrow: it should inspect the measurement command, the measurement working directory, and explicitly declared shared files, not the entire repository.
+
+### 1.5 Worktree Budget Check
+
+Count existing worktrees:
+```bash
+bash scripts/experiment-worktree.sh count
+```
+
+If count + `execution.max_concurrent` would exceed 12:
+- Warn the user
+- Suggest cleaning up existing worktrees or reducing `max_concurrent`
+- Do NOT block -- the user may proceed at their own risk
+
+### 1.6 Write Baseline to Disk (CP-1)
+
+**MANDATORY CHECKPOINT.** Before presenting results to the user, write the initial experiment log with baseline metrics to disk:
+
+1. Create the experiment log file at `.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml`
+2. Include all required top-level sections from `references/experiment-log-schema.yaml`: `spec`, `run_id`, `started_at`, `baseline`, `experiments`, and `best`
+3. Seed `experiments` as an empty array and seed `best` from the baseline snapshot (use `iteration: 0`, baseline metrics, and baseline judge scores if present) so later phases have a valid current-best state to compare against
+4. Optionally seed `hypothesis_backlog: []` here as well so the log shape is stable before Phase 2 populates it
+5. **Verify**: read the file back and confirm the required sections are present and the baseline values match
+6. Only THEN present results to the user
+
+### 1.7 User Approval Gate
+
+Present to the user via the platform question tool:
+
+- **Baseline metrics**: all gate values, diagnostic values, and judge scores (if applicable)
+- **Experiment log location**: show the file path so the user knows where results are saved
+- **Parallel readiness**: probe results, any blockers, mitigations applied
+- **Clean-tree status**: confirmed clean
+- **Worktree budget**: current count and projected usage
+- **Judge budget**: estimated per-experiment judge cost and configured `max_total_cost_usd` cap (or an explicit note that spend is uncapped)
+
+**Options:**
+1. **Proceed** -- approve baseline and parallel config, move to Phase 2
+2. **Adjust spec** -- modify spec settings before proceeding
+3. **Fix issues** -- user needs to resolve blockers first
+
+Do NOT proceed to Phase 2 until the user explicitly approves.
+
+If primary type is `judge` and `max_total_cost_usd` is null, call that out as uncapped spend and require explicit approval before proceeding.
+
+**State re-read:** After gate approval, re-read the spec and baseline from disk. Do not carry stale in-memory values forward.
+
+---
+
+## Phase 2: Hypothesis Generation
+
+### 2.1 Analyze Current Approach
+
+Read the code within `scope.mutable` to understand:
+- The current implementation approach
+- Obvious improvement opportunities
+- Constraints and dependencies between components
+
+Optionally dispatch `compound-engineering:research:repo-research-analyst` for deeper codebase analysis if the scope is large or unfamiliar.
+
+### 2.2 Generate Hypothesis List
+
+Generate an initial set of hypotheses. Each hypothesis should have:
+- **Description**: what to try
+- **Category**: one of the standard categories (signal-extraction, graph-signals, embedding, algorithm, preprocessing, parameter-tuning, architecture, data-handling) or a domain-specific category
+- **Priority**: high, medium, or low based on expected impact and feasibility
+- **Required dependencies**: any new packages or tools needed
+
+Include user-provided hypotheses if any were given as input.
+
+Aim for 10-30 hypotheses in the initial backlog. More can be generated during the loop based on learnings.
+
+### 2.3 Dependency Pre-Approval
+
+Collect all unique new dependencies across all hypotheses.
+
+If any hypotheses require new dependencies:
+1. Present the full dependency list to the user via the platform question tool
+2. Ask for bulk approval
+3. Mark each hypothesis's `dep_status` as `approved` or `needs_approval`
+
+Hypotheses with unapproved dependencies remain in the backlog but are skipped during batch selection. They are re-presented at wrap-up for potential approval.
+
+### 2.4 Record Hypothesis Backlog (CP-2)
+
+**MANDATORY CHECKPOINT.** Write the initial backlog to the experiment log file and verify:
+```yaml
+hypothesis_backlog:
+  - description: "Remove template boilerplate before embedding"
+    category: "signal-extraction"
+    priority: high
+    dep_status: approved
+    required_deps: []
+  - description: "Try HDBSCAN clustering algorithm"
+    category: "algorithm"
+    priority: medium
+    dep_status: needs_approval
+    required_deps: ["scikit-learn"]
+```
+
+---
+
+## Phase 3: Optimization Loop
+
+This phase repeats in batches until a stopping criterion is met.
+
+### 3.1 Batch Selection
+
+Select hypotheses for this batch:
+- Build a runnable backlog by excluding hypotheses with `dep_status: needs_approval`
+- If `execution.mode` is `serial`, force `batch_size = 1`
+- Otherwise, `batch_size = min(runnable_backlog_size, execution.max_concurrent)`
+- Prefer diversity: select from different categories when possible
+- Within a category, select by priority (high first)
+
+If the backlog is empty and no new hypotheses can be generated, proceed to Phase 4 (wrap-up).
+If the backlog is non-empty but no runnable hypotheses remain because everything needs approval or is otherwise blocked, proceed to Phase 4 so the user can approve dependencies instead of spinning forever.
+
+### 3.2 Dispatch Experiments
+
+For each hypothesis in the batch, dispatch according to `execution.mode`. In `serial` mode, run exactly one experiment to completion before selecting the next hypothesis. In `parallel` mode, dispatch the full batch concurrently.
+
+**Worktree backend:**
+1. Create experiment worktree:
+   ```bash
+   WORKTREE_PATH=$(bash scripts/experiment-worktree.sh create "<spec_name>" <exp_index> "optimize/<spec_name>" <shared_files...>)  # creates optimize-exp/<spec_name>/exp-<NNN>
+   ```
+2. Apply port parameterization if configured (set env vars for the measurement script)
+3. Fill the experiment prompt template (`references/experiment-prompt-template.md`) with:
+   - Iteration number, spec name
+   - Hypothesis description and category
+   - Current best and baseline metrics
+   - Mutable and immutable scope
+   - Constraints and approved dependencies
+   - Rolling window of last 10 experiments (concise summaries)
+4. Dispatch a subagent with the filled prompt, working in the experiment worktree
+
+**Codex backend:**
+1. Check environment guard -- do NOT delegate if already inside a Codex sandbox:
+   ```bash
+   # If these exist, we're already in Codex -- fall back to subagent
+   test -n "${CODEX_SANDBOX:-}" || test -n "${CODEX_SESSION_ID:-}" || test ! -w .git
+   ```
+2. Fill the experiment prompt template
+3. Write the filled prompt to a temp file
+4. Dispatch via Codex:
+   ```bash
+   cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1
+   ```
+5. Security posture: use the user's selection (ask once per session if not set in spec)
+
+### 3.3 Collect and Persist Results
+
+Process experiments as they complete — do NOT wait for the entire batch to finish before writing results.
+
+For each completed experiment, **immediately**:
+
+1. **Run measurement** in the experiment's worktree:
+   ```bash
+   bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>/<measurement.working_directory or .>" <env_vars...>
+   ```
+   - If stability mode is `repeat`, run the measurement harness `repeat_count` times in that working directory and aggregate the results exactly as in Phase 1 before evaluating gates or ranking the experiment.
+   - Use the aggregated metrics as the experiment's score; if variance exceeds `noise_threshold`, record that in learnings so the operator knows the result is noisy.
+
+2. **Write crash-recovery marker** — immediately after measurement, write `result.yaml` in the experiment worktree containing the raw metrics. This ensures the measurement is recoverable even if the agent crashes before updating the main log.
+
+3. **Read raw JSON output** from the measurement script
+
+4. **Evaluate degenerate gates**:
+   - For each gate in `metric.degenerate_gates`, parse the operator and threshold
+   - Compare the metric value against the threshold
+   - If ANY gate fails: mark outcome as `degenerate`, skip judge evaluation, save money
+
+5. **If gates pass AND primary type is `judge`**:
+   - Read the experiment's output (cluster assignments, search results, etc.)
+   - Apply stratified sampling per `metric.judge.stratification` config (using `sample_seed`)
+   - Group samples into batches of `metric.judge.batch_size`
+   - Fill the judge prompt template (`references/judge-prompt-template.md`) for each batch
+   - Dispatch `ceil(sample_size / batch_size)` parallel judge sub-agents
+   - Each sub-agent returns structured JSON scores
+   - Aggregate scores: compute the configured primary judge field from `metric.judge.scoring.primary` (which should match `metric.primary.name`) plus any `scoring.secondary` values
+   - If `singleton_sample > 0`: also dispatch singleton evaluation sub-agents
+
+6. **If gates pass AND primary type is `hard`**:
+   - Use the metric value directly from the measurement output
+
+7. **IMMEDIATELY append to experiment log on disk (CP-3)** — do not defer this to batch evaluation. Write the experiment entry (iteration, hypothesis, outcome, metrics, learnings) to `.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml` right now. Use the transitional outcome `measured` once the experiment has valid metrics but has not yet been compared to the current best. Update the outcome to `kept`, `reverted`, or another terminal state in the evaluation step, but the raw metrics are on disk and safe from context compaction.
+
+8. **VERIFY the write (CP-3 verification)** — read the experiment log back from disk and confirm the entry just written is present. If verification fails, retry the write. Do NOT proceed to the next experiment until this entry is confirmed on disk.
+
+**Why immediately + verify?** The agent's context window is NOT a durable store. Context compaction, session crashes, and restarts are expected during long runs. If results only exist in the agent's memory, they are lost. Karpathy's autoresearch writes to `results.tsv` after every single experiment — this skill must do the same with the experiment log. The verification step catches silent write failures that would otherwise lose data.
+
+### 3.4 Evaluate Batch
+
+After all experiments in the batch have been measured:
+
+1. **Rank** experiments by primary metric improvement:
+   - For hard metrics: compare to the current best using `metric.primary.direction` (`maximize` means higher is better, `minimize` means lower is better), and require the absolute improvement to exceed `measurement.stability.noise_threshold` before treating it as a real win
+   - For judge metrics: compare the configured primary judge score (`metric.judge.scoring.primary` / `metric.primary.name`) to the current best, and require it to exceed `minimum_improvement`
+
+2. **Identify the best experiment** that passes all gates and improves the primary metric
+
+3. **If best improves on current best: KEEP**
+   - Commit the experiment branch first so the winning diff exists as a real commit before any merge or cherry-pick
+   - Include only mutable-scope changes in that commit; if no eligible diff remains, treat the experiment as non-improving and revert it
+   - Merge the committed experiment branch into the optimization branch
+   - Use the message `optimize(<spec-name>): <hypothesis description>` for the experiment commit
+   - After the merge succeeds, clean up the winner's experiment worktree and branch; the integrated commit on the optimization branch is the durable artifact
+   - This is now the new baseline for subsequent batches
+
+4. **Check file-disjoint runners-up** (up to `max_runner_up_merges_per_batch`):
+   - For each runner-up that also improved, check file-level disjointness with the kept experiment
+   - **File-level disjointness**: two experiments are disjoint if they modified completely different files. Same file = overlapping, even if different lines.
+   - If disjoint: cherry-pick the runner-up onto the new baseline, re-run full measurement
+   - If combined measurement is strictly better: keep the cherry-pick (outcome: `runner_up_kept`), then clean up that runner-up's experiment worktree and branch
+   - Otherwise: revert the cherry-pick, log as "promising alone but neutral/harmful in combination" (outcome: `runner_up_reverted`), then clean up the runner-up's experiment worktree and branch
+   - Stop after first failed combination
+
+5. **Handle deferred deps**: experiments that need unapproved dependencies get outcome `deferred_needs_approval`
+
+6. **Revert all others**: cleanup worktrees, log as `reverted`
+
+### 3.5 Update State (CP-4)
+
+**MANDATORY CHECKPOINT.** By this point, individual experiment results are already on disk (written in step 3.3). This step updates aggregate state and verifies.
+
+1. **Re-read the experiment log from disk** — do not trust in-memory state. The log is the source of truth.
+
+2. **Finalize outcomes** — update experiment entries from step 3.4 evaluation (mark `kept`, `reverted`, `runner_up_kept`, etc.). Write these outcome updates to disk immediately.
+
+3. **Update the `best` section** in the experiment log if a new best was found. Write to disk.
+
+4. **Write strategy digest** to `.context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md`:
+   - Categories tried so far (with success/failure counts)
+   - Key learnings from this batch and overall
+   - Exploration frontier: what categories and approaches remain untried
+   - Current best metrics and improvement from baseline
+
+5. **Generate new hypotheses** based on learnings:
+   - Re-read the strategy digest from disk (not from memory)
+   - Read the rolling window (last 10 experiments from the log on disk)
+   - Do NOT read the full experiment log -- use the digest for broad context
+   - Add new hypotheses to the backlog and write the updated backlog to disk
+
+6. **Write updated hypothesis backlog to disk** — the backlog section of the experiment log must reflect newly added hypotheses and removed (tested) ones.
+
+**CP-4 Verification:** Read the experiment log back from disk. Confirm: (a) all experiment outcomes from this batch are finalized, (b) the `best` section reflects the current best, (c) the hypothesis backlog is updated. Read `strategy-digest.md` back and confirm it exists. Only THEN proceed to the next batch or stopping criteria check.
+
+**Checkpoint: at this point, all state for this batch is on disk. If the agent crashes and restarts, it can resume from the experiment log without loss.**
+
+### 3.6 Check Stopping Criteria
+
+Stop the loop if ANY of these are true:
+- **Target reached**: `stopping.target_reached` is true, `metric.primary.target` is set, and the primary metric reaches that target according to `metric.primary.direction` (`>=` for `maximize`, `<=` for `minimize`)
+- **Max iterations**: total experiments run >= `stopping.max_iterations`
+- **Max hours**: wall-clock time since Phase 3 start >= `stopping.max_hours`
+- **Judge budget exhausted**: cumulative judge spend >= `metric.judge.max_total_cost_usd` (if set)
+- **Plateau**: no improvement for `stopping.plateau_iterations` consecutive experiments
+- **Manual stop**: user interrupts (save state and proceed to Phase 4)
+- **Empty backlog**: no hypotheses remain and no new ones can be generated
+
+If no stopping criterion is met, proceed to the next batch (step 3.1).
+
+### 3.7 Cross-Cutting Concerns
+
+**Codex failure cascade**: Track consecutive Codex delegation failures. After 3 consecutive failures, auto-disable Codex for remaining experiments and fall back to subagent dispatch. Log the switch.
+
+**Error handling**: If an experiment's measurement command crashes, times out, or produces malformed output:
+- Log as outcome `error` or `timeout` with the error message
+- Revert the experiment (cleanup worktree)
+- The loop continues with remaining experiments in the batch
+
+**Progress reporting**: After each batch, report:
+- Batch N of estimated M (based on backlog size)
+- Experiments run this batch and total
+- Current best metric and improvement from baseline
+- Cumulative judge cost (if applicable)
+
+**Crash recovery**: See Persistence Discipline section. Per-experiment `result.yaml` markers are written in step 3.3. Individual experiment results are appended to the log immediately in step 3.3. Batch-level state (outcomes, best, digest) is written in step 3.5. On resume (Phase 0.4), the log on disk is the ground truth — scan for any `result.yaml` markers not yet reflected in the log.
+
+---
+
+## Phase 4: Wrap-Up
+
+### 4.1 Present Deferred Hypotheses
+
+If any hypotheses were deferred due to unapproved dependencies:
+1. List them with their dependency requirements
+2. Ask the user whether to approve, skip, or save for a future run
+3. If approved: add to backlog and offer to re-enter Phase 3 for one more round
+
+### 4.2 Summarize Results
+
+Present a comprehensive summary:
+
+```
+Optimization: <spec-name>
+Duration: <wall-clock time>
+Total experiments: <count>
+  Kept: <count> (including <runner_up_kept_count> runner-up merges)
+  Reverted: <count>
+  Degenerate: <count>
+  Errors: <count>
+  Deferred: <count>
+
+Baseline -> Final:
+  <primary_metric>: <baseline_value> -> <final_value> (<delta>)
+  <gate_metrics>: ...
+  <diagnostics>: ...
+
+Judge cost: $<total_judge_cost_usd> (if applicable)
+
+Key improvements:
+  1. <kept experiment 1 hypothesis> (+<delta>)
+  2. <kept experiment 2 hypothesis> (+<delta>)
+  ...
+```
+
+### 4.3 Preserve and Offer Next Steps
+
+The optimization branch (`optimize/<spec-name>`) is preserved with all commits from kept experiments.
+The experiment log and strategy digest remain in local `.context/...` scratch space for resume and audit on this machine only; they do not travel with the branch because `.context/` is gitignored.
+
+Present post-completion options via the platform question tool:
+
+1. **Run `/ce:review`** on the cumulative diff (baseline to final). Load the `ce:review` skill with `mode:autofix` on the optimization branch.
+2. **Run `/ce:compound`** to document the winning strategy as an institutional learning.
+3. **Create PR** from the optimization branch to the default branch.
+4. **Continue** with more experiments: re-enter Phase 3 with the current state. State re-read first.
+5. **Done** -- leave the optimization branch for manual review.
+
+### 4.4 Cleanup
+
+Clean up scratch space:
+```bash
+# Keep the experiment log for local resume/audit on this machine
+# Remove temporary batch artifacts
+rm -f .context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md
+```
+
+Do NOT delete the experiment log if the user may resume locally or wants a local audit trail. If they need a durable shared artifact, summarize or export the results into a tracked path before cleanup.
+Do NOT delete experiment worktrees that are still being referenced.
--- a/plugins/compound-engineering/skills/ce-optimize/references/example-hard-spec.yaml
+++ b/plugins/compound-engineering/skills/ce-optimize/references/example-hard-spec.yaml
@@ -0,0 +1,64 @@
+# Minimal first-run template for objective metrics.
+# Start here when "better" is a scalar value from the measurement harness.
+
+name: improve-build-latency
+description: Reduce build latency without regressing correctness
+
+metric:
+  primary:
+    type: hard
+    name: build_seconds
+    direction: minimize
+  degenerate_gates:
+    - name: build_passed
+      check: "== 1"
+      description: The build must stay green
+    - name: test_pass_rate
+      check: ">= 1.0"
+      description: Required tests must keep passing
+  diagnostics:
+    - name: artifact_size_mb
+    - name: peak_memory_mb
+
+measurement:
+  command: "python evaluate.py"
+  timeout_seconds: 300
+  working_directory: "tools/eval"
+  stability:
+    mode: repeat
+    repeat_count: 3
+    aggregation: median
+    noise_threshold: 0.05
+
+scope:
+  mutable:
+    - "src/build/"
+    - "config/build.yaml"
+  immutable:
+    - "tools/eval/evaluate.py"
+    - "tests/fixtures/"
+    - "scripts/ci/"
+
+execution:
+  mode: serial
+  backend: worktree
+  max_concurrent: 1
+
+parallel:
+  port_strategy: none
+  shared_files: []
+
+dependencies:
+  approved: []
+
+constraints:
+  - "Keep output artifacts backward compatible"
+  - "Do not skip required validation steps"
+
+stopping:
+  max_iterations: 4
+  max_hours: 1
+  plateau_iterations: 3
+  target_reached: true
+
+max_runner_up_merges_per_batch: 0
--- a/plugins/compound-engineering/skills/ce-optimize/references/example-judge-spec.yaml
+++ b/plugins/compound-engineering/skills/ce-optimize/references/example-judge-spec.yaml
@@ -0,0 +1,78 @@
+# Minimal first-run template for qualitative metrics.
+# Start here when true quality requires semantic judgment, not a proxy metric.
+
+name: improve-search-relevance
+description: Improve semantic relevance of search results without obvious failures
+
+metric:
+  primary:
+    type: judge
+    name: mean_score
+    direction: maximize
+  degenerate_gates:
+    - name: result_count
+      check: ">= 5"
+      description: Return enough results to judge quality
+    - name: empty_query_failures
+      check: "== 0"
+      description: Empty or trivial queries must not fail
+  diagnostics:
+    - name: latency_ms
+    - name: recall_at_10
+  judge:
+    rubric: |
+      Rate each result set from 1-5 for relevance:
+      - 5: Results are directly relevant and well ordered
+      - 4: Mostly relevant with minor ordering issues
+      - 3: Mixed relevance or one obvious miss
+      - 2: Weak relevance, several misses, or poor ordering
+      - 1: Mostly irrelevant
+      Also report: ambiguous (boolean)
+    scoring:
+      primary: mean_score
+      secondary:
+        - ambiguous_rate
+    model: haiku
+    sample_size: 10
+    batch_size: 5
+    sample_seed: 42
+    minimum_improvement: 0.2
+    max_total_cost_usd: 5
+
+measurement:
+  command: "python eval_search.py"
+  timeout_seconds: 300
+  working_directory: "tools/eval"
+
+scope:
+  mutable:
+    - "src/search/"
+    - "config/search.yaml"
+  immutable:
+    - "tools/eval/eval_search.py"
+    - "tests/fixtures/"
+    - "docs/"
+
+execution:
+  mode: serial
+  backend: worktree
+  max_concurrent: 1
+
+parallel:
+  port_strategy: none
+  shared_files: []
+
+dependencies:
+  approved: []
+
+constraints:
+  - "Preserve the existing search response shape"
+  - "Do not add new dependencies on the first run"
+
+stopping:
+  max_iterations: 4
+  max_hours: 1
+  plateau_iterations: 3
+  target_reached: true
+
+max_runner_up_merges_per_batch: 0
--- a/plugins/compound-engineering/skills/ce-optimize/references/experiment-log-schema.yaml
+++ b/plugins/compound-engineering/skills/ce-optimize/references/experiment-log-schema.yaml
@@ -0,0 +1,257 @@
+# Experiment Log Schema
+# This is the canonical schema for the experiment log file that accumulates
+# across an optimization run.
+#
+# Location: .context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
+#
+# PERSISTENCE MODEL:
+# The experiment log on disk is the SINGLE SOURCE OF TRUTH. The agent's
+# in-memory context is expendable and will be compacted during long runs.
+#
+# Write discipline:
+# - Each experiment entry is APPENDED immediately after its measurement
+#   completes (SKILL.md step 3.3), before batch evaluation
+# - Outcome fields may be updated in-place after batch evaluation (step 3.5)
+# - The `best` section is updated after each batch if a new best is found
+# - The `hypothesis_backlog` is updated after each batch
+# - The agent re-reads this file from disk at every phase boundary
+#
+# The orchestrator does NOT read the full log each iteration -- it uses a
+# rolling window (last 10 experiments) + a strategy digest file for
+# hypothesis generation. But the full log exists on disk for resume,
+# crash recovery, and post-run analysis.
+
+# ============================================================================
+# TOP-LEVEL STRUCTURE
+# ============================================================================
+
+structure:
+
+  spec:
+    type: string
+    required: true
+    description: "Name of the optimization spec this log belongs to"
+
+  run_id:
+    type: string
+    required: true
+    description: "Unique identifier for this optimization run (timestamp-based). Distinguishes resumed runs from fresh starts."
+
+  started_at:
+    type: string
+    format: "ISO 8601 timestamp"
+    required: true
+
+  baseline:
+    type: object
+    required: true
+    description: "Metrics measured on the original code before any optimization"
+    children:
+      timestamp:
+        type: string
+        format: "ISO 8601 timestamp"
+      gates:
+        type: object
+        description: "Key-value pairs of gate metric names to their baseline values"
+      diagnostics:
+        type: object
+        description: "Key-value pairs of diagnostic metric names to their baseline values"
+      judge:
+        type: object
+        description: "Judge scores on the baseline (only when primary type is 'judge')"
+        children:
+          # All fields from the scoring config appear here
+          # Plus:
+          sample_seed:
+            type: integer
+          judge_cost_usd:
+            type: number
+
+  experiments:
+    type: array
+    required: true
+    description: "Ordered list of all experiments, including kept, reverted, errored, and deferred"
+    items:
+      type: object
+      # See EXPERIMENT ENTRY below
+
+  best:
+    type: object
+    required: true
+    description: "Summary of the current best result"
+    children:
+      iteration:
+        type: integer
+        description: "Iteration number of the best experiment (use 0 for the baseline snapshot before any experiment is kept)"
+      metrics:
+        type: object
+        description: "All metric values from the current best state (seed with baseline metrics during CP-1)"
+      judge:
+        type: object
+        description: "Judge scores from the best experiment (only when primary type is 'judge')"
+      total_judge_cost_usd:
+        type: number
+        description: "Running total of all judge costs across all experiments"
+
+  hypothesis_backlog:
+    type: array
+    description: "Remaining hypotheses not yet tested"
+    items:
+      type: object
+      children:
+        description:
+          type: string
+        category:
+          type: string
+        priority:
+          type: string
+          enum: [high, medium, low]
+        dep_status:
+          type: string
+          enum: [approved, needs_approval, not_applicable]
+        required_deps:
+          type: array
+          items:
+            type: string
+
+# ============================================================================
+# EXPERIMENT ENTRY
+# ============================================================================
+
+experiment_entry:
+  required_children:
+
+    iteration:
+      type: integer
+      description: "Sequential experiment number (1-indexed, monotonically increasing)"
+
+    batch:
+      type: integer
+      description: "Batch number this experiment was part of. Multiple experiments in the same batch ran in parallel."
+
+    hypothesis:
+      type: string
+      description: "Human-readable description of what this experiment tried"
+
+    category:
+      type: string
+      description: "Category for grouping and diversity selection (e.g., signal-extraction, graph-signals, embedding, algorithm, preprocessing)"
+
+    outcome:
+      type: enum
+      values:
+        - measured                # measurement finished and metrics were persisted, awaiting batch evaluation
+        - kept                    # primary metric improved, gates passed -> merged to optimization branch
+        - reverted                # primary metric did not improve or was worse -> changes discarded
+        - degenerate              # degenerate gate failed -> immediately reverted, no judge evaluation
+        - error                   # measurement command crashed, timed out, or produced malformed output
+        - deferred_needs_approval # experiment needs an unapproved dependency -> set aside for batch approval
+        - timeout                 # measurement command exceeded timeout_seconds
+        - runner_up_kept          # file-disjoint runner-up that was cherry-picked and re-measured successfully
+        - runner_up_reverted      # file-disjoint runner-up that was cherry-picked but combined measurement was not better
+      description: >
+        Load-bearing state: the loop branches on this value.
+        'measured' is the only non-terminal state and exists so CP-3 can persist
+        raw metrics before batch-level comparison decides the final outcome.
+        'kept' and 'runner_up_kept' advance the optimization branch.
+        'deferred_needs_approval' items are re-presented at wrap-up.
+        All other states are terminal for that experiment.
+
+  optional_children:
+
+    changes:
+      type: array
+      description: "Files modified by this experiment"
+      items:
+        type: object
+        children:
+          file:
+            type: string
+          summary:
+            type: string
+
+    gates:
+      type: object
+      description: "Gate metric values from the measurement command"
+
+    gates_passed:
+      type: boolean
+      description: "Whether all degenerate gates passed"
+
+    diagnostics:
+      type: object
+      description: "Diagnostic metric values from the measurement command"
+
+    judge:
+      type: object
+      description: "Judge evaluation scores (only when primary type is 'judge' and gates passed)"
+      children:
+        # All fields from scoring.primary and scoring.secondary appear here
+        # Plus:
+        judge_cost_usd:
+          type: number
+          description: "Cost of judge calls for this experiment"
+
+    primary_delta:
+      type: string
+      description: "Change in primary metric from current best (e.g., '+0.7', '-0.3')"
+
+    learnings:
+      type: string
+      description: "What was learned from this experiment. The agent reads these to avoid re-trying similar approaches and to inform new hypothesis generation."
+
+    commit:
+      type: string
+      description: "Git commit SHA on the optimization branch (only for 'kept' and 'runner_up_kept' outcomes)"
+
+    deferred_reason:
+      type: string
+      description: "Why this experiment was deferred (only for 'deferred_needs_approval' outcome)"
+
+    error_message:
+      type: string
+      description: "Error details (only for 'error' and 'timeout' outcomes)"
+
+    merged_with:
+      type: integer
+      description: "Iteration number of the experiment this was merged with (only for 'runner_up_kept' and 'runner_up_reverted')"
+
+# ============================================================================
+# OUTCOME STATE TRANSITIONS
+# ============================================================================
+#
+# proposed (in hypothesis_backlog)
+#   -> selected for batch
+#     -> experiment dispatched
+#       -> measurement completed
+#         -> gates failed           -> outcome: degenerate
+#         -> measurement error      -> outcome: error
+#         -> measurement timeout    -> outcome: timeout
+#         -> gates passed
+#           -> persist raw metrics   -> outcome: measured
+#           -> judge evaluated (if type: judge)
+#             -> best in batch, improved  -> outcome: kept
+#             -> runner-up, file-disjoint -> cherry-pick + re-measure
+#               -> combined better        -> outcome: runner_up_kept
+#               -> combined not better    -> outcome: runner_up_reverted
+#             -> not improved             -> outcome: reverted
+#       -> needs unapproved dep    -> outcome: deferred_needs_approval
+#
+# Only 'kept' and 'runner_up_kept' produce a commit on the optimization branch.
+# Only 'deferred_needs_approval' items are re-presented at wrap-up for approval.
+
+# ============================================================================
+# STRATEGY DIGEST (separate file)
+# ============================================================================
+#
+# Written after each batch to:
+#   .context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md
+#
+# Contains a compressed summary of:
+# - What hypothesis categories have been tried
+# - Which approaches succeeded (kept) and which failed (reverted)
+# - The exploration frontier: what hasn't been tried yet
+# - Key learnings that should inform next hypotheses
+#
+# The orchestrator reads the strategy digest (not the full experiment log)
+# when generating new hypotheses between batches.
--- a/plugins/compound-engineering/skills/ce-optimize/references/experiment-prompt-template.md
+++ b/plugins/compound-engineering/skills/ce-optimize/references/experiment-prompt-template.md
@@ -0,0 +1,89 @@
+# Experiment Worker Prompt Template
+
+This template is used by the orchestrator to dispatch each experiment to a subagent or Codex. Variable substitution slots are filled at spawn time.
+
+---
+
+## Template
+
+```
+You are an optimization experiment worker.
+
+Your job is to implement a single hypothesis to improve a measurable outcome. You will modify code within a defined scope, then stop. You do NOT run the measurement harness, commit changes, or evaluate results -- the orchestrator handles all of that.
+
+<experiment-context>
+Experiment: #{iteration} for optimization target: {spec_name}
+Hypothesis: {hypothesis_description}
+Category: {hypothesis_category}
+
+Current best metrics:
+{current_best_metrics}
+
+Baseline metrics (before any optimization):
+{baseline_metrics}
+</experiment-context>
+
+<scope-rules>
+You MAY modify files in these paths:
+{scope_mutable}
+
+You MUST NOT modify files in these paths:
+{scope_immutable}
+
+CRITICAL: Do not modify any file outside the mutable scope. The measurement harness and evaluation data are immutable by design -- the agent cannot game the metric by changing how it is measured.
+</scope-rules>
+
+<constraints>
+{constraints}
+</constraints>
+
+<approved-dependencies>
+You may add or use these dependencies without further approval:
+{approved_dependencies}
+
+If your implementation requires a dependency NOT in this list, STOP and note it in your output. Do not install unapproved dependencies.
+</approved-dependencies>
+
+<previous-experiments>
+Recent experiments and their outcomes (for context -- avoid re-trying approaches that already failed):
+
+{recent_experiment_summaries}
+</previous-experiments>
+
+<instructions>
+1. Read and understand the relevant code in the mutable scope
+2. Implement the hypothesis described above
+3. Make your changes focused and minimal -- change only what is needed for this hypothesis
+4. Do NOT run the measurement harness (the orchestrator handles this)
+5. Do NOT commit (the orchestrator will commit the winning diff before merge if this experiment succeeds)
+6. Do NOT modify files outside the mutable scope
+7. When done, run `git diff --stat` so the orchestrator can see your changes
+8. If you discover you need an unapproved dependency, note it and stop
+
+Focus on implementing the hypothesis well. The orchestrator will measure and evaluate the results.
+</instructions>
+```
+
+## Variable Reference
+
+| Variable | Source | Description |
+|----------|--------|-------------|
+| `{iteration}` | Experiment counter | Sequential experiment number |
+| `{spec_name}` | Spec file `name` field | Optimization target identifier |
+| `{hypothesis_description}` | Hypothesis backlog | What this experiment should try |
+| `{hypothesis_category}` | Hypothesis backlog | Category (signal-extraction, algorithm, etc.) |
+| `{current_best_metrics}` | Experiment log `best` section | Current best metric values (compact YAML or key: value pairs) |
+| `{baseline_metrics}` | Experiment log `baseline` section | Original baseline before any optimization |
+| `{scope_mutable}` | Spec `scope.mutable` | List of files/dirs the worker may modify |
+| `{scope_immutable}` | Spec `scope.immutable` | List of files/dirs the worker must not touch |
+| `{constraints}` | Spec `constraints` | Free-text constraints to follow |
+| `{approved_dependencies}` | Spec `dependencies.approved` | Dependencies approved for use |
+| `{recent_experiment_summaries}` | Rolling window (last 10) from experiment log | Compact summaries: hypothesis, outcome, learnings |
+
+## Notes
+
+- This template works for both subagent and Codex dispatch. No platform-specific assumptions.
+- For Codex dispatch: write the filled template to a temp file and pipe via stdin (`cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1`).
+- For subagent dispatch: pass the filled template as the subagent prompt.
+- Keep `{recent_experiment_summaries}` concise -- 2-3 lines per experiment, last 10 only. Do not include the full experiment log.
+- The worker should NOT read the full experiment log or strategy digest. It receives only what the orchestrator provides.
--- a/plugins/compound-engineering/skills/ce-optimize/references/judge-prompt-template.md
+++ b/plugins/compound-engineering/skills/ce-optimize/references/judge-prompt-template.md
@@ -0,0 +1,110 @@
+# Judge Evaluation Prompt Template
+
+This template is used by the orchestrator to dispatch batched LLM-as-judge evaluation calls. Each judge sub-agent evaluates a batch of sampled output items and returns structured JSON scores.
+
+The orchestrator:
+1. Reads the experiment's output
+2. Selects samples per the stratification config (using fixed seed)
+3. Groups samples into batches of `judge.batch_size`
+4. Dispatches `ceil(sample_size / batch_size)` parallel sub-agents using this template
+5. Aggregates returned JSON scores
+
+---
+
+## Item Evaluation Template
+
+```
+You are a quality judge evaluating output items for an optimization experiment.
+
+Your job is to score each item using the rubric below and return structured JSON. Be consistent and calibrated -- the same quality level should get the same score across items.
+
+<rubric>
+{rubric}
+</rubric>
+
+<items>
+{items_json}
+</items>
+
+<output-contract>
+Return ONLY a valid JSON array. No prose, no markdown, no explanation outside the JSON.
+
+Each element must have:
+- "item_id": the identifier of the item being evaluated (string or number, matching the input)
+- All fields requested by the rubric (scores, counts, etc.)
+- "ambiguous": true if you cannot confidently score this item (e.g., insufficient context, borderline case). When ambiguous, still provide your best-guess score but flag it.
+
+Example output format (adapt field names to match the rubric):
+[
+  {"item_id": "cluster-42", "score": 4, "distinct_topics": 1, "outlier_count": 0, "ambiguous": false},
+  {"item_id": "cluster-17", "score": 2, "distinct_topics": 3, "outlier_count": 2, "ambiguous": false},
+  {"item_id": "cluster-99", "score": 3, "distinct_topics": 2, "outlier_count": 1, "ambiguous": true}
+]
+
+Rules:
+- Evaluate each item independently
+- Score based on the rubric, not on how other items in this batch scored
+- If an item is empty or has only 1 element when it should have more, score it based on what is present
+- For very large items (many elements), focus on a representative subset and note if quality varies across the item
+- Every item in the batch MUST appear in your output
+</output-contract>
+```
+
+## Singleton Evaluation Template
+
+```
+You are a quality judge evaluating singleton items -- items that are currently NOT in any group/cluster.
+
+Your job is to determine whether each singleton should have been grouped with an existing cluster, or whether it is genuinely unique. Return structured JSON.
+
+<rubric>
+{singleton_rubric}
+</rubric>
+
+<singletons>
+{singletons_json}
+</singletons>
+
+<existing-clusters>
+A summary of existing clusters for reference (titles/themes only, not full contents):
+{cluster_summaries}
+</existing-clusters>
+
+<output-contract>
+Return ONLY a valid JSON array. No prose, no markdown, no explanation outside the JSON.
+
+Each element must have:
+- "item_id": the identifier of the singleton
+- All fields requested by the singleton rubric (should_cluster, best_cluster_id, confidence, etc.)
+
+Example output format (adapt field names to match the rubric):
+[
+  {"item_id": "issue-1234", "should_cluster": true, "best_cluster_id": "cluster-42", "confidence": 4},
+  {"item_id": "issue-5678", "should_cluster": false, "best_cluster_id": null, "confidence": 5}
+]
+
+Rules:
+- A singleton that genuinely has no match in existing clusters should get should_cluster: false
+- A singleton that clearly belongs in an existing cluster should get should_cluster: true with the cluster ID
+- High confidence (4-5) means you are very sure. Low confidence (1-2) means the item is borderline.
+- Every singleton in the batch MUST appear in your output
+</output-contract>
+```
+
+## Variable Reference
+
+| Variable | Source | Description |
+|----------|--------|-------------|
+| `{rubric}` | Spec `metric.judge.rubric` | User-defined scoring rubric |
+| `{items_json}` | Sampled output items | JSON array of items to evaluate (one batch worth) |
+| `{singleton_rubric}` | Spec `metric.judge.singleton_rubric` | User-defined rubric for singleton evaluation |
+| `{singletons_json}` | Sampled singleton items | JSON array of singleton items to evaluate |
+| `{cluster_summaries}` | Experiment output | Summary of existing clusters (titles/themes) for singleton reference |
+
+## Notes
+
+- Designed for Haiku by default -- prompts are concise and well-structured for smaller models
+- The rubric is part of the immutable measurement harness -- the experiment agent cannot modify it
+- The `ambiguous` flag on items helps the orchestrator identify noisy evaluations without forcing bad scores
+- For singleton evaluation, the orchestrator provides cluster summaries (not full contents) to keep judge context lean
+- Each sub-agent evaluates one batch independently -- sub-agents do not see each other's results
--- a/plugins/compound-engineering/skills/ce-optimize/references/optimize-spec-schema.yaml
+++ b/plugins/compound-engineering/skills/ce-optimize/references/optimize-spec-schema.yaml
@@ -0,0 +1,392 @@
+# Optimization Spec Schema
+# This is the canonical schema for optimization spec files created by users
+# to configure a /ce-optimize run. The orchestrating agent validates specs
+# against this schema before proceeding.
+#
+# Usage: Create a YAML file matching this schema and pass it to /ce-optimize.
+# The agent reads this spec, validates required fields, and uses it to
+# configure the entire optimization run.
+
+# ============================================================================
+# REQUIRED FIELDS
+# ============================================================================
+
+required_fields:
+
+  name:
+    type: string
+    pattern: "^[a-z0-9]+(?:-[a-z0-9]+)*$"
+    description: "Unique identifier for this optimization run (lowercase kebab-case, safe for git refs and worktree paths)"
+    example: "improve-issue-clustering"
+
+  description:
+    type: string
+    description: "Human-readable description of the optimization goal"
+    example: "Improve coherence and coverage of issue/PR clusters"
+
+  metric:
+    type: object
+    description: "Three-tier metric configuration"
+    required_children:
+
+      primary:
+        type: object
+        description: "The metric the loop optimizes against"
+        required_children:
+
+          type:
+            type: enum
+            values:
+              - hard    # scalar metric from measurement command (e.g., build time, test pass rate)
+              - judge   # LLM-as-judge quality score from sampled outputs
+            description: "Whether the primary metric comes from the measurement command directly or from LLM-as-judge evaluation"
+
+          name:
+            type: string
+            description: "Metric name — must match a key in the measurement command's JSON output (for hard type) or a scoring field (for judge type)"
+            example: "cluster_coherence"
+
+          direction:
+            type: enum
+            values:
+              - maximize
+              - minimize
+            description: "Whether higher or lower is better"
+
+        optional_children:
+
+          baseline:
+            type: number
+            default: null
+            description: "Filled automatically during Phase 1 baseline measurement. Do not set manually."
+
+          target:
+            type: number
+            default: null
+            description: "Optional target value. Loop stops when this is reached."
+            example: 4.2
+
+      degenerate_gates:
+        type: array
+        description: "Fast boolean checks that reject obviously broken solutions before expensive evaluation. Run first, before the primary metric or judge."
+        required: true
+        items:
+          type: object
+          required_children:
+            name:
+              type: string
+              description: "Metric name — must match a key in the measurement command's JSON output"
+            check:
+              type: string
+              description: "Comparison operator and threshold. Supported operators: >=, <=, >, <, ==, !="
+              example: "<= 0.10"
+          optional_children:
+            description:
+              type: string
+              description: "Human-readable explanation of what this gate catches"
+
+    optional_children:
+
+      diagnostics:
+        type: array
+        default: []
+        description: "Metrics logged for understanding but never gated on. Useful for understanding WHY a primary metric changed."
+        items:
+          type: object
+          required_children:
+            name:
+              type: string
+              description: "Metric name — must match a key in the measurement command's JSON output"
+
+      judge:
+        type: object
+        description: "LLM-as-judge configuration. Required when metric.primary.type is 'judge'. Ignored when type is 'hard'."
+        required_when: "metric.primary.type == 'judge'"
+        required_children:
+          rubric:
+            type: string
+            description: "Multi-line rubric text sent to the judge model. Must instruct the judge to return JSON."
+            example: |
+              Rate this cluster 1-5:
+              - 5: All items clearly about the same issue/feature
+              - 4: Strong theme, minor outliers
+              - 3: Related but covers 2-3 sub-topics
+              - 2: Weak connection
+              - 1: Unrelated items grouped together
+          scoring:
+            type: object
+            required_children:
+              primary:
+                type: string
+                description: "Field name from judge JSON output to use as the primary optimization target"
+                example: "mean_score"
+            optional_children:
+              secondary:
+                type: array
+                default: []
+                description: "Additional scoring fields to log (not optimized against)"
+        optional_children:
+          model:
+            type: enum
+            values:
+              - haiku
+              - sonnet
+            default: haiku
+            description: "Model to use for judge evaluation. Haiku is cheaper and faster; Sonnet is more nuanced."
+          sample_size:
+            type: integer
+            default: 10
+            description: "Total number of output items to sample for judge evaluation per experiment"
+          stratification:
+            type: array
+            default: null
+            description: "Stratified sampling buckets. If null, uses uniform random sampling."
+            items:
+              type: object
+              required_children:
+                bucket:
+                  type: string
+                  description: "Bucket name for this stratum"
+                count:
+                  type: integer
+                  description: "Number of items to sample from this bucket"
+          singleton_sample:
+            type: integer
+            default: 0
+            description: "Number of singleton items to sample for false-negative evaluation"
+          singleton_rubric:
+            type: string
+            default: null
+            description: "Rubric for evaluating sampled singletons. Required if singleton_sample > 0."
+          sample_seed:
+            type: integer
+            default: 42
+            description: "Fixed seed for reproducible sampling across experiments"
+          batch_size:
+            type: integer
+            default: 5
+            description: "Number of samples per judge sub-agent batch. Controls parallelism vs overhead."
+          minimum_improvement:
+            type: number
+            default: 0.3
+            description: "Minimum judge score improvement required to accept an experiment as 'better'. Accounts for sample-composition variance when output structure changes between experiments. Distinct from measurement.stability.noise_threshold which handles run-to-run flakiness."
+          max_total_cost_usd:
+            type: number
+            default: 5
+            description: "Stop judge evaluation when cumulative judge spend reaches this cap. This is a first-run safety default; raise it only after the rubric and harness are trustworthy. Set to null only with explicit user approval."
+
+  measurement:
+    type: object
+    description: "How to run the measurement harness"
+    required_children:
+      command:
+        type: string
+        description: "Shell command that runs the evaluation and outputs JSON to stdout. The JSON must contain keys matching all gate names and diagnostic names."
+        example: "python evaluate.py"
+    optional_children:
+      timeout_seconds:
+        type: integer
+        default: 600
+        description: "Maximum seconds for the measurement command to run before being killed"
+      output_format:
+        type: enum
+        values:
+          - json
+        default: json
+        description: "Format of the measurement command's stdout. Currently only JSON is supported."
+      working_directory:
+        type: string
+        default: "."
+        description: "Working directory for the measurement command, relative to the repo root"
+      stability:
+        type: object
+        default: { mode: "stable" }
+        description: "How to handle metric variance across runs"
+        required_children:
+          mode:
+            type: enum
+            values:
+              - stable   # run once, trust the result
+              - repeat   # run N times, aggregate
+            default: stable
+        optional_children:
+          repeat_count:
+            type: integer
+            default: 5
+            description: "Number of times to run the harness when mode is 'repeat'"
+          aggregation:
+            type: enum
+            values:
+              - median
+              - mean
+              - min
+              - max
+            default: median
+            description: "How to combine repeated measurements into a single value"
+          noise_threshold:
+            type: number
+            default: 0.02
+            description: "Minimum improvement that must exceed this value to count as a real improvement (not noise). Applied to hard metrics only."
+
+  scope:
+    type: object
+    description: "What the experiment agent is allowed to modify"
+    required_children:
+      mutable:
+        type: array
+        description: "Files and directories the agent MAY modify during experiments"
+        items:
+          type: string
+          description: "File path or directory (relative to repo root). Directories match all files within."
+        example:
+          - "src/clustering/"
+          - "src/preprocessing/"
+          - "config/clustering.yaml"
+      immutable:
+        type: array
+        description: "Files and directories the agent MUST NOT modify. The measurement harness should always be listed here."
+        items:
+          type: string
+        example:
+          - "evaluate.py"
+          - "tests/fixtures/"
+          - "data/"
+
+# ============================================================================
+# OPTIONAL FIELDS
+# ============================================================================
+
+optional_fields:
+
+  execution:
+    type: object
+    default: { mode: "parallel", backend: "worktree", max_concurrent: 4 }
+    description: "How experiments are executed"
+    optional_children:
+      mode:
+        type: enum
+        values:
+          - parallel  # run experiments simultaneously (default)
+          - serial    # run one at a time
+        default: parallel
+      backend:
+        type: enum
+        values:
+          - worktree  # git worktrees for isolation (default)
+          - codex     # Codex sandboxes for isolation
+        default: worktree
+      max_concurrent:
+        type: integer
+        default: 4
+        minimum: 1
+        description: "Maximum experiments to run in parallel. Capped at 6 for worktree backend. 8+ only valid for Codex backend."
+      codex_security:
+        type: enum
+        values:
+          - full-auto                                # --full-auto (workspace write)
+          - yolo                                     # --dangerously-bypass-approvals-and-sandbox
+        default: null
+        description: "Codex security posture. If null, user is asked once per session."
+
+  parallel:
+    type: object
+    default: {}
+    description: "Parallelism configuration discovered or set during Phase 1"
+    optional_children:
+      port_strategy:
+        type: enum
+        values:
+          - parameterized  # use env var for port
+          - none           # no port parameterization needed
+        default: null
+        description: "If null, auto-detected during Phase 1 parallelism probe"
+      port_env_var:
+        type: string
+        default: null
+        description: "Environment variable name for port parameterization (e.g., EVAL_PORT)"
+      port_base:
+        type: integer
+        default: null
+        description: "Base port number. Each experiment gets port_base + experiment_index."
+      shared_files:
+        type: array
+        default: []
+        description: "Files that must be copied into each experiment worktree (e.g., SQLite databases)"
+        items:
+          type: string
+      exclusive_resources:
+        type: array
+        default: []
+        description: "Resources requiring exclusive access (e.g., 'gpu'). If non-empty, forces serial mode."
+        items:
+          type: string
+
+  dependencies:
+    type: object
+    default: { approved: [] }
+    description: "Dependency management for experiments"
+    optional_children:
+      approved:
+        type: array
+        default: []
+        description: "Pre-approved new dependencies that experiments may add"
+        items:
+          type: string
+
+  constraints:
+    type: array
+    default: []
+    description: "Free-text constraints that experiment agents must follow"
+    items:
+      type: string
+    example:
+      - "Do not change the output format of clusters"
+      - "Preserve backward compatibility with existing cluster consumers"
+
+  stopping:
+    type: object
+    default: { max_iterations: 100, max_hours: 8, plateau_iterations: 10, target_reached: true }
+    description: "When the optimization loop should stop. Any criterion can trigger a stop."
+    optional_children:
+      max_iterations:
+        type: integer
+        default: 100
+        description: "Stop after this many total experiments"
+      max_hours:
+        type: number
+        default: 8
+        description: "Stop after this many hours of wall-clock time"
+      plateau_iterations:
+        type: integer
+        default: 10
+        description: "Stop if no improvement for this many consecutive experiments"
+      target_reached:
+        type: boolean
+        default: true
+        description: "Stop when the primary metric reaches the target value (if set)"
+
+  max_runner_up_merges_per_batch:
+    type: integer
+    default: 1
+    description: "Maximum number of file-disjoint runner-up experiments to attempt merging per batch after keeping the best experiment"
+
+# ============================================================================
+# VALIDATION RULES
+# ============================================================================
+
+validation_rules:
+  - "All required fields must be present"
+  - "name must be lowercase kebab-case (`^[a-z0-9]+(?:-[a-z0-9]+)*$`)"
+  - "metric.primary.type must be 'hard' or 'judge'"
+  - "If metric.primary.type is 'judge', metric.judge must be present with rubric and scoring"
+  - "metric.degenerate_gates must have at least one entry"
+  - "measurement.command must be a non-empty string"
+  - "scope.mutable must have at least one entry"
+  - "scope.immutable must have at least one entry"
+  - "Gate check operators must be one of: >=, <=, >, <, ==, !="
+  - "execution.max_concurrent must be >= 1"
+  - "execution.max_concurrent must not exceed 6 when execution.backend is 'worktree'"
+  - "If parallel.exclusive_resources is non-empty, execution.mode should be 'serial'"
+  - "If metric.judge.singleton_sample > 0, metric.judge.singleton_rubric must be present"
+  - "If metric.primary.type is 'judge' and metric.judge.max_total_cost_usd is null, the user should explicitly approve uncapped spend"
+  - "stopping must have at least one non-default criterion or use defaults"
--- a/plugins/compound-engineering/skills/ce-optimize/references/usage-guide.md
+++ b/plugins/compound-engineering/skills/ce-optimize/references/usage-guide.md
@@ -0,0 +1,127 @@
+# `/ce-optimize` Usage Guide
+
+## What This Skill Is For
+
+`/ce-optimize` is for hard engineering problems where:
+
+1. You can try multiple code or config variants.
+2. You can run the same evaluation against each variant.
+3. You want the skill to keep the good variants and reject the bad ones.
+
+It is best for "search the space and score the results" work, not one-shot implementation work.
+
+## When To Use It
+
+Use `/ce-optimize` when the problem looks like:
+
+- "Find the smallest memory limit that stops OOM crashes without wasting RAM."
+- "Tune clustering parameters without collapsing everything into one garbage cluster."
+- "Find a prompt that is cheaper but still produces summaries good enough for downstream clustering."
+- "Compare several ranking, retrieval, batching, or threshold strategies against the same harness."
+
+Choose `type: hard` when success is objective and cheap to measure:
+
+- Memory usage
+- Latency
+- Throughput
+- Test pass rate
+- Build time
+
+Choose `type: judge` when a numeric metric can be gamed or when human usefulness matters:
+
+- Cluster coherence
+- Search relevance
+- Summary quality
+- Prompt quality
+- Classification quality with semantic edge cases
+
+## When Not To Use It
+
+`/ce-optimize` is usually the wrong tool when:
+
+- The fix is obvious and does not need experimentation
+- There is no repeatable measurement harness
+- The search space is fake and only has one plausible answer
+- The cost of evaluating variants is too high to justify multiple runs
+
+## How To Think About It
+
+The pattern is:
+
+1. Define the target.
+2. Build or validate the measurement harness first.
+3. Generate multiple plausible variants.
+4. Run the same evaluation loop against each variant.
+5. Keep the variants that improve the target without violating guard rails.
+
+The core rule is simple:
+
+- If a hard metric captures "better," optimize the hard metric.
+- If a hard metric can be gamed, add LLM-as-judge.
+
+Example: lowering a clustering threshold may increase cluster coverage. That sounds good until everything ends up in one giant cluster. Hard metrics may say "improved"; an LLM judge sampling real clusters can say "this is trash."
+
+## First-Run Advice
+
+For the first run:
+
+- Prefer `execution.mode: serial`
+- Set `execution.max_concurrent: 1`
+- Keep `stopping.max_iterations` small
+- Keep `stopping.max_hours` small
+- Avoid new dependencies until the baseline is trustworthy
+- In judge mode, use a small sample and a low cost cap
+
+The goal of the first run is to validate the harness, not to win the optimization immediately.
+
+## Example Prompts
+
+### 1. Memory Tuning
+
+```text
+Use /ce-optimize to find the smallest memory setting that keeps this service stable under our load test.
+
+The current container limit is 512 MB and the app sometimes OOM-crashes. Do not just jump to 8 GB. Try a small set of realistic memory limits, run the same load test for each one, and score the results using:
+- did the process OOM
+- did tail latency spike badly
+- did GC pauses become excessive
+
+Prefer the smallest memory limit that passes the guard rails.
+```
+
+### 2. Clustering Quality
+
+```text
+Use /ce-optimize to improve issue and PR clustering quality.
+
+We have about 18k open issues and PRs. We want to test changes that improve clustering quality, reduce singleton clusters, and improve match quality within each cluster.
+
+Do not mutate the shared default database. Copy it for the run, then use per-experiment copies when needed.
+
+Do not optimize only for coverage. Use LLM-as-judge to sample clusters and confirm they still preserve real semantic similarity instead of collapsing into giant low-quality clusters.
+```
+
+### 3. Prompt Optimization
+
+```text
+Use /ce-optimize to create a summarization prompt for issues and PRs that minimizes token spend while still producing summaries that are good enough for downstream clustering.
+
+I want the loop to compare prompt variants, measure token cost, and judge whether the summaries preserve the distinctions needed to cluster related issues together without merging unrelated ones.
+```
+
+## Choosing Between Hard Metrics And Judge Mode
+
+Use hard metrics alone when:
+
+- "Better" is obvious from the numbers.
+
+Add judge mode when:
+
+- The numbers can improve while the real output gets worse.
+
+Common pattern:
+
+- Hard gates reject broken outputs.
+- Judge mode scores the surviving candidates for actual usefulness.
+
+That hybrid setup is often the best default for ranking, clustering, and prompt work.
--- a/plugins/compound-engineering/skills/ce-optimize/scripts/experiment-worktree.sh
+++ b/plugins/compound-engineering/skills/ce-optimize/scripts/experiment-worktree.sh
@@ -0,0 +1,293 @@
+#!/bin/bash
+
+# Experiment Worktree Manager
+# Creates, cleans up, and manages worktrees for optimization experiments.
+# Each experiment gets an isolated worktree with copied shared resources.
+#
+# Usage:
+#   experiment-worktree.sh create <spec_name> <exp_index> <base_branch> [shared_file ...]
+#   experiment-worktree.sh cleanup <spec_name> <exp_index>
+#   experiment-worktree.sh cleanup-all <spec_name>
+#   experiment-worktree.sh count
+#
+# Worktrees are created at: .worktrees/optimize-<spec>-exp-<NNN>/
+# Branches are named: optimize-exp/<spec>/exp-<NNN>
+
+set -euo pipefail
+
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m'
+
+GIT_ROOT=$(git rev-parse --show-toplevel 2>/dev/null) || {
+  echo -e "${RED}Error: Not in a git repository${NC}" >&2
+  exit 1
+}
+
+WORKTREE_DIR="$GIT_ROOT/.worktrees"
+
+experiment_branch_name() {
+  local spec_name="${1:?Error: spec_name required}"
+  local padded_index="${2:?Error: padded_index required}"
+
+  # Keep experiment refs outside optimize/<spec> so they do not collide
+  # with the long-lived optimization branch namespace.
+  echo "optimize-exp/${spec_name}/exp-${padded_index}"
+}
+
+ensure_worktree_exclude() {
+  local exclude_file
+  exclude_file=$(git rev-parse --git-path info/exclude)
+
+  mkdir -p "$(dirname "$exclude_file")"
+
+  if ! grep -q "^\.worktrees$" "$exclude_file" 2>/dev/null; then
+    echo ".worktrees" >> "$exclude_file"
+  fi
+}
+
+is_registered_worktree() {
+  local worktree_path="${1:?Error: worktree_path required}"
+
+  git worktree list --porcelain | awk -v target="$worktree_path" '
+    $1 == "worktree" && $2 == target { found = 1 }
+    END { exit(found ? 0 : 1) }
+  '
+}
+
+is_branch_checked_out() {
+  local branch_name="${1:?Error: branch_name required}"
+  local branch_ref="refs/heads/$branch_name"
+
+  git worktree list --porcelain | awk -v target="$branch_ref" '
+    $1 == "branch" && $2 == target { found = 1 }
+    END { exit(found ? 0 : 1) }
+  '
+}
+
+reset_worktree_to_base() {
+  local worktree_path="${1:?Error: worktree_path required}"
+  local branch_name="${2:?Error: branch_name required}"
+  local base_branch="${3:?Error: base_branch required}"
+  local current_branch
+
+  current_branch=$(git -C "$worktree_path" symbolic-ref --quiet --short HEAD 2>/dev/null || true)
+  if [[ "$current_branch" != "$branch_name" ]]; then
+    echo -e "${RED}Error: Existing worktree is on unexpected branch: ${current_branch:-detached} (expected $branch_name)${NC}" >&2
+    echo -e "${RED}Clean up the stale worktree before rerunning this experiment.${NC}" >&2
+    return 1
+  fi
+
+  echo -e "${YELLOW}Resetting existing experiment worktree to base: $branch_name -> $base_branch${NC}" >&2
+  git -C "$worktree_path" reset --hard "$base_branch" >/dev/null
+  git -C "$worktree_path" clean -fdx >/dev/null
+}
+
+# Create an experiment worktree
+create_worktree() {
+  local spec_name="${1:?Error: spec_name required}"
+  local exp_index="${2:?Error: exp_index required}"
+  local base_branch="${3:?Error: base_branch required}"
+  shift 3
+
+  local padded_index
+  padded_index=$(printf "%03d" "$exp_index")
+  local worktree_name="optimize-${spec_name}-exp-${padded_index}"
+  local branch_name
+  branch_name=$(experiment_branch_name "$spec_name" "$padded_index")
+  local worktree_path="$WORKTREE_DIR/$worktree_name"
+
+  # Check if worktree already exists
+  if [[ -d "$worktree_path" ]]; then
+    if ! git -C "$worktree_path" rev-parse --is-inside-work-tree >/dev/null 2>&1 || \
+       ! is_registered_worktree "$worktree_path"; then
+      echo -e "${RED}Error: Existing path is not a valid registered git worktree: $worktree_path${NC}" >&2
+      echo -e "${RED}Remove or repair that directory before rerunning the experiment.${NC}" >&2
+      return 1
+    fi
+
+    echo -e "${YELLOW}Worktree already exists: $worktree_path${NC}" >&2
+    reset_worktree_to_base "$worktree_path" "$branch_name" "$base_branch"
+  else
+    mkdir -p "$WORKTREE_DIR"
+    ensure_worktree_exclude
+
+    # Create worktree from the base branch
+    if ! git worktree add -b "$branch_name" "$worktree_path" "$base_branch" --quiet 2>/dev/null; then
+      if git show-ref --verify --quiet "refs/heads/$branch_name"; then
+        if is_branch_checked_out "$branch_name"; then
+          echo -e "${RED}Error: Existing experiment branch is already checked out: $branch_name${NC}" >&2
+          echo -e "${RED}Clean up the stale worktree before rerunning this experiment.${NC}" >&2
+          return 1
+        fi
+
+        echo -e "${YELLOW}Resetting existing experiment branch to base: $branch_name -> $base_branch${NC}" >&2
+        git branch -f "$branch_name" "$base_branch" >/dev/null
+        git worktree add "$worktree_path" "$branch_name" --quiet
+      else
+        echo -e "${RED}Error: Failed to create worktree for $branch_name from $base_branch${NC}" >&2
+        return 1
+      fi
+    fi
+  fi
+
+  # Copy .env files from main repo
+  for f in "$GIT_ROOT"/.env*; do
+    if [[ -f "$f" ]]; then
+      local basename
+      basename=$(basename "$f")
+      if [[ "$basename" != ".env.example" ]]; then
+        cp "$f" "$worktree_path/$basename"
+      fi
+    fi
+  done
+
+  # Copy shared files
+  for shared_file in "$@"; do
+    if [[ -f "$GIT_ROOT/$shared_file" ]]; then
+      local dir
+      dir=$(dirname "$worktree_path/$shared_file")
+      mkdir -p "$dir"
+      cp "$GIT_ROOT/$shared_file" "$worktree_path/$shared_file"
+    elif [[ -d "$GIT_ROOT/$shared_file" ]]; then
+      local dir
+      dir=$(dirname "$worktree_path/$shared_file")
+      mkdir -p "$dir"
+      rm -rf "$worktree_path/$shared_file"
+      cp -R "$GIT_ROOT/$shared_file" "$worktree_path/$shared_file"
+    fi
+  done
+
+  echo "$worktree_path"
+}
+
+# Clean up a single experiment worktree
+cleanup_worktree() {
+  local spec_name="${1:?Error: spec_name required}"
+  local exp_index="${2:?Error: exp_index required}"
+
+  local padded_index
+  padded_index=$(printf "%03d" "$exp_index")
+  local worktree_name="optimize-${spec_name}-exp-${padded_index}"
+  local branch_name
+  branch_name=$(experiment_branch_name "$spec_name" "$padded_index")
+  local worktree_path="$WORKTREE_DIR/$worktree_name"
+
+  if [[ -d "$worktree_path" ]]; then
+    git worktree remove "$worktree_path" --force 2>/dev/null || {
+      # If worktree remove fails, try manual cleanup
+      rm -rf "$worktree_path" 2>/dev/null || true
+      git worktree prune 2>/dev/null || true
+    }
+  fi
+
+  # Delete the experiment branch
+  git branch -D "$branch_name" 2>/dev/null || true
+
+  echo -e "${GREEN}Cleaned up: $worktree_name${NC}" >&2
+}
+
+# Clean up all experiment worktrees for a spec
+cleanup_all() {
+  local spec_name="${1:?Error: spec_name required}"
+  local prefix="optimize-${spec_name}-exp-"
+  local count=0
+
+  if [[ ! -d "$WORKTREE_DIR" ]]; then
+    echo -e "${YELLOW}No worktrees directory found${NC}" >&2
+    return 0
+  fi
+
+  for worktree_path in "$WORKTREE_DIR"/${prefix}*; do
+    if [[ -d "$worktree_path" ]]; then
+      local worktree_name
+      worktree_name=$(basename "$worktree_path")
+      # Extract index from name
+      local index_str="${worktree_name#$prefix}"
+
+      git worktree remove "$worktree_path" --force 2>/dev/null || {
+        rm -rf "$worktree_path" 2>/dev/null || true
+      }
+
+      # Delete the branch
+      local branch_name
+      branch_name=$(experiment_branch_name "$spec_name" "$index_str")
+      git branch -D "$branch_name" 2>/dev/null || true
+
+      count=$((count + 1))
+    fi
+  done
+
+  git worktree prune 2>/dev/null || true
+
+  # Clean up empty worktree directory
+  if [[ -d "$WORKTREE_DIR" ]] && [[ -z "$(ls -A "$WORKTREE_DIR" 2>/dev/null)" ]]; then
+    rmdir "$WORKTREE_DIR" 2>/dev/null || true
+  fi
+
+  echo -e "${GREEN}Cleaned up $count experiment worktree(s) for $spec_name${NC}" >&2
+}
+
+# Count total worktrees (for budget check)
+count_worktrees() {
+  local count=0
+  if [[ -d "$WORKTREE_DIR" ]]; then
+    for worktree_path in "$WORKTREE_DIR"/*; do
+      if [[ -d "$worktree_path" ]] && [[ -e "$worktree_path/.git" ]]; then
+        count=$((count + 1))
+      fi
+    done
+  fi
+  echo "$count"
+}
+
+# Main
+main() {
+  local command="${1:-help}"
+
+  case "$command" in
+    create)
+      shift
+      create_worktree "$@"
+      ;;
+    cleanup)
+      shift
+      cleanup_worktree "$@"
+      ;;
+    cleanup-all)
+      shift
+      cleanup_all "$@"
+      ;;
+    count)
+      count_worktrees
+      ;;
+    help)
+      cat << 'EOF'
+Experiment Worktree Manager
+
+Usage:
+  experiment-worktree.sh create <spec_name> <exp_index> <base_branch> [shared_file ...]
+  experiment-worktree.sh cleanup <spec_name> <exp_index>
+  experiment-worktree.sh cleanup-all <spec_name>
+  experiment-worktree.sh count
+
+Commands:
+  create       Create an experiment worktree with copied shared files
+  cleanup      Remove a single experiment worktree and its branch
+  cleanup-all  Remove all experiment worktrees for a spec
+  count        Count total active worktrees (for budget checking)
+
+Worktrees:  .worktrees/optimize-<spec>-exp-<NNN>/
+Branches:   optimize-exp/<spec>/exp-<NNN>
+EOF
+      ;;
+    *)
+      echo -e "${RED}Unknown command: $command${NC}" >&2
+      exit 1
+      ;;
+  esac
+}
+
+main "$@"
--- a/plugins/compound-engineering/skills/ce-optimize/scripts/measure.sh
+++ b/plugins/compound-engineering/skills/ce-optimize/scripts/measure.sh
@@ -0,0 +1,90 @@
+#!/bin/bash
+
+# Measurement Runner
+# Runs a measurement command, captures JSON output, and handles timeouts.
+# The orchestrating agent (not this script) evaluates gates and handles
+# stability repeats.
+#
+# Usage: measure.sh <command> <timeout_seconds> [working_directory] [KEY=VALUE ...]
+#
+# Arguments:
+#   command          - Shell command to run (e.g., "python evaluate.py")
+#   timeout_seconds  - Maximum seconds before killing the command
+#   working_directory - Directory to run the command in (default: .)
+#   KEY=VALUE        - Optional environment variables to set before running
+#
+# Output:
+#   stdout: Raw JSON output from the measurement command
+#   stderr: Passed through from the measurement command
+#   exit code: Same as the measurement command (124 for timeout)
+
+set -euo pipefail
+
+# Parse arguments
+COMMAND="${1:?Error: command argument required}"
+TIMEOUT="${2:?Error: timeout_seconds argument required}"
+shift 2
+
+WORKDIR="."
+if [[ $# -gt 0 ]] && [[ "$1" != *=* ]]; then
+  WORKDIR="$1"
+  shift
+fi
+
+# Set any KEY=VALUE environment variables
+for arg in "$@"; do
+  if [[ "$arg" == *=* ]]; then
+    export "$arg"
+  fi
+done
+
+# Change to working directory
+cd "$WORKDIR" || {
+  echo "Error: cannot cd to $WORKDIR" >&2
+  exit 1
+}
+
+run_with_timeout() {
+  if command -v timeout >/dev/null 2>&1; then
+    timeout "$TIMEOUT" bash -c "$COMMAND"
+    return
+  fi
+
+  if command -v gtimeout >/dev/null 2>&1; then
+    gtimeout "$TIMEOUT" bash -c "$COMMAND"
+    return
+  fi
+
+  if command -v python3 >/dev/null 2>&1; then
+    python3 - "$TIMEOUT" "$COMMAND" <<'PY'
+import os
+import signal
+import subprocess
+import sys
+
+timeout_seconds = int(sys.argv[1])
+command = sys.argv[2]
+proc = subprocess.Popen(["bash", "-c", command], start_new_session=True)
+
+try:
+    sys.exit(proc.wait(timeout=timeout_seconds))
+except subprocess.TimeoutExpired:
+    os.killpg(proc.pid, signal.SIGTERM)
+    try:
+        proc.wait(timeout=5)
+    except subprocess.TimeoutExpired:
+        os.killpg(proc.pid, signal.SIGKILL)
+        proc.wait()
+    sys.exit(124)
+PY
+    return
+  fi
+
+  echo "Error: no timeout implementation available (tried timeout, gtimeout, python3)" >&2
+  exit 1
+}
+
+# Run the measurement command with timeout
+# timeout returns 124 if the command times out
+# We pass stdout and stderr through directly
+run_with_timeout
--- a/plugins/compound-engineering/skills/ce-optimize/scripts/parallel-probe.sh
+++ b/plugins/compound-engineering/skills/ce-optimize/scripts/parallel-probe.sh
@@ -0,0 +1,127 @@
+#!/bin/bash
+
+# Parallelism Probe
+# Detects common parallelism blockers in the target project.
+# Output is advisory -- the skill presents results to the user for approval.
+#
+# Usage: parallel-probe.sh <project_directory> [measurement_command] [measurement_workdir] [shared_file ...]
+#
+# Arguments:
+#   project_directory   - Root directory of the project to probe
+#   measurement_command - The measurement command from the spec (optional, for port detection)
+#   measurement_workdir - Measurement working directory relative to project root (default: .)
+#   shared_file         - Explicitly declared shared files that parallel runs depend on
+#
+# Output:
+#   JSON to stdout with:
+#     mode: "parallel" | "serial" | "user-decision"
+#     blockers: [ { type, description, suggestion } ]
+
+set -euo pipefail
+
+PROJECT_DIR="${1:?Error: project_directory argument required}"
+MEASUREMENT_CMD="${2:-}"
+MEASUREMENT_WORKDIR="${3:-.}"
+
+shift 3 2>/dev/null || shift $# 2>/dev/null || true
+SHARED_FILES=()
+if [[ $# -gt 0 ]]; then
+  SHARED_FILES=("$@")
+fi
+
+cd "$PROJECT_DIR" || {
+  echo '{"mode":"serial","blockers":[{"type":"error","description":"Cannot access project directory","suggestion":"Check path"}]}'
+  exit 0
+}
+
+if ! command -v python3 >/dev/null 2>&1; then
+  echo '{"mode":"serial","blockers":[{"type":"missing_dependency","description":"python3 is required for structured probe output","suggestion":"Install python3 or skip the probe and review parallel-readiness manually"}],"blocker_count":1}'
+  exit 0
+fi
+
+BLOCKERS="[]"
+SCAN_PATHS=()
+
+add_blocker() {
+  local type="$1"
+  local desc="$2"
+  local suggestion="$3"
+  BLOCKERS=$(echo "$BLOCKERS" | python3 -c "
+import json, sys
+b = json.load(sys.stdin)
+b.append({'type': '$type', 'description': '''$desc''', 'suggestion': '''$suggestion'''})
+print(json.dumps(b))
+" 2>/dev/null || echo "$BLOCKERS")
+}
+
+add_scan_path() {
+  local candidate="$1"
+
+  if [[ -z "$candidate" ]]; then
+    return
+  fi
+
+  if [[ -e "$candidate" ]]; then
+    SCAN_PATHS+=("$candidate")
+  fi
+}
+
+add_scan_path "$MEASUREMENT_WORKDIR"
+
+if [[ ${#SHARED_FILES[@]} -gt 0 ]]; then
+  for shared_file in "${SHARED_FILES[@]}"; do
+    add_scan_path "$shared_file"
+  done
+fi
+
+if [[ ${#SCAN_PATHS[@]} -eq 0 ]]; then
+  SCAN_PATHS=(".")
+fi
+
+# Check 1: Hardcoded ports in measurement command
+if [[ -n "$MEASUREMENT_CMD" ]]; then
+  # Look for common port patterns in the command itself
+  if echo "$MEASUREMENT_CMD" | grep -qE '(--port(?:\s+|=)[0-9]+|:\s*[0-9]{4,5}|PORT=[0-9]+|localhost:[0-9]+)'; then
+    add_blocker "port" "Measurement command contains hardcoded port reference" "Parameterize port via environment variable (e.g., PORT=\$EVAL_PORT)"
+  fi
+fi
+
+# Check 2: SQLite databases in the measurement workdir or declared shared files
+SQLITE_FILES=$(find "${SCAN_PATHS[@]}" -maxdepth 4 -type f \( -name '*.db' -o -name '*.sqlite' -o -name '*.sqlite3' \) ! -path '*/.git/*' ! -path '*/node_modules/*' ! -path '*/.claude/*' ! -path '*/.context/*' ! -path '*/.worktrees/*' 2>/dev/null | head -10 || true)
+if [[ -n "$SQLITE_FILES" ]]; then
+  FILE_COUNT=$(echo "$SQLITE_FILES" | wc -l | tr -d ' ')
+  add_blocker "shared_file" "Found $FILE_COUNT SQLite database file(s)" "Copy database files into each experiment worktree"
+fi
+
+# Check 3: Lock/PID files in the measurement workdir or declared shared files
+LOCK_FILES=$(find "${SCAN_PATHS[@]}" -maxdepth 4 -type f \( -name '*.lock' -o -name '*.pid' \) ! -path '*/.git/*' ! -path '*/node_modules/*' ! -path '*/.claude/*' ! -path '*/.context/*' ! -path '*/.worktrees/*' ! -name 'package-lock.json' ! -name 'yarn.lock' ! -name 'bun.lock' ! -name 'bun.lockb' ! -name 'Gemfile.lock' ! -name 'poetry.lock' ! -name 'Cargo.lock' 2>/dev/null | head -10 || true)
+if [[ -n "$LOCK_FILES" ]]; then
+  FILE_COUNT=$(echo "$LOCK_FILES" | wc -l | tr -d ' ')
+  add_blocker "lock_file" "Found $FILE_COUNT lock/PID file(s) that may cause contention" "Ensure measurement command cleans up lock files, or run in serial mode"
+fi
+
+# Check 4: Exclusive resource hints in the measurement command
+if [[ -n "$MEASUREMENT_CMD" ]] && echo "$MEASUREMENT_CMD" | grep -qiE '(cuda|gpu|tensorflow|torch|nvidia-smi|CUDA_VISIBLE_DEVICES)'; then
+  add_blocker "exclusive_resource" "Measurement command appears to use GPU or another exclusive accelerator" "GPU is typically an exclusive resource -- consider serial mode or device parameterization"
+fi
+
+# Determine mode
+BLOCKER_COUNT=$(echo "$BLOCKERS" | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null || echo "0")
+
+if [[ "$BLOCKER_COUNT" == "0" ]]; then
+  MODE="parallel"
+elif echo "$BLOCKERS" | python3 -c "import json,sys; b=json.load(sys.stdin); exit(0 if any(x['type']=='exclusive_resource' for x in b) else 1)" 2>/dev/null; then
+  MODE="serial"
+else
+  MODE="user-decision"
+fi
+
+# Output JSON result
+python3 -c "
+import json
+print(json.dumps({
+    'mode': '$MODE',
+    'blockers': $BLOCKERS,
+    'blocker_count': $BLOCKER_COUNT
+}, indent=2))
+"