feat(ce-optimize): Auto-research loop for tuning system prompts / vector clustering / evaluating different code solution / etc (#446)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 23:16:09 -04:00
parent 4e0ed2cc8d
commit 8f20aa0406
15 changed files with 3970 additions and 1 deletions
--- a/docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md
+++ b/docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md
@@ -0,0 +1,977 @@
+# Iterative Optimization Loop Skill — Requirements Brainstorm
+
+## Problem Statement
+
+CE has strong knowledge-compounding (learn from past work) and multi-agent review (quality gates), but no skill for **metric-driven iterative optimization** — the pattern where you define a measurable goal, build measurement scaffolding, then run an automated loop that tries many approaches, measures each, keeps improvements, and converges toward the best solution.
+
+### Motivating Example
+
+A project builds issue/PR clusters for a large open-source repo. Currently only ~20% of issues/PRs land in clusters with >1 item. The suspected achievable target is ~95%. Getting there requires testing many hypotheses:
+
+- Extracting signal (unique user-entered text) from noise (PR/issue template boilerplate that makes all vectors too similar)
+- Using issue-to-PR links as a new clustering signal
+- Adjusting similarity thresholds
+- Trying different embedding models or chunking strategies
+- Combining multiple signals (text similarity + link graph + label overlap + author patterns)
+- Pre-filtering or normalizing template sections before embedding
+
+No single hypothesis will get from 20% to 95%. It requires systematic experimentation — trying dozens or hundreds of variations, measuring each, and building on successes.
+
+## Landscape Analysis
+
+### Karpathy's AutoResearch (March 2026, 21k+ stars)
+
+The simplest and most influential model. Core design:
+
+- **One mutable file** (`train.py`) — the agent edits only this
+- **One immutable evaluator** (`prepare.py`) — the agent cannot touch measurement
+- **One instruction file** (`program.md`) — defines objectives, constraints, stopping criteria
+- **One metric** (`val_bpb`) — scalar, lower is better
+- **Linear keep/revert loop**: modify -> commit -> run -> measure -> if improved keep, else `git reset`
+- **History**: `results.tsv` accumulates all experiment results; git log preserves successful commits
+- **Result**: 700 experiments in 2 days, 20 discovered optimizations, ~12 experiments/hour
+
+**Strengths**: Dead simple. Git-native history. Easy to understand and debug.
+**Weaknesses**: Linear — can't explore multiple directions simultaneously. Single scalar metric. No backtracking to earlier promising states.
+
+### AIDE / WecoAI
+
+- **Tree search** in solution space — each script is a node, LLM patches spawn children
+- Can backtrack to any previous node and explore alternatives
+- 4x more Kaggle medals than linear agents on MLE-Bench
+- More complex but better at escaping local optima
+
+### Sakana AI Scientist v2
+
+- **Agentic tree search** with parallel experiment execution
+- VLM feedback for analyzing figures
+- Full paper generation with automated peer review
+- Overkill for code optimization but shows the value of tree-structured exploration
+
+### DSPy (Stanford)
+
+- Automated prompt/weight optimization for LLM programs
+- Bayesian optimization (MIPROv2), iterative feedback (GEPA), coordinate ascent (COPRO)
+- Shows that different optimization strategies suit different problem shapes
+
+### Existing Claude Code AutoResearch Forks
+
+- `uditgoenka/autoresearch` — packages the pattern as a Claude Code skill
+- `autoexp` — generalized for any project with a quantifiable metric
+- Multiple teams report 50-80% improvements over 30-70 iterations overnight
+
+## Key Design Decisions
+
+### 1. Linear vs. Tree Search
+
+| Approach | Pros | Cons |
+|---|---|---|
+| Linear (autoresearch) | Simple, easy to understand, git-native | Can't explore multiple directions, stuck in local optima |
+| Tree search (AIDE) | Can backtrack, explore alternatives | More complex state management, harder to review |
+| Hybrid: linear with manual branch points | Best of both — simple default, user chooses when to fork | Requires user interaction to fork |
+
+**Recommendation**: Start with linear keep/revert (Karpathy model) as the default. Add optional "branch point" support where the user can snapshot the current best and start a new exploration direction. Each direction is its own branch. This keeps the core loop simple while allowing multi-direction exploration when needed.
+
+### 2. What Gets Measured — The Three-Tier Metric Architecture
+
+AutoResearch uses a single scalar metric (val_bpb). That works when you have an objective function with clear ground truth. Most real-world optimization problems don't — especially when the quality of the output requires human judgment.
+
+**Key insight**: Hard scalar metrics are often the wrong optimization target. For clustering, "bigger clusters" isn't inherently better. "Fewer singletons" isn't inherently better. A solution with 35% singletons where every cluster is coherent beats a solution with 5% singletons where clusters are garbage. Hard metrics catch *degenerate* solutions; *quality* requires judgment.
+
+**Three tiers**:
+
+1. **Degenerate-case gates** (hard, cheap, fully automated):
+   - Catch obviously broken solutions before expensive evaluation
+   - Examples: "all items in 1 cluster" (degenerate merge), "all singletons" (degenerate split), "runtime > 10 minutes" (performance regression)
+   - These are fast boolean checks: pass/fail. If any gate fails, the experiment is immediately reverted without running the expensive judge
+   - Think of these as "sanity checks" not "optimization targets"
+
+2. **LLM-as-judge quality score** (the actual optimization target):
+   - For problems where quality requires judgment, this IS the primary metric
+   - Cost-controlled via stratified sampling (not exhaustive)
+   - Produces a scalar score the loop can optimize against
+   - Can include multiple dimensions (coherence, granularity, completeness)
+   - See detailed design below
+
+3. **Diagnostics** (logged for understanding, not gated on):
+   - Distribution stats, counts, histograms
+   - Useful for understanding WHY a judge score changed
+   - Examples: median cluster size, singleton %, largest cluster size, cluster count
+   - Logged in the experiment record but never used for keep/revert decisions
+
+**When to use which configuration**:
+
+| Problem Type | Degenerate Gates | Primary Metric | Example |
+|---|---|---|---|
+| Objective function exists | Yes | Hard metric (scalar) | Build time, test pass rate, API latency |
+| Quality requires judgment | Yes | LLM-as-judge score | Clustering quality, search relevance, content generation |
+| Hybrid | Yes | Hard metric + LLM-judge as guard rail | Latency (optimize) + response quality (must not drop) |
+
+**Recommendation**: Support all three tiers. The user declares whether the primary optimization target is a hard metric or an LLM-judge score. Degenerate gates always run first (cheap). Judge runs only on experiments that pass gates.
+
+### 3. What the Agent Can Edit
+
+AutoResearch constrains the agent to one file. This is elegant but too restrictive for most software projects.
+
+**Recommendation**: Define an explicit allowlist of mutable files/directories and an explicit denylist (measurement harness, test fixtures, evaluation data). The agent operates within the allowlist. The measurement harness is immutable — the agent cannot game the metric by changing how it's measured.
+
+### 4. Measurement Scaffolding First
+
+This is critical and distinguishes this from "just run the code in a loop":
+
+1. **Define the measurement spec** before any optimization begins
+2. **Build and validate the measurement harness** — ensure it produces reliable, reproducible results
+3. **Establish baseline** — run the harness on the current code to get starting metrics
+4. Only then begin the optimization loop
+
+**Recommendation**: Make this a hard phase gate. The skill refuses to enter the optimization loop until the measurement harness passes a validation check (runs successfully, produces expected metric types, baseline is recorded).
+
+### 5. History and Memory
+
+What gets remembered across iterations:
+
+- **Results log**: Every experiment's metrics, hypothesis, and outcome (kept/reverted)
+- **Git history**: Successful experiments are commits; branches are preserved
+- **Hypothesis log**: What was tried, why, what was learned — prevents re-trying failed approaches
+- **Strategy evolution**: As the agent learns what works, it should adapt its exploration strategy
+
+**Recommendation**: A structured experiment log (YAML or JSON) that captures: iteration number, hypothesis, changes made, metrics before/after, outcome (kept/reverted/error), and learnings. The agent reads this before proposing the next hypothesis. Git branches are preserved for all kept experiments.
+
+### 6. How Long It Runs
+
+- AutoResearch runs "indefinitely until manually stopped"
+- Real-world needs: time budgets, iteration budgets, metric targets, or "until no improvement for N iterations"
+
+**Recommendation**: Support multiple stopping criteria (any can trigger stop):
+- Target metric reached
+- Max iterations
+- Max wall-clock time
+- No improvement for N consecutive iterations
+- Manual stop (user interrupts)
+
+### 7. Parallelism
+
+AutoResearch is single-threaded. AIDE and AI Scientist run parallel experiments. For CE:
+
+- **Phase 1 (v1)**: Single-threaded linear loop. Simple, debuggable, works with git worktrees.
+- **Phase 2 (future)**: Parallel experiments using multiple worktrees or Codex sandboxes. Each experiment is independent.
+
+**Recommendation**: Start single-threaded. Design the experiment log and branching model to support parallelism later.
+
+### 8. Integration with Existing CE Skills
+
+The optimization loop should compose with existing CE capabilities:
+
+- **`/ce:ideate`** or **`/ce:brainstorm`** to generate initial hypothesis space
+- **Learnings researcher** to check if similar optimization was done before
+- **`/ce:compound`** to capture the winning strategy as institutional knowledge after the loop completes
+- **`/ce:review`** optionally on the final winning diff before it's merged
+
+## Proposed Skill: `/ce-optimize`
+
+### Workflow Phases
+
+```
+Phase 0: Setup
+  |-- Read/create optimization spec (target metric, guard rails, mutable files, constraints)
+  |-- Search learnings for prior related optimization attempts
+  '-- Validate spec completeness
+
+Phase 1: Measurement Scaffolding (HARD GATE - user must approve before Phase 2)
+  |-- If user provides harness:
+  |     |-- Review docs (or document usage if undocumented)
+  |     |-- Run harness once against current implementation
+  |     '-- Confirm baseline measurement is accurate with user
+  |-- If agent builds harness:
+  |     |-- Build measurement harness (immutable evaluator)
+  |     |-- Run validation: harness executes, produces expected metric types
+  |     '-- Establish baseline metrics
+  |-- Parallelism readiness probe:
+  |     |-- Check for hardcoded ports -> parameterize via env var
+  |     |-- Check for shared DB files (SQLite, etc.) -> plan copy strategy
+  |     |-- Check for shared external services -> warn user
+  |     |-- Check for exclusive resource needs (GPU, etc.)
+  |     '-- Produce parallel_readiness assessment
+  |-- Stability validation (if mode: repeat):
+  |     |-- Run harness repeat_count times
+  |     |-- Verify variance is within noise_threshold
+  |     '-- Confirm aggregation method produces stable baseline
+  '-- GATE: Present baseline + parallel readiness to user. Refuse to proceed until approved.
+
+Phase 2: Hypothesis Generation + Dependency Approval
+  |-- Analyze the problem space (read code, understand current approach)
+  |-- Generate initial hypothesis list (agent + optionally /ce:ideate)
+  |-- Prioritize by expected impact and feasibility
+  |-- Identify new dependencies across ALL planned hypotheses
+  |-- Present dependency list for bulk approval
+  '-- Record hypothesis backlog (with dep approval status per hypothesis)
+
+Phase 3: Optimization Loop (repeats in parallel batches)
+  |-- Select batch of hypotheses (batch_size = min(backlog, max_concurrent))
+  |     '-- Prefer diversity: mix different hypothesis categories per batch
+  |-- For each experiment in batch (PARALLEL by default):
+  |     |-- Create worktree or Codex sandbox
+  |     |-- Copy shared resources (DB files, data files)
+  |     |-- Apply parameterization (ports, env vars)
+  |     |-- Implement hypothesis (within mutable scope)
+  |     |-- Run measurement harness (respecting stability config)
+  |     '-- Collect metrics + diff
+  |-- Wait for batch completion
+  |-- Evaluate results:
+  |     |-- Rank by primary metric improvement
+  |     |-- Filter by guard rails (reject any that violate)
+  |     |-- If best > current: KEEP (merge to optimization branch)
+  |     |-- If best has unapproved dep: mark deferred_needs_approval
+  |     '-- All others: REVERT (log results, clean up worktrees)
+  |-- Handle unapproved deps:
+  |     '-- Set aside, don't block pipeline, batch-ask at end or check-in
+  |-- Update experiment log with ALL results (kept + reverted)
+  |-- Re-baseline: remaining hypotheses evaluated against new best
+  |-- Generate new hypotheses based on learnings from this batch
+  |-- Check stopping criteria
+  '-- Next batch
+
+Phase 4: Wrap-Up
+  |-- Present deferred hypotheses needing dep approval (if any)
+  |-- Summarize results: baseline -> final metrics, total iterations, kept improvements
+  |-- Preserve ALL experiment branches for reference
+  |-- Optionally run /ce:review on cumulative diff
+  |-- Optionally run /ce:compound to capture winning strategy as learning
+  '-- Report to user
+```
+
+### Optimization Spec File Format
+
+See "Updated Spec File Format" in the Resolved Design Decisions section below for the full spec with parallel execution and stability config.
+
+### Experiment Log Format
+
+```yaml
+# .context/compound-engineering/optimize/experiment-log.yaml
+spec: "improve-issue-clustering"
+
+baseline:
+  timestamp: "2026-03-29T10:00:00Z"
+  gates:
+    largest_cluster_pct: 0.02
+    singleton_pct: 0.79
+    cluster_count: 342
+    runtime_seconds: 45
+  diagnostics:
+    singleton_pct: 0.79
+    median_cluster_size: 2
+    cluster_count: 342
+    avg_cluster_size: 2.8
+    p95_cluster_size: 7
+  judge:
+    mean_score: 3.1
+    pct_scoring_4plus: 0.33
+    mean_distinct_topics: 1.8
+    singleton_false_negative_pct: 0.45   # 45% of sampled singletons should be clustered
+    sample_seed: 42
+    judge_cost_usd: 0.42
+
+experiments:
+  - iteration: 1
+    batch: 1
+    hypothesis: "Remove PR template boilerplate before embedding to reduce noise"
+    category: "signal-extraction"
+    changes:
+      - file: "src/preprocessing/text_cleaner.py"
+        summary: "Added template detection and removal using common PR template patterns"
+    gates:
+      largest_cluster_pct: 0.03
+      singleton_pct: 0.62
+      cluster_count: 489
+      runtime_seconds: 48
+    gates_passed: true
+    diagnostics:
+      singleton_pct: 0.62
+      median_cluster_size: 3
+      cluster_count: 489
+      avg_cluster_size: 3.4
+    judge:
+      mean_score: 3.8
+      pct_scoring_4plus: 0.57
+      mean_distinct_topics: 1.4
+      singleton_false_negative_pct: 0.31
+      judge_cost_usd: 0.38
+    outcome: "kept"
+    primary_delta: "+0.7"       # mean_score: 3.1 -> 3.8
+    learnings: "Template removal significantly improved coherence. Clusters now group by actual issue content rather than shared boilerplate. Singleton rate dropped 17pp."
+    commit: "abc123"
+
+  - iteration: 2
+    batch: 1                    # same batch as iteration 1 (ran in parallel)
+    hypothesis: "Lower similarity threshold from 0.85 to 0.75"
+    category: "clustering-algorithm"
+    changes:
+      - file: "config/clustering.yaml"
+        summary: "Changed similarity_threshold from 0.85 to 0.75"
+    gates:
+      largest_cluster_pct: 0.08
+      singleton_pct: 0.35
+      cluster_count: 210
+      runtime_seconds: 47
+    gates_passed: true
+    diagnostics:
+      singleton_pct: 0.35
+      median_cluster_size: 5
+      cluster_count: 210
+    judge:
+      mean_score: 2.4
+      pct_scoring_4plus: 0.13
+      mean_distinct_topics: 3.1   # clusters covering too many unrelated topics
+      singleton_false_negative_pct: 0.12
+      judge_cost_usd: 0.41
+    outcome: "reverted"
+    primary_delta: "-0.7"       # mean_score: 3.1 -> 2.4
+    learnings: "Lower threshold pulled in more items but destroyed coherence. Clusters became grab-bags. The hard metrics looked good (fewer singletons!) but judge correctly identified the quality drop. Validates that singleton_pct alone is a misleading optimization target."
+
+  - iteration: 3
+    batch: 2                    # new batch, runs on top of iteration 1's changes
+    hypothesis: "Use issue-to-PR link graph as additional clustering signal"
+    category: "graph-signals"
+    changes:
+      - file: "src/clustering/signals.py"
+        summary: "Added link-graph signal extraction from issue-PR references"
+      - file: "src/clustering/merger.py"
+        summary: "Combined text similarity with link-graph signal using weighted average"
+    gates:
+      largest_cluster_pct: 0.04
+      singleton_pct: 0.48
+      cluster_count: 520
+      runtime_seconds: 52
+    gates_passed: true
+    diagnostics:
+      singleton_pct: 0.48
+      median_cluster_size: 3
+      cluster_count: 520
+    judge:
+      mean_score: 4.1
+      pct_scoring_4plus: 0.70
+      mean_distinct_topics: 1.2
+      singleton_false_negative_pct: 0.22
+      judge_cost_usd: 0.39
+    outcome: "kept"
+    primary_delta: "+0.3"       # mean_score: 3.8 -> 4.1 (from iteration 1 baseline)
+    learnings: "Link graph is a strong complementary signal. Issues referencing the same PR are almost always related. Judge scores jumped — 70% of clusters now score 4+. Singleton false negatives dropped further."
+    commit: "def456"
+
+  - iteration: 4
+    batch: 2
+    hypothesis: "Add scikit-learn HDBSCAN for hierarchical density clustering"
+    category: "clustering-algorithm"
+    changes: []
+    gates_passed: false         # not evaluated — deferred
+    outcome: "deferred_needs_approval"
+    deferred_reason: "Requires unapproved dependency: scikit-learn"
+    learnings: "Set aside for batch approval at end of loop."
+
+best:
+  iteration: 3
+  judge:
+    mean_score: 4.1
+    pct_scoring_4plus: 0.70
+  total_judge_cost_usd: 1.60   # running total across all experiments
+```
+
+## Hypothesis Generation Strategies
+
+For the clustering example, here's the kind of hypothesis space the agent should explore:
+
+### Signal Extraction
+- Remove PR/issue template boilerplate before embedding
+- Extract only user-authored text (strip auto-generated sections)
+- Weight title more heavily than body
+- Use code snippets / file paths mentioned as signals
+- Extract error messages and stack traces as high-signal features
+
+### Graph-Based Signals
+- Issue-to-PR links (issues referencing same PR are related)
+- Cross-references between issues (`#123` mentions)
+- Author patterns (same author filing similar issues)
+- Label co-occurrence
+- Milestone/project board grouping
+
+### Embedding & Similarity
+- Try different embedding models (different size/quality tradeoffs)
+- Chunk long issues before embedding vs. truncate vs. summarize
+- Weighted combination of multiple similarity signals
+- Asymmetric similarity (issue-to-PR vs. issue-to-issue)
+
+### Clustering Algorithm
+- Adjust similarity thresholds (per-signal or combined)
+- Try hierarchical clustering vs. graph-based community detection
+- Two-pass: coarse clusters then split/merge refinement
+- Minimum cluster size constraints
+- Handle outlier issues that genuinely don't cluster
+
+### Pre-processing
+- Normalize markdown formatting
+- Deduplicate near-identical issues before clustering
+- Language detection and translation for multilingual repos
+- Time-decay weighting (recent issues weighted more)
+
+## Resolved Design Decisions
+
+### D1: Measurement Harness Ownership -> DECIDED: Agent builds, user validates
+
+The agent builds the measurement harness in Phase 1 and evaluates it against the current implementation. If the user provides an existing harness, the agent documents how to use it (or reviews existing docs), runs it once, and confirms the baseline measurement is accurate. Either way, the user reviews and approves before the loop starts. This is a hard gate.
+
+### D2: Flaky Metrics -> DECIDED: User-configurable, default stable
+
+The spec supports a `stability` block:
+
+```yaml
+measurement:
+  command: "python evaluate.py"
+  stability:
+    mode: "stable"          # default: run once, trust the result
+    # mode: "repeat"        # run N times, aggregate
+    # repeat_count: 5       # how many runs
+    # aggregation: "median" # median | mean | min | max | custom
+    # noise_threshold: 0.02 # improvement must exceed this to count
+```
+
+When `mode: repeat`, the harness runs `repeat_count` times. The `aggregation` function reduces results to a single value per metric. The `noise_threshold` prevents accepting improvements within the noise floor. Default is `stable` — run once, trust it.
+
+### D3: New Dependencies -> DECIDED: Pre-approve expected, defer surprises
+
+During Phase 2 (Hypothesis Generation), the agent outlines expected new dependencies across all planned variations and gets bulk approval up front. If an experiment during the loop discovers it needs an unapproved dependency, the agent:
+1. Sets that hypothesis aside (marks it `deferred_needs_approval` in the experiment log)
+2. Continues with other hypotheses that don't need new deps
+3. At the end of the loop (or at a user check-in), presents the deferred hypotheses and their dep requirements for batch approval
+4. If approved, those hypotheses enter the next iteration batch
+
+This prevents blocking the pipeline on interactive approval during long unattended runs.
+
+### D4: LLM-as-Judge -> DECIDED: Include in v1 (cost-controlled via sampling)
+
+LLM-as-judge is essential for problems where quality requires judgment — it's often the *actual* optimization target, not a nice-to-have. Hard metrics catch degenerate cases but can't tell you whether clusters are coherent or search results are relevant.
+
+**Cost control via stratified sampling**:
+- Don't judge every output item — sample a representative set
+- Stratified sampling ensures coverage of edge cases (small clusters, large clusters, singletons)
+- Default: ~30 samples per evaluation (configurable)
+- At ~$0.01-0.03 per judgment call, 30 samples = ~$0.30-0.90 per experiment
+- Over 100 experiments = $30-90 total — manageable
+
+**Sampling strategy**:
+```yaml
+judge:
+  sample_size: 30
+  stratification:
+    - bucket: "small"       # 2-3 items
+      count: 10
+    - bucket: "medium"      # 4-10 items
+      count: 10
+    - bucket: "large"       # 11+ items
+      count: 10
+  # For singletons: sample 10 and ask "should any of these be in a cluster?"
+  singleton_sample: 10
+```
+
+**Rubric-based scoring** (user-defined, per problem):
+```yaml
+judge:
+  rubric: |
+    Rate this cluster 1-5:
+    - 5: All items clearly about the same issue/feature
+    - 4: Strong theme, minor outliers
+    - 3: Related but covers 2-3 sub-topics
+    - 2: Weak connection
+    - 1: Unrelated items grouped together
+
+    Also answer:
+    - How many distinct sub-topics does this cluster represent?
+    - Should any items be removed from this cluster?
+
+  scoring:
+    primary: "mean_score"          # mean of 1-5 ratings
+    secondary: "pct_scoring_4plus" # % of samples scoring 4 or 5
+    output_format: "json"          # {"score": 4, "distinct_topics": 1, "remove_items": []}
+```
+
+**Judge execution order**:
+1. Run degenerate-case gates (fast, free) -- reject obviously broken solutions
+2. Run hard metrics (fast, free) -- collect diagnostics
+3. Only if gates pass: run LLM-as-judge on sampled outputs (slow, costs money)
+4. Keep/revert decision uses judge score as primary metric
+
+**Judge consistency**:
+- Use the same sample indices across experiments when possible (same random seed)
+- This reduces noise from sample variance — you're comparing the same clusters across runs
+- When the output structure changes (different number of clusters), re-sample but log the seed change
+
+**Judge model selection**:
+- Default: Haiku (fast, cheap, good enough for rubric-based scoring)
+- Option: Sonnet for nuanced judgment (2-3x cost)
+- The judge prompt is part of the immutable measurement harness — the agent cannot modify it
+
+**Singleton evaluation** (the non-obvious case):
+- Low singleton % isn't automatically good. High singleton % isn't automatically bad.
+- Sample singletons and ask the judge: "Given these other clusters, should this item be in one of them? Which one? Or is it genuinely unique?"
+- This catches false-negative clustering (items that should cluster but don't) AND validates true singletons
+
+### D5: Codex Support -> DECIDED: Include from v1
+
+Based on patterns from PRs #364/#365 in the compound-engineering plugin:
+
+**Dispatch pattern**: Write experiment prompt to a temp file, pipe to `codex exec` via stdin:
+```bash
+cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1
+```
+
+**Security posture**: User selects once per session (same as ce-work-beta):
+- Workspace write (`--full-auto`)
+- Full access (`--dangerously-bypass-approvals-and-sandbox`)
+
+**Result collection**: Inspect working directory diff after `codex exec` completes. No structured result format — Codex writes files, orchestrator reads the diff and runs the measurement harness.
+
+**Guard rails**:
+- Check for `CODEX_SANDBOX` / `CODEX_SESSION_ID` env vars to prevent recursive delegation
+- 3 consecutive delegate failures auto-disable Codex for remaining experiments
+- Orchestrator retains control of git operations, measurement, and keep/revert decisions
+
+### D6: Parallel Execution -> DECIDED: Parallel by default
+
+Experiments run in parallel by default. The user can specify serial execution if the system under test requires it. The skill actively probes for parallelism blockers.
+
+See full parallel execution design below.
+
+---
+
+## Parallel Execution Design
+
+### Default: Parallel Experiments
+
+The optimization loop dispatches multiple experiments simultaneously unless the user explicitly requests serial execution. This is the primary throughput lever — running 4-8 experiments in parallel vs. 1 at a time means 4-8x more iterations per hour.
+
+### Isolation Strategy
+
+Each parallel experiment needs full filesystem isolation. Two mechanisms, selectable per session:
+
+**Local worktrees** (default):
+```
+.claude/worktrees/optimize-exp-001/   # full repo copy
+.claude/worktrees/optimize-exp-002/
+.claude/worktrees/optimize-exp-003/
+```
+- Created via `git worktree add` with a unique branch per experiment
+- Each worktree gets its own copy of shared resources (see below)
+- Cleaned up after measurement: kept experiments merge to the optimization branch, reverted experiments have their worktree removed
+
+**Codex sandboxes** (opt-in):
+- Each experiment dispatched as an independent `codex exec` invocation
+- Codex provides built-in filesystem isolation
+- Orchestrator collects diffs after completion
+- Best for maximizing parallelism (no local resource limits)
+
+**Hybrid** (future):
+- Use Codex for implementation, local worktree for measurement
+- Useful when measurement requires local resources (GPU, specific hardware, large datasets)
+
+### Parallelism Blocker Detection (Phase 1)
+
+During Phase 1 (Measurement Scaffolding), the skill actively probes for common parallelism blockers:
+
+**Port conflicts**:
+- Run the measurement harness and check if it binds to fixed ports
+- Search config and code for hardcoded port numbers
+- If found: parameterize via environment variable (e.g., `PORT=0` for random, or `BASE_PORT + experiment_index`)
+- Add to spec: `parallel.port_strategy: "parameterized"` with the env var name
+
+**Shared database files**:
+- Check for SQLite databases, local file-based stores
+- If found: each experiment gets a copy of the database in its worktree
+- Cleanup: remove copies after measurement
+- Add to spec: `parallel.shared_files: ["data/clusters.db"]` with copy strategy
+
+**Shared external services**:
+- Check if the system writes to a shared external database, API, or queue
+- If found: warn user, suggest serial mode or test database isolation
+- This is a hard blocker for parallel unless the user confirms isolation
+
+**Resource contention**:
+- Check for GPU usage, large memory requirements
+- If the system needs exclusive access to a resource, serial mode is required
+- Add to spec: `parallel.exclusive_resources: ["gpu"]`
+
+**Detection output**: Phase 1 produces a `parallel_readiness` assessment:
+```yaml
+parallel:
+  mode: "parallel"            # parallel | serial | user-decision
+  max_concurrent: 4           # default, adjustable
+  blockers_found: []          # or list of issues
+  mitigations_applied:
+    - type: "port_parameterization"
+      env_var: "EVAL_PORT"
+      strategy: "base_port_plus_index"
+      base: 9000
+    - type: "database_copy"
+      source: "data/clusters.db"
+      strategy: "copy_per_worktree"
+  blockers_unresolved: []     # these force serial unless user resolves
+```
+
+### Parallel Loop Mechanics
+
+```
+Orchestrator (main branch)
+  |
+  |-- Batch N experiments from hypothesis backlog
+  |     (batch_size = min(backlog_size, max_concurrent))
+  |
+  |-- For each experiment in batch (parallel):
+  |     |-- Create worktree / Codex sandbox
+  |     |-- Copy shared resources (DB files, etc.)
+  |     |-- Apply parameterization (ports, env vars)
+  |     |-- Implement hypothesis (agent edits mutable files)
+  |     |-- Run measurement harness
+  |     |-- Collect metrics + diff
+  |     |-- Clean up shared resource copies
+  |
+  |-- Wait for all experiments in batch to complete
+  |
+  |-- Evaluate results:
+  |     |-- Rank by primary metric improvement
+  |     |-- Filter by guard rails
+  |     |-- Select best experiment that passes all guards
+  |     |-- If best > current best: KEEP (merge to optimization branch)
+  |     |-- All others: REVERT (remove worktrees, log results)
+  |     |-- If none improve: log all results, advance to next batch
+  |
+  |-- Update experiment log with all results (kept + reverted)
+  |-- Update hypothesis backlog based on learnings from ALL experiments
+  |-- Check stopping criteria
+  |-- Next batch
+```
+
+### Parallel-Aware Keep/Revert
+
+With parallel experiments, multiple experiments might improve the metric but conflict with each other (they modify the same files in incompatible ways). Resolution strategy:
+
+1. **Non-overlapping changes**: If the best experiment's changes don't overlap with the second-best, consider keeping both (merge sequentially, re-measure after merge to confirm)
+2. **Overlapping changes**: Keep only the best. Log the second-best as "promising but conflicts with experiment N" for potential future retry on top of the new baseline
+3. **Re-baseline**: After keeping any experiment, all remaining experiments in the batch that were reverted get re-measured mentally against the new baseline — their hypotheses go back into the backlog for potential retry
+
+### Experiment Prompt Template (for Codex dispatch)
+
+```markdown
+# Optimization Experiment #{iteration}
+
+## Context
+You are running experiment #{iteration} for optimization target: {spec.name}
+Current best metrics: {current_best_metrics}
+Baseline metrics: {baseline_metrics}
+
+## Your Hypothesis
+{hypothesis.description}
+
+## What To Change
+Modify ONLY files in the mutable scope:
+{spec.scope.mutable}
+
+DO NOT modify:
+{spec.scope.immutable}
+
+## Constraints
+{spec.constraints}
+{approved_dependencies}
+
+## Previous Experiments (for context)
+{recent_experiment_summaries}
+
+## Instructions
+1. Implement the hypothesis
+2. Do NOT run the measurement harness (orchestrator handles this)
+3. Do NOT commit (orchestrator handles this)
+4. Run `git diff --stat` when done so the orchestrator can see your changes
+```
+
+### Concurrency Limits
+
+```yaml
+parallel:
+  max_concurrent: 4           # default for local worktrees
+  # max_concurrent: 8         # default for Codex (no local resource limits)
+  codex_rate_limit: 10        # max Codex invocations per minute
+  worktree_cleanup: "immediate"  # or "batch" (clean up after full batch)
+```
+
+---
+
+## Updated Spec File Format
+
+### Example A: Hard-Metric Primary (build performance, test pass rate)
+
+```yaml
+# .context/compound-engineering/optimize/spec.yaml
+name: "reduce-build-time"
+description: "Reduce CI build time while maintaining test pass rate"
+
+metric:
+  primary:
+    type: "hard"               # hard | judge
+    name: "build_time_seconds"
+    direction: "minimize"
+    baseline: null             # filled by Phase 1
+    target: 60                 # optional target to stop at
+
+  degenerate_gates:            # fast boolean checks, run first
+    - name: "test_pass_rate"
+      check: ">= 1.0"         # all tests must pass
+    - name: "build_exits_zero"
+      check: "== true"
+
+  diagnostics:
+    - name: "cache_hit_rate"
+    - name: "slowest_step"
+    - name: "total_test_count"
+
+measurement:
+  command: "python evaluate.py"
+  timeout_seconds: 600
+  output_format: "json"
+  stability:
+    mode: "stable"
+```
+
+### Example B: LLM-Judge Primary (clustering quality, search relevance)
+
+```yaml
+# .context/compound-engineering/optimize/spec.yaml
+name: "improve-issue-clustering"
+description: "Improve coherence and coverage of issue/PR clusters"
+
+metric:
+  primary:
+    type: "judge"
+    name: "cluster_coherence"
+    direction: "maximize"
+    baseline: null
+    target: 4.2               # mean judge score (1-5 scale)
+
+  degenerate_gates:            # cheap checks that reject obviously broken solutions
+    - name: "largest_cluster_pct"
+      description: "% of all items in the single largest cluster"
+      check: "<= 0.10"        # if >10% of items are in one cluster, it's degenerate
+    - name: "singleton_pct"
+      description: "% of items that are singletons"
+      check: "<= 0.80"        # if >80% singletons, clustering isn't working at all
+    - name: "cluster_count"
+      check: ">= 10"          # fewer than 10 clusters for 18k items is degenerate
+    - name: "runtime_seconds"
+      check: "<= 600"
+
+  diagnostics:                 # logged for understanding, never gated on
+    - name: "singleton_pct"    # note: same metric can be diagnostic AND gate
+    - name: "median_cluster_size"
+    - name: "cluster_count"
+    - name: "avg_cluster_size"
+    - name: "p95_cluster_size"
+
+  judge:
+    model: "haiku"             # haiku (cheap) | sonnet (nuanced)
+    sample_size: 30
+    stratification:
+      - bucket: "small"       # 2-3 items per cluster
+        count: 10
+      - bucket: "medium"      # 4-10 items
+        count: 10
+      - bucket: "large"       # 11+ items
+        count: 10
+    singleton_sample: 10       # also sample singletons to check false negatives
+    sample_seed: 42            # fixed seed for cross-experiment consistency
+    rubric: |
+      Rate this cluster 1-5:
+      - 5: All items clearly about the same issue/feature
+      - 4: Strong theme, minor outliers
+      - 3: Related but covers 2-3 sub-topics
+      - 2: Weak connection
+      - 1: Unrelated items grouped together
+
+      Also answer in JSON:
+      - "score": your 1-5 rating
+      - "distinct_topics": how many distinct sub-topics this cluster represents
+      - "outlier_count": how many items don't belong
+    singleton_rubric: |
+      This item is currently a singleton (not in any cluster).
+      Given the cluster titles listed below, should this item be in one of them?
+
+      Answer in JSON:
+      - "should_cluster": true/false
+      - "best_cluster_id": cluster ID it belongs in (or null)
+      - "confidence": 1-5 how confident you are
+    scoring:
+      primary: "mean_score"              # what the loop optimizes
+      secondary:
+        - "pct_scoring_4plus"            # % of samples scoring 4+
+        - "mean_distinct_topics"         # lower is better (tighter clusters)
+        - "singleton_false_negative_pct" # % of sampled singletons that should be clustered
+
+measurement:
+  command: "python evaluate.py"          # outputs JSON with gate + diagnostic metrics
+  timeout_seconds: 600
+  output_format: "json"
+  stability:
+    mode: "stable"
+
+scope:
+  mutable:
+    - "src/clustering/"
+    - "src/preprocessing/"
+    - "config/clustering.yaml"
+  immutable:
+    - "evaluate.py"
+    - "tests/fixtures/"
+    - "data/"
+
+execution:
+  mode: "parallel"
+  backend: "worktree"
+  max_concurrent: 4
+  codex_security: null
+
+parallel:
+  port_strategy: null
+  shared_files: ["data/clusters.db"]
+  exclusive_resources: []
+
+dependencies:
+  approved: []
+
+constraints:
+  - "Do not change the output format of clusters"
+  - "Preserve backward compatibility with existing cluster consumers"
+
+stopping:
+  max_iterations: 100
+  max_hours: 8
+  plateau_iterations: 10
+  target_reached: true
+```
+
+### Evaluation Execution Order (per experiment)
+
+```
+1. Run measurement command (evaluate.py)
+   -> Produces JSON with gate metrics + diagnostics
+   -> Fast, free
+
+2. Check degenerate gates
+   -> If ANY gate fails: REVERT immediately, log as "degenerate"
+   -> Do NOT run the judge (saves money)
+
+3. If primary type is "judge": Run LLM-as-judge
+   -> Sample outputs according to stratification config
+   -> Send each sample to judge model with rubric
+   -> Aggregate scores per scoring config
+   -> This is the number the loop optimizes against
+
+4. Keep/revert decision
+   -> Based on primary metric (hard or judge score)
+   -> Must also pass all degenerate gates (already checked in step 2)
+```
+
+---
+
+## Open Questions (Remaining)
+
+1. **Should the agent propose hypotheses, or should the user provide them?**
+   - Both — agent generates from analysis, user can inject ideas, agent prioritizes
+
+2. **Judge calibration across experiments**
+   - LLM judges can drift or be inconsistent across calls
+   - Should we include "anchor samples" — a fixed set of clusters with known scores — in every judge batch to detect drift?
+   - If anchor scores shift >0.5 from baseline, re-calibrate or flag for user review
+
+3. **Judge rubric iteration**
+   - The rubric itself might need improvement after seeing early results
+   - But changing the rubric mid-loop invalidates comparisons to earlier experiments
+   - Solution: if rubric changes, re-judge the current best with the new rubric to re-baseline?
+
+4. **Relationship to `/lfg` and `/slfg`?**
+   - `/lfg` is autonomous execution of a single task
+   - `/ce-optimize` is autonomous execution of an iterative search
+   - `/ce-optimize` can delegate each experiment to Codex (decided D5)
+   - Local experiments use subagent dispatch similar to `/ce:review`
+
+5. **Branch strategy details?**
+   - Main optimization branch: `optimize/<spec-name>`
+   - Each kept experiment is a commit on that branch
+   - Branch points create `optimize/<spec-name>/direction-<N>`
+   - All branches preserved for later reference and comparison
+
+6. **Batch size adaptation?**
+   - Should the batch size grow/shrink based on success rate?
+   - High success rate -> larger batches (more exploration)
+   - Low success rate -> smaller batches (more focused)
+   - Or keep it simple and let the user tune `max_concurrent`
+
+7. **Hypothesis diversity within a batch?**
+   - Should parallel experiments in the same batch be intentionally diverse?
+   - E.g., one threshold tweak + one new signal + one preprocessing change
+   - Or let the prioritization algorithm decide naturally?
+
+8. **Judge cost budgets?**
+   - Should the spec include a `max_judge_cost_usd` budget?
+   - When budget is exhausted, switch to hard-metrics-only mode or stop?
+   - Or just track cost in the log and let the user decide?
+
+## What Makes This Different From "Just Using AutoResearch"
+
+AutoResearch is designed for ML training on a single GPU. CE's version needs to handle:
+
+1. **Multi-file changes** — real code changes span multiple files
+2. **Complex metrics** — not just one scalar, but primary + guard rails + diagnostics
+3. **Varied execution environments** — not just `python train.py` but arbitrary commands
+4. **Integration with existing workflows** — learnings, review, ideation
+5. **User-in-the-loop** — pause for approval on scope-expanding changes, inject new hypotheses
+6. **Knowledge capture** — document what worked and why for the team, not just for the agent's context
+7. **Non-ML domains** — clustering, search quality, API performance, test coverage, build times, etc.
+
+## Success Criteria for This Skill
+
+- User can define an optimization target in <15 minutes
+- Measurement scaffolding is validated before the loop starts
+- Loop runs unattended for hours, producing measurable improvement
+- All experiments are preserved in git for later reference
+- The winning strategy is documented as a learning
+- A human reviewing the experiment log can understand what was tried and why
+- The skill handles failures gracefully (bad experiments don't corrupt state)
+
+## Lessons from First Run (2026-03-30)
+
+The skill was tested on the clustering problem for ~90 minutes. Results:
+
+**What worked:**
+- Ran 16 experiments, improved multi_member_pct from 31.4% to 72.1%
+- Explored multiple algorithm modes (basic, refine, bounded union-find)
+- Correctly identified size-bounded union-find as the winning approach
+- Hypothesis diversity across parameter sweeps was reasonable
+
+**What failed:**
+
+1. **No LLM-as-judge evaluation** -- The skill defaulted to `type: hard` and optimized `multi_member_pct` as the primary metric. This is a proxy metric that can mislead. A solution that puts 72% of items in clusters is useless if the clusters are incoherent. The Phase 0.2 interactive spec creation did not actively probe whether the target was qualitative or guide toward judge mode.
+
+   **Fix applied**: Phase 0.2 now includes explicit qualitative vs quantitative detection, concrete examples of when to use each type, sampling strategy guidance with walkthrough questions, and rubric design guidance. The skill now strongly recommends `type: judge` for qualitative targets.
+
+2. **No disk persistence** -- Experiment results existed only in the conversation context (as a table dumped to chat). If the session had been compacted or crashed, all 90 minutes of results would have been lost. This directly contradicts the Karpathy model where `results.tsv` is written after every single experiment.
+
+   **Fix applied**: Added mandatory disk checkpoints (CP-0 through CP-5) at every phase boundary. Each checkpoint requires a write-then-verify cycle: write the file, read it back, confirm the content is present. The persistence discipline section now explicitly states "If you produce a results table in the conversation without writing those results to disk first, you have a bug."
+
+3. **Sampling strategy not prompted** -- Even if `type: judge` had been used, the skill didn't guide the user through designing a sampling strategy. For clustering, the user wants stratified sampling across: top clusters by size (check for mega-clusters), mid-range clusters (representative quality), small clusters (check if connections are real), and singletons (check for false negatives). This domain-specific guidance was missing.
+
+   **Fix applied**: Phase 0.2 now walks through sampling strategy design with concrete questions and domain-specific examples.
+
+**Key takeaway**: The skill had all the right machinery in the schema and templates but the SKILL.md instructions didn't forcefully enough guide the agent toward using that machinery. Instructions that say "if judge type, do X" are ignored when the skill silently defaults to hard type. Instructions need to actively detect the right path and guide toward it.
+
+## Next Steps
+
+1. Re-test with the clustering use case using `type: judge` to validate the judge loop works end-to-end
+2. Verify disk persistence works on a long run (2+ hours) with context compaction
+3. Test with a second use case (e.g., prompt optimization, build performance) to validate generality
+4. Consider adding anchor samples for judge calibration across experiments (Open Question #2)
+5. Consider judge cost budgets (Open Question #8)
--- a/docs/plans/2026-03-29-001-feat-iterative-optimization-loop-skill-beta-plan.md
+++ b/docs/plans/2026-03-29-001-feat-iterative-optimization-loop-skill-beta-plan.md
@@ -0,0 +1,664 @@
+---
+title: "feat(ce-optimize): Add iterative optimization loop skill"
+type: feat
+status: completed
+date: 2026-03-29
+origin: docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md
+deepened: 2026-03-29
+---
+
+# feat(ce-optimize): Add iterative optimization loop skill
+
+## Overview
+
+Add a new `/ce-optimize` skill that implements metric-driven iterative optimization — the pattern where you define a measurable goal, build measurement scaffolding first, then run an automated loop that tries many parallel experiments, measures each against hard gates and/or LLM-as-judge quality scores, keeps improvements, and converges toward the best solution. Inspired by Karpathy's autoresearch but generalized for multi-file code changes, complex metrics, and non-ML domains.
+
+## Problem Frame
+
+CE has knowledge-compounding and quality gates but no skill for systematic experimentation. When a developer needs to improve a measurable outcome (clustering quality, build performance, search relevance), they currently iterate manually — one change at a time, eyeballing results. This skill automates the modify-measure-decide cycle, runs experiments in parallel via worktrees or Codex sandboxes, and preserves all experiment history in git for later reference. (see origin: `docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md`)
+
+## Requirements Trace
+
+- R1. User can define an optimization target (spec file) in <15 minutes
+- R2. Measurement scaffolding is validated before the loop starts (hard phase gate)
+- R3. Three-tier metric architecture: degenerate gates (cheap boolean checks) -> LLM-as-judge quality score (sampled, cost-controlled) -> diagnostics (logged, not gated)
+- R4. LLM-as-judge with stratified sampling and user-defined rubric is a first-class primary metric type, not deferred
+- R5. Experiments run in parallel by default using worktree isolation or Codex sandboxes
+- R6. Parallelism blockers (ports, shared DBs, exclusive resources) are actively detected and mitigated during Phase 1
+- R7. Dependencies are pre-approved in bulk during hypothesis generation; unapproved deps defer the hypothesis without blocking the pipeline
+- R8. Flaky metrics are configurable (repeat N times, aggregate via median/mean, noise threshold)
+- R9. All experiments preserved in git for later reference; experiment log captures hypothesis, metrics, outcome, and learnings
+- R10. The winning strategy is documented via `/ce:compound` integration
+- R11. Codex support from v1 using established `codex exec` stdin-pipe pattern
+- R12. Loop handles failures gracefully (bad experiments don't corrupt state)
+- R13. Multiple stopping criteria: target reached, max iterations, max hours, plateau (N iterations no improvement), manual stop
+
+## Scope Boundaries
+
+- No tree search / backtracking in v1 — linear keep/revert with optional manual branch points only
+- No batch size adaptation — fixed `max_concurrent`, user-tunable
+- No LLM-as-judge calibration anchors in v1 — deferred to future iteration
+- No rubric mid-loop iteration protocol in v1
+- No judge cost budget enforcement — cost tracked in log, user decides
+- This plan covers the skill, reference files, and scripts. It does not cover changes to the CLI converter or other targets
+
+## Context & Research
+
+### Relevant Code and Patterns
+
+- **Skill format**: `plugins/compound-engineering/skills/ce-work/SKILL.md` — multi-phase skill with YAML frontmatter, `#$ARGUMENTS` input, parallel subagent dispatch
+- **Parallel dispatch**: `plugins/compound-engineering/skills/ce-review/SKILL.md` — spawns N reviewers in parallel, merges structured JSON results
+- **Subagent template**: `plugins/compound-engineering/skills/ce-review/references/subagent-template.md` — confidence rubric, false-positive suppression
+- **Codex delegation**: `plugins/compound-engineering/skills/ce-work-beta/SKILL.md` — `codex exec` stdin pipe, security posture, 3-failure auto-disable, environment guard
+- **Worktree management**: `plugins/compound-engineering/skills/git-worktree/SKILL.md` + `scripts/worktree-manager.sh`
+- **Scratch space**: `.context/compound-engineering/<skill-name>/` with per-run subdirs for concurrent runs
+- **State file patterns**: YAML frontmatter in plan files, JSON schemas in ce:review references
+- **Skill-to-skill references**: `Load the <skill> skill` for pass-through; `/ce:compound` slash syntax for published commands
+
+### Institutional Learnings
+
+- **State machine design is mandatory** for multi-phase workflows — re-read state after every transition, never carry stale values (`docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md`)
+- **Script-first for measurement harnesses** — 60-75% token savings by moving mechanical work (parsing, classification, aggregation) into bundled scripts (`docs/solutions/skill-design/script-first-skill-architecture.md`)
+- **Confidence rubric pattern** — use 0.0-1.0 scale with explicit suppression threshold (0.60 proven in production), define false-positive categories (`ce:review subagent-template.md`)
+- **Pass paths not content to sub-agents** — orchestrator discovers paths, workers read what they need (`docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md`)
+- **State transitions must be load-bearing** — if experiment states exist (proposed/running/measured/evaluated), at least one consumer must branch on them (`docs/solutions/workflow/todo-status-lifecycle.md`)
+- **Branch name sanitization** — `/` to `~` is injective for filesystem paths (`docs/solutions/developer-experience/branch-based-plugin-install-and-testing-2026-03-26.md`)
+
+## Key Technical Decisions
+
+- **Linear keep/revert with parallel batches**: Each batch runs N experiments in parallel, best-in-batch is kept if it improves on current best, all others reverted. Simpler than tree search, compatible with git-native workflows. (see origin: Decision 1)
+- **Three-tier metrics**: Degenerate gates (fast, free, boolean) -> LLM-as-judge or hard primary metric -> diagnostics (logged only). Gates run first to avoid wasting judge calls on obviously broken solutions. (see origin: Decision 2)
+- **LLM-as-judge via stratified sampling**: ~30 samples per evaluation, stratified by output category (small/medium/large clusters), with user-defined rubric. Cost: ~$0.30-0.90 per experiment. Judge prompt is immutable (part of measurement harness). Judge score requires `minimum_improvement` (default 0.3 on a 1-5 scale) to accept as "better" — this accounts for sample-composition variance when output structure changes between experiments. (see origin: D4)
+- **Model-parsed spec, script-executed measurement**: The orchestrating agent reads and parses the YAML spec file directly (agents are natively capable of YAML handling). The measurement script receives flat arguments (command, timeout, working directory), runs the command, and returns raw JSON output. The agent evaluates gates and aggregates stability repeats. This follows the established plugin pattern where no shell scripts parse YAML — the model interprets structure, scripts handle I/O.
+- **Parallel-batch merge strategy**: When multiple experiments in a batch improve the metric: (1) Keep the best experiment, merge to optimization branch. (2) For each runner-up that also improved: check **file-level disjointness** with the kept experiment (same file modified by both = overlapping, even if different lines). (3) If disjoint: cherry-pick runner-up onto new baseline, re-run full measurement. (4) If combined measurement is strictly better: keep the cherry-pick. Otherwise revert and log as "promising alone but neutral/harmful in combination." (5) Process runners-up in descending metric order; stop after first failed combination. Config: `max_runner_up_merges_per_batch` (default: 1). Rationale: two changes that each independently improve a metric can interfere when combined (e.g., one tightens thresholds while another loosens them). This is expected, not a bug.
+- **Worktree isolation for parallel experiments**: Each experiment gets a git worktree under `.worktrees/` (aligned with existing convention) with copied shared resources. Codex sandboxes as opt-in alternative. Orchestrator retains git control. Max concurrent capped at 6 for worktree backend (git performance degrades beyond ~10-15 concurrent worktrees); 8+ only valid for Codex backend. (see origin: D6)
+- **Codex dispatch via stdin pipe**: Write prompt to temp file, pipe to `codex exec`, collect diff after completion. Security posture selected once per session. (see origin: D5)
+- **Context window management via rolling window + strategy digest**: The experiment log grows unboundedly (20-30 lines per experiment). The orchestrator does NOT read the full log each iteration. Instead: (1) maintain a rolling window of the last 10 experiments in working memory, (2) after each batch write a strategy digest summarizing what categories have been tried, what succeeded/failed, and the exploration frontier, (3) read the full log only in filtered sections (e.g., by category) when checking whether a specific hypothesis was already tried. The full log remains the durable ground truth on disk.
+- **Judge dispatch via batched parallel sub-agents**: Orchestrator selects samples per stratification config, groups them into batches of `judge.batch_size` (default: 10), dispatches `ceil(sample_size / batch_size)` parallel sub-agents. Each sub-agent evaluates its batch and returns structured JSON scores. Orchestrator aggregates. This follows the ce:review parallel reviewer dispatch pattern and avoids the overhead of spawning one sub-agent per sample.
+
+## Open Questions
+
+### Resolved During Planning
+
+- **Skill naming**: `ce-optimize` with directory `ce-optimize/`. The frontmatter name now matches the directory and slash command.
+- **Where does experiment state live**: `.context/compound-engineering/ce-optimize/<spec-name>/` — contains spec, experiment log, strategy digest, and per-batch scratch. Cleaned after successful completion except the final experiment log which moves to the optimization branch.
+- **How are experiment branches named**: `optimize/<spec-name>` for the main optimization branch. Per-experiment worktree branches: `optimize/<spec-name>/exp-<NNN>`. Sanitized with `/` to `~` for filesystem paths.
+- **Judge model selection**: Haiku by default (fast, cheap), Sonnet optional. Specified in spec file.
+- **Who parses the YAML spec**: The orchestrating agent (model), not the measurement script. No CE scripts parse YAML — the established pattern is model reads structure, scripts handle I/O. The measurement script receives flat arguments and returns raw JSON.
+- **Judge dispatch mechanism**: Batched parallel sub-agents following ce:review pattern. Orchestrator selects samples, groups into batches of `judge.batch_size` (default: 10), dispatches parallel sub-agents, aggregates JSON scores.
+- **Branch collision on re-run**: Phase 0 detects existing `optimize/<spec-name>` branch and experiment log. Presents user with choice: resume (inherit existing state, continue from last iteration) or fresh start (archive old branch to `optimize/<spec-name>/archived-<timestamp>`, clear log).
+- **Judge score comparability**: Add `judge.minimum_improvement` (default: 0.3 on 1-5 scale) as minimum improvement to accept. This accounts for sample-composition variance when output structure changes. Distinct from `noise_threshold` which handles run-to-run flakiness.
+
+### Deferred to Implementation
+
+- **Exact gate check evaluation**: The agent interprets operator strings like `">= 0.85"` from the spec and evaluates them against metric values. The exact edge cases depend on what metric shapes users provide.
+- **Codex exec flag compatibility**: The exact `codex exec` flags may change. The skill should check `codex --version` and adapt.
+- **Worktree cleanup timing**: Whether to clean up worktrees immediately after each batch or defer to end-of-loop may depend on disk space constraints discovered at runtime.
+- **Harness bug discovered mid-loop**: If the measurement harness itself has a bug discovered during the loop, the user must fix it manually. The harness is immutable by design — the agent cannot modify it. After the fix, the user should re-baseline and resume (or start fresh). The exact UX for this depends on implementation.
+
+## High-Level Technical Design
+
+> *This illustrates the intended approach and is directional guidance for review, not implementation specification. The implementing agent should treat it as context, not code to reproduce.*
+
+```
+                    +-----------------+
+                    |  User provides  |
+                    |  goal + scope   |
+                    +--------+--------+
+                             |
+                    +--------v--------+
+                    | Phase 0: Setup  |
+                    | Create/load spec|
+                    +--------+--------+
+                             |
+                    +--------v-----------+
+                    | Phase 1: Scaffold  |
+                    | Build/validate     |
+                    | harness + baseline |
+                    | Probe parallelism  |
+                    +--------+-----------+
+                             |
+                      [USER GATE]
+                             |
+                    +--------v-----------+
+                    | Phase 2: Hypotheses|
+                    | Generate + approve |
+                    | deps in bulk       |
+                    +--------+-----------+
+                             |
+              +--------------v--------------+
+              |   Phase 3: Optimize Loop    |
+              |                             |
+              |  +--- Batch N hypotheses    |
+              |  |                          |
+              |  |  +--+ Worktree/Codex     |
+              |  |  |  | per experiment     |
+              |  |  |  |  implement         |
+              |  |  |  |  measure           |
+              |  |  |  |  collect metrics   |
+              |  |  +--+                    |
+              |  |                          |
+              |  +--- Evaluate batch        |
+              |  |    gates -> judge -> rank |
+              |  |    KEEP best / REVERT    |
+              |  |                          |
+              |  +--- Update log + backlog  |
+              |  +--- Check stop criteria   |
+              |  +--- Next batch            |
+              +--------------+--------------+
+                             |
+                    +--------v--------+
+                    | Phase 4: Wrap-Up|
+                    | Summarize       |
+                    | /ce:compound    |
+                    | /ce:review      |
+                    +--------+--------+
+                             |
+                        [DONE]
+```
+
+## Implementation Units
+
+### Phase A: Reference Files and Scripts (no dependencies between units)
+
+- [ ] **Unit 1: Optimization spec schema**
+
+**Goal:** Define the YAML schema for the optimization spec file that users create to configure an optimization run.
+
+**Requirements:** R1, R3, R4, R5, R8, R13
+
+**Dependencies:** None
+
+**Files:**
+- Create: `plugins/compound-engineering/skills/ce-optimize/references/optimize-spec-schema.yaml`
+
+**Approach:**
+- Define a commented YAML schema document (not JSON Schema — YAML is more readable for skill-authoring context) that the skill references to validate user-provided specs
+- Cover all three metric tiers: `metric.primary` (type: hard|judge), `metric.degenerate_gates`, `metric.diagnostics`, `metric.judge`
+- Include `measurement` (command, timeout, stability), `scope` (mutable/immutable), `execution` (mode, backend, max_concurrent), `parallel` (port strategy, shared files, exclusive resources), `dependencies`, `constraints`, `stopping`
+- Include inline comments explaining each field, valid values, and defaults
+- Use the two example specs from the brainstorm (hard-metric primary and LLM-judge primary) as validation targets
+
+**Patterns to follow:**
+- `plugins/compound-engineering/skills/ce-review/references/findings-schema.json` for structured schema reference
+- `plugins/compound-engineering/skills/ce-compound/references/schema.yaml` for YAML schema with inline comments
+
+**Test scenarios:**
+- Schema covers all fields from both example specs in the brainstorm
+- Required vs optional fields are clearly marked
+- Default values are documented for every optional field
+
+**Verification:**
+- A user reading only this file can create a valid spec without consulting other docs
+
+---
+
+- [ ] **Unit 2: Experiment log schema**
+
+**Goal:** Define the YAML schema for the experiment log that accumulates across the optimization run.
+
+**Requirements:** R9, R12
+
+**Dependencies:** None
+
+**Files:**
+- Create: `plugins/compound-engineering/skills/ce-optimize/references/experiment-log-schema.yaml`
+
+**Approach:**
+- Define the structure: baseline metrics, experiments array (iteration, batch, hypothesis, category, changes, gates, diagnostics, judge, outcome, primary_delta, learnings, commit), and best-so-far summary
+- Include all experiment outcome states: `kept`, `reverted`, `degenerate`, `error`, `deferred_needs_approval`, `timeout`
+- These states are load-bearing — the loop branches on them (per todo-status-lifecycle learning)
+
+**Patterns to follow:**
+- `plugins/compound-engineering/skills/ce-compound/references/schema.yaml`
+
+**Test scenarios:**
+- Schema covers the full experiment log example from the brainstorm
+- All outcome states documented with transition rules
+
+**Verification:**
+- An implementer reading this schema can produce or parse an experiment log without ambiguity
+
+---
+
+- [ ] **Unit 3: Experiment worker prompt template**
+
+**Goal:** Define the prompt template used to dispatch each experiment to a subagent or Codex.
+
+**Requirements:** R5, R11
+
+**Dependencies:** None
+
+**Files:**
+- Create: `plugins/compound-engineering/skills/ce-optimize/references/experiment-prompt-template.md`
+
+**Approach:**
+- Template with variable substitution slots: `{iteration}`, `{spec.name}`, `{current_best_metrics}`, `{hypothesis.description}`, `{scope.mutable}`, `{scope.immutable}`, `{constraints}`, `{approved_dependencies}`, `{recent_experiment_summaries}`
+- Include explicit instructions: implement only, do NOT run harness, do NOT commit, do NOT modify immutable files
+- Include `git diff --stat` instruction at end for orchestrator to collect changes
+- Follow the path-not-content pattern — pass file paths for large context, inline only small structural data
+
+**Patterns to follow:**
+- `plugins/compound-engineering/skills/ce-review/references/subagent-template.md` for variable substitution pattern and output contract
+
+**Test scenarios:**
+- Template produces a clear, unambiguous prompt when all slots are filled
+- Immutable file constraints are prominent and unambiguous
+- Works for both subagent and Codex dispatch (no platform-specific assumptions in template body)
+
+**Verification:**
+- An implementer can fill this template and dispatch it without needing to read other reference files
+
+---
+
+- [ ] **Unit 4: Judge evaluation prompt template**
+
+**Goal:** Define the prompt template for LLM-as-judge evaluation of sampled outputs.
+
+**Requirements:** R3, R4
+
+**Dependencies:** None
+
+**Files:**
+- Create: `plugins/compound-engineering/skills/ce-optimize/references/judge-prompt-template.md`
+
+**Approach:**
+- Two template sections: cluster/item evaluation (using the user's rubric from the spec) and singleton evaluation (using the user's singleton_rubric)
+- Template includes: the rubric text, the sample data to evaluate, and explicit JSON output format instructions
+- Include confidence calibration guidance adapted from ce:review's rubric pattern: each judge call returns a score + structured metadata
+- Template is designed for Haiku by default — keep prompts concise and well-structured for smaller models
+- Include the false-positive suppression concept: judge should flag if a sample is ambiguous rather than forcing a score
+
+**Patterns to follow:**
+- `plugins/compound-engineering/skills/ce-review/references/subagent-template.md` — confidence rubric structure, JSON output contract
+
+**Test scenarios:**
+- Template works with both the cluster coherence rubric and a generic quality rubric
+- JSON output format is unambiguous and parseable
+- Template handles edge cases: empty clusters, single-item clusters, very large clusters
+
+**Verification:**
+- Filling this template with a rubric and sample data produces a prompt that a model can respond to with valid JSON
+
+---
+
+- [ ] **Unit 5: Measurement runner script**
+
+**Goal:** Create a script that runs the measurement command, captures JSON output, and handles timeouts and errors. The orchestrating agent (not this script) evaluates gates and handles stability repeats.
+
+**Requirements:** R2, R12
+
+**Dependencies:** None
+
+**Files:**
+- Create: `plugins/compound-engineering/skills/ce-optimize/scripts/measure.sh`
+
+**Approach:**
+- Division of labor follows established plugin pattern: scripts handle I/O, the model interprets structure
+- Input: flat positional arguments only — command to run, timeout in seconds, working directory, optional environment variables (KEY=VALUE pairs for port parameterization)
+- Steps: set environment variables -> cd to working directory -> run measurement command with timeout -> capture stdout (expected JSON) and stderr (for error context) -> exit with the command's exit code
+- Output: raw JSON from the measurement command to stdout, stderr passed through. No post-processing, no YAML parsing, no gate evaluation — the orchestrating agent handles all of that after reading the script's output
+- Handle: command timeout (via `timeout` command), non-zero exit (pass through), stderr capture for error diagnosis
+- The script does NOT: parse YAML spec files, evaluate gate checks, aggregate stability repeats, or produce structured result envelopes. These are all orchestrator responsibilities.
+
+**Patterns to follow:**
+- `plugins/compound-engineering/skills/git-worktree/scripts/worktree-manager.sh` — flat positional arguments, no structured data parsing
+- `plugins/compound-engineering/skills/resolve-pr-feedback/scripts/get-pr-comments` — simple script that runs a command and returns JSON
+
+**Test scenarios:**
+- Command succeeds: JSON output passed through to stdout
+- Command fails (non-zero exit): exit code passed through, stderr available
+- Command times out: timeout exit code returned
+- Environment variables applied: PORT env var set before command runs
+
+**Verification:**
+- Script can be run standalone with a command and timeout and returns the command's raw output
+
+---
+
+- [ ] **Unit 6: Parallelism probe script**
+
+**Goal:** Create a script that detects common parallelism blockers in the target project.
+
+**Requirements:** R5, R6
+
+**Dependencies:** None
+
+**Files:**
+- Create: `plugins/compound-engineering/skills/ce-optimize/scripts/parallel-probe.sh`
+
+**Approach:**
+- Input: spec file path (for measurement command and mutable scope), project directory
+- Checks:
+  1. Port detection: search measurement command output and config files for hardcoded port patterns (`:\d{4,5}`, `PORT=`, `--port`, `bind`, `listen`)
+  2. Shared file detection: check for SQLite files (`.db`, `.sqlite`, `.sqlite3`), local file stores in mutable/measurement paths
+  3. Lock file detection: check for `.lock`, `.pid` files created by the measurement command
+  4. Resource contention: check for GPU references (`cuda`, `torch.device`, `gpu`), large memory markers
+- Output: JSON with `mode` (parallel|serial|user-decision), `blockers_found` array, `mitigations` array, `unresolved` array
+- This is advisory — the skill presents results to the user for approval, does not auto-mitigate
+
+**Patterns to follow:**
+- `plugins/compound-engineering/skills/git-worktree/scripts/worktree-manager.sh`
+
+**Test scenarios:**
+- No blockers found: mode = parallel
+- Port hardcoded: detected and reported with suggested mitigation
+- SQLite file in scope: detected and reported
+- Multiple blockers: all listed
+
+**Verification:**
+- Script can be run against a sample project directory and produces valid JSON
+
+---
+
+- [ ] **Unit 7: Experiment worktree manager script**
+
+**Goal:** Create a script that manages experiment worktrees — creation with shared file copying, and cleanup.
+
+**Requirements:** R5, R6, R12
+
+**Dependencies:** None
+
+**Files:**
+- Create: `plugins/compound-engineering/skills/ce-optimize/scripts/experiment-worktree.sh`
+
+**Approach:**
+- Subcommands: `create`, `cleanup`, `cleanup-all`
+- `create`: takes spec name, experiment index, list of shared files to copy, base branch
+  - Creates worktree at `.claude/worktrees/optimize-<spec>-exp-<NNN>/` on branch `optimize/<spec>/exp-<NNN>`
+  - Copies shared files from main repo into worktree
+  - Copies `.env`, `.env.local` if they exist (per existing worktree convention)
+  - Applies port parameterization if configured (writes env var to worktree's `.env`)
+  - Returns worktree path
+- `cleanup`: removes a single experiment worktree and its branch
+- `cleanup-all`: removes all experiment worktrees for a given spec name
+- Error handling: verify git repo, check for existing worktrees, handle cleanup of partially created worktrees
+
+**Patterns to follow:**
+- `plugins/compound-engineering/skills/git-worktree/scripts/worktree-manager.sh` — worktree creation, `.env` copying, branch management
+
+**Test scenarios:**
+- Create worktree: directory exists, branch created, shared files copied
+- Create with port parameterization: env var written to worktree
+- Cleanup: worktree removed, branch deleted
+- Cleanup-all: all experiment worktrees for spec removed
+- Partial failure: cleanup handles partially created state
+
+**Verification:**
+- Script can create and clean up worktrees in a test git repo
+
+---
+
+### Phase B: Core Skill (depends on all Phase A units)
+
+- [ ] **Unit 8: SKILL.md — Phase 0 (Setup) and Phase 1 (Measurement Scaffolding)**
+
+**Goal:** Create the SKILL.md file with frontmatter, Phase 0 (setup, spec validation, run identity, learnings search), and Phase 1 (harness validation, baseline, parallelism probe, clean-tree gate, user approval gate).
+
+**Requirements:** R1, R2, R6, R8
+
+**Dependencies:** Units 1-7
+
+**Files:**
+- Create: `plugins/compound-engineering/skills/ce-optimize/SKILL.md`
+
+**Approach:**
+
+*Frontmatter:*
+- `name: ce-optimize`
+- `description:` — rich description covering what it does (iterative optimization), when to use it (measurable improvement goals), and key capabilities (parallel experiments, LLM-as-judge, git-native history)
+- No `disable-model-invocation` — this is a v1 skill, not beta
+
+*Phase 0: Setup*
+- Accept spec file path as argument, or interactively create one guided by the spec schema reference (`references/optimize-spec-schema.yaml`)
+- Agent reads and validates spec (required fields, valid metric types, valid operators). Agent parses YAML natively — no shell script parsing.
+- Search learnings via `compound-engineering:research:learnings-researcher` for prior optimization work on similar topics
+- **Run identity detection**: Check if `optimize/<spec-name>` branch already exists. If yes, check for existing experiment log. Present user with choice via platform question tool: resume (inherit state, continue from last iteration) or fresh start (archive old branch to `optimize/<spec-name>/archived-<timestamp>`, clear log)
+- Create or switch to optimization branch
+- Create scratch directory: `.context/compound-engineering/ce-optimize/<spec-name>/`
+
+*Phase 1: Measurement Scaffolding (HARD GATE)*
+- **Clean-tree gate**: Verify `git status` shows no uncommitted changes to files within `scope.mutable` or `scope.immutable`. If dirty, require commit or stash before proceeding.
+- If user provides measurement harness: run it once via measurement script (pass command and timeout as flat args), validate JSON output matches expected metric names, present baseline to user
+- If agent must build harness: analyze codebase, build evaluation script, validate it, present baseline to user
+- Run parallelism probe script, present results
+- **Worktree budget check**: Count existing worktrees. Warn if total + `max_concurrent` would exceed 12.
+- If stability mode is repeat: run harness `repeat_count` times, agent aggregates results (median/mean/min/max), validate variance within `noise_threshold`
+- GATE: Present baseline metrics + parallel readiness + clean-tree status to user. Use platform question tool. Refuse to proceed until approved.
+- State re-read: after gate approval, re-read spec and baseline from disk (per state-machine learning)
+
+**Patterns to follow:**
+- `plugins/compound-engineering/skills/ce-work/SKILL.md` — Phase 0 input triage and Phase 1 setup pattern
+- `plugins/compound-engineering/skills/ce-plan/SKILL.md` — Phase 0 resume detection pattern
+
+**Test scenarios:**
+- Spec validation catches missing required fields
+- Existing optimization branch detected: resume and fresh-start paths both work
+- Clean-tree gate: blocks on dirty worktree, passes on clean
+- Baseline measurement: harness runs and produces valid JSON
+- Parallelism probe: blockers detected and presented
+
+**Verification:**
+- YAML frontmatter passes `bun test tests/frontmatter.test.ts`
+- All reference file paths use backtick syntax (no markdown links)
+- Cross-platform question tool pattern used for user gate
+
+---
+
+- [ ] **Unit 9: SKILL.md — Phase 2 (Hypothesis Generation)**
+
+**Goal:** Add Phase 2 to the SKILL.md — hypothesis generation, categorization, dependency pre-approval, and backlog recording.
+
+**Requirements:** R7
+
+**Dependencies:** Unit 8
+
+**Files:**
+- Modify: `plugins/compound-engineering/skills/ce-optimize/SKILL.md`
+
+**Approach:**
+
+*Phase 2: Hypothesis Generation*
+- Analyze mutable scope code to understand current approach
+- Generate hypothesis list — optionally via `compound-engineering:research:repo-research-analyst` for deeper codebase analysis
+- Categorize hypotheses (signal-extraction, graph-signals, embedding, algorithm, preprocessing, etc.)
+- Identify new dependencies across all hypotheses
+- Present dependency list for bulk approval via platform question tool
+- Record hypothesis backlog in experiment log file (with dep approval status per hypothesis)
+- Include user-provided hypotheses if any were given as input
+
+**Patterns to follow:**
+- `plugins/compound-engineering/skills/ce-ideate/SKILL.md` — hypothesis generation, categorization, iterative refinement
+
+**Test scenarios:**
+- Hypotheses generated from codebase analysis
+- User-provided hypotheses merged into backlog
+- Dependencies identified and presented for bulk approval
+- Hypotheses needing unapproved deps marked in backlog
+
+**Verification:**
+- Hypothesis backlog recorded in experiment log with categories and dep status
+
+---
+
+- [ ] **Unit 10: SKILL.md — Phase 3 (Optimization Loop)**
+
+**Goal:** Add Phase 3 to the SKILL.md — the core parallel batch dispatch, measurement, judge evaluation, keep/revert logic, and stopping criteria. This is the largest and riskiest unit.
+
+**Requirements:** R3, R4, R5, R9, R11, R12, R13
+
+**Dependencies:** Unit 9
+
+**Files:**
+- Modify: `plugins/compound-engineering/skills/ce-optimize/SKILL.md`
+
+**Approach:**
+
+*Phase 3: Optimization Loop*
+- For each batch:
+  1. Select hypotheses (batch_size = min(backlog_size, max_concurrent)). Prefer diversity across categories within each batch.
+  2. Dispatch experiments in parallel:
+     - **Worktree backend**: create worktree per experiment (via script), dispatch subagent with experiment prompt template (`references/experiment-prompt-template.md`)
+     - **Codex backend**: write prompt to temp file, dispatch via `codex exec` stdin pipe (per ce-work-beta pattern)
+     - Environment guard: check for `CODEX_SANDBOX`/`CODEX_SESSION_ID` to prevent recursive delegation
+  3. Wait for batch completion
+  4. For each completed experiment:
+     - Run measurement script in the experiment's worktree (flat args: command, timeout, working dir, env vars)
+     - Agent reads raw JSON output, evaluates degenerate gates
+     - If gates pass and primary type is judge: dispatch batched parallel judge sub-agents per judge prompt template (`references/judge-prompt-template.md`). Group samples into batches of `judge.batch_size` (default: 10), dispatch `ceil(sample_size / batch_size)` sub-agents. Aggregate returned JSON scores.
+     - If gates pass and primary type is hard: use hard metric value directly
+     - Record all results in experiment log
+  5. Evaluate batch using the parallel-batch merge strategy (see Key Technical Decisions):
+     - Rank by primary metric improvement (hard metric delta or judge `mean_score` delta, must exceed `minimum_improvement`)
+     - Best improves on current: KEEP (merge experiment branch to optimization branch)
+     - Check file-disjoint runners-up: cherry-pick, re-measure, keep if combined is strictly better
+     - Handle deferred deps: mark hypothesis `deferred_needs_approval`, continue
+     - All others: REVERT (log, cleanup worktree)
+  6. Update experiment log with ALL results from this batch
+  7. Write strategy digest summarizing categories tried, successes, failures, exploration frontier
+  8. Generate new hypotheses based on learnings from this batch (read rolling window of last 10 experiments + strategy digest, not full log)
+  9. Check stopping criteria (target reached, max iterations, max hours, plateau, manual stop)
+  10. State re-read: re-read current best from experiment log before next batch
+
+*Cross-cutting concerns:*
+- **Codex failure cascade**: 3 consecutive delegate failures auto-disable Codex for remaining experiments, fall back to subagent
+- **Error handling**: experiment errors (command crash, timeout, malformed output) are logged as `outcome: error` and the experiment is reverted. The loop continues.
+- **Progress reporting**: after each batch, report: batch N of ~M, experiments run, current best metric, improvement from baseline, cumulative judge cost
+- **Manual stop**: if user interrupts, save current experiment log state and offer wrap-up
+- **Crash recovery**: each experiment writes a `result.yaml` marker in its worktree upon measurement completion. On resume, scan for completed-but-unlogged experiments before starting a new batch.
+
+**Execution note:** Execution target: external-delegate — this unit is large and well-specified
+
+**Patterns to follow:**
+- `plugins/compound-engineering/skills/ce-review/SKILL.md` — parallel subagent dispatch (Stage 4), structured result merging (Stage 5)
+- `plugins/compound-engineering/skills/ce-work-beta/SKILL.md` — Codex delegation section
+- `plugins/compound-engineering/skills/ce-review/references/subagent-template.md` — sub-agent prompt structure and JSON output contract
+
+**Test scenarios:**
+- Spec with hard primary metric: gates + hard metric evaluation, no judge calls
+- Spec with judge primary metric: gates -> batched judge sub-agents -> keep/revert based on aggregated judge score
+- Parallel batch of 4 experiments: all dispatched, results collected, best kept, others reverted
+- Experiment that violates degenerate gate: immediately reverted, no judge call, no judge cost
+- Experiment needing unapproved dep: deferred, pipeline continues
+- Codex dispatch failure: fallback to subagent after 3 failures
+- Plateau stopping: 10 consecutive batches with no improvement -> stop
+- Flaky metric with repeat mode: agent runs harness N times, aggregates, applies noise threshold
+- Runner-up merge: file-disjoint runner-up cherry-picked, re-measured, combined is better -> kept
+- Runner-up merge fails: combined is worse than best-only -> runner-up reverted, logged
+- Context management: after 50 experiments, strategy digest used instead of full log
+
+**Verification:**
+- Experiment log updated after every batch (not just at end)
+- Strategy digest file written after every batch
+- Worktrees cleaned up after measurement
+- All reference file paths use backtick syntax
+- Script references use relative paths (`bash scripts/measure.sh`)
+
+---
+
+- [ ] **Unit 11: SKILL.md — Phase 4 (Wrap-Up)**
+
+**Goal:** Add Phase 4 to the SKILL.md — deferred hypothesis presentation, result summary, branch preservation, and integration with ce:review and ce:compound.
+
+**Requirements:** R9, R10
+
+**Dependencies:** Unit 10
+
+**Files:**
+- Modify: `plugins/compound-engineering/skills/ce-optimize/SKILL.md`
+
+**Approach:**
+
+*Phase 4: Wrap-Up*
+- Present deferred hypotheses needing dep approval (if any)
+- Summarize: baseline -> final metrics, total iterations run, kept count, reverted count, judge cost total
+- Preserve optimization branch with all commits
+- Offer post-completion options via platform question tool:
+  1. Run `/ce:review` on cumulative diff (baseline -> final)
+  2. Run `/ce:compound` to document the winning strategy
+  3. Create PR from optimization branch
+  4. Continue with more experiments (re-enter Phase 3)
+  5. Done
+
+**Patterns to follow:**
+- `plugins/compound-engineering/skills/ce-work/SKILL.md` — Phase 4 (Ship It) post-completion options
+- `plugins/compound-engineering/skills/lfg/SKILL.md` — skill-to-skill handoff pattern
+
+**Test scenarios:**
+- Deferred hypotheses presented with dep requirements
+- Summary includes all key metrics and cost data
+- Each post-completion option works (ce:review, ce:compound, PR creation, continue, done)
+- "Continue" re-enters Phase 3 cleanly with state re-read
+
+**Verification:**
+- Optimization branch preserved with full commit history
+- Post-completion options use platform question tool pattern
+
+---
+
+### Phase C: Registration (depends on Unit 11)
+
+- [ ] **Unit 12: Plugin registration and validation**
+
+**Goal:** Register the new skill in plugin documentation and validate consistency.
+
+**Requirements:** R1
+
+**Dependencies:** Unit 11
+
+**Files:**
+- Modify: `plugins/compound-engineering/README.md`
+
+**Approach:**
+- Add `ce-optimize` to the skills table in README.md with description
+- Update skill count in README.md
+- Run `bun run release:validate` to verify plugin consistency
+- Do NOT bump version in plugin.json or marketplace.json (per versioning rules)
+
+**Patterns to follow:**
+- Existing skill table entries in `plugins/compound-engineering/README.md`
+
+**Test scenarios:**
+- `bun run release:validate` passes
+- Skill count in README matches actual skill count
+- Skill table entry is alphabetically placed and has accurate description
+
+**Verification:**
+- `bun run release:validate` exits 0
+- `bun test` passes (especially frontmatter tests)
+
+## System-Wide Impact
+
+- **Interaction graph:** The skill dispatches to learnings-researcher (Phase 0), repo-research-analyst (Phase 2), parallel judge sub-agents (Phase 3), and optionally ce:review and ce:compound (Phase 4). It creates git worktrees and branches. It invokes Codex as an external process.
+- **Error propagation:** Experiment failures are contained — each runs in an isolated worktree. Failures are logged and reverted. The optimization branch only advances on successful, validated improvements. If the orchestrator crashes mid-batch, each completed experiment should have a `result.yaml` marker in its worktree; on resume the orchestrator scans for completed-but-unlogged experiments before starting a new batch.
+- **State lifecycle risks:** The experiment log is the critical state artifact. It must be written after each batch (not just at end) to survive crashes. Log atomicity is ensured by the batch-then-evaluate architecture — only the single-threaded orchestrator writes to the log, never concurrent workers.
+- **Context window pressure:** The experiment log grows ~25 lines per experiment. At 100 experiments that is ~2,500 lines of YAML. The orchestrator manages this via a rolling summary window (last 10 experiments) + a strategy digest file, never reading the full log unless filtering by category for duplicate-hypothesis detection.
+- **Branch collision:** If `optimize/<spec-name>` already exists from a prior run, Phase 0 detects it and offers resume vs. fresh start. This prevents accidental overwrites of prior experiment history.
+- **Dirty working tree:** Phase 1 includes a clean-tree gate: `git status` must show no uncommitted changes to files within `scope.mutable` or `scope.immutable`. If dirty, require commit or stash before proceeding. This prevents baseline measurement from differing between the main worktree and experiment worktrees.
+- **Worktree budget:** Optimization worktrees live under `.worktrees/` (same convention as git-worktree skill). Before creating experiment worktrees, check total worktree count (including non-optimize worktrees from ce:work or ce:review). Refuse to exceed 12 total worktrees to prevent git performance degradation.
+- **API surface parity:** This is a new skill, no existing surface to maintain parity with.
+- **Integration coverage:** The parallelism readiness probe should be validated against real projects with known blockers (SQLite DBs, hardcoded ports) to ensure detection works.
+
+## Risks & Dependencies
+
+- **Codex exec flags may change** — the skill should detect `codex` version and adapt. Mitigate by checking `codex --version` before first dispatch.
+- **Worktree disk usage** — parallel experiments with large repos consume disk. Mitigate by cleaning up worktrees immediately after measurement, capping at 6 concurrent for worktree backend, and enforcing a 12-worktree budget across all CE skills.
+- **LLM-as-judge consistency** — judge scores may vary across calls for the same input. Mitigate by using fixed sample seeds, requiring `minimum_improvement` threshold (default 0.3) to accept, and logging per-sample scores for post-hoc analysis. v2 can add anchor-based calibration.
+- **Long-running unattended execution** — the loop may run for hours. Mitigate by saving experiment log after every batch, writing per-experiment `result.yaml` markers for crash recovery, and designing for graceful resume from saved state.
+- **Context window exhaustion** — experiment log grows ~25 lines per experiment. Mitigate with rolling summary window (last 10 experiments) + strategy digest file. The orchestrator never reads the full log in one pass.
+- **Judge API rate limiting** — if using Claude API for judge calls, rate limits could throttle parallel judge evaluation. Mitigate by batching judge calls (10 per sub-agent) to reduce total API calls, and adding a brief delay between judge sub-agent dispatches if rate-limited.
+- **Runner-up merge interactions** — two independently beneficial changes can be harmful in combination. Mitigate by re-measuring after every merge, stopping after the first failed combination per batch, and logging interactions as learnings.
+
+## Documentation / Operational Notes
+
+- Update `plugins/compound-engineering/README.md` skill table
+- No new MCP servers or external dependencies for the plugin itself
+- The skill will appear in Claude Code's skill list automatically once the SKILL.md exists
+
+## Sources & References
+
+- **Origin document:** [docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md](docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md)
+- Related code: `plugins/compound-engineering/skills/ce-work-beta/SKILL.md` (Codex delegation), `plugins/compound-engineering/skills/ce-review/SKILL.md` (parallel dispatch)
+- Related PRs: #364 (Codex security posture), #365 (Codex exec pitfalls)
+- External: Karpathy autoresearch (github.com/karpathy/autoresearch), AIDE/WecoAI (github.com/WecoAI/aideml)
+- Learnings: `docs/solutions/skill-design/script-first-skill-architecture.md`, `docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md`, `docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md`, `docs/solutions/workflow/todo-status-lifecycle.md`