feat(ce-debug): environment sanity, assumption audit, more techniques (#649)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 09:08:24 -07:00
parent 6155b9de3c
commit cce95fb814
3 changed files with 265 additions and 3 deletions
--- a/plugins/compound-engineering/skills/ce-debug/references/defense-in-depth.md
+++ b/plugins/compound-engineering/skills/ce-debug/references/defense-in-depth.md
@@ -0,0 +1,35 @@
+# Defense-in-Depth
+
+When a bug is caused by invalid state reaching a vulnerable code path, fixing just one layer leaves the door open for different code paths, refactors, or mocks to re-introduce the same bug. Defense-in-depth makes the bug structurally harder to re-create by validating at multiple layers.
+
+Not every bug warrants this. Use when:
+
+- The root-cause pattern exists in 3+ other files (grep the fix signature)
+- The bug would have been catastrophic in production
+- The vulnerable operation is dangerous regardless of caller (destructive side effects, security-sensitive, irreversible)
+
+Skip when the root cause is a one-off logic error with no realistic recurrence path.
+
+## The four layers
+
+Pick the layers that apply. Not every bug needs all four.
+
+| Layer | Purpose | Apply when | Example |
+|-------|---------|------------|---------|
+| 1. Entry validation | Reject obviously invalid input at the API boundary | The bug was caused by a caller passing bad data that should have been rejected | Throw if `workingDirectory` is empty or doesn't exist, before any downstream code touches it |
+| 2. Invariant / business-logic check | Enforce that data makes sense for this operation | The operation has preconditions that entry validation cannot express | Assert `user.state === 'verified'` before issuing a password reset |
+| 3. Environment guard | Refuse dangerous operations in contexts where they make no sense | The operation can be catastrophic if run in the wrong environment | In tests (`NODE_ENV === 'test'`), refuse `git init` outside the OS temp dir |
+| 4. Diagnostic breadcrumb | Capture forensic context before the risky operation | Other layers might still be bypassed; future failures need evidence | Log `{ directory, cwd, env, stack }` immediately before `git init` |
+
+## Applying the pattern
+
+1. Trace the data flow from the bad value's origin through every function that passed it along.
+2. Map the checkpoints: at which of those points could validation have rejected the bad value earlier?
+3. Add guards at the appropriate layers. Each guard should be as narrow as possible — validating exactly what this layer is responsible for, not duplicating checks from other layers.
+4. Test each guard independently: construct a case that bypasses layer 1 and verify layer 2 still catches it.
+
+## Common mistakes
+
+- **Duplicating the same check at every layer.** Each layer should catch a distinct class of failure. If layer 2 just repeats layer 1, the second one is noise.
+- **Adding guards speculatively without a bug to justify them.** Defense-in-depth is a response to an observed failure mode, not a generic code-hygiene practice.
+- **Leaving layer 4 (diagnostic breadcrumb) out.** When layers 1-3 still get bypassed — they will, eventually — the breadcrumb is what makes the next bug debuggable.
--- a/plugins/compound-engineering/skills/ce-debug/references/investigation-techniques.md
+++ b/plugins/compound-engineering/skills/ce-debug/references/investigation-techniques.md
@@ -38,6 +38,44 @@ Use `console.error()` in tests — logger output may be suppressed. Log before t

 ---

+## Multi-Component Boundary Instrumentation
+
+Root-cause tracing walks one call chain. When a bug crosses subsystems — CI → build → signing, API → service → database, frontend → API → background worker — the failure localizes poorly to a single chain. Instead, instrument every component boundary in one run, capture what enters and what exits each, and let the evidence point to the failing layer.
+
+**Shape:**
+
+1. List the component boundaries data crosses from trigger to observed symptom.
+2. At each boundary, log what enters and what exits — include the values, relevant environment, and a short tag identifying the boundary.
+3. Run the scenario once.
+4. Read the log linearly, comparing each "exits" value to the next "enters" value.
+5. The boundary where data first stops matching expectation is the failing layer.
+
+**Worked example (app signing on CI):**
+
+```bash
+# Layer 1: workflow env
+echo "=== workflow env ==="
+echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"
+
+# Layer 2: build script env
+echo "=== build script env ==="
+echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"
+
+# Layer 3: signing stage keychain state
+echo "=== keychain ==="
+security list-keychains
+security find-identity -v
+
+# Layer 4: the actual signing call
+codesign --sign "$IDENTITY" --verbose=4 "$APP"
+```
+
+One run, and the log shows precisely which layer drops the value — secrets → workflow ✓, workflow → build ✗ → focus investigation on the workflow-to-build-script inheritance, not on signing.
+
+**When this beats backward tracing:** When the symptom is far from the trigger (many components apart), when components are owned by different systems (CI vs app code), when the "call stack" is conceptual rather than literal (message bus, HTTP, process boundaries). Backward tracing still applies within each layer once the failing layer is identified.
+
+---
+
 ## Git Bisect for Regressions

 When a bug is a regression ("it worked before"), use binary search to find the breaking commit:
@@ -87,6 +125,41 @@ A 5% reproduction rate confirms the bug exists but suggests timing or data sensi
 - Does the input size, encoding, or edge value matter?
 - Is the data order significant (sorted vs random)?

+**Test-order pollution.** If an individual test passes in isolation but fails when the suite runs, tests are leaking state between each other:
+
+- Run the failing test alone — if it passes, pollution is confirmed
+- Run the failing test's file alone — narrows pollution to same-file or cross-file
+- Run the suite with randomized test order (most runners support a seed flag) — a different failing-test neighbor each run implies global state mutation
+- Bisect the preceding tests: run the failing test with just the first half of the earlier tests, then the second half, then narrow
+
+Common culprits once isolated: module-level state, mocks not torn down, temp files not cleaned up, database rows not rolled back, environment variables mutated and not restored.
+
+---
+
+## Repro Minimization
+
+Once a bug reproduces reliably, the reproduction is often large — a 500-line integration test, a huge payload, a lengthy form-filling sequence. A smaller reproduction makes every subsequent investigation step faster and localizes the actual trigger.
+
+**Delta debugging (manual):**
+
+1. Cut the reproduction in half.
+2. Does it still fail? If yes, discard the other half; recurse on what remains. If no, the failing behavior depends on something in the half you cut — put it back and cut the other half instead.
+3. Continue until no further reduction is possible without losing the failure.
+
+**For input payloads:**
+
+- Remove fields one at a time (or half at a time) while confirming the bug persists
+- Shrink string values until the minimum length that still triggers the bug
+- Replace complex nested structures with the smallest shape that reproduces
+
+**For test sequences:**
+
+- Remove setup steps that don't appear to affect the failing assertion
+- Inline helpers into the test to see what actually runs
+- Remove other assertions to isolate which one fails and on what state
+
+The minimized repro often reveals the root cause directly — "the bug only triggers when the string contains a tab character" is a much louder signal than "the bug triggers in this 500-line integration test."
+
 ---

 ## Framework-Specific Debugging
@@ -111,6 +184,30 @@ A 5% reproduction rate confirms the bug exists but suggests timing or data sensi

 ---

+## Stepping Debugger vs Instrumentation
+
+Print-debugging is the default reach — it is fast to add and scales across many cases. But there are cases where an interactive stepping debugger converges to the root cause far faster. The rule of thumb:
+
+- **Reach for a stepping debugger when:** the failing code path is localized (a specific function or tight call chain), the bug is reliably reproducible, and you need precise state at a known point — values of many locals at once, the exact shape of a structure, or the progression of state across a loop. One break, inspect everything.
+- **Reach for instrumentation when:** the bug is intermittent, spans many calls or distributed components, or happens in a context where breaking execution is disruptive (production, concurrent code whose timing matters, long-running processes). Instrumentation captures diffuse behavior across time and environments.
+
+Mixed use is common: instrument first to localize, then attach a debugger at the localized point.
+
+**Entry points by language:**
+
+| Language | Interactive breakpoint | Attach to running process |
+|----------|------------------------|---------------------------|
+| Python | `breakpoint()` in code, or `python -m pdb script.py` | `python -m pdb -p <pid>` (Python 3.14+ only); on earlier versions, instrument the target with `rpdb` / `remote-pdb` and connect after it triggers |
+| Node.js | `debugger;` in code + `node --inspect-brk`, then connect via Chrome DevTools or VS Code | `kill -SIGUSR1 <pid>` to enable the inspector on the running process (Linux/macOS), then connect Chrome DevTools or VS Code to the default port 9229 |
+| Ruby | `binding.irb` (stdlib), `binding.pry` (pry gem), `debugger` (debug gem), `rdbg` | `rdbg --attach <pid>` with `debug` gem loaded |
+| Go | `dlv debug` or `dlv test`, then `break`, `continue`, `print` | `dlv attach <pid>` |
+| Rust / C / C++ | `lldb target/debug/binary` or `gdb binary`, then `break`, `run`, `print` | `lldb -p <pid>` / `gdb -p <pid>` |
+| Browser JS | `debugger;` in code, or DevTools Sources → set breakpoint | DevTools attaches to page automatically |
+
+For test runs, most test runners integrate with the above — e.g., `node --inspect-brk $(which jest)`, `pytest --pdb`, `rspec` with `binding.pry`, `dlv test`. Prefer the runner's integration over trying to attach post-hoc.
+
+---
+
 ## Race Condition Investigation

 When timing or concurrency is suspected:
@@ -132,6 +229,43 @@ await new Promise(r => setTimeout(r, 100));
 - Event handlers that assume emission order
 - Database writes that assume read consistency

+**Condition-based waits instead of arbitrary delays.** Flaky tests are often built on `setTimeout`/`sleep` calls that guess at how long an operation takes. These pass on fast machines and fail under load or in CI. Replace the guess with polling the condition the test actually depends on, bounded by a timeout:
+
+```typescript
+// before: races under load
+await new Promise(r => setTimeout(r, 50));
+expect(getResult()).toBeDefined();
+
+// after: waits for the condition
+await waitFor(() => getResult() !== undefined, 'result available', 5000);
+expect(getResult()).toBeDefined();
+```
+
+Arbitrary delays remain correct only when testing actual timing behavior (debounce intervals, throttle windows) — in that case, comment why the specific duration is needed.
+
+---
+
+## Heisenbugs and the Observer Effect
+
+When adding `console.log`, attaching a debugger, or inserting instrumentation causes the bug to disappear, the observation is changing the system's behavior. That is itself diagnostic — do not conclude "fixed." The bug is still present; your instrumentation perturbed it out of sight.
+
+**What the disappearance tells you:**
+
+- **Timing-sensitive:** Instrumentation slowed the code enough that a race condition no longer wins. Investigate concurrency, async ordering, and shared mutable state rather than the nominal logic.
+- **Garbage-collection-sensitive:** Logging allocated memory and triggered a GC that hid the symptom. Look at memory pressure, finalizers, object lifecycle.
+- **Optimization-dependent:** Instrumentation prevented a compiler/JIT optimization that was producing wrong results. Rare but real (especially in C/C++/Rust release builds).
+- **Buffering-dependent:** Log flushing changed I/O ordering. Often indicates unflushed writes elsewhere.
+- **Async-ordering-sensitive:** Log I/O introduced a microtask boundary that reorders subsequent operations. Look for code that implicitly depends on synchronous ordering.
+
+**How to investigate without perturbing:**
+
+- Non-blocking instrumentation: write to a ring buffer in memory, dump it only after failure is observed
+- Sampling profilers instead of tracing: external observation of what's running without injecting code into the path
+- Platform-level instrumentation: `strace`, `dtrace`, eBPF, platform profilers that don't require code changes
+- Post-mortem evidence: core dumps, heap snapshots, captured state from after the failure, without observing during
+
+The defining rule: if the bug is sensitive to observation, the fix must survive re-introduction of the observation. A fix that only works while instrumentation is present is itself a heisenbug.
+
 ---

 ## Browser Debugging
@@ -159,3 +293,82 @@ agent-browser screenshot bug-evidence.png
 **Console errors:** Check browser console output for JavaScript errors, failed network requests, and CORS issues. These often reveal the root cause of UI bugs before any code tracing is needed.

 **Network tab:** Check for failed API requests, unexpected response codes, or missing CORS headers. A 422 or 500 response from the backend narrows the investigation immediately.
+
+---
+
+## Evidence Harvesting Across Systems
+
+When a bug spans a real environment — production, staging, a multi-service setup — the richest evidence usually already exists in logs, traces, and error-tracker payloads. Use it rather than reproducing from scratch when possible.
+
+**Follow a single request end-to-end.** Pick one concrete failing request (an exact timestamp, user ID, or event ID from an error tracker). Then:
+
+- Search every relevant log source for that identifier — correlation ID, request ID, trace ID, user ID
+- Assemble the timeline in order: edge → API → service → database → downstream calls → response
+- Note where the timeline has gaps (missing logs) or contradictions (timestamps out of order, IDs that don't propagate)
+
+One traced request usually reveals the root cause faster than a dozen attempts to reproduce.
+
+**Correlation IDs.** Most web frameworks either attach a request ID automatically or accept one via header (`X-Request-ID`, `traceparent`). When the project has one, every log line and every downstream call should carry it. If it's missing or not propagated, that is itself a finding — propagation gaps mean the agent cannot assemble the timeline, and neither could the on-call human who investigates the next incident.
+
+**Timestamp triangulation.** When the failing operation has no shared ID, timestamps are the fallback. Constrain every log query to a narrow window around the observed failure, then look for the first anomaly in order. Watch for clock skew between services — a 30-second drift between two hosts reorders evidence and misleads triangulation.
+
+**Error tracker payloads.** Sentry, Bugsnag, Honeybadger, AppSignal and similar tools capture stack traces, breadcrumbs, user context, request state, and release metadata at the moment of failure. Read the full payload before tracing code — it often contains the exact file:line, the variable state, and the breadcrumbs leading to the error. Grouping rules sometimes hide frequency and variant information; expand to see every instance rather than just the representative one.
+
+**APM / distributed traces.** When the project has Datadog APM, Honeycomb, New Relic, or an OpenTelemetry collector, the trace view shows the full call tree across services with timings. Look for: unexpectedly long spans (blocking or slow dependency), failed spans in the middle of the chain, spans that should exist but don't (missing instrumentation also masks bugs).
+
+**Preserve before investigating.** Error trackers and log systems have retention windows. Before starting a long investigation, export or snapshot the key evidence (event ID, trace ID, full stack trace, breadcrumbs) so it doesn't age out mid-session.
+
+---
+
+## System Boundary Checks
+
+Many bugs live at the boundary between an application and the system it runs on — network, database, filesystem, OS. A fast pass through these boundaries often eliminates whole categories of suspicion before deep code tracing.
+
+**Network.**
+
+- DNS resolution: `dig <host>`, `nslookup <host>`, `host <host>` — does the name resolve to what you expect from this host?
+- Reachability: `curl -v https://host/path` — full headers, redirects, TLS errors
+- Status codes and headers: check response for 4xx/5xx, unexpected redirects, missing CORS headers, content-encoding surprises
+- Connection state: `ss -tan` / `netstat -an` / `lsof -i` — open connections, listening ports, connections in TIME_WAIT or CLOSE_WAIT
+- TLS: `openssl s_client -connect host:443` — certificate chain, expiry, SNI mismatches
+
+**Database.**
+
+- Query plan: `EXPLAIN` / `EXPLAIN ANALYZE` on the suspect query — is it using the expected index, or scanning a large table?
+- Slow query log / recent queries: most databases surface the N slowest recent queries — failing queries often show up there
+- Locks and transactions: inspect the lock/transaction tables (`pg_locks`, `information_schema.innodb_trx`, `sys.dm_tran_locks`) — is the operation waiting on a long-held lock?
+- Connection pool: is the app exhausting its pool? Are connections leaking?
+- Replication lag (if read replicas are in the path): a read right after a write may hit a replica that hasn't caught up yet
+
+**Filesystem.**
+
+- Existence and permissions: `ls -la <path>` — does the file exist, is it readable/writable by the running user?
+- Case sensitivity: bugs that only appear on Linux (not macOS) are often case mismatches
+- Open handles: `lsof <path>` or `lsof -p <pid>` — is something still holding the file, preventing write/unlink?
+- Disk space: `df -h` — out-of-space errors sometimes surface as cryptic write failures elsewhere
+- File watching / inotify limits: EMFILE or "too many open files" often means an inotify/FD limit, not a leak in your code
+- Path separators and encoding: Windows-style paths in Unix code, or UTF-8 paths in a non-UTF-8 locale
+
+**Processes and signals.** Check whether the process is actually the version you think is running (`ps aux | grep`, cross-reference pid to build time). Zombies, orphaned workers, and crashed-then-restarted-with-old-code processes all masquerade as code bugs.
+
+---
+
+## Bug-Class Pattern Checklist
+
+Before deep tracing, run down this checklist. Many bugs match a recognizable class, and the class implies where to look first. Check whether the observed symptom fits any of these patterns:
+
+- **Time and timezone:** off-by-hours errors near midnight, failures specifically during DST transitions, epoch/milliseconds confusion, naive vs timezone-aware datetimes mixed, UTC-vs-local assumed incorrectly
+- **Encoding and locale:** mojibake in output, byte-vs-character length off-by-one, BOM at the start of a file breaking parsers, non-ASCII characters missing, locale-sensitive comparisons producing inconsistent results
+- **Floating-point precision:** comparisons that "should" be equal but aren't, NaN propagating through a calculation and silently corrupting downstream results, very large or very small numbers losing precision
+- **Integer overflow / underflow:** wraparound on bounded integer types, `int32` overflows in languages without arbitrary-precision integers, negative values where non-negative was assumed
+- **Off-by-one and boundaries:** empty-collection edge case, first or last element missing, inclusive vs exclusive range mismatch, fencepost errors
+- **Cache staleness:** correct behavior immediately after a change, wrong behavior after some time, fixed by restart or cache flush; includes HTTP caches, CDN caches, app-level memoization, browser service workers
+- **Permissions / auth:** works for one user and not another, works in dev without auth layer but fails in prod with it, works with superuser but not with the actual operating identity
+- **Dependency or version drift:** works on one machine but not another, lockfile out of sync with manifest, transitive dependency updated and changed behavior, native module built against a different runtime version
+- **Path / case sensitivity:** works on macOS and fails on Linux (case), works on Linux and fails on Windows (path separators, reserved names like `CON`/`PRN`)
+- **Concurrency / ordering:** works in serial test mode, fails in parallel; works one way and fails another when randomized
+- **Stale build artifacts:** `dist/`, `.next/`, compiled `.pyc`, generated code, Docker image layers — rebuild from clean and see if it reproduces
+- **Observer effect (heisenbug):** bug vanishes when logging, debugger, or profiler is attached — see the Heisenbugs section above
+- **TOCTOU (time-of-check vs time-of-use):** a check passed a moment ago but the underlying state changed before the dependent action ran
+
+Pattern-matching here is cheap. Spending 30 seconds checking whether the symptom fits a known class can eliminate hours of speculative tracing.