feat(ce-debug): add systematic debugging skill (#543)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 10:23:48 -07:00
parent b979143ad0
commit e38223ae91
5 changed files with 444 additions and 195 deletions
--- a/plugins/compound-engineering/skills/ce-debug/references/investigation-techniques.md
+++ b/plugins/compound-engineering/skills/ce-debug/references/investigation-techniques.md
@@ -0,0 +1,161 @@
+# Investigation Techniques
+
+Techniques for deeper investigation when standard code tracing is not enough. Load this when a bug does not reproduce reliably, involves timing or concurrency, or requires framework-specific tracing.
+
+---
+
+## Root-Cause Tracing
+
+When a bug manifests deep in the call stack, the instinct is to fix where the error appears. That treats a symptom. Instead, trace backward through the call chain to find where the bad state originated.
+
+**Backward tracing:**
+
+- Start at the error
+- At each level, ask: where did this value come from? Who called this function? What state was passed in?
+- Keep going upstream until finding the point where valid state first became invalid — that is the root cause
+
+**Worked example:**
+
+```
+Symptom: API returns 500 with "Cannot read property 'email' of undefined"
+Where it crashes: sendWelcomeEmail(user.email) in NotificationService
+Who called this? UserController.create() after saving the user record
+What was passed? user = await UserRepo.create(params) — but create() returns undefined on duplicate key
+Original cause: UserRepo.create() silently swallows duplicate key errors and returns undefined instead of throwing
+```
+
+The fix belongs at the origin (UserRepo.create should throw on duplicate key), not where the error appeared (NotificationService).
+
+**When manual tracing stalls**, add instrumentation:
+
+```
+// Before the problematic operation
+const stack = new Error().stack;
+console.error('DEBUG [operation]:', { value, cwd: process.cwd(), stack });
+```
+
+Use `console.error()` in tests — logger output may be suppressed. Log before the dangerous operation, not after it fails.
+
+---
+
+## Git Bisect for Regressions
+
+When a bug is a regression ("it worked before"), use binary search to find the breaking commit:
+
+```bash
+git bisect start
+git bisect bad                    # current commit is broken
+git bisect good <known-good-ref> # a commit where it worked
+# git bisect will checkout a middle commit — test it
+# mark as good or bad, repeat until the breaking commit is found
+git bisect reset                  # return to original branch when done
+```
+
+For automated bisection with a test script:
+
+```bash
+git bisect start HEAD <known-good-ref>
+git bisect run <test-command>
+```
+
+The test command should exit 0 for good, non-zero for bad.
+
+---
+
+## Intermittent Bug Techniques
+
+When a bug does not reproduce reliably after 2-3 attempts:
+
+**Logging traps.** Add targeted logging at the suspected failure point and run the scenario repeatedly. Capture the state that differs between passing and failing runs.
+
+**Statistical reproduction.** Run the failing scenario in a loop to establish a reproduction rate:
+
+```bash
+for i in $(seq 1 20); do echo "Run $i:"; <test-command> && echo "PASS" || echo "FAIL"; done
+```
+
+A 5% reproduction rate confirms the bug exists but suggests timing or data sensitivity.
+
+**Environment isolation.** Systematically eliminate variables:
+- Same test, different machine?
+- Same test, different data seed?
+- Same test, serial vs parallel execution?
+- Same test, with vs without network access?
+
+**Data-dependent triggers.** If the bug only appears with certain data, identify the trigger condition:
+- What is unique about the failing input?
+- Does the input size, encoding, or edge value matter?
+- Is the data order significant (sorted vs random)?
+
+---
+
+## Framework-Specific Debugging
+
+### Rails
+- Check callbacks: `before_save`, `after_commit`, `around_action` — these execute implicitly and can alter state
+- Check middleware chain: `rake middleware` lists the full stack
+- Check Active Record query generation: `.to_sql` on any relation
+- Use `Rails.logger.debug` with tagged logging for request tracing
+
+### Node.js
+- Async stack traces: run with `--async-stack-traces` flag for full async call chains
+- Unhandled rejections: check for missing `.catch()` or `await` on promises
+- Event loop delays: `process.hrtime()` before and after suspect operations
+- Memory leaks: `--inspect` flag + Chrome DevTools heap snapshots
+
+### Python
+- Traceback enrichment: `traceback.print_exc()` in except blocks
+- `pdb.set_trace()` or `breakpoint()` for interactive debugging
+- `sys.settrace()` for execution tracing
+- `logging.basicConfig(level=logging.DEBUG)` for verbose output
+
+---
+
+## Race Condition Investigation
+
+When timing or concurrency is suspected:
+
+**Timing isolation.** Add deliberate delays at suspect points to widen the race window and make it reproducible:
+
+```
+// Simulate slow operation to expose race
+await new Promise(r => setTimeout(r, 100));
+```
+
+**Shared mutable state.** Search for variables, caches, or database rows accessed by multiple threads or processes without synchronization. Common patterns:
+- Global or module-level mutable state
+- Cache reads without locks
+- Database rows read then updated without optimistic locking
+
+**Async ordering.** Check whether operations assume a specific execution order that is not guaranteed:
+- Promise.all with dependent operations
+- Event handlers that assume emission order
+- Database writes that assume read consistency
+
+---
+
+## Browser Debugging
+
+When investigating UI bugs with `agent-browser` or equivalent tools:
+
+```bash
+# Open the affected page
+agent-browser open http://localhost:${PORT:-3000}/affected/route
+
+# Capture current state
+agent-browser snapshot -i
+
+# Interact with the page
+agent-browser click @ref          # click an element
+agent-browser fill @ref "text"    # fill a form field
+agent-browser snapshot -i         # capture state after interaction
+
+# Save visual evidence
+agent-browser screenshot bug-evidence.png
+```
+
+**Port detection:** Check project instruction files (`AGENTS.md`, `CLAUDE.md`) for port references, then `package.json` dev scripts, then `.env` files, falling back to `3000`.
+
+**Console errors:** Check browser console output for JavaScript errors, failed network requests, and CORS issues. These often reveal the root cause of UI bugs before any code tracing is needed.
+
+**Network tab:** Check for failed API requests, unexpected response codes, or missing CORS headers. A 422 or 500 response from the backend narrows the investigation immediately.