Files
claude-engineering-plugin/plugins/compound-engineering/skills/ce-debug/references/investigation-techniques.md
Trevin Chow e38223ae91 feat(ce-debug): add systematic debugging skill (#543)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 10:23:48 -07:00

162 lines
6.0 KiB
Markdown

# Investigation Techniques
Techniques for deeper investigation when standard code tracing is not enough. Load this when a bug does not reproduce reliably, involves timing or concurrency, or requires framework-specific tracing.
---
## Root-Cause Tracing
When a bug manifests deep in the call stack, the instinct is to fix where the error appears. That treats a symptom. Instead, trace backward through the call chain to find where the bad state originated.
**Backward tracing:**
- Start at the error
- At each level, ask: where did this value come from? Who called this function? What state was passed in?
- Keep going upstream until finding the point where valid state first became invalid — that is the root cause
**Worked example:**
```
Symptom: API returns 500 with "Cannot read property 'email' of undefined"
Where it crashes: sendWelcomeEmail(user.email) in NotificationService
Who called this? UserController.create() after saving the user record
What was passed? user = await UserRepo.create(params) — but create() returns undefined on duplicate key
Original cause: UserRepo.create() silently swallows duplicate key errors and returns undefined instead of throwing
```
The fix belongs at the origin (UserRepo.create should throw on duplicate key), not where the error appeared (NotificationService).
**When manual tracing stalls**, add instrumentation:
```
// Before the problematic operation
const stack = new Error().stack;
console.error('DEBUG [operation]:', { value, cwd: process.cwd(), stack });
```
Use `console.error()` in tests — logger output may be suppressed. Log before the dangerous operation, not after it fails.
---
## Git Bisect for Regressions
When a bug is a regression ("it worked before"), use binary search to find the breaking commit:
```bash
git bisect start
git bisect bad # current commit is broken
git bisect good <known-good-ref> # a commit where it worked
# git bisect will checkout a middle commit — test it
# mark as good or bad, repeat until the breaking commit is found
git bisect reset # return to original branch when done
```
For automated bisection with a test script:
```bash
git bisect start HEAD <known-good-ref>
git bisect run <test-command>
```
The test command should exit 0 for good, non-zero for bad.
---
## Intermittent Bug Techniques
When a bug does not reproduce reliably after 2-3 attempts:
**Logging traps.** Add targeted logging at the suspected failure point and run the scenario repeatedly. Capture the state that differs between passing and failing runs.
**Statistical reproduction.** Run the failing scenario in a loop to establish a reproduction rate:
```bash
for i in $(seq 1 20); do echo "Run $i:"; <test-command> && echo "PASS" || echo "FAIL"; done
```
A 5% reproduction rate confirms the bug exists but suggests timing or data sensitivity.
**Environment isolation.** Systematically eliminate variables:
- Same test, different machine?
- Same test, different data seed?
- Same test, serial vs parallel execution?
- Same test, with vs without network access?
**Data-dependent triggers.** If the bug only appears with certain data, identify the trigger condition:
- What is unique about the failing input?
- Does the input size, encoding, or edge value matter?
- Is the data order significant (sorted vs random)?
---
## Framework-Specific Debugging
### Rails
- Check callbacks: `before_save`, `after_commit`, `around_action` — these execute implicitly and can alter state
- Check middleware chain: `rake middleware` lists the full stack
- Check Active Record query generation: `.to_sql` on any relation
- Use `Rails.logger.debug` with tagged logging for request tracing
### Node.js
- Async stack traces: run with `--async-stack-traces` flag for full async call chains
- Unhandled rejections: check for missing `.catch()` or `await` on promises
- Event loop delays: `process.hrtime()` before and after suspect operations
- Memory leaks: `--inspect` flag + Chrome DevTools heap snapshots
### Python
- Traceback enrichment: `traceback.print_exc()` in except blocks
- `pdb.set_trace()` or `breakpoint()` for interactive debugging
- `sys.settrace()` for execution tracing
- `logging.basicConfig(level=logging.DEBUG)` for verbose output
---
## Race Condition Investigation
When timing or concurrency is suspected:
**Timing isolation.** Add deliberate delays at suspect points to widen the race window and make it reproducible:
```
// Simulate slow operation to expose race
await new Promise(r => setTimeout(r, 100));
```
**Shared mutable state.** Search for variables, caches, or database rows accessed by multiple threads or processes without synchronization. Common patterns:
- Global or module-level mutable state
- Cache reads without locks
- Database rows read then updated without optimistic locking
**Async ordering.** Check whether operations assume a specific execution order that is not guaranteed:
- Promise.all with dependent operations
- Event handlers that assume emission order
- Database writes that assume read consistency
---
## Browser Debugging
When investigating UI bugs with `agent-browser` or equivalent tools:
```bash
# Open the affected page
agent-browser open http://localhost:${PORT:-3000}/affected/route
# Capture current state
agent-browser snapshot -i
# Interact with the page
agent-browser click @ref # click an element
agent-browser fill @ref "text" # fill a form field
agent-browser snapshot -i # capture state after interaction
# Save visual evidence
agent-browser screenshot bug-evidence.png
```
**Port detection:** Check project instruction files (`AGENTS.md`, `CLAUDE.md`) for port references, then `package.json` dev scripts, then `.env` files, falling back to `3000`.
**Console errors:** Check browser console output for JavaScript errors, failed network requests, and CORS issues. These often reveal the root cause of UI bugs before any code tracing is needed.
**Network tab:** Check for failed API requests, unexpected response codes, or missing CORS headers. A 422 or 500 response from the backend narrows the investigation immediately.