feat(testing): close the testing gap in ce:work, ce:plan, and testing-reviewer (#438)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 13:07:05 -07:00
parent a301a08205
commit 35678b8add
8 changed files with 399 additions and 6 deletions
--- a/docs/plans/2026-03-29-001-feat-testing-addressed-gate-plan.md
+++ b/docs/plans/2026-03-29-001-feat-testing-addressed-gate-plan.md
@@ -0,0 +1,239 @@
+---
+title: "feat: Close the testing gap in ce:work, ce:plan, and testing-reviewer"
+type: feat
+status: active
+date: 2026-03-29
+origin: docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md
+---
+
+# feat: Close the testing gap in ce:work, ce:plan, and testing-reviewer
+
+## Overview
+
+Targeted edits to three skill/agent files to make "no tests" a deliberate decision rather than an accidental omission. Adds per-task testing deliberation in ce:work's execution loop, blank-test-scenarios handling in ce:plan's review, and a missing-test-pattern check in the testing-reviewer agent. Ships with contract tests following the existing repo pattern.
+
+## Problem Frame
+
+ce:work has thorough testing instructions but two narrow gaps let untested behavioral changes slip through silently: the quality gate says "All tests pass" (vacuously true with no tests), and ce:plan allows blank test scenarios without annotation. The testing-reviewer catches some gaps after the fact but doesn't flag the broad pattern of behavioral changes with zero test additions. (see origin: docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md)
+
+## Requirements Trace
+
+- R1. ce:plan units with no test scenarios should annotate why, not leave the field blank
+- R2. Blank test scenarios on feature-bearing units treated as incomplete in Phase 5.1 review
+- R3. Per-task testing deliberation in ce:work's execution loop before marking a task done
+- R4. Quality checklist and Final Validation updated from "Tests pass" to "Testing addressed"
+- R5. Apply R3 and R4 to ce:work-beta with explicit sync decision
+- R6. testing-reviewer adds a check for behavioral changes with no corresponding test additions
+- R7. New check complements existing checks (untested branches, weak assertions, brittle tests, missing edge cases)
+- R8. Contract tests verifying each behavioral change ships as intended
+
+## Scope Boundaries
+
+- Prompt-level changes only -- no CI enforcement, no programmatic gates
+- No new abstractions (no "testing assessment artifacts" or structured output schemas)
+- No changes to testing-reviewer's output format (findings JSON stays the same)
+- Deliberate test omission with justification is a valid outcome
+
+## Context & Research
+
+### Relevant Code and Patterns
+
+- `plugins/compound-engineering/skills/ce-plan/SKILL.md` — Phase 5.1 review checklist at lines 583-601, test scenario quality checks at lines 591-592. Two edit sites: instruction prose for Test scenarios at line 339 (section 3.5), and plan output template with HTML comment at line 499
+- `plugins/compound-engineering/skills/ce-work/SKILL.md` — Phase 2 task loop at lines ~143-155, Final Validation at lines 287-295 ("All tests pass"), Quality Checklist at lines 427-443 ("Tests pass (run project's test command)")
+- `plugins/compound-engineering/skills/ce-work-beta/SKILL.md` — Identical loop/checklist structure. Final Validation at lines 296-304, Quality Checklist at lines 500-516
+- `plugins/compound-engineering/agents/review/testing-reviewer.md` — 4 existing checks in "What you're hunting for" (lines 15-20), confidence calibration (lines 22-29), output format (lines 37-48)
+- `tests/pipeline-review-contract.test.ts` — Contract tests for ce:work, ce:work-beta, ce:brainstorm, ce:plan using `readRepoFile()` + `toContain`/`not.toContain` assertions
+- `tests/review-skill-contract.test.ts` — Contract tests for ce:review agent using same pattern, includes frontmatter parsing and cross-file schema alignment
+
+### Institutional Learnings
+
+- Beta-to-stable sync must be explicit per AGENTS.md (lines 161-163). The existing `pipeline-review-contract.test.ts` already tests ce:work-beta mirrors ce:work's review contract — follow same pattern.
+- Skill review checklist warns against contradictory rules across phases — the new "testing deliberation" must complement, not contradict, existing "Run tests after changes" instruction.
+- Use negative assertions (`not.toContain`) to prevent regression — assert old "Tests pass" / "All tests pass" language is fully replaced.
+
+## Key Technical Decisions
+
+- **Testing deliberation goes after "Run tests after changes" in the loop**: This is the natural deliberation point — tests have just run (or not), and the agent should assess whether testing was adequately addressed before marking the task done. Placing it earlier (before test execution) would be premature; placing it at "Mark task as completed" would intermingle it with completion bookkeeping.
+- **Annotation uses existing template field, not a new field**: `Test expectation: none -- [reason]` goes in the Test scenarios section rather than adding a new template field. This keeps the template stable and leverages the existing Phase 5.1 check surface.
+- **New testing-reviewer check is a 5th bullet, not a replacement**: It's conceptually distinct from check #1 (untested branches within new code). Check #1 looks at branch coverage within tests that exist; the new check flags when no tests exist at all for behavioral changes.
+- **Contract tests extend existing files**: New ce:work/ce:plan assertions go in `pipeline-review-contract.test.ts`. Testing-reviewer assertion goes in `review-skill-contract.test.ts`. This follows the established convention rather than creating a new file.
+
+## Open Questions
+
+### Resolved During Planning
+
+- **Where does testing deliberation go in the loop?** After "Run tests after changes" (bullet 8) and before "Mark task as completed" (bullet 9). The agent has just run tests or skipped them — now it deliberates.
+- **What annotation format for units with no tests?** `Test expectation: none -- [reason]` in the Test scenarios field. Follows existing template structure.
+- **Where does the new check go in testing-reviewer?** 5th bullet in "What you're hunting for" after the existing 4 checks.
+- **New test file or extend existing?** Extend existing — `pipeline-review-contract.test.ts` for skill changes, `review-skill-contract.test.ts` for the agent change.
+
+### Deferred to Implementation
+
+- Exact wording of the testing deliberation prompt in the execution loop — should be concise and action-oriented, final phrasing determined during implementation
+- Whether the testing-reviewer's "What you don't flag" section needs a corresponding exclusion for non-behavioral changes (config, formatting, comments) — inspect during implementation
+
+## Implementation Units
+
+- [ ] **Unit 1: ce:plan — Blank test scenarios handling**
+
+**Goal:** Make blank test scenarios on feature-bearing units flagged as incomplete during plan review, and establish the annotation convention for units that genuinely need no tests.
+
+**Requirements:** R1, R2
+
+**Dependencies:** None
+
+**Files:**
+- Modify: `plugins/compound-engineering/skills/ce-plan/SKILL.md`
+
+**Approach:**
+- Two edit sites in ce:plan for the annotation convention:
+  - The instruction prose (section 3.5, around line 339) that describes how to write Test scenarios — mention the `Test expectation: none -- [reason]` convention here so the planner agent learns it when reading instructions
+  - The plan output template (around line 499) which contains the HTML comment `<!-- Include only categories that apply to this unit. Omit categories that don't. -->` — update this comment to also show the annotation convention for units with no test scenarios
+- In Phase 5.1 review checklist (after line 592), add a new bullet: blank or missing test scenarios on a feature-bearing unit (as defined by ce:plan's existing Plan Quality Bar language) should be flagged as incomplete
+- In the Phase 5.3.3 confidence-scoring checklist for Implementation Units (around line 717), add a parallel item so the confidence check also catches blank test scenarios
+
+**Patterns to follow:**
+- Existing Phase 5.1 test scenario quality checks at lines 591-592
+- The unit template comment style at line 499
+- ce:plan's existing "feature-bearing unit" terminology in the Plan Quality Bar
+
+**Test scenarios:**
+- Happy path: Plan with a feature-bearing unit that has `Test expectation: none -- config-only change` in test scenarios -> Phase 5.1 review accepts it
+- Error path: Plan with a feature-bearing unit that has a completely blank/absent Test scenarios field -> Phase 5.1 review flags it as incomplete
+- Happy path: Plan with a non-feature-bearing unit (scaffolding, config) that uses the annotation -> accepted without issue
+
+**Verification:**
+- Phase 5.1 checklist explicitly addresses blank test scenarios
+- Plan template comment mentions the `Test expectation: none -- [reason]` convention
+- Confidence scoring checklist includes blank test scenarios as a scoring trigger
+
+---
+
+- [ ] **Unit 2: ce:work and ce:work-beta — Testing deliberation and checklist update**
+
+**Goal:** Add per-task testing deliberation to the execution loop and update both checklist surfaces from "Tests pass" to "Testing addressed."
+
+**Requirements:** R3, R4, R5
+
+**Dependencies:** None
+
+**Files:**
+- Modify: `plugins/compound-engineering/skills/ce-work/SKILL.md`
+- Modify: `plugins/compound-engineering/skills/ce-work-beta/SKILL.md`
+
+**Approach:**
+- In the Phase 2 task execution loop (lines ~143-155 in ce:work, ~144-156 in ce:work-beta), add a **new bullet** between "Run tests after changes" and "Mark task as completed". The new bullet should prompt the agent to assess: did this task change behavior? If yes, were tests written or updated? If no tests were added, what is the justification? Keep it concise — 2-3 questions in one bullet, matching the existing loop bullet style. Do not expand into a multi-paragraph section
+- In the Quality Checklist (ce:work line ~433, ce:work-beta line ~506), replace `- [ ] Tests pass (run project's test command)` with `- [ ] Testing addressed -- tests pass AND new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)`
+- In the Final Validation (ce:work line ~289, ce:work-beta line ~298), replace `- All tests pass` with `- Testing addressed -- tests pass and new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)`
+- Ensure both files receive identical changes
+
+**Sync decision:** Propagating to beta — shared testing deliberation guidance, not experimental delegate-mode behavior.
+
+**Patterns to follow:**
+- Existing execution loop bullet style at lines 138-155
+- Existing Quality Checklist item style (checkbox with parenthetical guidance)
+- The mandatory review pattern (which was also synced identically between stable and beta)
+
+**Test scenarios:**
+- Happy path: ce:work execution loop includes the testing deliberation step in the correct position (after "Run tests" and before "Mark task as completed")
+- Happy path: Quality Checklist contains "Testing addressed" and does not contain "Tests pass (run project's test command)"
+- Happy path: Final Validation contains "Testing addressed" and does not contain "All tests pass"
+- Integration: ce:work-beta has identical testing deliberation and checklist wording as ce:work
+
+**Verification:**
+- Both files contain the testing deliberation step in the execution loop
+- Both files' Quality Checklist and Final Validation use "Testing addressed" language
+- Old "Tests pass" and "All tests pass" language is fully removed from both files
+
+---
+
+- [ ] **Unit 3: testing-reviewer — Behavioral changes with no test additions check**
+
+**Goal:** Add a 5th check to the testing-reviewer agent that flags behavioral code changes in the diff with zero corresponding test additions or modifications.
+
+**Requirements:** R6, R7
+
+**Dependencies:** None
+
+**Files:**
+- Modify: `plugins/compound-engineering/agents/review/testing-reviewer.md`
+
+**Approach:**
+- Add a 5th bold-titled bullet in "What you're hunting for" (after the existing 4th check at line 20). The check should: describe the pattern (behavioral code changes — new logic branches, state mutations, API changes — with zero corresponding test file additions or modifications in the diff), explain what makes it distinct from check #1 (which looks at untested branches *within* code that has tests, while this flags when no tests exist at all), and note that non-behavioral changes (config, formatting, comments, type-only changes) are excluded
+- Consider adding a corresponding item in "What you don't flag" for non-behavioral changes if it adds clarity
+
+**Patterns to follow:**
+- Existing check format: bold title followed by `--` and explanation
+- Existing checks use specific, concrete language ("new `if/else`, `switch`, `try/catch`")
+- Confidence calibration tiers (High 0.80+ when provable from diff alone)
+
+**Test scenarios:**
+- Happy path: testing-reviewer.md "What you're hunting for" section contains the behavioral-changes-with-no-tests check
+- Happy path: Check is described as distinct from existing untested-branches check
+
+**Verification:**
+- testing-reviewer.md has 5 checks in "What you're hunting for" instead of 4
+- The new check specifically addresses "behavioral changes with no corresponding test additions"
+
+---
+
+- [ ] **Unit 4: Contract tests for all changes**
+
+**Goal:** Add contract tests that verify each skill/agent modification ships as intended, following the existing string-assertion pattern.
+
+**Requirements:** R8
+
+**Dependencies:** Units 1, 2, 3
+
+**Files:**
+- Modify: `tests/pipeline-review-contract.test.ts`
+- Modify: `tests/review-skill-contract.test.ts`
+
+**Approach:**
+- In `pipeline-review-contract.test.ts`, extend the existing `ce:work review contract` describe block with new tests:
+  - ce:work includes testing deliberation in execution loop
+  - ce:work Quality Checklist contains "Testing addressed" and does not contain "Tests pass (run project's test command)"
+  - ce:work Final Validation contains "Testing addressed" and does not contain "All tests pass"
+  - ce:work-beta mirrors all testing deliberation and checklist changes
+- In `pipeline-review-contract.test.ts`, extend or add a `ce:plan review contract` test:
+  - ce:plan Phase 5.1 review addresses blank test scenarios on feature-bearing units
+- In `review-skill-contract.test.ts`, add a new describe block for testing-reviewer:
+  - testing-reviewer includes the behavioral-changes-with-no-test-additions check
+
+Use negative assertions (`not.toContain`) for the old checklist language to prevent regression.
+
+**Patterns to follow:**
+- `readRepoFile()` helper + `expect(content).toContain(...)` / `expect(content).not.toContain(...)` in existing contract tests
+- ce:work-beta mirror test pattern at pipeline-review-contract.test.ts lines 39-50
+- `describe`/`test` block naming convention in both files
+
+**Test scenarios:**
+- Happy path: All new contract tests pass after Units 1-3 are complete
+- Error path: Reverting any skill change causes the corresponding contract test to fail (verified by inspection of assertion specificity)
+
+**Verification:**
+- `bun test` passes with the new contract tests
+- Each R3-R7 change surface has at least one contract test assertion
+
+## System-Wide Impact
+
+- **Interaction graph:** These are prompt-level skill edits. No callbacks, middleware, or runtime dependencies. The testing-reviewer is invoked by ce:review which is invoked by ce:work — the chain is: ce:work -> ce:review -> testing-reviewer. Changes to the reviewer's check list affect what ce:review surfaces but not how it surfaces it.
+- **Error propagation:** Not applicable — no runtime error paths. If the testing deliberation prompt is poorly worded, the worst case is the agent ignores it (same as today).
+- **API surface parity:** ce:work and ce:work-beta must remain in sync per AGENTS.md. Contract tests enforce this.
+- **Unchanged invariants:** The testing-reviewer's output format (JSON with `findings`, `residual_risks`, `testing_gaps`) is unchanged. The plan template's structure is unchanged — only the comment and Phase 5.1 checklist are modified.
+
+## Risks & Dependencies
+
+| Risk | Mitigation |
+|------|------------|
+| Testing deliberation prompt is too verbose and gets ignored by the agent | Keep it concise — 2-3 questions, not a paragraph. Match the existing loop bullet style. |
+| Old "Tests pass" language persists in one location, creating contradiction | Negative contract test assertions (`not.toContain`) catch any leftover old language |
+| ce:work-beta drifts from ce:work | Contract tests explicitly assert both files contain identical testing changes |
+
+## Sources & References
+
+- **Origin document:** [docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md](docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md)
+- Related learning: `docs/solutions/skill-design/beta-promotion-orchestration-contract.md`
+- Related learning: `docs/solutions/skill-design/compound-refresh-skill-improvements.md` (avoid contradictory rules across phases)
+- Related test: `tests/pipeline-review-contract.test.ts`
+- Related test: `tests/review-skill-contract.test.ts`