feat(testing): close the testing gap in ce:work, ce:plan, and testing-reviewer (#438)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 13:07:05 -07:00
parent a301a08205
commit 35678b8add
8 changed files with 399 additions and 6 deletions
--- a/docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md
+++ b/docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md
@@ -0,0 +1,82 @@
+---
+date: 2026-03-29
+topic: testing-addressed-gate
+---
+
+# Close the Testing Gap in ce:work and ce:plan
+
+## Problem Frame
+
+ce:work has extensive testing instructions -- test discovery, test-first execution posture, system-wide test checks, and a test scenario completeness checklist. But two narrow gaps let untested behavioral changes slip through silently:
+
+1. **ce:work's quality gate says "All tests pass"** -- which is vacuously true when no tests exist. A passing empty test suite is indistinguishable from a passing comprehensive one. "No tests" can be a deliberate decision or an accidental omission, and the skill doesn't distinguish between the two.
+
+2. **ce:plan allows blank test scenarios without annotation** -- when a plan unit has no test scenarios, it's ambiguous whether the planner assessed testing and determined none were needed, or simply didn't think about it. ce:plan already requires test scenarios for feature-bearing units (Plan Quality Bar, Phase 5.1 review), but non-feature-bearing units legitimately omit them, and the template doesn't require saying so.
+
+The testing-reviewer in ce:review catches some of these after the fact by examining diffs for untested branches and missing edge case coverage. But it doesn't specifically flag the broader pattern: behavioral changes with no corresponding test additions at all.
+
+The existing testing instructions are thorough but generic. The gap isn't volume of instructions -- it's specificity at the right moments. This targets focused changes at three layers: planning (ce:plan annotation), execution (ce:work per-task deliberation), and review (testing-reviewer detection).
+
+## Requirements
+
+**ce:plan -- Handle the Blank Case**
+
+- R1. When a plan unit has no test scenarios, the planner should annotate why (e.g., "Test expectation: none -- config-only, no behavioral change") rather than leaving the field blank
+- R2. A blank or missing test scenarios field on a feature-bearing unit should be treated as incomplete during ce:plan's Phase 5.1 review, not silently accepted
+
+---
+
+**ce:work -- Per-Task Testing Deliberation**
+
+- R3. Before marking a task done, ce:work's execution loop should include an explicit testing deliberation: did this task change behavior? If yes, were tests written or updated? If no tests were added, why not? This is a prompt for deliberation at the point of action, not a formal artifact
+- R4. The Phase 3 quality checklist item "Tests pass (run project's test command)" and the Final Validation item "All tests pass" should both be updated to "Testing addressed -- tests pass AND new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)"
+- R5. Apply R3 and R4 to ce:work-beta (AGENTS.md requires explicit sync decisions for beta counterparts)
+
+---
+
+**testing-reviewer -- Flag the Missing-Test Pattern**
+
+- R6. The testing-reviewer agent should add a new check: when the diff contains behavioral code changes (new logic branches, state mutations, API changes) with zero corresponding test additions or modifications, flag it as a finding
+- R7. This check complements the existing checks (untested branches, weak assertions, brittle tests, missing edge cases) -- it catches the case those miss: no tests at all for new behavior
+
+**Contract Tests -- Practice What We Preach**
+
+- R8. Add contract tests verifying each behavioral change ships as intended. Following the existing pattern in `pipeline-review-contract.test.ts` and `review-skill-contract.test.ts` (string assertions against skill/agent file content):
+  - ce:work includes per-task testing deliberation in the execution loop (R3)
+  - ce:work checklist says "Testing addressed", not "Tests pass" or "All tests pass" (R4)
+  - ce:work-beta mirrors the testing deliberation and checklist changes (R5)
+  - ce:plan Phase 5.1 review treats blank test scenarios on feature-bearing units as incomplete (R2)
+  - testing-reviewer agent includes the behavioral-changes-with-no-test-additions check (R6)
+
+## Success Criteria
+
+- A diff with behavioral changes and no test changes gets flagged by the testing-reviewer (R6) -- the detective layer catches it on real artifacts
+- ce:plan units without test scenarios either have an explicit annotation or get flagged during plan review (R1-R2) -- the preventive layer operates at planning time
+- ce:work's execution loop prompts testing deliberation per task, and the checklist makes the agent explicitly consider whether testing was addressed, not just whether the suite is green (R3-R4)
+- "No tests needed" with justification remains a valid outcome -- the goal is deliberate decisions, not forced ceremony
+
+## Scope Boundaries
+
+- Not adding CI-level enforcement or programmatic gates -- these are prompt-level changes
+- Not adding new abstractions like "testing assessment artifacts" or structured output schemas
+- Not mandating coverage thresholds or specific testing frameworks
+- Not changing the testing-reviewer's output format -- adding one check within its existing review protocol
+
+## Key Decisions
+
+- **Layered approach -- deliberation + detection**: ce:work's per-task deliberation (R3) prompts the agent to think about testing at the point of action. The testing-reviewer (R6) operates on the actual diff as a backstop. Instruction specificity at the right moment matters -- "did you address testing for this task?" is a much more targeted prompt than "tests pass."
+- **Targeted edits over a new system**: Rather than introducing a "testing assessment gate" abstraction, make focused changes to ce:plan, ce:work, and testing-reviewer that close the identified gaps.
+- **Deliberate omission is a first-class outcome**: "No tests needed" with justification is valid. The goal is making "no tests" a deliberate decision, not an accidental one.
+
+## Outstanding Questions
+
+### Deferred to Planning
+
+- [Affects R1][Technical] What's the lightest-weight annotation for plan units that genuinely need no tests -- a field, a comment, or a convention?
+- [Affects R6][Needs research] Review the testing-reviewer's current check implementation to determine where the new "behavioral changes with no test changes" check fits in its analysis protocol
+- [Affects R3][Technical] Where in ce:work's execution loop (Phase 2 task loop) does the testing deliberation prompt fit -- after "Run tests after changes" or as part of "Mark task as completed"?
+- [Affects R4-R5][Resolved] ce:work's Phase 3 checklist is plaintext markdown in SKILL.md (line ~433 and ~289). ce:work-beta has the same pattern. The change is editing bullet points, no dynamic infrastructure.
+
+## Next Steps
+
+-> `/ce:plan` for structured implementation planning
--- a/docs/plans/2026-03-29-001-feat-testing-addressed-gate-plan.md
+++ b/docs/plans/2026-03-29-001-feat-testing-addressed-gate-plan.md
@@ -0,0 +1,239 @@
+---
+title: "feat: Close the testing gap in ce:work, ce:plan, and testing-reviewer"
+type: feat
+status: active
+date: 2026-03-29
+origin: docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md
+---
+
+# feat: Close the testing gap in ce:work, ce:plan, and testing-reviewer
+
+## Overview
+
+Targeted edits to three skill/agent files to make "no tests" a deliberate decision rather than an accidental omission. Adds per-task testing deliberation in ce:work's execution loop, blank-test-scenarios handling in ce:plan's review, and a missing-test-pattern check in the testing-reviewer agent. Ships with contract tests following the existing repo pattern.
+
+## Problem Frame
+
+ce:work has thorough testing instructions but two narrow gaps let untested behavioral changes slip through silently: the quality gate says "All tests pass" (vacuously true with no tests), and ce:plan allows blank test scenarios without annotation. The testing-reviewer catches some gaps after the fact but doesn't flag the broad pattern of behavioral changes with zero test additions. (see origin: docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md)
+
+## Requirements Trace
+
+- R1. ce:plan units with no test scenarios should annotate why, not leave the field blank
+- R2. Blank test scenarios on feature-bearing units treated as incomplete in Phase 5.1 review
+- R3. Per-task testing deliberation in ce:work's execution loop before marking a task done
+- R4. Quality checklist and Final Validation updated from "Tests pass" to "Testing addressed"
+- R5. Apply R3 and R4 to ce:work-beta with explicit sync decision
+- R6. testing-reviewer adds a check for behavioral changes with no corresponding test additions
+- R7. New check complements existing checks (untested branches, weak assertions, brittle tests, missing edge cases)
+- R8. Contract tests verifying each behavioral change ships as intended
+
+## Scope Boundaries
+
+- Prompt-level changes only -- no CI enforcement, no programmatic gates
+- No new abstractions (no "testing assessment artifacts" or structured output schemas)
+- No changes to testing-reviewer's output format (findings JSON stays the same)
+- Deliberate test omission with justification is a valid outcome
+
+## Context & Research
+
+### Relevant Code and Patterns
+
+- `plugins/compound-engineering/skills/ce-plan/SKILL.md` — Phase 5.1 review checklist at lines 583-601, test scenario quality checks at lines 591-592. Two edit sites: instruction prose for Test scenarios at line 339 (section 3.5), and plan output template with HTML comment at line 499
+- `plugins/compound-engineering/skills/ce-work/SKILL.md` — Phase 2 task loop at lines ~143-155, Final Validation at lines 287-295 ("All tests pass"), Quality Checklist at lines 427-443 ("Tests pass (run project's test command)")
+- `plugins/compound-engineering/skills/ce-work-beta/SKILL.md` — Identical loop/checklist structure. Final Validation at lines 296-304, Quality Checklist at lines 500-516
+- `plugins/compound-engineering/agents/review/testing-reviewer.md` — 4 existing checks in "What you're hunting for" (lines 15-20), confidence calibration (lines 22-29), output format (lines 37-48)
+- `tests/pipeline-review-contract.test.ts` — Contract tests for ce:work, ce:work-beta, ce:brainstorm, ce:plan using `readRepoFile()` + `toContain`/`not.toContain` assertions
+- `tests/review-skill-contract.test.ts` — Contract tests for ce:review agent using same pattern, includes frontmatter parsing and cross-file schema alignment
+
+### Institutional Learnings
+
+- Beta-to-stable sync must be explicit per AGENTS.md (lines 161-163). The existing `pipeline-review-contract.test.ts` already tests ce:work-beta mirrors ce:work's review contract — follow same pattern.
+- Skill review checklist warns against contradictory rules across phases — the new "testing deliberation" must complement, not contradict, existing "Run tests after changes" instruction.
+- Use negative assertions (`not.toContain`) to prevent regression — assert old "Tests pass" / "All tests pass" language is fully replaced.
+
+## Key Technical Decisions
+
+- **Testing deliberation goes after "Run tests after changes" in the loop**: This is the natural deliberation point — tests have just run (or not), and the agent should assess whether testing was adequately addressed before marking the task done. Placing it earlier (before test execution) would be premature; placing it at "Mark task as completed" would intermingle it with completion bookkeeping.
+- **Annotation uses existing template field, not a new field**: `Test expectation: none -- [reason]` goes in the Test scenarios section rather than adding a new template field. This keeps the template stable and leverages the existing Phase 5.1 check surface.
+- **New testing-reviewer check is a 5th bullet, not a replacement**: It's conceptually distinct from check #1 (untested branches within new code). Check #1 looks at branch coverage within tests that exist; the new check flags when no tests exist at all for behavioral changes.
+- **Contract tests extend existing files**: New ce:work/ce:plan assertions go in `pipeline-review-contract.test.ts`. Testing-reviewer assertion goes in `review-skill-contract.test.ts`. This follows the established convention rather than creating a new file.
+
+## Open Questions
+
+### Resolved During Planning
+
+- **Where does testing deliberation go in the loop?** After "Run tests after changes" (bullet 8) and before "Mark task as completed" (bullet 9). The agent has just run tests or skipped them — now it deliberates.
+- **What annotation format for units with no tests?** `Test expectation: none -- [reason]` in the Test scenarios field. Follows existing template structure.
+- **Where does the new check go in testing-reviewer?** 5th bullet in "What you're hunting for" after the existing 4 checks.
+- **New test file or extend existing?** Extend existing — `pipeline-review-contract.test.ts` for skill changes, `review-skill-contract.test.ts` for the agent change.
+
+### Deferred to Implementation
+
+- Exact wording of the testing deliberation prompt in the execution loop — should be concise and action-oriented, final phrasing determined during implementation
+- Whether the testing-reviewer's "What you don't flag" section needs a corresponding exclusion for non-behavioral changes (config, formatting, comments) — inspect during implementation
+
+## Implementation Units
+
+- [ ] **Unit 1: ce:plan — Blank test scenarios handling**
+
+**Goal:** Make blank test scenarios on feature-bearing units flagged as incomplete during plan review, and establish the annotation convention for units that genuinely need no tests.
+
+**Requirements:** R1, R2
+
+**Dependencies:** None
+
+**Files:**
+- Modify: `plugins/compound-engineering/skills/ce-plan/SKILL.md`
+
+**Approach:**
+- Two edit sites in ce:plan for the annotation convention:
+  - The instruction prose (section 3.5, around line 339) that describes how to write Test scenarios — mention the `Test expectation: none -- [reason]` convention here so the planner agent learns it when reading instructions
+  - The plan output template (around line 499) which contains the HTML comment `<!-- Include only categories that apply to this unit. Omit categories that don't. -->` — update this comment to also show the annotation convention for units with no test scenarios
+- In Phase 5.1 review checklist (after line 592), add a new bullet: blank or missing test scenarios on a feature-bearing unit (as defined by ce:plan's existing Plan Quality Bar language) should be flagged as incomplete
+- In the Phase 5.3.3 confidence-scoring checklist for Implementation Units (around line 717), add a parallel item so the confidence check also catches blank test scenarios
+
+**Patterns to follow:**
+- Existing Phase 5.1 test scenario quality checks at lines 591-592
+- The unit template comment style at line 499
+- ce:plan's existing "feature-bearing unit" terminology in the Plan Quality Bar
+
+**Test scenarios:**
+- Happy path: Plan with a feature-bearing unit that has `Test expectation: none -- config-only change` in test scenarios -> Phase 5.1 review accepts it
+- Error path: Plan with a feature-bearing unit that has a completely blank/absent Test scenarios field -> Phase 5.1 review flags it as incomplete
+- Happy path: Plan with a non-feature-bearing unit (scaffolding, config) that uses the annotation -> accepted without issue
+
+**Verification:**
+- Phase 5.1 checklist explicitly addresses blank test scenarios
+- Plan template comment mentions the `Test expectation: none -- [reason]` convention
+- Confidence scoring checklist includes blank test scenarios as a scoring trigger
+
+---
+
+- [ ] **Unit 2: ce:work and ce:work-beta — Testing deliberation and checklist update**
+
+**Goal:** Add per-task testing deliberation to the execution loop and update both checklist surfaces from "Tests pass" to "Testing addressed."
+
+**Requirements:** R3, R4, R5
+
+**Dependencies:** None
+
+**Files:**
+- Modify: `plugins/compound-engineering/skills/ce-work/SKILL.md`
+- Modify: `plugins/compound-engineering/skills/ce-work-beta/SKILL.md`
+
+**Approach:**
+- In the Phase 2 task execution loop (lines ~143-155 in ce:work, ~144-156 in ce:work-beta), add a **new bullet** between "Run tests after changes" and "Mark task as completed". The new bullet should prompt the agent to assess: did this task change behavior? If yes, were tests written or updated? If no tests were added, what is the justification? Keep it concise — 2-3 questions in one bullet, matching the existing loop bullet style. Do not expand into a multi-paragraph section
+- In the Quality Checklist (ce:work line ~433, ce:work-beta line ~506), replace `- [ ] Tests pass (run project's test command)` with `- [ ] Testing addressed -- tests pass AND new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)`
+- In the Final Validation (ce:work line ~289, ce:work-beta line ~298), replace `- All tests pass` with `- Testing addressed -- tests pass and new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)`
+- Ensure both files receive identical changes
+
+**Sync decision:** Propagating to beta — shared testing deliberation guidance, not experimental delegate-mode behavior.
+
+**Patterns to follow:**
+- Existing execution loop bullet style at lines 138-155
+- Existing Quality Checklist item style (checkbox with parenthetical guidance)
+- The mandatory review pattern (which was also synced identically between stable and beta)
+
+**Test scenarios:**
+- Happy path: ce:work execution loop includes the testing deliberation step in the correct position (after "Run tests" and before "Mark task as completed")
+- Happy path: Quality Checklist contains "Testing addressed" and does not contain "Tests pass (run project's test command)"
+- Happy path: Final Validation contains "Testing addressed" and does not contain "All tests pass"
+- Integration: ce:work-beta has identical testing deliberation and checklist wording as ce:work
+
+**Verification:**
+- Both files contain the testing deliberation step in the execution loop
+- Both files' Quality Checklist and Final Validation use "Testing addressed" language
+- Old "Tests pass" and "All tests pass" language is fully removed from both files
+
+---
+
+- [ ] **Unit 3: testing-reviewer — Behavioral changes with no test additions check**
+
+**Goal:** Add a 5th check to the testing-reviewer agent that flags behavioral code changes in the diff with zero corresponding test additions or modifications.
+
+**Requirements:** R6, R7
+
+**Dependencies:** None
+
+**Files:**
+- Modify: `plugins/compound-engineering/agents/review/testing-reviewer.md`
+
+**Approach:**
+- Add a 5th bold-titled bullet in "What you're hunting for" (after the existing 4th check at line 20). The check should: describe the pattern (behavioral code changes — new logic branches, state mutations, API changes — with zero corresponding test file additions or modifications in the diff), explain what makes it distinct from check #1 (which looks at untested branches *within* code that has tests, while this flags when no tests exist at all), and note that non-behavioral changes (config, formatting, comments, type-only changes) are excluded
+- Consider adding a corresponding item in "What you don't flag" for non-behavioral changes if it adds clarity
+
+**Patterns to follow:**
+- Existing check format: bold title followed by `--` and explanation
+- Existing checks use specific, concrete language ("new `if/else`, `switch`, `try/catch`")
+- Confidence calibration tiers (High 0.80+ when provable from diff alone)
+
+**Test scenarios:**
+- Happy path: testing-reviewer.md "What you're hunting for" section contains the behavioral-changes-with-no-tests check
+- Happy path: Check is described as distinct from existing untested-branches check
+
+**Verification:**
+- testing-reviewer.md has 5 checks in "What you're hunting for" instead of 4
+- The new check specifically addresses "behavioral changes with no corresponding test additions"
+
+---
+
+- [ ] **Unit 4: Contract tests for all changes**
+
+**Goal:** Add contract tests that verify each skill/agent modification ships as intended, following the existing string-assertion pattern.
+
+**Requirements:** R8
+
+**Dependencies:** Units 1, 2, 3
+
+**Files:**
+- Modify: `tests/pipeline-review-contract.test.ts`
+- Modify: `tests/review-skill-contract.test.ts`
+
+**Approach:**
+- In `pipeline-review-contract.test.ts`, extend the existing `ce:work review contract` describe block with new tests:
+  - ce:work includes testing deliberation in execution loop
+  - ce:work Quality Checklist contains "Testing addressed" and does not contain "Tests pass (run project's test command)"
+  - ce:work Final Validation contains "Testing addressed" and does not contain "All tests pass"
+  - ce:work-beta mirrors all testing deliberation and checklist changes
+- In `pipeline-review-contract.test.ts`, extend or add a `ce:plan review contract` test:
+  - ce:plan Phase 5.1 review addresses blank test scenarios on feature-bearing units
+- In `review-skill-contract.test.ts`, add a new describe block for testing-reviewer:
+  - testing-reviewer includes the behavioral-changes-with-no-test-additions check
+
+Use negative assertions (`not.toContain`) for the old checklist language to prevent regression.
+
+**Patterns to follow:**
+- `readRepoFile()` helper + `expect(content).toContain(...)` / `expect(content).not.toContain(...)` in existing contract tests
+- ce:work-beta mirror test pattern at pipeline-review-contract.test.ts lines 39-50
+- `describe`/`test` block naming convention in both files
+
+**Test scenarios:**
+- Happy path: All new contract tests pass after Units 1-3 are complete
+- Error path: Reverting any skill change causes the corresponding contract test to fail (verified by inspection of assertion specificity)
+
+**Verification:**
+- `bun test` passes with the new contract tests
+- Each R3-R7 change surface has at least one contract test assertion
+
+## System-Wide Impact
+
+- **Interaction graph:** These are prompt-level skill edits. No callbacks, middleware, or runtime dependencies. The testing-reviewer is invoked by ce:review which is invoked by ce:work — the chain is: ce:work -> ce:review -> testing-reviewer. Changes to the reviewer's check list affect what ce:review surfaces but not how it surfaces it.
+- **Error propagation:** Not applicable — no runtime error paths. If the testing deliberation prompt is poorly worded, the worst case is the agent ignores it (same as today).
+- **API surface parity:** ce:work and ce:work-beta must remain in sync per AGENTS.md. Contract tests enforce this.
+- **Unchanged invariants:** The testing-reviewer's output format (JSON with `findings`, `residual_risks`, `testing_gaps`) is unchanged. The plan template's structure is unchanged — only the comment and Phase 5.1 checklist are modified.
+
+## Risks & Dependencies
+
+| Risk | Mitigation |
+|------|------------|
+| Testing deliberation prompt is too verbose and gets ignored by the agent | Keep it concise — 2-3 questions, not a paragraph. Match the existing loop bullet style. |
+| Old "Tests pass" language persists in one location, creating contradiction | Negative contract test assertions (`not.toContain`) catch any leftover old language |
+| ce:work-beta drifts from ce:work | Contract tests explicitly assert both files contain identical testing changes |
+
+## Sources & References
+
+- **Origin document:** [docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md](docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md)
+- Related learning: `docs/solutions/skill-design/beta-promotion-orchestration-contract.md`
+- Related learning: `docs/solutions/skill-design/compound-refresh-skill-improvements.md` (avoid contradictory rules across phases)
+- Related test: `tests/pipeline-review-contract.test.ts`
+- Related test: `tests/review-skill-contract.test.ts`