feat(testing): close the testing gap in ce:work, ce:plan, and testing-reviewer (#438)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trevin Chow
2026-03-29 13:07:05 -07:00
committed by GitHub
parent a301a08205
commit 35678b8add
8 changed files with 399 additions and 6 deletions

View File

@@ -0,0 +1,239 @@
---
title: "feat: Close the testing gap in ce:work, ce:plan, and testing-reviewer"
type: feat
status: active
date: 2026-03-29
origin: docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md
---
# feat: Close the testing gap in ce:work, ce:plan, and testing-reviewer
## Overview
Targeted edits to three skill/agent files to make "no tests" a deliberate decision rather than an accidental omission. Adds per-task testing deliberation in ce:work's execution loop, blank-test-scenarios handling in ce:plan's review, and a missing-test-pattern check in the testing-reviewer agent. Ships with contract tests following the existing repo pattern.
## Problem Frame
ce:work has thorough testing instructions but two narrow gaps let untested behavioral changes slip through silently: the quality gate says "All tests pass" (vacuously true with no tests), and ce:plan allows blank test scenarios without annotation. The testing-reviewer catches some gaps after the fact but doesn't flag the broad pattern of behavioral changes with zero test additions. (see origin: docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md)
## Requirements Trace
- R1. ce:plan units with no test scenarios should annotate why, not leave the field blank
- R2. Blank test scenarios on feature-bearing units treated as incomplete in Phase 5.1 review
- R3. Per-task testing deliberation in ce:work's execution loop before marking a task done
- R4. Quality checklist and Final Validation updated from "Tests pass" to "Testing addressed"
- R5. Apply R3 and R4 to ce:work-beta with explicit sync decision
- R6. testing-reviewer adds a check for behavioral changes with no corresponding test additions
- R7. New check complements existing checks (untested branches, weak assertions, brittle tests, missing edge cases)
- R8. Contract tests verifying each behavioral change ships as intended
## Scope Boundaries
- Prompt-level changes only -- no CI enforcement, no programmatic gates
- No new abstractions (no "testing assessment artifacts" or structured output schemas)
- No changes to testing-reviewer's output format (findings JSON stays the same)
- Deliberate test omission with justification is a valid outcome
## Context & Research
### Relevant Code and Patterns
- `plugins/compound-engineering/skills/ce-plan/SKILL.md` — Phase 5.1 review checklist at lines 583-601, test scenario quality checks at lines 591-592. Two edit sites: instruction prose for Test scenarios at line 339 (section 3.5), and plan output template with HTML comment at line 499
- `plugins/compound-engineering/skills/ce-work/SKILL.md` — Phase 2 task loop at lines ~143-155, Final Validation at lines 287-295 ("All tests pass"), Quality Checklist at lines 427-443 ("Tests pass (run project's test command)")
- `plugins/compound-engineering/skills/ce-work-beta/SKILL.md` — Identical loop/checklist structure. Final Validation at lines 296-304, Quality Checklist at lines 500-516
- `plugins/compound-engineering/agents/review/testing-reviewer.md` — 4 existing checks in "What you're hunting for" (lines 15-20), confidence calibration (lines 22-29), output format (lines 37-48)
- `tests/pipeline-review-contract.test.ts` — Contract tests for ce:work, ce:work-beta, ce:brainstorm, ce:plan using `readRepoFile()` + `toContain`/`not.toContain` assertions
- `tests/review-skill-contract.test.ts` — Contract tests for ce:review agent using same pattern, includes frontmatter parsing and cross-file schema alignment
### Institutional Learnings
- Beta-to-stable sync must be explicit per AGENTS.md (lines 161-163). The existing `pipeline-review-contract.test.ts` already tests ce:work-beta mirrors ce:work's review contract — follow same pattern.
- Skill review checklist warns against contradictory rules across phases — the new "testing deliberation" must complement, not contradict, existing "Run tests after changes" instruction.
- Use negative assertions (`not.toContain`) to prevent regression — assert old "Tests pass" / "All tests pass" language is fully replaced.
## Key Technical Decisions
- **Testing deliberation goes after "Run tests after changes" in the loop**: This is the natural deliberation point — tests have just run (or not), and the agent should assess whether testing was adequately addressed before marking the task done. Placing it earlier (before test execution) would be premature; placing it at "Mark task as completed" would intermingle it with completion bookkeeping.
- **Annotation uses existing template field, not a new field**: `Test expectation: none -- [reason]` goes in the Test scenarios section rather than adding a new template field. This keeps the template stable and leverages the existing Phase 5.1 check surface.
- **New testing-reviewer check is a 5th bullet, not a replacement**: It's conceptually distinct from check #1 (untested branches within new code). Check #1 looks at branch coverage within tests that exist; the new check flags when no tests exist at all for behavioral changes.
- **Contract tests extend existing files**: New ce:work/ce:plan assertions go in `pipeline-review-contract.test.ts`. Testing-reviewer assertion goes in `review-skill-contract.test.ts`. This follows the established convention rather than creating a new file.
## Open Questions
### Resolved During Planning
- **Where does testing deliberation go in the loop?** After "Run tests after changes" (bullet 8) and before "Mark task as completed" (bullet 9). The agent has just run tests or skipped them — now it deliberates.
- **What annotation format for units with no tests?** `Test expectation: none -- [reason]` in the Test scenarios field. Follows existing template structure.
- **Where does the new check go in testing-reviewer?** 5th bullet in "What you're hunting for" after the existing 4 checks.
- **New test file or extend existing?** Extend existing — `pipeline-review-contract.test.ts` for skill changes, `review-skill-contract.test.ts` for the agent change.
### Deferred to Implementation
- Exact wording of the testing deliberation prompt in the execution loop — should be concise and action-oriented, final phrasing determined during implementation
- Whether the testing-reviewer's "What you don't flag" section needs a corresponding exclusion for non-behavioral changes (config, formatting, comments) — inspect during implementation
## Implementation Units
- [ ] **Unit 1: ce:plan — Blank test scenarios handling**
**Goal:** Make blank test scenarios on feature-bearing units flagged as incomplete during plan review, and establish the annotation convention for units that genuinely need no tests.
**Requirements:** R1, R2
**Dependencies:** None
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-plan/SKILL.md`
**Approach:**
- Two edit sites in ce:plan for the annotation convention:
- The instruction prose (section 3.5, around line 339) that describes how to write Test scenarios — mention the `Test expectation: none -- [reason]` convention here so the planner agent learns it when reading instructions
- The plan output template (around line 499) which contains the HTML comment `<!-- Include only categories that apply to this unit. Omit categories that don't. -->` — update this comment to also show the annotation convention for units with no test scenarios
- In Phase 5.1 review checklist (after line 592), add a new bullet: blank or missing test scenarios on a feature-bearing unit (as defined by ce:plan's existing Plan Quality Bar language) should be flagged as incomplete
- In the Phase 5.3.3 confidence-scoring checklist for Implementation Units (around line 717), add a parallel item so the confidence check also catches blank test scenarios
**Patterns to follow:**
- Existing Phase 5.1 test scenario quality checks at lines 591-592
- The unit template comment style at line 499
- ce:plan's existing "feature-bearing unit" terminology in the Plan Quality Bar
**Test scenarios:**
- Happy path: Plan with a feature-bearing unit that has `Test expectation: none -- config-only change` in test scenarios -> Phase 5.1 review accepts it
- Error path: Plan with a feature-bearing unit that has a completely blank/absent Test scenarios field -> Phase 5.1 review flags it as incomplete
- Happy path: Plan with a non-feature-bearing unit (scaffolding, config) that uses the annotation -> accepted without issue
**Verification:**
- Phase 5.1 checklist explicitly addresses blank test scenarios
- Plan template comment mentions the `Test expectation: none -- [reason]` convention
- Confidence scoring checklist includes blank test scenarios as a scoring trigger
---
- [ ] **Unit 2: ce:work and ce:work-beta — Testing deliberation and checklist update**
**Goal:** Add per-task testing deliberation to the execution loop and update both checklist surfaces from "Tests pass" to "Testing addressed."
**Requirements:** R3, R4, R5
**Dependencies:** None
**Files:**
- Modify: `plugins/compound-engineering/skills/ce-work/SKILL.md`
- Modify: `plugins/compound-engineering/skills/ce-work-beta/SKILL.md`
**Approach:**
- In the Phase 2 task execution loop (lines ~143-155 in ce:work, ~144-156 in ce:work-beta), add a **new bullet** between "Run tests after changes" and "Mark task as completed". The new bullet should prompt the agent to assess: did this task change behavior? If yes, were tests written or updated? If no tests were added, what is the justification? Keep it concise — 2-3 questions in one bullet, matching the existing loop bullet style. Do not expand into a multi-paragraph section
- In the Quality Checklist (ce:work line ~433, ce:work-beta line ~506), replace `- [ ] Tests pass (run project's test command)` with `- [ ] Testing addressed -- tests pass AND new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)`
- In the Final Validation (ce:work line ~289, ce:work-beta line ~298), replace `- All tests pass` with `- Testing addressed -- tests pass and new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)`
- Ensure both files receive identical changes
**Sync decision:** Propagating to beta — shared testing deliberation guidance, not experimental delegate-mode behavior.
**Patterns to follow:**
- Existing execution loop bullet style at lines 138-155
- Existing Quality Checklist item style (checkbox with parenthetical guidance)
- The mandatory review pattern (which was also synced identically between stable and beta)
**Test scenarios:**
- Happy path: ce:work execution loop includes the testing deliberation step in the correct position (after "Run tests" and before "Mark task as completed")
- Happy path: Quality Checklist contains "Testing addressed" and does not contain "Tests pass (run project's test command)"
- Happy path: Final Validation contains "Testing addressed" and does not contain "All tests pass"
- Integration: ce:work-beta has identical testing deliberation and checklist wording as ce:work
**Verification:**
- Both files contain the testing deliberation step in the execution loop
- Both files' Quality Checklist and Final Validation use "Testing addressed" language
- Old "Tests pass" and "All tests pass" language is fully removed from both files
---
- [ ] **Unit 3: testing-reviewer — Behavioral changes with no test additions check**
**Goal:** Add a 5th check to the testing-reviewer agent that flags behavioral code changes in the diff with zero corresponding test additions or modifications.
**Requirements:** R6, R7
**Dependencies:** None
**Files:**
- Modify: `plugins/compound-engineering/agents/review/testing-reviewer.md`
**Approach:**
- Add a 5th bold-titled bullet in "What you're hunting for" (after the existing 4th check at line 20). The check should: describe the pattern (behavioral code changes — new logic branches, state mutations, API changes — with zero corresponding test file additions or modifications in the diff), explain what makes it distinct from check #1 (which looks at untested branches *within* code that has tests, while this flags when no tests exist at all), and note that non-behavioral changes (config, formatting, comments, type-only changes) are excluded
- Consider adding a corresponding item in "What you don't flag" for non-behavioral changes if it adds clarity
**Patterns to follow:**
- Existing check format: bold title followed by `--` and explanation
- Existing checks use specific, concrete language ("new `if/else`, `switch`, `try/catch`")
- Confidence calibration tiers (High 0.80+ when provable from diff alone)
**Test scenarios:**
- Happy path: testing-reviewer.md "What you're hunting for" section contains the behavioral-changes-with-no-tests check
- Happy path: Check is described as distinct from existing untested-branches check
**Verification:**
- testing-reviewer.md has 5 checks in "What you're hunting for" instead of 4
- The new check specifically addresses "behavioral changes with no corresponding test additions"
---
- [ ] **Unit 4: Contract tests for all changes**
**Goal:** Add contract tests that verify each skill/agent modification ships as intended, following the existing string-assertion pattern.
**Requirements:** R8
**Dependencies:** Units 1, 2, 3
**Files:**
- Modify: `tests/pipeline-review-contract.test.ts`
- Modify: `tests/review-skill-contract.test.ts`
**Approach:**
- In `pipeline-review-contract.test.ts`, extend the existing `ce:work review contract` describe block with new tests:
- ce:work includes testing deliberation in execution loop
- ce:work Quality Checklist contains "Testing addressed" and does not contain "Tests pass (run project's test command)"
- ce:work Final Validation contains "Testing addressed" and does not contain "All tests pass"
- ce:work-beta mirrors all testing deliberation and checklist changes
- In `pipeline-review-contract.test.ts`, extend or add a `ce:plan review contract` test:
- ce:plan Phase 5.1 review addresses blank test scenarios on feature-bearing units
- In `review-skill-contract.test.ts`, add a new describe block for testing-reviewer:
- testing-reviewer includes the behavioral-changes-with-no-test-additions check
Use negative assertions (`not.toContain`) for the old checklist language to prevent regression.
**Patterns to follow:**
- `readRepoFile()` helper + `expect(content).toContain(...)` / `expect(content).not.toContain(...)` in existing contract tests
- ce:work-beta mirror test pattern at pipeline-review-contract.test.ts lines 39-50
- `describe`/`test` block naming convention in both files
**Test scenarios:**
- Happy path: All new contract tests pass after Units 1-3 are complete
- Error path: Reverting any skill change causes the corresponding contract test to fail (verified by inspection of assertion specificity)
**Verification:**
- `bun test` passes with the new contract tests
- Each R3-R7 change surface has at least one contract test assertion
## System-Wide Impact
- **Interaction graph:** These are prompt-level skill edits. No callbacks, middleware, or runtime dependencies. The testing-reviewer is invoked by ce:review which is invoked by ce:work — the chain is: ce:work -> ce:review -> testing-reviewer. Changes to the reviewer's check list affect what ce:review surfaces but not how it surfaces it.
- **Error propagation:** Not applicable — no runtime error paths. If the testing deliberation prompt is poorly worded, the worst case is the agent ignores it (same as today).
- **API surface parity:** ce:work and ce:work-beta must remain in sync per AGENTS.md. Contract tests enforce this.
- **Unchanged invariants:** The testing-reviewer's output format (JSON with `findings`, `residual_risks`, `testing_gaps`) is unchanged. The plan template's structure is unchanged — only the comment and Phase 5.1 checklist are modified.
## Risks & Dependencies
| Risk | Mitigation |
|------|------------|
| Testing deliberation prompt is too verbose and gets ignored by the agent | Keep it concise — 2-3 questions, not a paragraph. Match the existing loop bullet style. |
| Old "Tests pass" language persists in one location, creating contradiction | Negative contract test assertions (`not.toContain`) catch any leftover old language |
| ce:work-beta drifts from ce:work | Contract tests explicitly assert both files contain identical testing changes |
## Sources & References
- **Origin document:** [docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md](docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md)
- Related learning: `docs/solutions/skill-design/beta-promotion-orchestration-contract.md`
- Related learning: `docs/solutions/skill-design/compound-refresh-skill-improvements.md` (avoid contradictory rules across phases)
- Related test: `tests/pipeline-review-contract.test.ts`
- Related test: `tests/review-skill-contract.test.ts`