diff --git a/docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md b/docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md new file mode 100644 index 0000000..54219b2 --- /dev/null +++ b/docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md @@ -0,0 +1,82 @@ +--- +date: 2026-03-29 +topic: testing-addressed-gate +--- + +# Close the Testing Gap in ce:work and ce:plan + +## Problem Frame + +ce:work has extensive testing instructions -- test discovery, test-first execution posture, system-wide test checks, and a test scenario completeness checklist. But two narrow gaps let untested behavioral changes slip through silently: + +1. **ce:work's quality gate says "All tests pass"** -- which is vacuously true when no tests exist. A passing empty test suite is indistinguishable from a passing comprehensive one. "No tests" can be a deliberate decision or an accidental omission, and the skill doesn't distinguish between the two. + +2. **ce:plan allows blank test scenarios without annotation** -- when a plan unit has no test scenarios, it's ambiguous whether the planner assessed testing and determined none were needed, or simply didn't think about it. ce:plan already requires test scenarios for feature-bearing units (Plan Quality Bar, Phase 5.1 review), but non-feature-bearing units legitimately omit them, and the template doesn't require saying so. + +The testing-reviewer in ce:review catches some of these after the fact by examining diffs for untested branches and missing edge case coverage. But it doesn't specifically flag the broader pattern: behavioral changes with no corresponding test additions at all. + +The existing testing instructions are thorough but generic. The gap isn't volume of instructions -- it's specificity at the right moments. This targets focused changes at three layers: planning (ce:plan annotation), execution (ce:work per-task deliberation), and review (testing-reviewer detection). + +## Requirements + +**ce:plan -- Handle the Blank Case** + +- R1. When a plan unit has no test scenarios, the planner should annotate why (e.g., "Test expectation: none -- config-only, no behavioral change") rather than leaving the field blank +- R2. A blank or missing test scenarios field on a feature-bearing unit should be treated as incomplete during ce:plan's Phase 5.1 review, not silently accepted + +--- + +**ce:work -- Per-Task Testing Deliberation** + +- R3. Before marking a task done, ce:work's execution loop should include an explicit testing deliberation: did this task change behavior? If yes, were tests written or updated? If no tests were added, why not? This is a prompt for deliberation at the point of action, not a formal artifact +- R4. The Phase 3 quality checklist item "Tests pass (run project's test command)" and the Final Validation item "All tests pass" should both be updated to "Testing addressed -- tests pass AND new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)" +- R5. Apply R3 and R4 to ce:work-beta (AGENTS.md requires explicit sync decisions for beta counterparts) + +--- + +**testing-reviewer -- Flag the Missing-Test Pattern** + +- R6. The testing-reviewer agent should add a new check: when the diff contains behavioral code changes (new logic branches, state mutations, API changes) with zero corresponding test additions or modifications, flag it as a finding +- R7. This check complements the existing checks (untested branches, weak assertions, brittle tests, missing edge cases) -- it catches the case those miss: no tests at all for new behavior + +**Contract Tests -- Practice What We Preach** + +- R8. Add contract tests verifying each behavioral change ships as intended. Following the existing pattern in `pipeline-review-contract.test.ts` and `review-skill-contract.test.ts` (string assertions against skill/agent file content): + - ce:work includes per-task testing deliberation in the execution loop (R3) + - ce:work checklist says "Testing addressed", not "Tests pass" or "All tests pass" (R4) + - ce:work-beta mirrors the testing deliberation and checklist changes (R5) + - ce:plan Phase 5.1 review treats blank test scenarios on feature-bearing units as incomplete (R2) + - testing-reviewer agent includes the behavioral-changes-with-no-test-additions check (R6) + +## Success Criteria + +- A diff with behavioral changes and no test changes gets flagged by the testing-reviewer (R6) -- the detective layer catches it on real artifacts +- ce:plan units without test scenarios either have an explicit annotation or get flagged during plan review (R1-R2) -- the preventive layer operates at planning time +- ce:work's execution loop prompts testing deliberation per task, and the checklist makes the agent explicitly consider whether testing was addressed, not just whether the suite is green (R3-R4) +- "No tests needed" with justification remains a valid outcome -- the goal is deliberate decisions, not forced ceremony + +## Scope Boundaries + +- Not adding CI-level enforcement or programmatic gates -- these are prompt-level changes +- Not adding new abstractions like "testing assessment artifacts" or structured output schemas +- Not mandating coverage thresholds or specific testing frameworks +- Not changing the testing-reviewer's output format -- adding one check within its existing review protocol + +## Key Decisions + +- **Layered approach -- deliberation + detection**: ce:work's per-task deliberation (R3) prompts the agent to think about testing at the point of action. The testing-reviewer (R6) operates on the actual diff as a backstop. Instruction specificity at the right moment matters -- "did you address testing for this task?" is a much more targeted prompt than "tests pass." +- **Targeted edits over a new system**: Rather than introducing a "testing assessment gate" abstraction, make focused changes to ce:plan, ce:work, and testing-reviewer that close the identified gaps. +- **Deliberate omission is a first-class outcome**: "No tests needed" with justification is valid. The goal is making "no tests" a deliberate decision, not an accidental one. + +## Outstanding Questions + +### Deferred to Planning + +- [Affects R1][Technical] What's the lightest-weight annotation for plan units that genuinely need no tests -- a field, a comment, or a convention? +- [Affects R6][Needs research] Review the testing-reviewer's current check implementation to determine where the new "behavioral changes with no test changes" check fits in its analysis protocol +- [Affects R3][Technical] Where in ce:work's execution loop (Phase 2 task loop) does the testing deliberation prompt fit -- after "Run tests after changes" or as part of "Mark task as completed"? +- [Affects R4-R5][Resolved] ce:work's Phase 3 checklist is plaintext markdown in SKILL.md (line ~433 and ~289). ce:work-beta has the same pattern. The change is editing bullet points, no dynamic infrastructure. + +## Next Steps + +-> `/ce:plan` for structured implementation planning diff --git a/docs/plans/2026-03-29-001-feat-testing-addressed-gate-plan.md b/docs/plans/2026-03-29-001-feat-testing-addressed-gate-plan.md new file mode 100644 index 0000000..3b737fd --- /dev/null +++ b/docs/plans/2026-03-29-001-feat-testing-addressed-gate-plan.md @@ -0,0 +1,239 @@ +--- +title: "feat: Close the testing gap in ce:work, ce:plan, and testing-reviewer" +type: feat +status: active +date: 2026-03-29 +origin: docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md +--- + +# feat: Close the testing gap in ce:work, ce:plan, and testing-reviewer + +## Overview + +Targeted edits to three skill/agent files to make "no tests" a deliberate decision rather than an accidental omission. Adds per-task testing deliberation in ce:work's execution loop, blank-test-scenarios handling in ce:plan's review, and a missing-test-pattern check in the testing-reviewer agent. Ships with contract tests following the existing repo pattern. + +## Problem Frame + +ce:work has thorough testing instructions but two narrow gaps let untested behavioral changes slip through silently: the quality gate says "All tests pass" (vacuously true with no tests), and ce:plan allows blank test scenarios without annotation. The testing-reviewer catches some gaps after the fact but doesn't flag the broad pattern of behavioral changes with zero test additions. (see origin: docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md) + +## Requirements Trace + +- R1. ce:plan units with no test scenarios should annotate why, not leave the field blank +- R2. Blank test scenarios on feature-bearing units treated as incomplete in Phase 5.1 review +- R3. Per-task testing deliberation in ce:work's execution loop before marking a task done +- R4. Quality checklist and Final Validation updated from "Tests pass" to "Testing addressed" +- R5. Apply R3 and R4 to ce:work-beta with explicit sync decision +- R6. testing-reviewer adds a check for behavioral changes with no corresponding test additions +- R7. New check complements existing checks (untested branches, weak assertions, brittle tests, missing edge cases) +- R8. Contract tests verifying each behavioral change ships as intended + +## Scope Boundaries + +- Prompt-level changes only -- no CI enforcement, no programmatic gates +- No new abstractions (no "testing assessment artifacts" or structured output schemas) +- No changes to testing-reviewer's output format (findings JSON stays the same) +- Deliberate test omission with justification is a valid outcome + +## Context & Research + +### Relevant Code and Patterns + +- `plugins/compound-engineering/skills/ce-plan/SKILL.md` — Phase 5.1 review checklist at lines 583-601, test scenario quality checks at lines 591-592. Two edit sites: instruction prose for Test scenarios at line 339 (section 3.5), and plan output template with HTML comment at line 499 +- `plugins/compound-engineering/skills/ce-work/SKILL.md` — Phase 2 task loop at lines ~143-155, Final Validation at lines 287-295 ("All tests pass"), Quality Checklist at lines 427-443 ("Tests pass (run project's test command)") +- `plugins/compound-engineering/skills/ce-work-beta/SKILL.md` — Identical loop/checklist structure. Final Validation at lines 296-304, Quality Checklist at lines 500-516 +- `plugins/compound-engineering/agents/review/testing-reviewer.md` — 4 existing checks in "What you're hunting for" (lines 15-20), confidence calibration (lines 22-29), output format (lines 37-48) +- `tests/pipeline-review-contract.test.ts` — Contract tests for ce:work, ce:work-beta, ce:brainstorm, ce:plan using `readRepoFile()` + `toContain`/`not.toContain` assertions +- `tests/review-skill-contract.test.ts` — Contract tests for ce:review agent using same pattern, includes frontmatter parsing and cross-file schema alignment + +### Institutional Learnings + +- Beta-to-stable sync must be explicit per AGENTS.md (lines 161-163). The existing `pipeline-review-contract.test.ts` already tests ce:work-beta mirrors ce:work's review contract — follow same pattern. +- Skill review checklist warns against contradictory rules across phases — the new "testing deliberation" must complement, not contradict, existing "Run tests after changes" instruction. +- Use negative assertions (`not.toContain`) to prevent regression — assert old "Tests pass" / "All tests pass" language is fully replaced. + +## Key Technical Decisions + +- **Testing deliberation goes after "Run tests after changes" in the loop**: This is the natural deliberation point — tests have just run (or not), and the agent should assess whether testing was adequately addressed before marking the task done. Placing it earlier (before test execution) would be premature; placing it at "Mark task as completed" would intermingle it with completion bookkeeping. +- **Annotation uses existing template field, not a new field**: `Test expectation: none -- [reason]` goes in the Test scenarios section rather than adding a new template field. This keeps the template stable and leverages the existing Phase 5.1 check surface. +- **New testing-reviewer check is a 5th bullet, not a replacement**: It's conceptually distinct from check #1 (untested branches within new code). Check #1 looks at branch coverage within tests that exist; the new check flags when no tests exist at all for behavioral changes. +- **Contract tests extend existing files**: New ce:work/ce:plan assertions go in `pipeline-review-contract.test.ts`. Testing-reviewer assertion goes in `review-skill-contract.test.ts`. This follows the established convention rather than creating a new file. + +## Open Questions + +### Resolved During Planning + +- **Where does testing deliberation go in the loop?** After "Run tests after changes" (bullet 8) and before "Mark task as completed" (bullet 9). The agent has just run tests or skipped them — now it deliberates. +- **What annotation format for units with no tests?** `Test expectation: none -- [reason]` in the Test scenarios field. Follows existing template structure. +- **Where does the new check go in testing-reviewer?** 5th bullet in "What you're hunting for" after the existing 4 checks. +- **New test file or extend existing?** Extend existing — `pipeline-review-contract.test.ts` for skill changes, `review-skill-contract.test.ts` for the agent change. + +### Deferred to Implementation + +- Exact wording of the testing deliberation prompt in the execution loop — should be concise and action-oriented, final phrasing determined during implementation +- Whether the testing-reviewer's "What you don't flag" section needs a corresponding exclusion for non-behavioral changes (config, formatting, comments) — inspect during implementation + +## Implementation Units + +- [ ] **Unit 1: ce:plan — Blank test scenarios handling** + +**Goal:** Make blank test scenarios on feature-bearing units flagged as incomplete during plan review, and establish the annotation convention for units that genuinely need no tests. + +**Requirements:** R1, R2 + +**Dependencies:** None + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-plan/SKILL.md` + +**Approach:** +- Two edit sites in ce:plan for the annotation convention: + - The instruction prose (section 3.5, around line 339) that describes how to write Test scenarios — mention the `Test expectation: none -- [reason]` convention here so the planner agent learns it when reading instructions + - The plan output template (around line 499) which contains the HTML comment `` — update this comment to also show the annotation convention for units with no test scenarios +- In Phase 5.1 review checklist (after line 592), add a new bullet: blank or missing test scenarios on a feature-bearing unit (as defined by ce:plan's existing Plan Quality Bar language) should be flagged as incomplete +- In the Phase 5.3.3 confidence-scoring checklist for Implementation Units (around line 717), add a parallel item so the confidence check also catches blank test scenarios + +**Patterns to follow:** +- Existing Phase 5.1 test scenario quality checks at lines 591-592 +- The unit template comment style at line 499 +- ce:plan's existing "feature-bearing unit" terminology in the Plan Quality Bar + +**Test scenarios:** +- Happy path: Plan with a feature-bearing unit that has `Test expectation: none -- config-only change` in test scenarios -> Phase 5.1 review accepts it +- Error path: Plan with a feature-bearing unit that has a completely blank/absent Test scenarios field -> Phase 5.1 review flags it as incomplete +- Happy path: Plan with a non-feature-bearing unit (scaffolding, config) that uses the annotation -> accepted without issue + +**Verification:** +- Phase 5.1 checklist explicitly addresses blank test scenarios +- Plan template comment mentions the `Test expectation: none -- [reason]` convention +- Confidence scoring checklist includes blank test scenarios as a scoring trigger + +--- + +- [ ] **Unit 2: ce:work and ce:work-beta — Testing deliberation and checklist update** + +**Goal:** Add per-task testing deliberation to the execution loop and update both checklist surfaces from "Tests pass" to "Testing addressed." + +**Requirements:** R3, R4, R5 + +**Dependencies:** None + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-work/SKILL.md` +- Modify: `plugins/compound-engineering/skills/ce-work-beta/SKILL.md` + +**Approach:** +- In the Phase 2 task execution loop (lines ~143-155 in ce:work, ~144-156 in ce:work-beta), add a **new bullet** between "Run tests after changes" and "Mark task as completed". The new bullet should prompt the agent to assess: did this task change behavior? If yes, were tests written or updated? If no tests were added, what is the justification? Keep it concise — 2-3 questions in one bullet, matching the existing loop bullet style. Do not expand into a multi-paragraph section +- In the Quality Checklist (ce:work line ~433, ce:work-beta line ~506), replace `- [ ] Tests pass (run project's test command)` with `- [ ] Testing addressed -- tests pass AND new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)` +- In the Final Validation (ce:work line ~289, ce:work-beta line ~298), replace `- All tests pass` with `- Testing addressed -- tests pass and new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)` +- Ensure both files receive identical changes + +**Sync decision:** Propagating to beta — shared testing deliberation guidance, not experimental delegate-mode behavior. + +**Patterns to follow:** +- Existing execution loop bullet style at lines 138-155 +- Existing Quality Checklist item style (checkbox with parenthetical guidance) +- The mandatory review pattern (which was also synced identically between stable and beta) + +**Test scenarios:** +- Happy path: ce:work execution loop includes the testing deliberation step in the correct position (after "Run tests" and before "Mark task as completed") +- Happy path: Quality Checklist contains "Testing addressed" and does not contain "Tests pass (run project's test command)" +- Happy path: Final Validation contains "Testing addressed" and does not contain "All tests pass" +- Integration: ce:work-beta has identical testing deliberation and checklist wording as ce:work + +**Verification:** +- Both files contain the testing deliberation step in the execution loop +- Both files' Quality Checklist and Final Validation use "Testing addressed" language +- Old "Tests pass" and "All tests pass" language is fully removed from both files + +--- + +- [ ] **Unit 3: testing-reviewer — Behavioral changes with no test additions check** + +**Goal:** Add a 5th check to the testing-reviewer agent that flags behavioral code changes in the diff with zero corresponding test additions or modifications. + +**Requirements:** R6, R7 + +**Dependencies:** None + +**Files:** +- Modify: `plugins/compound-engineering/agents/review/testing-reviewer.md` + +**Approach:** +- Add a 5th bold-titled bullet in "What you're hunting for" (after the existing 4th check at line 20). The check should: describe the pattern (behavioral code changes — new logic branches, state mutations, API changes — with zero corresponding test file additions or modifications in the diff), explain what makes it distinct from check #1 (which looks at untested branches *within* code that has tests, while this flags when no tests exist at all), and note that non-behavioral changes (config, formatting, comments, type-only changes) are excluded +- Consider adding a corresponding item in "What you don't flag" for non-behavioral changes if it adds clarity + +**Patterns to follow:** +- Existing check format: bold title followed by `--` and explanation +- Existing checks use specific, concrete language ("new `if/else`, `switch`, `try/catch`") +- Confidence calibration tiers (High 0.80+ when provable from diff alone) + +**Test scenarios:** +- Happy path: testing-reviewer.md "What you're hunting for" section contains the behavioral-changes-with-no-tests check +- Happy path: Check is described as distinct from existing untested-branches check + +**Verification:** +- testing-reviewer.md has 5 checks in "What you're hunting for" instead of 4 +- The new check specifically addresses "behavioral changes with no corresponding test additions" + +--- + +- [ ] **Unit 4: Contract tests for all changes** + +**Goal:** Add contract tests that verify each skill/agent modification ships as intended, following the existing string-assertion pattern. + +**Requirements:** R8 + +**Dependencies:** Units 1, 2, 3 + +**Files:** +- Modify: `tests/pipeline-review-contract.test.ts` +- Modify: `tests/review-skill-contract.test.ts` + +**Approach:** +- In `pipeline-review-contract.test.ts`, extend the existing `ce:work review contract` describe block with new tests: + - ce:work includes testing deliberation in execution loop + - ce:work Quality Checklist contains "Testing addressed" and does not contain "Tests pass (run project's test command)" + - ce:work Final Validation contains "Testing addressed" and does not contain "All tests pass" + - ce:work-beta mirrors all testing deliberation and checklist changes +- In `pipeline-review-contract.test.ts`, extend or add a `ce:plan review contract` test: + - ce:plan Phase 5.1 review addresses blank test scenarios on feature-bearing units +- In `review-skill-contract.test.ts`, add a new describe block for testing-reviewer: + - testing-reviewer includes the behavioral-changes-with-no-test-additions check + +Use negative assertions (`not.toContain`) for the old checklist language to prevent regression. + +**Patterns to follow:** +- `readRepoFile()` helper + `expect(content).toContain(...)` / `expect(content).not.toContain(...)` in existing contract tests +- ce:work-beta mirror test pattern at pipeline-review-contract.test.ts lines 39-50 +- `describe`/`test` block naming convention in both files + +**Test scenarios:** +- Happy path: All new contract tests pass after Units 1-3 are complete +- Error path: Reverting any skill change causes the corresponding contract test to fail (verified by inspection of assertion specificity) + +**Verification:** +- `bun test` passes with the new contract tests +- Each R3-R7 change surface has at least one contract test assertion + +## System-Wide Impact + +- **Interaction graph:** These are prompt-level skill edits. No callbacks, middleware, or runtime dependencies. The testing-reviewer is invoked by ce:review which is invoked by ce:work — the chain is: ce:work -> ce:review -> testing-reviewer. Changes to the reviewer's check list affect what ce:review surfaces but not how it surfaces it. +- **Error propagation:** Not applicable — no runtime error paths. If the testing deliberation prompt is poorly worded, the worst case is the agent ignores it (same as today). +- **API surface parity:** ce:work and ce:work-beta must remain in sync per AGENTS.md. Contract tests enforce this. +- **Unchanged invariants:** The testing-reviewer's output format (JSON with `findings`, `residual_risks`, `testing_gaps`) is unchanged. The plan template's structure is unchanged — only the comment and Phase 5.1 checklist are modified. + +## Risks & Dependencies + +| Risk | Mitigation | +|------|------------| +| Testing deliberation prompt is too verbose and gets ignored by the agent | Keep it concise — 2-3 questions, not a paragraph. Match the existing loop bullet style. | +| Old "Tests pass" language persists in one location, creating contradiction | Negative contract test assertions (`not.toContain`) catch any leftover old language | +| ce:work-beta drifts from ce:work | Contract tests explicitly assert both files contain identical testing changes | + +## Sources & References + +- **Origin document:** [docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md](docs/brainstorms/2026-03-29-testing-addressed-gate-requirements.md) +- Related learning: `docs/solutions/skill-design/beta-promotion-orchestration-contract.md` +- Related learning: `docs/solutions/skill-design/compound-refresh-skill-improvements.md` (avoid contradictory rules across phases) +- Related test: `tests/pipeline-review-contract.test.ts` +- Related test: `tests/review-skill-contract.test.ts` diff --git a/plugins/compound-engineering/agents/review/testing-reviewer.md b/plugins/compound-engineering/agents/review/testing-reviewer.md index e3eb308..e19d004 100644 --- a/plugins/compound-engineering/agents/review/testing-reviewer.md +++ b/plugins/compound-engineering/agents/review/testing-reviewer.md @@ -17,6 +17,7 @@ You are a test architecture and coverage expert who evaluates whether the tests - **Tests that don't assert behavior (false confidence)** -- tests that call a function but only assert it doesn't throw, assert truthiness instead of specific values, or mock so heavily that the test verifies the mocks, not the code. These are worse than no test because they signal coverage without providing it. - **Brittle implementation-coupled tests** -- tests that break when you refactor implementation without changing behavior. Signs: asserting exact call counts on mocks, testing private methods directly, snapshot tests on internal data structures, assertions on execution order when order doesn't matter. - **Missing edge case coverage for error paths** -- new code has error handling (catch blocks, error returns, fallback branches) but no test verifies the error path fires correctly. The happy path is tested; the sad path is not. +- **Behavioral changes with no test additions** -- the diff modifies behavior (new logic branches, state mutations, changed API contracts, altered control flow) but adds or modifies zero test files. This is distinct from untested branches above, which checks coverage *within* code that has tests. This check flags when the diff contains behavioral changes with no corresponding test work at all. Non-behavioral changes (config edits, formatting, comments, type-only annotations, dependency bumps) are excluded. ## Confidence calibration diff --git a/plugins/compound-engineering/skills/ce-plan/SKILL.md b/plugins/compound-engineering/skills/ce-plan/SKILL.md index 549061b..98fd730 100644 --- a/plugins/compound-engineering/skills/ce-plan/SKILL.md +++ b/plugins/compound-engineering/skills/ce-plan/SKILL.md @@ -336,7 +336,7 @@ For each unit, include: - **Execution note** - optional, only when the unit benefits from a non-default execution posture such as test-first, characterization-first, or external delegation - **Technical design** - optional pseudo-code or diagram when the unit's approach is non-obvious and prose alone would leave it ambiguous. Frame explicitly as directional guidance, not implementation specification - **Patterns to follow** - existing code or conventions to mirror -- **Test scenarios** - enumerate the specific test cases the implementer should write, right-sized to the unit's complexity and risk. Consider each category below and include scenarios from every category that applies to this unit. A simple config change may need one scenario; a payment flow may need a dozen. The quality signal is specificity — each scenario should name the input, action, and expected outcome so the implementer doesn't have to invent coverage. +- **Test scenarios** - enumerate the specific test cases the implementer should write, right-sized to the unit's complexity and risk. Consider each category below and include scenarios from every category that applies to this unit. A simple config change may need one scenario; a payment flow may need a dozen. The quality signal is specificity — each scenario should name the input, action, and expected outcome so the implementer doesn't have to invent coverage. For units with no behavioral change (pure config, scaffolding, styling), use `Test expectation: none -- [reason]` instead of leaving the field blank. - **Happy path behaviors** - core functionality with expected inputs and outputs - **Edge cases** (when the unit has meaningful boundaries) - boundary values, empty inputs, nil/null states, concurrent access - **Error and failure paths** (when the unit has failure modes) - invalid input, downstream service failures, timeout behavior, permission denials @@ -496,7 +496,7 @@ deepened: YYYY-MM-DD # optional, set when the confidence check substantively st - [Existing file, class, or pattern] **Test scenarios:** - + - [Scenario: specific input/action -> expected outcome. Prefix with category — Happy path, Edge case, Error path, or Integration — to signal intent] **Verification:** @@ -622,6 +622,7 @@ Before finalizing, check: - If test-first or characterization-first posture was explicit or strongly implied, the relevant units carry it forward with a lightweight `Execution note` - Each feature-bearing unit has test scenarios from every applicable category (happy path, edge cases, error paths, integration) — right-sized to the unit's complexity, not padded or skimped - Test scenarios name specific inputs, actions, and expected outcomes without becoming test code +- Feature-bearing units with blank or missing test scenarios are flagged as incomplete — feature-bearing units must have actual test scenarios, not just an annotation. The `Test expectation: none -- [reason]` annotation is only valid for non-feature-bearing units (pure config, scaffolding, styling) - Deferred items are explicit and not hidden as fake certainty - If a High-Level Technical Design section is included, it uses the right medium for the work, carries the non-prescriptive framing, and does not contain implementation code (no imports, exact signatures, or framework-specific syntax) - Per-unit technical design fields, if present, are concise and directional rather than copy-paste-ready @@ -748,6 +749,7 @@ If the plan already has a `deepened:` date: - Units are too large, too vague, or broken into micro-steps - Approach notes are thin or do not name the pattern to follow - Test scenarios are vague (don't name inputs and expected outcomes), skip applicable categories (e.g., no error paths for a unit with failure modes, no integration scenarios for a unit crossing layers), or are disproportionate to the unit's complexity +- Feature-bearing units have blank or missing test scenarios (feature-bearing units require actual test scenarios; the `Test expectation: none` annotation is only valid for non-feature-bearing units) - Verification outcomes are vague or not expressed as observable results **System-Wide Impact** diff --git a/plugins/compound-engineering/skills/ce-work-beta/SKILL.md b/plugins/compound-engineering/skills/ce-work-beta/SKILL.md index ddc2dd9..8328e58 100644 --- a/plugins/compound-engineering/skills/ce-work-beta/SKILL.md +++ b/plugins/compound-engineering/skills/ce-work-beta/SKILL.md @@ -151,6 +151,7 @@ Determine how to proceed based on what was provided in ``. - Add, update, or remove tests to match implementation changes (see Test Discovery below) - Run System-Wide Test Check (see below) - Run tests after changes + - Assess testing coverage: did this task change behavior? If yes, were tests written or updated? If no tests were added, is the justification deliberate (e.g., pure config, no behavioral change)? - Mark task as completed - Evaluate for incremental commit (see below) ``` @@ -295,7 +296,7 @@ Determine how to proceed based on what was provided in ``. 3. **Final Validation** - All tasks marked completed - - All tests pass + - Testing addressed -- tests pass and new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed) - Linting passes - Code follows existing patterns - Figma designs match (if applicable) @@ -503,7 +504,7 @@ Before creating PR, verify: - [ ] All clarifying questions asked and answered - [ ] All tasks marked completed -- [ ] Tests pass (run project's test command) +- [ ] Testing addressed -- tests pass AND new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed) - [ ] Linting passes (use linting-agent) - [ ] Code follows existing patterns - [ ] Figma designs match implementation (if applicable) diff --git a/plugins/compound-engineering/skills/ce-work/SKILL.md b/plugins/compound-engineering/skills/ce-work/SKILL.md index cbeee35..6644f1a 100644 --- a/plugins/compound-engineering/skills/ce-work/SKILL.md +++ b/plugins/compound-engineering/skills/ce-work/SKILL.md @@ -150,6 +150,7 @@ Determine how to proceed based on what was provided in ``. - Add, update, or remove tests to match implementation changes (see Test Discovery below) - Run System-Wide Test Check (see below) - Run tests after changes + - Assess testing coverage: did this task change behavior? If yes, were tests written or updated? If no tests were added, is the justification deliberate (e.g., pure config, no behavioral change)? - Mark task as completed - Evaluate for incremental commit (see below) ``` @@ -286,7 +287,7 @@ Determine how to proceed based on what was provided in ``. 3. **Final Validation** - All tasks marked completed - - All tests pass + - Testing addressed -- tests pass and new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed) - Linting passes - Code follows existing patterns - Figma designs match (if applicable) @@ -430,7 +431,7 @@ Before creating PR, verify: - [ ] All clarifying questions asked and answered - [ ] All tasks marked completed -- [ ] Tests pass (run project's test command) +- [ ] Testing addressed -- tests pass AND new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed) - [ ] Linting passes (use linting-agent) - [ ] Code follows existing patterns - [ ] Figma designs match implementation (if applicable) diff --git a/tests/pipeline-review-contract.test.ts b/tests/pipeline-review-contract.test.ts index b2423b5..91b138f 100644 --- a/tests/pipeline-review-contract.test.ts +++ b/tests/pipeline-review-contract.test.ts @@ -48,6 +48,45 @@ describe("ce:work review contract", () => { expect(beta).toContain("`git-commit` skill") expect(beta).not.toContain("gh pr create") }) + + test("includes per-task testing deliberation in execution loop", async () => { + const content = await readRepoFile("plugins/compound-engineering/skills/ce-work/SKILL.md") + + // Testing deliberation exists in the execution loop + expect(content).toContain("Assess testing coverage") + + // Deliberation is between "Run tests after changes" and "Mark task as completed" + const runTestsIdx = content.indexOf("Run tests after changes") + const assessIdx = content.indexOf("Assess testing coverage") + const markDoneIdx = content.indexOf("Mark task as completed") + expect(runTestsIdx).toBeLessThan(assessIdx) + expect(assessIdx).toBeLessThan(markDoneIdx) + }) + + test("quality checklist says 'Testing addressed' not 'Tests pass'", async () => { + const content = await readRepoFile("plugins/compound-engineering/skills/ce-work/SKILL.md") + + // New language present + expect(content).toContain("Testing addressed") + + // Old language fully removed + expect(content).not.toContain("Tests pass (run project's test command)") + expect(content).not.toContain("- All tests pass") + }) + + test("ce:work-beta mirrors testing deliberation and checklist changes", async () => { + const beta = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/SKILL.md") + + // Testing deliberation in loop + expect(beta).toContain("Assess testing coverage") + + // New checklist language + expect(beta).toContain("Testing addressed") + + // Old language removed + expect(beta).not.toContain("Tests pass (run project's test command)") + expect(beta).not.toContain("- All tests pass") + }) }) describe("ce:brainstorm review contract", () => { @@ -64,6 +103,19 @@ describe("ce:brainstorm review contract", () => { }) }) +describe("ce:plan testing contract", () => { + test("flags blank test scenarios on feature-bearing units as incomplete", async () => { + const content = await readRepoFile("plugins/compound-engineering/skills/ce-plan/SKILL.md") + + // Phase 5.1 review checklist addresses blank test scenarios + expect(content).toContain("blank or missing test scenarios") + expect(content).toContain("Test expectation: none") + + // Template comment mentions the annotation convention + expect(content).toContain("Test expectation: none -- [reason]") + }) +}) + describe("ce:plan review contract", () => { test("requires document review after confidence check", async () => { const content = await readRepoFile("plugins/compound-engineering/skills/ce-plan/SKILL.md") diff --git a/tests/review-skill-contract.test.ts b/tests/review-skill-contract.test.ts index 9b63b6c..f5fae42 100644 --- a/tests/review-skill-contract.test.ts +++ b/tests/review-skill-contract.test.ts @@ -245,3 +245,18 @@ describe("ce-review contract", () => { expect(slfg).toContain("/ce:review mode:autofix") }) }) + +describe("testing-reviewer contract", () => { + test("includes behavioral-changes-with-no-test-additions check", async () => { + const content = await readRepoFile("plugins/compound-engineering/agents/review/testing-reviewer.md") + + // New check exists in "What you're hunting for" section + expect(content).toContain("Behavioral changes with no test additions") + + // Check is distinct from untested branches check + expect(content).toContain("distinct from untested branches") + + // Non-behavioral changes are excluded + expect(content).toContain("Non-behavioral changes") + }) +})