diff --git a/plugins/compound-engineering/skills/ce-review/SKILL.md b/plugins/compound-engineering/skills/ce-review/SKILL.md index 8a31b3e..6bf2703 100644 --- a/plugins/compound-engineering/skills/ce-review/SKILL.md +++ b/plugins/compound-engineering/skills/ce-review/SKILL.md @@ -62,7 +62,7 @@ All tokens are optional. Each one present means one less thing to infer. When ab - **Skip all user questions.** Never use the platform question tool (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini) or other interactive prompts. Infer intent conservatively if the diff metadata is thin. - **Require a determinable diff scope.** If headless mode cannot determine a diff scope (no branch, PR, or `base:` ref determinable without user interaction), emit `Review failed (headless mode). Reason: no diff scope detected. Re-invoke with a branch name, PR number, or base:.` and stop without dispatching agents. - **Apply only `safe_auto -> review-fixer` findings in a single pass.** No bounded re-review rounds. Leave `gated_auto`, `manual`, `human`, and `release` work unresolved and return them in the structured output. -- **Return all non-auto findings as structured text output.** Use the headless output envelope format (see Stage 6 below) preserving severity, autofix_class, owner, requires_verification, confidence, evidence[], and pre_existing per finding. +- **Return all non-auto findings as structured text output.** Use the headless output envelope format (see Stage 6 below) preserving severity, autofix_class, owner, requires_verification, confidence, pre_existing, and suggested_fix per finding. Enrich with detail-tier fields (why_it_matters, evidence[]) from the per-agent artifact files on disk (see Detail enrichment in Stage 6). - **Write a run artifact** under `.context/compound-engineering/ce-review//` summarizing findings, applied fixes, and advisory outputs. Include the artifact path in the structured output. - **Do not create todo files.** The caller receives structured findings and routes downstream work itself. - **Do not switch the shared checkout.** If the caller passes an explicit PR or branch target, `mode:headless` must run in an isolated checkout/worktree or stop instead of running `gh pr checkout` / `git checkout`. When stopping, emit `Review failed (headless mode). Reason: cannot switch shared checkout. Re-invoke with base: to review the current checkout, or run from an isolated worktree.` @@ -339,6 +339,8 @@ If a plan is found, read its **Requirements Trace** (R1, R2, etc.) and **Impleme Read the diff and file list from Stage 1. The 4 always-on personas and 2 CE always-on agents are automatic. For each cross-cutting and stack-specific conditional persona in the persona catalog included below, decide whether the diff warrants it. This is agent judgment, not keyword matching. +**File-type awareness for conditional selection:** Instruction-prose files (Markdown skill definitions, JSON schemas, config files) are product code but do not benefit from runtime-focused reviewers. The adversarial reviewer's techniques (race conditions, cascade failures, abuse cases) target executable code behavior. For diffs that only change instruction-prose files, skip adversarial unless the prose describes auth, payment, or data-mutation behavior. Count only executable code lines toward line-count thresholds. + **`previous-comments` is PR-only.** Only select this persona when Stage 1 gathered PR metadata (PR number or URL was provided as an argument, or `gh pr view` returned metadata for the current branch). Skip it entirely for standalone branch reviews with no associated PR -- there are no prior comments to check. Stack-specific personas are additive. A Rails UI change may warrant `kieran-rails` plus `julik-frontend-races`; a TypeScript API diff may warrant `kieran-typescript` plus `api-contract` and `reliability`. @@ -385,6 +387,19 @@ CE always-on agents (agent-native-reviewer, learnings-researcher) and CE conditi The orchestrator (this skill) stays on the default model because it handles intent discovery, reviewer selection, finding merge/dedup, and synthesis -- tasks that benefit from stronger reasoning. +#### Run ID + +Generate a unique run identifier before dispatching any agents. This ID scopes all agent artifact files and the post-review run artifact to the same directory. + +```bash +RUN_ID=$(date +%Y%m%d-%H%M%S)-$(head -c4 /dev/urandom | od -An -tx1 | tr -d ' ') +mkdir -p ".context/compound-engineering/ce-review/$RUN_ID" +``` + +Pass `{run_id}` to every persona sub-agent so they can write their full analysis to `.context/compound-engineering/ce-review/{run_id}/{reviewer_name}.json`. + +**Report-only mode:** Skip run-id generation and directory creation. Do not pass `{run_id}` to agents. Agents return compact JSON only with no file write, consistent with report-only's no-write contract. + #### Spawning Omit the `mode` parameter when dispatching sub-agents so the user's configured permission settings apply. Do not pass `mode: "auto"`. @@ -396,45 +411,71 @@ Spawn each selected persona reviewer as a parallel sub-agent using the subagent 3. The JSON output contract from the findings schema included below 4. PR metadata: title, body, and URL when reviewing a PR (empty string otherwise). Passed in a `` block so reviewers can verify code against stated intent 5. Review context: intent summary, file list, diff -6. **For `project-standards` only:** the standards file path list from Stage 3b, wrapped in a `` block appended to the review context +6. Run ID and reviewer name for the artifact file path +7. **For `project-standards` only:** the standards file path list from Stage 3b, wrapped in a `` block appended to the review context -Persona sub-agents are **read-only**: they review and return structured JSON. They do not edit files or propose refactors. +Persona sub-agents are **read-only** with respect to the project: they review and return structured JSON. They do not edit project files or propose refactors. The one permitted write is saving their full analysis to the `.context/` artifact path specified in the output contract. -Read-only here means **non-mutating**, not "no shell access." Reviewer sub-agents may use non-mutating inspection commands when needed to gather evidence or verify scope, including read-oriented `git` / `gh` usage such as `git diff`, `git show`, `git blame`, `git log`, and `gh pr view`. They must not edit files, change branches, commit, push, create PRs, or otherwise mutate the checkout or repository state. +Read-only here means **non-mutating**, not "no shell access." Reviewer sub-agents may use non-mutating inspection commands when needed to gather evidence or verify scope, including read-oriented `git` / `gh` usage such as `git diff`, `git show`, `git blame`, `git log`, and `gh pr view`. They must not edit project files, change branches, commit, push, create PRs, or otherwise mutate the checkout or repository state. -Each persona sub-agent returns JSON matching the findings schema included below: +Each persona sub-agent writes full JSON (all schema fields) to `.context/compound-engineering/ce-review/{run_id}/{reviewer_name}.json` and returns compact JSON with merge-tier fields only: ```json { "reviewer": "security", - "findings": [...], + "findings": [ + { + "title": "User-supplied ID in account lookup without ownership check", + "severity": "P0", + "file": "orders_controller.rb", + "line": 42, + "confidence": 0.92, + "autofix_class": "gated_auto", + "owner": "downstream-resolver", + "requires_verification": true, + "pre_existing": false, + "suggested_fix": "Add current_user.owns?(account) guard before lookup" + } + ], "residual_risks": [...], "testing_gaps": [...] } ``` +Detail-tier fields (`why_it_matters`, `evidence`) are in the artifact file only. `suggested_fix` is optional in both tiers -- included in compact returns when present so the orchestrator has fix context for auto-apply decisions. If the file write fails, the compact return still provides everything the merge needs. + **CE always-on agents** (agent-native-reviewer, learnings-researcher) are dispatched as standard Agent calls in parallel with the persona agents. Give them the same review context bundle the personas receive: entry mode, any PR metadata gathered in Stage 1, intent summary, review base branch name when known, `BASE:` marker, file list, diff, and `UNTRACKED:` scope notes. Do not invoke them with a generic "review this" prompt. Their output is unstructured and synthesized separately in Stage 6. **CE conditional agents** (schema-drift-detector, deployment-verification-agent) are also dispatched as standard Agent calls when applicable. Pass the same review context bundle plus the applicability reason (for example, which migration files triggered the agent). For schema-drift-detector specifically, pass the resolved review base branch explicitly so it never assumes `main`. Their output is unstructured and must be preserved for Stage 6 synthesis just like the CE always-on agents. ### Stage 5: Merge findings -Convert multiple reviewer JSON payloads into one deduplicated, confidence-gated finding set. +Convert multiple reviewer compact JSON returns into one deduplicated, confidence-gated finding set. The compact returns contain merge-tier fields (title, severity, file, line, confidence, autofix_class, owner, requires_verification, pre_existing) plus the optional suggested_fix. Detail-tier fields (why_it_matters, evidence) are on disk in the per-agent artifact files and are not loaded at this stage. -1. **Validate.** Check each output against the schema. Drop malformed findings (missing required fields). Record the drop count. +1. **Validate.** Check each compact return for required top-level and per-finding fields, plus value constraints. Drop malformed returns or findings. Record the drop count. + - **Top-level required:** reviewer (string), findings (array), residual_risks (array), testing_gaps (array). Drop the entire return if any are missing or wrong type. + - **Per-finding required:** title, severity, file, line, confidence, autofix_class, owner, requires_verification, pre_existing + - **Value constraints:** + - severity: P0 | P1 | P2 | P3 + - autofix_class: safe_auto | gated_auto | manual | advisory + - owner: review-fixer | downstream-resolver | human | release + - confidence: numeric, 0.0-1.0 + - line: positive integer + - pre_existing, requires_verification: boolean + - Do not validate against the full schema here -- the full schema (including why_it_matters and evidence) applies to the artifact files on disk, not the compact returns. 2. **Confidence gate.** Suppress findings below 0.60 confidence. Exception: P0 findings at 0.50+ confidence survive the gate -- critical-but-uncertain issues must not be silently dropped. Record the suppressed count. This matches the persona instructions and the schema's confidence thresholds. -3. **Deduplicate.** Compute fingerprint: `normalize(file) + line_bucket(line, +/-3) + normalize(title)`. When fingerprints match, merge: keep highest severity, keep highest confidence with strongest evidence, union evidence, note which reviewers flagged it. +3. **Deduplicate.** Compute fingerprint: `normalize(file) + line_bucket(line, +/-3) + normalize(title)`. When fingerprints match, merge: keep highest severity, keep highest confidence, note which reviewers flagged it. 4. **Cross-reviewer agreement.** When 2+ independent reviewers flag the same issue (same fingerprint), boost the merged confidence by 0.10 (capped at 1.0). Cross-reviewer agreement is strong signal -- independent reviewers converging on the same issue is more reliable than any single reviewer's confidence. Note the agreement in the Reviewer column of the output (e.g., "security, correctness"). 5. **Separate pre-existing.** Pull out findings with `pre_existing: true` into a separate list. -5. **Resolve disagreements.** When reviewers flag the same code region but disagree on severity, autofix_class, or owner, record the disagreement in the finding's evidence (e.g., "security rated P0, correctness rated P1 -- keeping P0"). This transparency helps the user understand why a finding was routed the way it was. -6. **Normalize routing.** For each merged finding, set the final `autofix_class`, `owner`, and `requires_verification`. If reviewers disagree, keep the most conservative route. Synthesis may narrow a finding from `safe_auto` to `gated_auto` or `manual`, but must not widen it without new evidence. -7. **Partition the work.** Build three sets: +6. **Resolve disagreements.** When reviewers flag the same code region but disagree on severity, autofix_class, or owner, annotate the Reviewer column with the disagreement (e.g., "security (P0), correctness (P1) -- kept P0"). This transparency helps the user understand why a finding was routed the way it was. +7. **Normalize routing.** For each merged finding, set the final `autofix_class`, `owner`, and `requires_verification`. If reviewers disagree, keep the most conservative route. Synthesis may narrow a finding from `safe_auto` to `gated_auto` or `manual`, but must not widen it without new evidence. +8. **Partition the work.** Build three sets: - in-skill fixer queue: only `safe_auto -> review-fixer` - residual actionable queue: unresolved `gated_auto` or `manual` findings whose owner is `downstream-resolver` - report-only queue: `advisory` findings plus anything owned by `human` or `release` -8. **Sort.** Order by severity (P0 first) -> confidence (descending) -> file path -> line number. -9. **Collect coverage data.** Union residual_risks and testing_gaps across reviewers. -10. **Preserve CE agent artifacts.** Keep the learnings, agent-native, schema-drift, and deployment-verification outputs alongside the merged finding set. Do not drop unstructured agent output just because it does not match the persona JSON schema. +9. **Sort.** Order by severity (P0 first) -> confidence (descending) -> file path -> line number. +10. **Collect coverage data.** Union residual_risks and testing_gaps across reviewers. +11. **Preserve CE agent artifacts.** Keep the learnings, agent-native, schema-drift, and deployment-verification outputs alongside the merged finding set. Do not drop unstructured agent output just because it does not match the persona JSON schema. ### Stage 6: Synthesize and present @@ -524,6 +565,12 @@ Coverage: Review complete ``` +**Detail enrichment (headless only):** The headless envelope includes `Why:`, `Evidence:`, and `Suggested fix:` lines. After merge (Stage 5), read the per-agent artifact files from `.context/compound-engineering/ce-review/{run_id}/` for only the findings that survived dedup and confidence gating. + - **Field tiers:** `Why:` and `Evidence:` are detail-tier -- load from per-agent artifact files. `Suggested fix:` is merge-tier -- use it directly from the compact return without artifact lookup. + - **Artifact matching:** For each surviving finding, look up its detail-tier fields in the artifact files of the contributing reviewers. Match on `file + line_bucket(line, +/-3)` (the same tolerance used in Stage 5 dedup) within each contributing reviewer's artifact. When multiple artifact entries fall within the line bucket, apply `normalize(title)` to both the merged finding's title and each candidate entry's title as a tie-breaker. + - **Reviewer order:** Try contributing reviewers in the order they appear in the merged finding's reviewer list; use the first match. + - **No-match fallback:** If no artifact file contains a match (all writes failed, or the finding was synthesized during merge), omit the `Why:` and `Evidence:` lines for that finding and note the gap in Coverage. The `Suggested fix:` line can still be populated from the compact return since it is merge-tier. + **Formatting rules:** - The `[needs-verification]` marker appears only on findings where `requires_verification: true`. - The `Artifact:` line gives callers the path to the full run artifact for machine-readable access to the complete findings schema. The text envelope is the primary handoff; the artifact is for debugging and full-fidelity access. @@ -626,10 +673,11 @@ After presenting findings and verdict (Stage 6), route the next steps by mode. R #### Step 4: Emit artifacts and downstream handoff - In interactive, autofix, and headless modes, write a per-run artifact under `.context/compound-engineering/ce-review//` containing: - - synthesized findings + - synthesized findings (merged output from Stage 5) - applied fixes - residual actionable work - advisory-only outputs + Per-agent full-detail JSON files (`{reviewer_name}.json`) are already present in this directory from Stage 4 dispatch. - In autofix mode, create durable todo files only for unresolved actionable findings whose final owner is `downstream-resolver`. Load the `todo-create` skill for the canonical directory path, naming convention, YAML frontmatter structure, and template. Each todo should map the finding's severity to the todo priority (`P0`/`P1` -> `p1`, `P2` -> `p2`, `P3` -> `p3`) and set `status: ready` since these findings have already been triaged by synthesis. - Do not create todos for `advisory` findings, `owner: human`, `owner: release`, or protected-artifact cleanup suggestions. - If only advisory outputs remain, create no todos. diff --git a/plugins/compound-engineering/skills/ce-review/references/findings-schema.json b/plugins/compound-engineering/skills/ce-review/references/findings-schema.json index 445d7ec..146b7c7 100644 --- a/plugins/compound-engineering/skills/ce-review/references/findings-schema.json +++ b/plugins/compound-engineering/skills/ce-review/references/findings-schema.json @@ -124,6 +124,11 @@ "downstream-resolver": "Turn this into residual work for later resolution.", "human": "A person must make a judgment call before code changes should continue.", "release": "Operational or rollout follow-up; do not convert into code-fix work automatically." + }, + "return_tiers": { + "description": "Finding fields are split into two tiers. The full schema (with all required fields) applies to the artifact file on disk. The compact return to the orchestrator omits detail-tier fields. Both are valid uses of this schema in different contexts.", + "merge_tier": "Returned to orchestrator: title, severity, file, line, confidence, autofix_class, owner, requires_verification, pre_existing, suggested_fix (optional). Plus top-level reviewer, residual_risks, testing_gaps.", + "detail_tier": "Required in artifact file, omitted from compact return: why_it_matters, evidence. The artifact file must pass full schema validation including all required fields. Headless output depends on why_it_matters and evidence being present in the artifact." } } } diff --git a/plugins/compound-engineering/skills/ce-review/references/persona-catalog.md b/plugins/compound-engineering/skills/ce-review/references/persona-catalog.md index 6e1187b..024490e 100644 --- a/plugins/compound-engineering/skills/ce-review/references/persona-catalog.md +++ b/plugins/compound-engineering/skills/ce-review/references/persona-catalog.md @@ -33,7 +33,7 @@ Spawned when the orchestrator identifies relevant patterns in the diff. The orch | `api-contract` | `compound-engineering:review:api-contract-reviewer` | Route definitions, serializer/interface changes, event schemas, exported type signatures, API versioning | | `data-migrations` | `compound-engineering:review:data-migrations-reviewer` | Migration files, schema changes, backfill scripts, data transformations | | `reliability` | `compound-engineering:review:reliability-reviewer` | Error handling, retry logic, circuit breakers, timeouts, background jobs, async handlers, health checks | -| `adversarial` | `compound-engineering:review:adversarial-reviewer` | Diff has >=50 changed non-test, non-generated, non-lockfile lines, OR touches auth, payments, data mutations, external API integrations, or other high-risk domains | +| `adversarial` | `compound-engineering:review:adversarial-reviewer` | Diff has >=50 changed lines of executable code (not prose/instruction Markdown, JSON schemas, or config), OR touches auth, payments, data mutations, external API integrations, or other high-risk domains regardless of file type | | `cli-readiness` | `compound-engineering:review:cli-readiness-reviewer` | CLI command definitions, argument parsing, CLI framework usage, command handler implementations | | `previous-comments` | `compound-engineering:review:previous-comments-reviewer` | **PR-only.** Reviewing a PR that has existing review comments or review threads from prior review rounds. Skip entirely when no PR metadata was gathered in Stage 1. | diff --git a/plugins/compound-engineering/skills/ce-review/references/subagent-template.md b/plugins/compound-engineering/skills/ce-review/references/subagent-template.md index bd57dfa..ad3f39d 100644 --- a/plugins/compound-engineering/skills/ce-review/references/subagent-template.md +++ b/plugins/compound-engineering/skills/ce-review/references/subagent-template.md @@ -18,7 +18,23 @@ You are a specialist code reviewer. -Return ONLY valid JSON matching the findings schema below. No prose, no markdown, no explanation outside the JSON object. +You produce up to two outputs depending on whether a run ID was provided: + +1. **Artifact file (when run ID is present).** If a Run ID appears in below, WRITE your full analysis (all schema fields, including why_it_matters, evidence, and suggested_fix) as JSON to: + .context/compound-engineering/ce-review/{run_id}/{reviewer_name}.json + This is the ONE write operation you are permitted to make. Use the platform's file-write tool. + If the write fails, continue -- the compact return still provides everything the merge needs. + If no Run ID is provided (the field is empty or absent), skip this step entirely -- do not attempt any file write. + +2. **Compact return (always).** RETURN compact JSON to the parent with ONLY merge-tier fields per finding: + title, severity, file, line, confidence, autofix_class, owner, requires_verification, pre_existing, suggested_fix. + Do NOT include why_it_matters or evidence in the returned JSON. + Include reviewer, residual_risks, and testing_gaps at the top level. + +The full file preserves detail for downstream consumers (headless output, debugging). +The compact return keeps the orchestrator's context lean for merge and synthesis. + +The schema below describes the **full artifact file format** (all fields required). For the compact return, follow the field list above -- omit why_it_matters and evidence even though the schema marks them as required. {schema} @@ -42,9 +58,9 @@ False-positive categories to actively suppress: Rules: - You are a leaf reviewer inside an already-running compound-engineering review workflow. Do not invoke compound-engineering skills or agents unless this template explicitly instructs you to. Perform your analysis directly and return findings in the required output format only. -- Every finding MUST include at least one evidence item grounded in the actual code. +- Every finding in the full artifact file MUST include at least one evidence item grounded in the actual code. The compact return omits evidence -- the evidence requirement applies to the disk artifact only. - Set pre_existing to true ONLY for issues in unchanged code that are unrelated to this diff. If the diff makes the issue newly relevant, it is NOT pre-existing. -- You are operationally read-only. You may use non-mutating inspection commands, including read-oriented `git` / `gh` commands, to gather evidence. Do not edit files, change branches, commit, push, create PRs, or otherwise mutate the checkout or repository state. +- You are operationally read-only. The one permitted exception is writing your full analysis to the `.context/` artifact path when a run ID is provided. You may also use non-mutating inspection commands, including read-oriented `git` / `gh` commands, to gather evidence. Do not edit project files, change branches, commit, push, create PRs, or otherwise mutate the checkout or repository state. - Set `autofix_class` accurately -- not every finding is `advisory`. Use this decision guide: - `safe_auto`: The fix is local and deterministic — the fixer can apply it mechanically without design judgment. Examples: extracting a duplicated helper, adding a missing nil/null check, fixing an off-by-one, adding a missing test for an untested code path, removing dead code. - `gated_auto`: A concrete fix exists but it changes contracts, permissions, or crosses a module boundary in a way that deserves explicit approval. Examples: adding authentication to an unprotected endpoint, changing a public API response shape, switching from soft-delete to hard-delete. @@ -63,6 +79,9 @@ Rules: +Run ID: {run_id} +Reviewer name: {reviewer_name} + Intent: {intent_summary} Changed files: {file_list} @@ -83,3 +102,5 @@ Diff: | `{pr_metadata}` | Stage 1 output | PR title, body, and URL when reviewing a PR. Empty string when reviewing a branch or standalone checkout | | `{file_list}` | Stage 1 output | List of changed files from the scope step | | `{diff}` | Stage 1 output | The actual diff content to review | +| `{run_id}` | Stage 4 output | Unique review run identifier for the artifact directory | +| `{reviewer_name}` | Stage 3 output | Persona or agent name used as the artifact filename stem | diff --git a/plugins/compound-engineering/skills/resolve-pr-feedback/SKILL.md b/plugins/compound-engineering/skills/resolve-pr-feedback/SKILL.md index 8d49538..e79b973 100644 --- a/plugins/compound-engineering/skills/resolve-pr-feedback/SKILL.md +++ b/plugins/compound-engineering/skills/resolve-pr-feedback/SKILL.md @@ -92,7 +92,7 @@ If the gate does not fire, proceed to step 4. The common case (first review roun 1. **Assign concern categories** from this fixed list: `error-handling`, `validation`, `type-safety`, `naming`, `performance`, `testing`, `security`, `documentation`, `style`, `architecture`, `other`. Each item (new and previously-resolved) gets exactly one category based on what the feedback is about. -2. **Group by category + spatial proximity**. Two items form a potential cluster when they share a concern category AND are spatially proximate (same file, or files in the same directory subtree). Clusters can span new and previously-resolved threads. +2. **Group by category + spatial proximity**. Form groups from all categorized items -- new and previously-resolved together, not new items only. Two items form a potential cluster when they share a concern category AND are spatially proximate (same file, or files in the same directory subtree). | Thematic match | Spatial proximity | Action | |---|---|---|