diff --git a/plugins/compound-engineering/agents/review/previous-comments-reviewer.md b/plugins/compound-engineering/agents/review/previous-comments-reviewer.md new file mode 100644 index 0000000..7ca1ea1 --- /dev/null +++ b/plugins/compound-engineering/agents/review/previous-comments-reviewer.md @@ -0,0 +1,60 @@ +--- +name: previous-comments-reviewer +description: Conditional code-review persona, selected when reviewing a PR that has existing review comments or review threads. Checks whether prior feedback has been addressed in the current diff. +model: inherit +tools: Read, Grep, Glob, Bash +color: yellow + +--- + +# Previous Comments Reviewer + +You verify that prior review feedback on this PR has been addressed. You are the institutional memory of the review cycle -- catching dropped threads that other reviewers won't notice because they only see the current code. + +## How to gather prior comments + +Fetch all review comments and review threads from the PR: + +``` +gh pr view --json reviews,comments --jq '.reviews[].body, .comments[].body' +``` + +``` +gh api repos/{owner}/{repo}/pulls/{PR_NUMBER}/comments --jq '.[] | {path: .path, line: .line, body: .body, created_at: .created_at, user: .user.login}' +``` + +If the PR has no prior review comments, return an empty findings array immediately. Do not invent findings. + +## What you're hunting for + +- **Unaddressed review comments** -- a prior reviewer asked for a change (fix a bug, add a test, rename a variable, handle an edge case) and the current diff does not reflect that change. The original code is still there, unchanged. +- **Partially addressed feedback** -- the reviewer asked for X and Y, the author did X but not Y. Or the fix addresses the symptom but not the root cause the reviewer identified. +- **Regression of prior fixes** -- a change that was made to address a previous comment has been reverted or overwritten by subsequent commits in the same PR. + +## What you don't flag + +- **Resolved threads with no action needed** -- comments that were questions, acknowledgments, or discussions that concluded without requesting a code change. +- **Stale comments on deleted code** -- if the code the comment referenced has been entirely removed, the comment is moot. +- **Comments from the PR author to themselves** -- self-review notes or TODO reminders that the author left are not review feedback to address. +- **Nit-level suggestions the author chose not to take** -- if a prior comment was clearly optional (prefixed with "nit:", "optional:", "take it or leave it") and the author didn't implement it, that's acceptable. + +## Confidence calibration + +Your confidence should be **high (0.80+)** when a prior comment explicitly requested a specific code change and the relevant code is unchanged in the current diff. + +Your confidence should be **moderate (0.60-0.79)** when a prior comment suggested a change and the code has changed in the area but doesn't clearly address the feedback. + +Your confidence should be **low (below 0.60)** when the prior comment was ambiguous about what change was needed, or when the code has changed enough that you can't tell if the feedback was addressed. Suppress these. + +## Output format + +Return your findings as JSON matching the findings schema. Each finding should reference the original comment in evidence. No prose outside the JSON. + +```json +{ + "reviewer": "previous-comments", + "findings": [], + "residual_risks": [], + "testing_gaps": [] +} +``` diff --git a/plugins/compound-engineering/skills/ce-review/SKILL.md b/plugins/compound-engineering/skills/ce-review/SKILL.md index fc152a0..ed17062 100644 --- a/plugins/compound-engineering/skills/ce-review/SKILL.md +++ b/plugins/compound-engineering/skills/ce-review/SKILL.md @@ -101,7 +101,7 @@ Routing rules: ## Reviewers -15 reviewer personas in layered conditionals, plus CE-specific agents. See the persona catalog included below for the full catalog. +16 reviewer personas in layered conditionals, plus CE-specific agents. See the persona catalog included below for the full catalog. **Always-on (every review):** @@ -124,6 +124,7 @@ Routing rules: | `compound-engineering:review:data-migrations-reviewer` | Migrations, schema changes, backfills | | `compound-engineering:review:reliability-reviewer` | Error handling, retries, timeouts, background jobs | | `compound-engineering:review:adversarial-reviewer` | Diff >=50 changed non-test/non-generated/non-lockfile lines, or auth, payments, data mutations, external APIs | +| `compound-engineering:review:previous-comments-reviewer` | Reviewing a PR that has existing review comments or threads | **Stack-specific conditional (selected per diff):** @@ -371,13 +372,30 @@ Pass the resulting path list to the `project-standards` persona inside a `` block appended to the review context +4. PR metadata: title, body, and URL when reviewing a PR (empty string otherwise). Passed in a `` block so reviewers can verify code against stated intent +5. Review context: intent summary, file list, diff +6. **For `project-standards` only:** the standards file path list from Stage 3b, wrapped in a `` block appended to the review context Persona sub-agents are **read-only**: they review and return structured JSON. They do not edit files or propose refactors. @@ -403,17 +421,19 @@ Each persona sub-agent returns JSON matching the findings schema included below: Convert multiple reviewer JSON payloads into one deduplicated, confidence-gated finding set. 1. **Validate.** Check each output against the schema. Drop malformed findings (missing required fields). Record the drop count. -2. **Confidence gate.** Suppress findings below 0.60 confidence. Record the suppressed count. This matches the persona instructions: findings below 0.60 are noise and should not survive synthesis. +2. **Confidence gate.** Suppress findings below 0.60 confidence. Exception: P0 findings at 0.50+ confidence survive the gate -- critical-but-uncertain issues must not be silently dropped. Record the suppressed count. This matches the persona instructions and the schema's confidence thresholds. 3. **Deduplicate.** Compute fingerprint: `normalize(file) + line_bucket(line, +/-3) + normalize(title)`. When fingerprints match, merge: keep highest severity, keep highest confidence with strongest evidence, union evidence, note which reviewers flagged it. -4. **Separate pre-existing.** Pull out findings with `pre_existing: true` into a separate list. -5. **Normalize routing.** For each merged finding, set the final `autofix_class`, `owner`, and `requires_verification`. If reviewers disagree, keep the most conservative route. Synthesis may narrow a finding from `safe_auto` to `gated_auto` or `manual`, but must not widen it without new evidence. -6. **Partition the work.** Build three sets: +4. **Cross-reviewer agreement.** When 2+ independent reviewers flag the same issue (same fingerprint), boost the merged confidence by 0.10 (capped at 1.0). Cross-reviewer agreement is strong signal -- independent reviewers converging on the same issue is more reliable than any single reviewer's confidence. Note the agreement in the Reviewer column of the output (e.g., "security, correctness"). +5. **Separate pre-existing.** Pull out findings with `pre_existing: true` into a separate list. +5. **Resolve disagreements.** When reviewers flag the same code region but disagree on severity, autofix_class, or owner, record the disagreement in the finding's evidence (e.g., "security rated P0, correctness rated P1 -- keeping P0"). This transparency helps the user understand why a finding was routed the way it was. +6. **Normalize routing.** For each merged finding, set the final `autofix_class`, `owner`, and `requires_verification`. If reviewers disagree, keep the most conservative route. Synthesis may narrow a finding from `safe_auto` to `gated_auto` or `manual`, but must not widen it without new evidence. +7. **Partition the work.** Build three sets: - in-skill fixer queue: only `safe_auto -> review-fixer` - residual actionable queue: unresolved `gated_auto` or `manual` findings whose owner is `downstream-resolver` - report-only queue: `advisory` findings plus anything owned by `human` or `release` -7. **Sort.** Order by severity (P0 first) -> confidence (descending) -> file path -> line number. -8. **Collect coverage data.** Union residual_risks and testing_gaps across reviewers. -9. **Preserve CE agent artifacts.** Keep the learnings, agent-native, schema-drift, and deployment-verification outputs alongside the merged finding set. Do not drop unstructured agent output just because it does not match the persona JSON schema. +8. **Sort.** Order by severity (P0 first) -> confidence (descending) -> file path -> line number. +9. **Collect coverage data.** Union residual_risks and testing_gaps across reviewers. +10. **Preserve CE agent artifacts.** Keep the learnings, agent-native, schema-drift, and deployment-verification outputs alongside the merged finding set. Do not drop unstructured agent output just because it does not match the persona JSON schema. ### Stage 6: Synthesize and present @@ -494,7 +514,7 @@ Testing gaps: - Coverage: -- Suppressed: findings below 0.60 confidence +- Suppressed: findings below 0.60 confidence (P0 at 0.50+ retained) - Untracked files excluded: , - Failed reviewers: diff --git a/plugins/compound-engineering/skills/ce-review/references/findings-schema.json b/plugins/compound-engineering/skills/ce-review/references/findings-schema.json index e7eee5d..611912f 100644 --- a/plugins/compound-engineering/skills/ce-review/references/findings-schema.json +++ b/plugins/compound-engineering/skills/ce-review/references/findings-schema.json @@ -102,9 +102,10 @@ "_meta": { "confidence_thresholds": { - "suppress": "Below 0.60 -- do not report. Finding is speculative noise.", - "flag": "0.60-0.69 -- include only when the persona's calibration says the issue is actionable at that confidence.", - "report": "0.70+ -- report with full confidence." + "suppress": "Below 0.60 -- do not report. Finding is speculative noise. Exception: P0 findings at 0.50+ may be reported.", + "flag": "0.60-0.69 -- include only when the issue is clearly actionable with concrete evidence.", + "confident": "0.70-0.84 -- real and important. Report with full evidence.", + "certain": "0.85-1.00 -- verifiable from the code alone. Report." }, "severity_definitions": { "P0": "Critical breakage, exploitable vulnerability, data loss/corruption. Must fix before merge.", diff --git a/plugins/compound-engineering/skills/ce-review/references/persona-catalog.md b/plugins/compound-engineering/skills/ce-review/references/persona-catalog.md index ddaf837..8bcf00c 100644 --- a/plugins/compound-engineering/skills/ce-review/references/persona-catalog.md +++ b/plugins/compound-engineering/skills/ce-review/references/persona-catalog.md @@ -1,6 +1,6 @@ # Persona Catalog -15 reviewer personas organized into always-on, cross-cutting conditional, and stack-specific conditional layers, plus CE-specific agents. The orchestrator uses this catalog to select which reviewers to spawn for each review. +16 reviewer personas organized into always-on, cross-cutting conditional, and stack-specific conditional layers, plus CE-specific agents. The orchestrator uses this catalog to select which reviewers to spawn for each review. ## Always-on (4 personas + 2 CE agents) @@ -22,7 +22,7 @@ Spawned on every review regardless of diff content. | `compound-engineering:review:agent-native-reviewer` | Verify new features are agent-accessible | | `compound-engineering:research:learnings-researcher` | Search docs/solutions/ for past issues related to this PR's modules and patterns | -## Conditional (6 personas) +## Conditional (7 personas) Spawned when the orchestrator identifies relevant patterns in the diff. The orchestrator reads the full diff and reasons about selection -- this is agent judgment, not keyword matching. @@ -34,6 +34,7 @@ Spawned when the orchestrator identifies relevant patterns in the diff. The orch | `data-migrations` | `compound-engineering:review:data-migrations-reviewer` | Migration files, schema changes, backfill scripts, data transformations | | `reliability` | `compound-engineering:review:reliability-reviewer` | Error handling, retry logic, circuit breakers, timeouts, background jobs, async handlers, health checks | | `adversarial` | `compound-engineering:review:adversarial-reviewer` | Diff has >=50 changed non-test, non-generated, non-lockfile lines, OR touches auth, payments, data mutations, external API integrations, or other high-risk domains | +| `previous-comments` | `compound-engineering:review:previous-comments-reviewer` | Reviewing a PR that has existing review comments or review threads from prior review rounds | ## Stack-Specific Conditional (5 personas) diff --git a/plugins/compound-engineering/skills/ce-review/references/subagent-template.md b/plugins/compound-engineering/skills/ce-review/references/subagent-template.md index bc4f367..d68fb00 100644 --- a/plugins/compound-engineering/skills/ce-review/references/subagent-template.md +++ b/plugins/compound-engineering/skills/ce-review/references/subagent-template.md @@ -22,8 +22,25 @@ Return ONLY valid JSON matching the findings schema below. No prose, no markdown {schema} +Confidence rubric (0.0-1.0 scale): +- 0.00-0.29: Not confident / likely false positive. Do not report. +- 0.30-0.49: Somewhat confident. Do not report -- too speculative for actionable review. +- 0.50-0.59: Moderately confident. Real but uncertain. Do not report unless P0 severity. +- 0.60-0.69: Confident enough to flag. Include only when the issue is clearly actionable. +- 0.70-0.84: Highly confident. Real and important. Report with full evidence. +- 0.85-1.00: Certain. Verifiable from the code alone. Report. + +Suppress threshold: 0.60. Do not emit findings below 0.60 confidence (except P0 at 0.50+). + +False-positive categories to actively suppress: +- Pre-existing issues unrelated to this diff (mark pre_existing: true for unchanged code the diff does not interact with; if the diff makes it newly relevant, it is secondary, not pre-existing) +- Pedantic style nitpicks that a linter/formatter would catch +- Code that looks wrong but is intentional (check comments, commit messages, PR description for intent) +- Issues already handled elsewhere in the codebase (check callers, guards, middleware) +- Suggestions that restate what the code already does in different words +- Generic "consider adding" advice without a concrete failure mode + Rules: -- Suppress any finding below your stated confidence floor (see your Confidence calibration section). - Every finding MUST include at least one evidence item grounded in the actual code. - Set pre_existing to true ONLY for issues in unchanged code that are unrelated to this diff. If the diff makes the issue newly relevant, it is NOT pre-existing. - You are operationally read-only. You may use non-mutating inspection commands, including read-oriented `git` / `gh` commands, to gather evidence. Do not edit files, change branches, commit, push, create PRs, or otherwise mutate the checkout or repository state. @@ -32,8 +49,13 @@ Rules: - Set `requires_verification` to true whenever the likely fix needs targeted tests, a focused re-review, or operational validation before it should be trusted. - suggested_fix is optional. Only include it when the fix is obvious and correct. A bad suggestion is worse than none. - If you find no issues, return an empty findings array. Still populate residual_risks and testing_gaps if applicable. +- **Intent verification:** Compare the code changes against the stated intent (and PR title/body when available). If the code does something the intent does not describe, or fails to do something the intent promises, flag it as a finding. Mismatches between stated intent and actual code are high-value findings. + +{pr_metadata} + + Intent: {intent_summary} @@ -52,5 +74,6 @@ Diff: | `{diff_scope_rules}` | `references/diff-scope.md` content | Primary/secondary/pre-existing tier rules | | `{schema}` | `references/findings-schema.json` content | The JSON schema reviewers must conform to | | `{intent_summary}` | Stage 2 output | 2-3 line description of what the change is trying to accomplish | +| `{pr_metadata}` | Stage 1 output | PR title, body, and URL when reviewing a PR. Empty string when reviewing a branch or standalone checkout | | `{file_list}` | Stage 1 output | List of changed files from the scope step | | `{diff}` | Stage 1 output | The actual diff content to review |