feat(ce-review): improve signal-to-noise with confidence rubric, FP suppression, and intent verification (#434)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 00:34:59 -07:00
parent ae69680e95
commit 03f5aa65b0
5 changed files with 122 additions and 17 deletions
--- a/plugins/compound-engineering/skills/ce-review/references/findings-schema.json
+++ b/plugins/compound-engineering/skills/ce-review/references/findings-schema.json
@@ -102,9 +102,10 @@

  "_meta": {
    "confidence_thresholds": {
-      "suppress": "Below 0.60 -- do not report. Finding is speculative noise.",
-      "flag": "0.60-0.69 -- include only when the persona's calibration says the issue is actionable at that confidence.",
-      "report": "0.70+ -- report with full confidence."
+      "suppress": "Below 0.60 -- do not report. Finding is speculative noise. Exception: P0 findings at 0.50+ may be reported.",
+      "flag": "0.60-0.69 -- include only when the issue is clearly actionable with concrete evidence.",
+      "confident": "0.70-0.84 -- real and important. Report with full evidence.",
+      "certain": "0.85-1.00 -- verifiable from the code alone. Report."
    },
    "severity_definitions": {
      "P0": "Critical breakage, exploitable vulnerability, data loss/corruption. Must fix before merge.",
--- a/plugins/compound-engineering/skills/ce-review/references/persona-catalog.md
+++ b/plugins/compound-engineering/skills/ce-review/references/persona-catalog.md
@@ -1,6 +1,6 @@
 # Persona Catalog

-15 reviewer personas organized into always-on, cross-cutting conditional, and stack-specific conditional layers, plus CE-specific agents. The orchestrator uses this catalog to select which reviewers to spawn for each review.
+16 reviewer personas organized into always-on, cross-cutting conditional, and stack-specific conditional layers, plus CE-specific agents. The orchestrator uses this catalog to select which reviewers to spawn for each review.

 ## Always-on (4 personas + 2 CE agents)

@@ -22,7 +22,7 @@ Spawned on every review regardless of diff content.
 | `compound-engineering:review:agent-native-reviewer` | Verify new features are agent-accessible |
 | `compound-engineering:research:learnings-researcher` | Search docs/solutions/ for past issues related to this PR's modules and patterns |

-## Conditional (6 personas)
+## Conditional (7 personas)

 Spawned when the orchestrator identifies relevant patterns in the diff. The orchestrator reads the full diff and reasons about selection -- this is agent judgment, not keyword matching.

@@ -34,6 +34,7 @@ Spawned when the orchestrator identifies relevant patterns in the diff. The orch
 | `data-migrations` | `compound-engineering:review:data-migrations-reviewer` | Migration files, schema changes, backfill scripts, data transformations |
 | `reliability` | `compound-engineering:review:reliability-reviewer` | Error handling, retry logic, circuit breakers, timeouts, background jobs, async handlers, health checks |
 | `adversarial` | `compound-engineering:review:adversarial-reviewer` | Diff has >=50 changed non-test, non-generated, non-lockfile lines, OR touches auth, payments, data mutations, external API integrations, or other high-risk domains |
+| `previous-comments` | `compound-engineering:review:previous-comments-reviewer` | Reviewing a PR that has existing review comments or review threads from prior review rounds |

 ## Stack-Specific Conditional (5 personas)

--- a/plugins/compound-engineering/skills/ce-review/references/subagent-template.md
+++ b/plugins/compound-engineering/skills/ce-review/references/subagent-template.md
@@ -22,8 +22,25 @@ Return ONLY valid JSON matching the findings schema below. No prose, no markdown

 {schema}

+Confidence rubric (0.0-1.0 scale):
+- 0.00-0.29: Not confident / likely false positive. Do not report.
+- 0.30-0.49: Somewhat confident. Do not report -- too speculative for actionable review.
+- 0.50-0.59: Moderately confident. Real but uncertain. Do not report unless P0 severity.
+- 0.60-0.69: Confident enough to flag. Include only when the issue is clearly actionable.
+- 0.70-0.84: Highly confident. Real and important. Report with full evidence.
+- 0.85-1.00: Certain. Verifiable from the code alone. Report.
+
+Suppress threshold: 0.60. Do not emit findings below 0.60 confidence (except P0 at 0.50+).
+
+False-positive categories to actively suppress:
+- Pre-existing issues unrelated to this diff (mark pre_existing: true for unchanged code the diff does not interact with; if the diff makes it newly relevant, it is secondary, not pre-existing)
+- Pedantic style nitpicks that a linter/formatter would catch
+- Code that looks wrong but is intentional (check comments, commit messages, PR description for intent)
+- Issues already handled elsewhere in the codebase (check callers, guards, middleware)
+- Suggestions that restate what the code already does in different words
+- Generic "consider adding" advice without a concrete failure mode
+
 Rules:
- Suppress any finding below your stated confidence floor (see your Confidence calibration section).
 - Every finding MUST include at least one evidence item grounded in the actual code.
 - Set pre_existing to true ONLY for issues in unchanged code that are unrelated to this diff. If the diff makes the issue newly relevant, it is NOT pre-existing.
 - You are operationally read-only. You may use non-mutating inspection commands, including read-oriented `git` / `gh` commands, to gather evidence. Do not edit files, change branches, commit, push, create PRs, or otherwise mutate the checkout or repository state.
@@ -32,8 +49,13 @@ Rules:
 - Set `requires_verification` to true whenever the likely fix needs targeted tests, a focused re-review, or operational validation before it should be trusted.
 - suggested_fix is optional. Only include it when the fix is obvious and correct. A bad suggestion is worse than none.
 - If you find no issues, return an empty findings array. Still populate residual_risks and testing_gaps if applicable.
+- **Intent verification:** Compare the code changes against the stated intent (and PR title/body when available). If the code does something the intent does not describe, or fails to do something the intent promises, flag it as a finding. Mismatches between stated intent and actual code are high-value findings.
 </output-contract>

+<pr-context>
+{pr_metadata}
+</pr-context>
+
 <review-context>
 Intent: {intent_summary}

@@ -52,5 +74,6 @@ Diff:
 | `{diff_scope_rules}` | `references/diff-scope.md` content | Primary/secondary/pre-existing tier rules |
 | `{schema}` | `references/findings-schema.json` content | The JSON schema reviewers must conform to |
 | `{intent_summary}` | Stage 2 output | 2-3 line description of what the change is trying to accomplish |
+| `{pr_metadata}` | Stage 1 output | PR title, body, and URL when reviewing a PR. Empty string when reviewing a branch or standalone checkout |
 | `{file_list}` | Stage 1 output | List of changed files from the scope step |
 | `{diff}` | Stage 1 output | The actual diff content to review |