feat(ce-review): improve signal-to-noise with confidence rubric, FP suppression, and intent verification (#434)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trevin Chow
2026-03-29 00:34:59 -07:00
committed by GitHub
parent ae69680e95
commit 03f5aa65b0
5 changed files with 122 additions and 17 deletions

View File

@@ -101,7 +101,7 @@ Routing rules:
## Reviewers
15 reviewer personas in layered conditionals, plus CE-specific agents. See the persona catalog included below for the full catalog.
16 reviewer personas in layered conditionals, plus CE-specific agents. See the persona catalog included below for the full catalog.
**Always-on (every review):**
@@ -124,6 +124,7 @@ Routing rules:
| `compound-engineering:review:data-migrations-reviewer` | Migrations, schema changes, backfills |
| `compound-engineering:review:reliability-reviewer` | Error handling, retries, timeouts, background jobs |
| `compound-engineering:review:adversarial-reviewer` | Diff >=50 changed non-test/non-generated/non-lockfile lines, or auth, payments, data mutations, external APIs |
| `compound-engineering:review:previous-comments-reviewer` | Reviewing a PR that has existing review comments or threads |
**Stack-specific conditional (selected per diff):**
@@ -371,13 +372,30 @@ Pass the resulting path list to the `project-standards` persona inside a `<stand
### Stage 4: Spawn sub-agents
#### Model tiering
Persona sub-agents do focused, scoped work and should use cheaper/faster models to reduce cost and latency. The orchestrator itself stays on the default (most capable) model.
| Platform | Persona agent model | How to set |
|----------|-------------------|------------|
| **Claude Code** | Haiku | Pass `model: "haiku"` in the Agent tool call |
| **Codex** | GPT-4o mini | Set the model parameter to `gpt-4o-mini` in the agent spawn config |
| **Other platforms** | Cheapest capable model available | Use the platform's equivalent of a fast/cheap tier |
CE always-on agents (agent-native-reviewer, learnings-researcher) and CE conditional agents (schema-drift-detector, deployment-verification-agent) also use the cheaper model tier since they perform scoped, focused work.
The orchestrator (this skill) stays on the default model because it handles intent discovery, reviewer selection, finding merge/dedup, and synthesis -- tasks that benefit from stronger reasoning.
#### Spawning
Spawn each selected persona reviewer as a parallel sub-agent using the subagent template included below. Each persona sub-agent receives:
1. Their persona file content (identity, failure modes, calibration, suppress conditions)
2. Shared diff-scope rules from the diff-scope reference included below
3. The JSON output contract from the findings schema included below
4. Review context: intent summary, file list, diff
5. **For `project-standards` only:** the standards file path list from Stage 3b, wrapped in a `<standards-paths>` block appended to the review context
4. PR metadata: title, body, and URL when reviewing a PR (empty string otherwise). Passed in a `<pr-context>` block so reviewers can verify code against stated intent
5. Review context: intent summary, file list, diff
6. **For `project-standards` only:** the standards file path list from Stage 3b, wrapped in a `<standards-paths>` block appended to the review context
Persona sub-agents are **read-only**: they review and return structured JSON. They do not edit files or propose refactors.
@@ -403,17 +421,19 @@ Each persona sub-agent returns JSON matching the findings schema included below:
Convert multiple reviewer JSON payloads into one deduplicated, confidence-gated finding set.
1. **Validate.** Check each output against the schema. Drop malformed findings (missing required fields). Record the drop count.
2. **Confidence gate.** Suppress findings below 0.60 confidence. Record the suppressed count. This matches the persona instructions: findings below 0.60 are noise and should not survive synthesis.
2. **Confidence gate.** Suppress findings below 0.60 confidence. Exception: P0 findings at 0.50+ confidence survive the gate -- critical-but-uncertain issues must not be silently dropped. Record the suppressed count. This matches the persona instructions and the schema's confidence thresholds.
3. **Deduplicate.** Compute fingerprint: `normalize(file) + line_bucket(line, +/-3) + normalize(title)`. When fingerprints match, merge: keep highest severity, keep highest confidence with strongest evidence, union evidence, note which reviewers flagged it.
4. **Separate pre-existing.** Pull out findings with `pre_existing: true` into a separate list.
5. **Normalize routing.** For each merged finding, set the final `autofix_class`, `owner`, and `requires_verification`. If reviewers disagree, keep the most conservative route. Synthesis may narrow a finding from `safe_auto` to `gated_auto` or `manual`, but must not widen it without new evidence.
6. **Partition the work.** Build three sets:
4. **Cross-reviewer agreement.** When 2+ independent reviewers flag the same issue (same fingerprint), boost the merged confidence by 0.10 (capped at 1.0). Cross-reviewer agreement is strong signal -- independent reviewers converging on the same issue is more reliable than any single reviewer's confidence. Note the agreement in the Reviewer column of the output (e.g., "security, correctness").
5. **Separate pre-existing.** Pull out findings with `pre_existing: true` into a separate list.
5. **Resolve disagreements.** When reviewers flag the same code region but disagree on severity, autofix_class, or owner, record the disagreement in the finding's evidence (e.g., "security rated P0, correctness rated P1 -- keeping P0"). This transparency helps the user understand why a finding was routed the way it was.
6. **Normalize routing.** For each merged finding, set the final `autofix_class`, `owner`, and `requires_verification`. If reviewers disagree, keep the most conservative route. Synthesis may narrow a finding from `safe_auto` to `gated_auto` or `manual`, but must not widen it without new evidence.
7. **Partition the work.** Build three sets:
- in-skill fixer queue: only `safe_auto -> review-fixer`
- residual actionable queue: unresolved `gated_auto` or `manual` findings whose owner is `downstream-resolver`
- report-only queue: `advisory` findings plus anything owned by `human` or `release`
7. **Sort.** Order by severity (P0 first) -> confidence (descending) -> file path -> line number.
8. **Collect coverage data.** Union residual_risks and testing_gaps across reviewers.
9. **Preserve CE agent artifacts.** Keep the learnings, agent-native, schema-drift, and deployment-verification outputs alongside the merged finding set. Do not drop unstructured agent output just because it does not match the persona JSON schema.
8. **Sort.** Order by severity (P0 first) -> confidence (descending) -> file path -> line number.
9. **Collect coverage data.** Union residual_risks and testing_gaps across reviewers.
10. **Preserve CE agent artifacts.** Keep the learnings, agent-native, schema-drift, and deployment-verification outputs alongside the merged finding set. Do not drop unstructured agent output just because it does not match the persona JSON schema.
### Stage 6: Synthesize and present
@@ -494,7 +514,7 @@ Testing gaps:
- <gap>
Coverage:
- Suppressed: <N> findings below 0.60 confidence
- Suppressed: <N> findings below 0.60 confidence (P0 at 0.50+ retained)
- Untracked files excluded: <file1>, <file2>
- Failed reviewers: <reviewer>