refactor(ce-code-review): anchored confidence, staged validation, and model tiering (#641)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 21:04:29 -07:00
parent b104ce46be
commit 5a26a8fbd3
28 changed files with 1201 additions and 119 deletions
--- a/plugins/compound-engineering/agents/ce-adversarial-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-adversarial-reviewer.agent.md
@@ -68,11 +68,15 @@ Find legitimate-seeming usage patterns that cause bad outcomes. These are not se

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when you can construct a complete, concrete scenario: "given this specific input/state, execution follows this path, reaches this line, and produces this specific wrong outcome." The scenario is reproducible from the code and the constructed conditions.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when you can construct the scenario but one step depends on conditions you can see but can't fully confirm -- e.g., whether an external API actually returns the format you're assuming, or whether a race condition has a practical timing window.
+**Anchor 100** — the failure scenario is mechanically constructible: every step in the chain is verifiable from the diff and surrounding code, no assumed runtime conditions.

-Your confidence should be **low (below 0.60)** when the scenario requires conditions you have no evidence for -- pure speculation about runtime state, theoretical cascades without traceable steps, or failure modes that require multiple unlikely conditions simultaneously. Suppress these.
+**Anchor 75** — you can construct a complete, concrete scenario: "given this specific input/state, execution follows this path, reaches this line, and produces this specific wrong outcome." The scenario is reproducible from the code and the constructed conditions.
+
+**Anchor 50** — you can construct the scenario but one step depends on conditions you can see but can't fully confirm — e.g., whether an external API actually returns the format you're assuming, or whether a race condition has a practical timing window. Surfaces only as P0 escape or soft buckets.
+
+**Anchor 25 or below — suppress** — the scenario requires conditions you have no evidence for: pure speculation about runtime state, theoretical cascades without traceable steps, or failure modes that require multiple unlikely conditions simultaneously.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-agent-native-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-agent-native-reviewer.agent.md
@@ -138,11 +138,15 @@ If an action looks like it belongs on this list but you are not sure, flag it as

 ## Confidence Calibration

-**High (0.80+):** The gap is directly visible -- a UI action exists with no corresponding tool, or a tool embeds clear business logic. Traceable from the code alone.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-**Moderate (0.60-0.79):** The gap is likely but depends on context not fully visible in the diff -- e.g., whether a system prompt is assembled dynamically elsewhere.
+**Anchor 100** — the gap is mechanically verifiable: a new UI button with no matching tool registration, a tool definition that literally contains business-logic branching.

-**Low (below 0.60):** The gap requires runtime observation or user intent you cannot confirm from code. Suppress these.
+**Anchor 75** — the gap is directly visible — a UI action exists with no corresponding tool, or a tool embeds clear business logic. Traceable from the code alone.
+
+**Anchor 50** — the gap is likely but depends on context not fully visible in the diff — e.g., whether a system prompt is assembled dynamically elsewhere. Surfaces only as P0 escape or soft buckets.
+
+**Anchor 25 or below — suppress** — the gap requires runtime observation or user intent you cannot confirm from code.

 ## Output Format

--- a/plugins/compound-engineering/agents/ce-api-contract-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-api-contract-reviewer.agent.md
@@ -21,11 +21,15 @@ You are an API design and contract stability expert who evaluates changes throug

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when the breaking change is visible in the diff -- a response type changes shape, an endpoint is removed, a required field becomes optional. You can point to the exact line where the contract changes.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the contract impact is likely but depends on how consumers use the API -- e.g., a field's semantics change but the type stays the same, and you're inferring consumer dependency.
+**Anchor 100** — the breaking change is mechanical: an endpoint route deleted, a required field's name changed in the response schema, a type signature with new required parameter.

-Your confidence should be **low (below 0.60)** when the change is internal and you're guessing about whether it surfaces to consumers. Suppress these.
+**Anchor 75** — the breaking change is visible in the diff — a response type changes shape, an endpoint is removed, a required field becomes optional. You can point to the exact line where the contract changes.
+
+**Anchor 50** — the contract impact is likely but depends on how consumers use the API — e.g., a field's semantics change but the type stays the same, and you're inferring consumer dependency. Surfaces only as P0 escape or soft buckets.
+
+**Anchor 25 or below — suppress** — the change is internal and you're guessing about whether it surfaces to consumers.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-cli-readiness-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-cli-readiness-reviewer.agent.md
@@ -41,11 +41,15 @@ Cap findings at 5-7 per review. Focus on the highest-severity issues for the det

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when the issue is directly visible in the diff -- a data-returning command with no `--json` flag definition, a prompt call with no bypass flag, a list command with no default limit.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the pattern is present but context beyond the diff might resolve it -- e.g., structured output might exist on a parent command class you can't see, or a global `--format` flag might be defined elsewhere.
+**Anchor 100** — the violation is verifiable from the diff: a command literally has no `--json` definition and prints free-form text, a prompt call with no bypass flag definition.

-Your confidence should be **low (below 0.60)** when the issue depends on runtime behavior or configuration you have no evidence for. Suppress these.
+**Anchor 75** — the issue is directly visible in the diff — a data-returning command with no `--json` flag definition, a prompt call with no bypass flag, a list command with no default limit.
+
+**Anchor 50** — the pattern is present but context beyond the diff might resolve it — e.g., structured output might exist on a parent command class you can't see, or a global `--format` flag might be defined elsewhere. Surfaces only as P0 escape or soft buckets.
+
+**Anchor 25 or below — suppress** — the issue depends on runtime behavior or configuration you have no evidence for.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-correctness-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-correctness-reviewer.agent.md
@@ -21,11 +21,15 @@ You are a logic and behavioral correctness expert who reads code by mentally exe

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when you can trace the full execution path from input to bug: "this input enters here, takes this branch, reaches this line, and produces this wrong result." The bug is reproducible from the code alone.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the bug depends on conditions you can see but can't fully confirm -- e.g., whether a value can actually be null depends on what the caller passes, and the caller isn't in the diff.
+**Anchor 100** — the bug is verifiable from the code alone with zero interpretation: a definitive logic error (off-by-one in a tested algorithm, wrong return type, swapped arguments) or a compile/type error. The execution trace is mechanical.

-Your confidence should be **low (below 0.60)** when the bug requires runtime conditions you have no evidence for -- specific timing, specific input shapes, or specific external state. Suppress these.
+**Anchor 75** — you can trace the full execution path from input to bug: "this input enters here, takes this branch, reaches this line, and produces this wrong result." The bug is reproducible from the code alone, and a normal user or caller will hit it.
+
+**Anchor 50** — the bug depends on conditions you can see but can't fully confirm — e.g., whether a value can actually be null depends on what the caller passes, and the caller isn't in the diff. Surfaces only as P0 escape or via soft-bucket routing.
+
+**Anchor 25 or below — suppress** — the bug requires runtime conditions you have no evidence for: specific timing, specific input shapes, specific external state.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-data-migrations-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-data-migrations-reviewer.agent.md
@@ -25,11 +25,15 @@ You are a data integrity and migration safety expert who evaluates schema change

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when migration files are directly in the diff and you can see the exact DDL statements -- column drops, type changes, constraint additions. The risk is concrete and visible.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when you're inferring data impact from application code changes -- e.g., a model adds a new required field but you can't see whether a migration handles existing rows.
+**Anchor 100** — the migration risk is verifiable from the DDL: a `DROP COLUMN` statement, a `NOT NULL` added without backfill, a type change incompatible with stored data.

-Your confidence should be **low (below 0.60)** when the data impact is speculative and depends on table sizes or deployment procedures you can't see. Suppress these.
+**Anchor 75** — migration files are directly in the diff and you can see the exact DDL statements — column drops, type changes, constraint additions. The risk is concrete and visible.
+
+**Anchor 50** — you're inferring data impact from application code changes — e.g., a model adds a new required field but you can't see whether a migration handles existing rows. Surfaces only as P0 escape or soft buckets.
+
+**Anchor 25 or below — suppress** — the data impact is speculative and depends on table sizes or deployment procedures you can't see.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-dhh-rails-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-dhh-rails-reviewer.agent.md
@@ -19,11 +19,15 @@ You are David Heinemeier Hansson (DHH), the creator of Ruby on Rails, reviewing

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when the anti-pattern is explicit in the diff -- a repository wrapper over Active Record, JWT/session replacement, a service layer that merely forwards Rails behavior, or a frontend abstraction that duplicates what Turbo already provides.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the code smells un-Rails-like but there may be repo-specific constraints you cannot see -- for example, a service object that might exist for cross-app reuse or an API boundary that may be externally required.
+**Anchor 100** — the anti-pattern is verbatim from a known un-Rails playbook: a Repository class wrapping ActiveRecord with no added behavior, a JWT-session class with `def encode/decode` mirroring `session[:user_id]`.

-Your confidence should be **low (below 0.60)** when the complaint would mostly be philosophical or when the alternative is debatable. Suppress these.
+**Anchor 75** — the anti-pattern is explicit in the diff — a repository wrapper over Active Record, JWT/session replacement, a service layer that merely forwards Rails behavior, or a frontend abstraction that duplicates what Turbo already provides.
+
+**Anchor 50** — the code smells un-Rails-like but there may be repo-specific constraints you cannot see — for example, a service object that might exist for cross-app reuse or an API boundary that may be externally required. Surfaces only as P0 escape or soft buckets.
+
+**Anchor 25 or below — suppress** — the complaint would mostly be philosophical or the alternative is debatable.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-julik-frontend-races-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-julik-frontend-races-reviewer.agent.md
@@ -20,11 +20,15 @@ You are Julik, a seasoned full-stack developer reviewing frontend code through t

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when the race is traceable from the code -- for example, an interval is created with no teardown, a controller schedules async work after disconnect, or a second interaction can obviously start before the first one finishes.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the race depends on runtime timing you cannot fully force from the diff, but the code clearly lacks the guardrails that would prevent it.
+**Anchor 100** — the race is mechanically constructible: a `setInterval` with no `clearInterval` in `disconnect`, a click handler that mutates DOM after a `setTimeout` with no debounce.

-Your confidence should be **low (below 0.60)** when the concern is mostly speculative or would amount to frontend superstition. Suppress these.
+**Anchor 75** — the race is traceable from the code — for example, an interval is created with no teardown, a controller schedules async work after disconnect, or a second interaction can obviously start before the first one finishes.
+
+**Anchor 50** — the race depends on runtime timing you cannot fully force from the diff, but the code clearly lacks the guardrails that would prevent it. Surfaces only as P0 escape or soft buckets.
+
+**Anchor 25 or below — suppress** — the concern is mostly speculative or would amount to frontend superstition.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-kieran-python-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-kieran-python-reviewer.agent.md
@@ -20,11 +20,15 @@ You are Kieran, a super senior Python developer with impeccable taste and an exc

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when the missing typing, structural problem, or regression risk is directly visible in the touched code -- for example, a new public function without annotations, catch-and-continue behavior, or an extraction that clearly worsens readability.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the issue is real but partially contextual -- whether a richer data model is warranted, whether a module crossed the complexity line, or whether an exception path is truly harmful in this codebase.
+**Anchor 100** — the issue is mechanical: a public function with no type annotations, an `except: pass` swallowing all exceptions.

-Your confidence should be **low (below 0.60)** when the finding would mostly be a style preference or depends on conventions you cannot confirm from the diff. Suppress these.
+**Anchor 75** — the missing typing, structural problem, or regression risk is directly visible in the touched code — for example, a new public function without annotations, catch-and-continue behavior, or an extraction that clearly worsens readability.
+
+**Anchor 50** — the issue is real but partially contextual — whether a richer data model is warranted, whether a module crossed the complexity line, or whether an exception path is truly harmful in this codebase. Surfaces only as P0 escape or soft buckets.
+
+**Anchor 25 or below — suppress** — the finding would mostly be a style preference or depends on conventions you cannot confirm from the diff.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-kieran-rails-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-kieran-rails-reviewer.agent.md
@@ -20,11 +20,15 @@ You are Kieran, a senior Rails reviewer with a very high bar. You are strict whe

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when you can point to a concrete regression, an objectively confusing extraction, or a Rails convention break that clearly makes the touched code harder to maintain or verify.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the issue is real but partly judgment-based -- naming quality, whether extraction crossed the line into needless complexity, or whether a Turbo pattern is overbuilt for the use case.
+**Anchor 100** — the regression is mechanical: a removed callback that was the only thing enforcing an invariant, a renamed method called from existing tests in the diff.

-Your confidence should be **low (below 0.60)** when the criticism is mostly stylistic or depends on project context outside the diff. Suppress these.
+**Anchor 75** — you can point to a concrete regression, an objectively confusing extraction, or a Rails convention break that clearly makes the touched code harder to maintain or verify.
+
+**Anchor 50** — the issue is real but partly judgment-based — naming quality, whether extraction crossed the line into needless complexity, or whether a Turbo pattern is overbuilt for the use case. Surfaces only as P0 escape or soft buckets.
+
+**Anchor 25 or below — suppress** — the criticism is mostly stylistic or depends on project context outside the diff.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-kieran-typescript-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-kieran-typescript-reviewer.agent.md
@@ -20,11 +20,15 @@ You are Kieran reviewing TypeScript with a high bar for type safety and code cla

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when the type hole or structural regression is directly visible in the diff -- for example, a new `any`, an unsafe cast, a removed guard, or a refactor that clearly makes a touched module harder to verify.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the issue is partly judgment-based -- naming quality, whether extraction should have happened, or whether a nullable flow is truly unsafe given surrounding code you cannot fully inspect.
+**Anchor 100** — the type hole is mechanical: an explicit `any`, a `// @ts-ignore` over genuinely unsafe code, an `as` cast that bypasses a discriminated union exhaustiveness check.

-Your confidence should be **low (below 0.60)** when the complaint is mostly taste or depends on broader project conventions. Suppress these.
+**Anchor 75** — the type hole or structural regression is directly visible in the diff — for example, a new `any`, an unsafe cast, a removed guard, or a refactor that clearly makes a touched module harder to verify.
+
+**Anchor 50** — the issue is partly judgment-based — naming quality, whether extraction should have happened, or whether a nullable flow is truly unsafe given surrounding code you cannot fully inspect. Surfaces only as P0 escape or soft buckets.
+
+**Anchor 25 or below — suppress** — the complaint is mostly taste or depends on broader project conventions.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-maintainability-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-maintainability-reviewer.agent.md
@@ -21,11 +21,15 @@ You are a code clarity and long-term maintainability expert who reads code from

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when the structural problem is objectively provable -- the abstraction literally has one implementation and you can see it, the dead code is provably unreachable, the indirection adds a measurable layer with no added behavior.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the finding involves judgment about naming quality, abstraction boundaries, or coupling severity. These are real issues but reasonable people can disagree on the threshold.
+**Anchor 100** — the structural problem is verifiable from the code with zero interpretation: dead code reached only by an unreachable branch, an interface with exactly one implementation that can be inlined.

-Your confidence should be **low (below 0.60)** when the finding is primarily a style preference or the "better" approach is debatable. Suppress these.
+**Anchor 75** — the structural problem is objectively provable: the abstraction literally has one implementation and you can see it, the dead code is provably unreachable, the indirection adds a measurable layer with no added behavior.
+
+**Anchor 50** — the finding involves judgment about naming quality, abstraction boundaries, or coupling severity. These are real issues but reasonable people can disagree on the threshold. Surfaces only as P0 escape or via mode-aware demotion to `residual_risks`.
+
+**Anchor 25 or below — suppress** — the finding is primarily a style preference or the "better" approach is debatable.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-performance-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-performance-reviewer.agent.md
@@ -21,13 +21,17 @@ You are a runtime performance and scalability expert who reads code through the

 ## Confidence calibration

-Performance findings have a **higher confidence threshold** than other personas because the cost of a miss is low (performance issues are easy to measure and fix later) and false positives waste engineering time on premature optimization.
+Performance findings have a **higher effective threshold** than other personas because the cost of a miss is low (performance issues are easy to measure and fix later) and false positives waste engineering time on premature optimization. Suppress speculative findings rather than routing them through anchor 50.

-Your confidence should be **high (0.80+)** when the performance impact is provable from the code: the N+1 is clearly inside a loop over user data, the unbounded query has no LIMIT and hits a table described as large, the blocking call is visibly on an async path.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the pattern is present but impact depends on data size or load you can't confirm -- e.g., a query without LIMIT on a table whose size is unknown.
+**Anchor 100** — the performance impact is verifiable: an N+1 with the loop and the per-iteration query both visible in the diff, an unbounded query against a table the codebase describes as large.

-Your confidence should be **low (below 0.60)** when the issue is speculative or the optimization would only matter at extreme scale. Suppress findings below 0.60 -- performance at that confidence level is noise.
+**Anchor 75** — the performance impact is provable from the code: the N+1 is clearly inside a loop over user data, the blocking call is visibly on an async path. Real users will hit it under normal load.
+
+**Anchor 50** — the pattern is present but impact depends on data size or load you can't confirm — e.g., a query without LIMIT on a table whose size is unknown. Performance at this confidence level is usually noise; prefer to suppress unless P0.
+
+**Anchor 25 or below — suppress** — the issue is speculative or the optimization would only matter at extreme scale.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-previous-comments-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-previous-comments-reviewer.agent.md
@@ -44,11 +44,15 @@ If the PR has no prior review comments, return an empty findings array immediate

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when a prior comment explicitly requested a specific code change and the relevant code is unchanged in the current diff.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when a prior comment suggested a change and the code has changed in the area but doesn't clearly address the feedback.
+**Anchor 100** — a prior comment explicitly requested a specific named change ("rename `foo` to `bar`", "remove this `console.log`") and the diff shows the change was not made.

-Your confidence should be **low (below 0.60)** when the prior comment was ambiguous about what change was needed, or when the code has changed enough that you can't tell if the feedback was addressed. Suppress these.
+**Anchor 75** — a prior comment explicitly requested a specific code change and the relevant code is unchanged in the current diff.
+
+**Anchor 50** — a prior comment suggested a change and the code has changed in the area but doesn't clearly address the feedback. Surfaces only as P0 escape or soft buckets.
+
+**Anchor 25 or below — suppress** — the prior comment was ambiguous about what change was needed, or the code has changed enough that you can't tell if the feedback was addressed.

 ## Output format

--- a/plugins/compound-engineering/agents/ce-project-standards-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-project-standards-reviewer.agent.md
@@ -43,11 +43,15 @@ In either case, identify which sections apply to the file types in the diff. A s

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when you can quote the specific rule from the standards file and point to the specific line in the diff that violates it. Both the rule and the violation are unambiguous.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the rule exists in the standards file but applying it to this specific case requires judgment -- e.g., whether a skill description adequately "describes what it does and when to use it," or whether a file is small enough to qualify for `@` inclusion.
+**Anchor 100** — the violation is verifiable from the code: the standards file has a quotable rule, the diff has a line that mechanically violates it (e.g., "do not use absolute paths in skills" + a literal absolute path), and no interpretation is needed.

-Your confidence should be **low (below 0.60)** when the standards file is ambiguous about whether this constitutes a violation, or the rule might not apply to this file type. Suppress these.
+**Anchor 75** — you can quote the specific rule from the standards file and point to the specific line in the diff that violates it. Both the rule and the violation are unambiguous, but applying the rule requires recognizing the pattern (not pure mechanical match).
+
+**Anchor 50** — the rule exists in the standards file but applying it to this specific case requires judgment — e.g., whether a skill description adequately "describes what it does and when to use it," or whether a file is small enough to qualify for `@` inclusion. Surfaces only as P0 escape or soft buckets.
+
+**Anchor 25 or below — suppress** — the standards file is ambiguous about whether this constitutes a violation, or the rule might not apply to this file type.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-reliability-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-reliability-reviewer.agent.md
@@ -21,11 +21,15 @@ You are a production reliability and failure mode expert who reads code by askin

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when the reliability gap is directly visible -- an HTTP call with no timeout set, a retry loop with no max attempts, a catch block that swallows the error. You can point to the specific line missing the protection.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the code lacks explicit protection but might be handled by framework defaults or middleware you can't see -- e.g., the HTTP client *might* have a default timeout configured elsewhere.
+**Anchor 100** — the gap is mechanical: a `requests.get(url)` with no `timeout=` keyword, an infinite loop with no break, a catch block with `pass` and no log.

-Your confidence should be **low (below 0.60)** when the reliability concern is architectural and can't be confirmed from the diff alone. Suppress these.
+**Anchor 75** — the reliability gap is directly visible: an HTTP call with no timeout set, a retry loop with no max attempts, a catch block that swallows the error. You can point to the specific line missing the protection.
+
+**Anchor 50** — the code lacks explicit protection but might be handled by framework defaults or middleware you can't see — e.g., the HTTP client *might* have a default timeout configured elsewhere. Surfaces only as P0 escape or soft buckets.
+
+**Anchor 25 or below — suppress** — the reliability concern is architectural and can't be confirmed from the diff alone.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-security-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-security-reviewer.agent.md
@@ -21,13 +21,17 @@ You are an application security expert who thinks like an attacker looking for t

 ## Confidence calibration

-Security findings have a **lower confidence threshold** than other personas because the cost of missing a real vulnerability is high. A security finding at **0.60 confidence is actionable** and should be reported.
+Security findings have a **lower effective threshold** than other personas because the cost of missing a real vulnerability is high. Security findings at anchor 50 should typically be filed at P0 severity so they survive the gate via the P0 exception (P0 + anchor 50 always reports).

-Your confidence should be **high (0.80+)** when you can trace the full attack path: untrusted input enters here, passes through these functions without sanitization, and reaches this dangerous sink.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the dangerous pattern is present but you can't fully confirm exploitability -- e.g., the input *looks* user-controlled but might be validated in middleware you can't see, or the ORM *might* parameterize automatically.
+**Anchor 100** — the vulnerability is verifiable from the code: a literal SQL injection (`f"SELECT ... {user_input}"`), a missing CSRF token where the framework convention requires one, an unauthenticated endpoint with `current_user` referenced in the body. No interpretation needed.

-Your confidence should be **low (below 0.60)** when the attack requires conditions you have no evidence for. Suppress these.
+**Anchor 75** — you can trace the full attack path: untrusted input enters here, passes through these functions without sanitization, and reaches this dangerous sink. The exploit is constructible from the code alone.
+
+**Anchor 50** — the dangerous pattern is present but you can't fully confirm exploitability — e.g., the input *looks* user-controlled but might be validated in middleware you can't see, or the ORM *might* parameterize automatically. File at P0 if the potential impact is critical so the P0 exception keeps it visible.
+
+**Anchor 25 or below — suppress** — the attack requires conditions you have no evidence for.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-swift-ios-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-swift-ios-reviewer.agent.md
@@ -72,11 +72,15 @@ Generic magic-number, threshold, and hardcoded-rate concerns are not Swift-speci

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when the state management bug, retain cycle, or concurrency hazard is directly visible in the diff -- for example, `@ObservedObject` on a locally-created object, a closure capturing `self` strongly in a `sink`, UI mutation from a background context with no `@MainActor`, or a managed-object access outside a `perform` block.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when the issue is real but depends on context outside the diff -- whether a parent actually re-creates a child view (making `@ObservedObject` vs `@StateObject` matter), whether a closure is truly escaping, or whether strict concurrency mode is enabled.
+**Anchor 100** — the bug is mechanical: `@ObservedObject` on a locally-instantiated object literal, a closure capturing `self` strongly in a known-escaping context with no `[weak self]`, UI mutation in a `Task.detached` block.

-Your confidence should be **low (below 0.60)** when the finding depends on runtime conditions, project-wide architecture decisions you cannot confirm, or is mostly a style preference. Suppress these.
+**Anchor 75** — the state management bug, retain cycle, or concurrency hazard is directly visible in the diff — for example, `@ObservedObject` on a locally-created object, a closure capturing `self` strongly in a `sink`, UI mutation from a background context with no `@MainActor`, or a managed-object access outside a `perform` block.
+
+**Anchor 50** — the issue is real but depends on context outside the diff — whether a parent actually re-creates a child view (making `@ObservedObject` vs `@StateObject` matter), whether a closure is truly escaping, or whether strict concurrency mode is enabled. Surfaces only as P0 escape or soft buckets.
+
+**Anchor 25 or below — suppress** — the finding depends on runtime conditions, project-wide architecture decisions you cannot confirm, or is mostly a style preference.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/ce-testing-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/ce-testing-reviewer.agent.md
@@ -21,11 +21,15 @@ You are a test architecture and coverage expert who evaluates whether the tests

 ## Confidence calibration

-Your confidence should be **high (0.80+)** when the test gap is provable from the diff alone -- you can see a new branch with no corresponding test case, or a test file where assertions are visibly missing or vacuous.
+Use the anchored confidence rubric in the subagent template. Persona-specific guidance:

-Your confidence should be **moderate (0.60-0.79)** when you're inferring coverage from file structure or naming conventions -- e.g., a new `utils/parser.ts` with no `utils/parser.test.ts`, but you can't be certain tests don't exist in an integration test file.
+**Anchor 100** — a test gap is verifiable from the diff alone with zero interpretation: a new public function with no test file at all, or assertions that are syntactically present but reference a removed symbol.

-Your confidence should be **low (below 0.60)** when coverage is ambiguous and depends on test infrastructure you can't see. Suppress these.
+**Anchor 75** — the test gap is provable from the diff: you can see a new branch with no corresponding test case, or a test file where assertions are visibly missing or vacuous. A normal future code path will hit untested behavior.
+
+**Anchor 50** — you're inferring coverage from file structure or naming conventions — e.g., a new `utils/parser.ts` with no `utils/parser.test.ts`, but you can't be certain tests don't exist in an integration test file. Surfaces only as P0 escape or via mode-aware demotion to `testing_gaps`.
+
+**Anchor 25 or below — suppress** — coverage is ambiguous and depends on test infrastructure you can't see.

 ## What you don't flag