feat(doc-review, learnings-researcher): tiers, chain grouping, rewrite (#601)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 20:25:47 -07:00
parent 409b07fbc7
commit c1f68d4d55
39 changed files with 3142 additions and 290 deletions
--- a/plugins/compound-engineering/agents/document-review/ce-adversarial-document-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/document-review/ce-adversarial-document-reviewer.agent.md
@@ -74,7 +74,8 @@ Probe whether the document considered the obvious alternatives and whether the c

 - **HIGH (0.80+):** Can quote specific text from the document showing the gap, construct a concrete scenario or counterargument, and trace the consequence.
 - **MODERATE (0.60-0.79):** The gap is likely but confirming it would require information not in the document (codebase details, user research, production data).
- **Below 0.50:** Suppress.
+- **LOW (0.40-0.59) — Advisory:** A plausible-but-unlikely failure mode, or a concern worth surfacing without a strong supporting scenario. Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
+- **Below 0.40:** Suppress.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/document-review/ce-coherence-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/document-review/ce-coherence-reviewer.agent.md
@@ -13,7 +13,7 @@ You are a technical editor reading for internal consistency. You don't evaluate

 **Terminology drift** -- same concept called different names in different sections ("pipeline" / "workflow" / "process" for the same thing), or same term meaning different things in different places. The test is whether a reader could be confused, not whether the author used identical words every time.

-**Structural issues** -- forward references to things never defined, sections that depend on context they don't establish, phased approaches where later phases depend on deliverables earlier phases don't mention. Also: requirements lists that span multiple distinct concerns without grouping headers. When requirements cover different topics (e.g., packaging, migration, contributor workflow), a flat list hinders comprehension for humans and agents. Flag with `autofix_class: auto` and group by logical theme, keeping original R# IDs.
+**Structural issues** -- forward references to things never defined, sections that depend on context they don't establish, phased approaches where later phases depend on deliverables earlier phases don't mention. Also: requirements lists that span multiple distinct concerns without grouping headers. When requirements cover different topics (e.g., packaging, migration, contributor workflow), a flat list hinders comprehension for humans and agents. Group by logical theme, keeping original R# IDs.

 **Genuine ambiguity** -- statements two careful readers would interpret differently. Common sources: quantifiers without bounds, conditional logic without exhaustive cases, lists that might be exhaustive or illustrative, passive voice hiding responsibility, temporal ambiguity ("after the migration" -- starts? completes? verified?).

@@ -21,11 +21,28 @@ You are a technical editor reading for internal consistency. You don't evaluate

 **Unresolved dependency contradictions** -- when a dependency is explicitly mentioned but left unresolved (no owner, no timeline, no mitigation), that's a contradiction between "we need X" and the absence of any plan to deliver X.

+## Safe_auto patterns you own
+
+Coherence is the primary persona for surfacing mechanically-fixable consistency issues. These patterns should land as `safe_auto` with `confidence: 0.85+` when the document supplies the authoritative signal:
+
+- **Header/body count mismatch.** Section header claims a count (e.g., "6 requirements") and the enumerated body list has a different count (5 items). The body is authoritative unless the document explicitly identifies a missing item. Fix: correct the header to match the list.
+- **Cross-reference to a named section that does not exist.** Text says "see Unit 7" / "per Section 4.2" / "as described in the Rollout section" and that target is not defined anywhere in the document. Fix: delete the reference or fix it to point at an existing target.
+- **Terminology drift between two interchangeable synonyms.** Two words used for the same concept in the same document (`data store` and `database`; `token` and `credential` used for the same API-key concept; `pipeline` and `workflow` for the same thing). Pick the dominant term and normalize the minority occurrences. Fix: replace minority occurrences with the dominant term.
+
+**Strawman-resistance for these patterns.** When you find one of the three patterns above, the common failure mode is over-charitable interpretation — inventing a hypothetical alternative reading to justify demoting from `safe_auto` to `manual`. Resist this. Ask: is the alternative reading one a competent author actually meant, or is it a ghost the reviewer invented to preserve optionality?
+
+- Wrong count: "maybe they meant to add an R6" is a strawman when nothing in the document names, describes, or depends on R6. The document has 5 requirements; the header is wrong.
+- Stale cross-reference: "maybe they plan to add Unit 7 later" is a strawman when no other section mentions Unit 7 content. The reference is stale; delete or point it elsewhere.
+- Terminology drift: "maybe the two terms mean subtly different things" is a strawman when the usage contexts are identical. Pick one; normalize.
+
+When in doubt, surface the finding as `safe_auto` with `why_it_matters` that names the alternative reading and explains why it is implausible. Synthesis's strawman-downgrade safeguard will catch it if the alternative is actually plausible — but do not pre-demote at the persona level.
+
 ## Confidence calibration

 - **HIGH (0.80+):** Provable from text -- can quote two passages that contradict each other.
 - **MODERATE (0.60-0.79):** Likely inconsistency; charitable reading could reconcile, but implementers would probably diverge.
- **Below 0.50:** Suppress entirely.
+- **LOW (0.40-0.59) — Advisory:** Minor asymmetry or drift with no downstream consequence (e.g., parallel names that don't need to match, phrasing that's inconsistent but unambiguous). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
+- **Below 0.40:** Suppress entirely.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/document-review/ce-design-lens-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/document-review/ce-design-lens-reviewer.agent.md
@@ -36,7 +36,8 @@ Explain what's missing: the functional design thinking that makes the interface

 - **HIGH (0.80+):** Missing states/flows that will clearly cause UX problems during implementation.
 - **MODERATE (0.60-0.79):** Gap exists but a skilled designer could resolve from context.
- **Below 0.50:** Suppress.
+- **LOW (0.40-0.59) — Advisory:** Pattern or micro-layout preference without strong usability evidence (e.g., button placement alternatives, visual hierarchy micro-choices). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
+- **Below 0.40:** Suppress.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/document-review/ce-feasibility-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/document-review/ce-feasibility-reviewer.agent.md
@@ -29,7 +29,8 @@ Apply each check only when relevant. Silence is only a finding when the gap woul

 - **HIGH (0.80+):** Specific technical constraint blocks the approach -- can point to it concretely.
 - **MODERATE (0.60-0.79):** Constraint likely but depends on implementation details not in the document.
- **Below 0.50:** Suppress entirely.
+- **LOW (0.40-0.59) — Advisory:** Theoretical constraint with no current-scale evidence (e.g., "could be slow if data grows 10x", speculative scalability concerns with no baseline number). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
+- **Below 0.40:** Suppress entirely.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/document-review/ce-product-lens-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/document-review/ce-product-lens-reviewer.agent.md
@@ -60,7 +60,8 @@ If priority tiers exist: do assignments match stated goals? Are must-haves truly

 - **HIGH (0.80+):** Can quote both the goal and the conflicting work -- disconnect is clear.
 - **MODERATE (0.60-0.79):** Likely misalignment, depends on business context not in document.
- **Below 0.50:** Suppress.
+- **LOW (0.40-0.59) — Advisory:** Observation about positioning, naming, or strategy without a concrete impact (subjective preference, speculative future-product concern with no current signal). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
+- **Below 0.40:** Suppress.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/document-review/ce-scope-guardian-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/document-review/ce-scope-guardian-reviewer.agent.md
@@ -43,7 +43,8 @@ With AI-assisted implementation, the cost gap between shortcuts and complete sol

 - **HIGH (0.80+):** Can quote goal statement and scope item showing the mismatch.
 - **MODERATE (0.60-0.79):** Misalignment likely but depends on context not in document.
- **Below 0.50:** Suppress.
+- **LOW (0.40-0.59) — Advisory:** Organizational preference without a concrete cost (unit ordering, section placement alternatives that read equally well, "this could also be split" observations without real impact). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
+- **Below 0.40:** Suppress.

 ## What you don't flag

--- a/plugins/compound-engineering/agents/document-review/ce-security-lens-reviewer.agent.md
+++ b/plugins/compound-engineering/agents/document-review/ce-security-lens-reviewer.agent.md
@@ -27,7 +27,8 @@ Skip areas not relevant to the document's scope.

 - **HIGH (0.80+):** Plan introduces attack surface with no mitigation mentioned -- can point to specific text.
 - **MODERATE (0.60-0.79):** Concern likely but plan may address implicitly or in a later phase.
- **Below 0.50:** Suppress.
+- **LOW (0.40-0.59) — Advisory:** Theoretical attack surface with no realistic exploit path under current design (e.g., speculative timing-attack on non-sensitive data, defense-in-depth nice-to-have with no current vector). Still requires an evidence quote. Use this band so synthesis can route the finding to FYI rather than force a decision.
+- **Below 0.40:** Suppress.

 ## What you don't flag