feat(doc-review, learnings-researcher): tiers, chain grouping, rewrite (#601)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 20:25:47 -07:00
parent 409b07fbc7
commit c1f68d4d55
39 changed files with 3142 additions and 290 deletions
--- a/docs/solutions/adding-converter-target-providers.md
+++ b/docs/solutions/adding-converter-target-providers.md
@@ -5,7 +5,7 @@ tags: [converter, target-provider, plugin-conversion, multi-platform, pattern]
 created: 2026-02-23
 severity: medium
 component: converter-cli
-problem_type: best_practice
+problem_type: architecture_pattern
 root_cause: architectural_pattern
 ---

--- a/docs/solutions/best-practices/codex-delegation-best-practices-2026-04-01.md
+++ b/docs/solutions/best-practices/codex-delegation-best-practices-2026-04-01.md
@@ -3,7 +3,7 @@ title: "Codex Delegation Best Practices"
 date: 2026-04-01
 category: best-practices
 module: "Codex delegation / skill design"
-problem_type: best_practice
+problem_type: convention
 component: tooling
 severity: medium
 applies_when:
--- a/docs/solutions/best-practices/conditional-visual-aids-in-generated-documents-2026-03-29.md
+++ b/docs/solutions/best-practices/conditional-visual-aids-in-generated-documents-2026-03-29.md
@@ -3,7 +3,7 @@ title: Conditional visual aids in generated documents and PR descriptions
 date: 2026-03-29
 category: best-practices
 module: compound-engineering plugin skills
-problem_type: best_practice
+problem_type: design_pattern
 component: documentation
 symptoms:
  - "Generated documents and PR descriptions lack visual aids that would improve comprehension of complex workflows and relationships"
--- a/docs/solutions/best-practices/prefer-python-over-bash-for-pipeline-scripts-2026-04-09.md
+++ b/docs/solutions/best-practices/prefer-python-over-bash-for-pipeline-scripts-2026-04-09.md
@@ -3,7 +3,7 @@ title: "Prefer Python over bash for multi-step pipeline scripts"
 date: 2026-04-09
 category: best-practices
 module: "skill scripting / ce-demo-reel"
-problem_type: best_practice
+problem_type: tooling_decision
 component: tooling
 severity: medium
 applies_when:
--- a/docs/solutions/codex-skill-prompt-entrypoints.md
+++ b/docs/solutions/codex-skill-prompt-entrypoints.md
@@ -5,7 +5,7 @@ tags: [codex, converter, skills, prompts, workflows, deprecation]
 created: 2026-03-15
 severity: medium
 component: codex-target
-problem_type: best_practice
+problem_type: convention
 root_cause: outdated_target_model
 ---

--- a/docs/solutions/skill-design/ce-doc-review-calibration-patterns-2026-04-19.md
+++ b/docs/solutions/skill-design/ce-doc-review-calibration-patterns-2026-04-19.md
@@ -0,0 +1,110 @@
+---
+title: "ce-doc-review calibration patterns: tier classification, chain grouping, and FYI routing"
+date: 2026-04-19
+category: skill-design
+module: compound-engineering / ce-doc-review
+problem_type: design_pattern
+component: tooling
+severity: medium
+tags:
+  - ce-doc-review
+  - autofix-classification
+  - synthesis-pipeline
+  - persona-calibration
+  - premise-dependency
+  - fyi-routing
+  - calibration
+applies_when:
+  - Changing persona confidence calibration bands in `plugins/compound-engineering/agents/document-review/`
+  - Modifying the synthesis pipeline in `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md`
+  - Adjusting the subagent template's output contract in `references/subagent-template.md`
+  - Adding or modifying seeded test fixtures under `tests/fixtures/ce-doc-review/`
+  - Debugging why a finding landed in a different tier than expected
+---
+
+# ce-doc-review calibration patterns
+
+Calibration work on ce-doc-review (PR #601 series, 2026-04-18 and -19) surfaced several non-obvious patterns in how the synthesis pipeline classifies findings. These patterns are durable: they will re-surface any time personas or synthesis guidance are retuned. Future contributors changing calibration should expect them and not "fix" them as bugs.
+
+## Tier classification is context-sensitive, not purely formal
+
+The naive read of the tier spec says `safe_auto` = "one clear correct fix, applied silently." In practice, the same shape of finding can legitimately land in different tiers depending on scope and verifiability. Two recurring patterns:
+
+### External stale cross-reference → gated_auto (not safe_auto)
+
+When the document says `see Unit 7` and Unit 7 doesn't exist in the same document, that's an **internal** stale cross-reference — coherence can verify from the document text alone and apply `safe_auto`. When the document says `see docs/guides/keyboard-nav.md Section 4` and that file isn't verifiable from the document content, that's an **external** cross-reference; applying "delete this reference" silently risks masking a legitimate external doc. The reviewer should route these to `gated_auto` with a "verify before applying" fix, not `safe_auto`.
+
+Observed in: feature-plan fixture runs. The external cross-ref landed at P2 0.70 gated_auto with the fix "Verify docs/guides/keyboard-nav.md exists... If stale, either remove the reference or replace with inline guidance."
+
+### Multi-surface terminology drift → gated_auto (not safe_auto)
+
+When two synonyms appear in prose only (`data store` / `database`), `safe_auto` normalizes correctly. When the drift crosses surfaces — UI copy, aria-labels, toast messages, analytics events, file names, code identifiers — the fix's scope exceeds prose normalization and warrants user confirmation. Security-adjacent terminology (`token` / `credential` / `secret` / `API key`) carries different semantic weight and should also route to `gated_auto` with a glossary-fix recommendation.
+
+Observed in: auth-plan fixture runs (security-lens escalated), feature-plan fixture runs (UI-surface escalated).
+
+**Do not tighten coherence's `safe_auto` guidance to force these into `safe_auto`.** The reclassification is reviewer judgment doing useful work.
+
+## Premise-dependency chains have scope hierarchy
+
+Synthesis step 3.5c groups manual findings whose fixes cascade from a single premise challenge. When multiple premise-level candidates surface, they may be **peer roots** (independent premises at different scopes) or **nested** (one premise's resolution moots the other). The decision rules:
+
+### Peer vs nested — mechanical test, not example-based
+
+> "Two candidate roots are peers when accepting root A's proposed fix would not resolve root B's concern (and vice versa). They are nested when one root's fix would moot the other — in which case the subsumed candidate becomes a dependent of the surviving root."
+
+Apply symmetrically: check both directions before deciding. Example-based teaching ("e.g., 'drop the alias'") overfits to specific vocabulary; a mechanical decision test generalizes across domains.
+
+### Surviving root under nested — scope dominates confidence
+
+When nested, the surviving root is the one whose fix moots the other — **not** the higher-confidence candidate. In a rename plan, "rename premise unsupported" (0.82) dominates "alias machinery unjustified" (0.98) because rejecting the rename moots the alias entirely, while rejecting the alias still leaves the rename standing. Earlier synthesis picked the higher-confidence candidate as root, which stranded the broader-scope premise's natural dependents as independent findings.
+
+Confidence is for tie-breaking *among peers*, not for deciding which of two nested candidates dominates.
+
+### Multi-root requires explicit elevation
+
+Synthesis defaults to picking a single root when multiple candidates match. A phrase like "typically 0–2 roots surface per review" anchors the synthesizer to elevate only one. Explicit guidance to elevate ALL matching candidates (subject only to the peer-vs-nested test) is needed. The criteria themselves are the filter — no numerical cap on roots.
+
+## FYI routing requires band + template-level anchoring
+
+The FYI bucket (manual findings with confidence 0.40–0.65) stayed empty for initial calibration runs because personas had only two bands defined (HIGH ≥0.80, MODERATE 0.60–0.79) with "Suppress below 0.50." Advisory observations with no articulable consequence had nowhere to land — they were either promoted above gate (appearing as real decisions) or suppressed entirely.
+
+Two changes together populate the FYI bucket reliably:
+
+1. **Per-persona LOW (0.40–0.59) Advisory band** tailored to each persona's scope. Each of the 7 personas needs its own band; a single template-level rule doesn't override persona-specific calibrations.
+2. **Template-level advisory rule** in `subagent-template.md`'s output-contract using the "what actually breaks if we don't fix this?" heuristic. Anchors the scoring decision when a persona's own rubric doesn't make the band's applicability obvious.
+
+Either alone is insufficient. Persona bands without the template rule produce inconsistent results across personas; the template rule without per-persona bands has nothing to calibrate against.
+
+## Schema compliance requires inline enum callouts, not just `{schema}` injection
+
+The subagent template injects the full JSON schema into each persona's prompt. Schema conformance nonetheless broke on longer personas (adversarial at 89 lines, scope-guardian at 54 lines) — severity emitted as `"high"/"medium"/"low"` instead of `P0/P1/P2/P3`, evidence as strings instead of arrays.
+
+The fix that worked: a **"Schema conformance — hard constraints"** block at the top of the output contract prose, naming the exact enum values and forbidding common deviations. Schema injection alone gets pushed down in attention by dense persona rubrics; inline enum callouts anchor them at the top of the output contract and survive longer prompts.
+
+A severity translation rule ("if your persona's prose discusses 'critical/important/low-signal', map to P0/P1/P2/P3 at emit time") prevents informal priority language in persona rubrics from leaking into JSON output.
+
+## Coverage/rendering count invariants need a single source of truth
+
+Early chain runs reported coverage count (`1 root with 6 dependents`) that didn't match the rendered output (5 dependents shown). The spec didn't name which step's count was authoritative (candidate count from Step 2, post-safeguard from Step 3, or post-cap from Step 4), so the orchestrator used different values for coverage and rendering.
+
+**Invariant to preserve:** the `dependents` array populated in the final annotation step (after all filtering) is the single source of truth for both coverage and rendering. A finding appearing in a root's `dependents` array must appear nested under that root in presentation and must NOT appear at its own severity position. Coverage count equals the length of the `dependents` array.
+
+Any future pipeline change that adds filtering or reorganization steps must re-state which post-step snapshot is authoritative.
+
+## Reviewer variance is inherent; single runs aren't baselines
+
+Across 7+ runs on the rename fixture, the same document produced user-engagement counts of 0, 1, 2, 3 for `safe_auto` applied and 14, 19, 6, 12, 8, 6 for total user decisions. Calibration work reduced but did not eliminate variance. Primary variance sources:
+
+- **Adversarial reviewer activation** — the activation signals (requirement count, architectural decisions, high-stakes domain) produce non-deterministic decisions at borderline documents
+- **Root selection when multiple candidates exist** — even with scope-dominance guidance, the synthesizer's root choice varies across runs
+- **Confidence calibration on borderline findings** — the same finding lands in FYI on one run and manual on the next, because the reviewer scored 0.63 vs 0.68
+
+**Testing implication:** validate calibration changes against multiple runs, not single samples. A single "bad" run is likely noise; a pattern across 3+ runs is signal. Seeded fixtures document expected tier distributions as targets, not as pass/fail assertions.
+
+## Related documentation
+
+- `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md` — canonical synthesis pipeline spec, including 3.5c premise-dependency chain linking
+- `plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md` — output contract with schema conformance block and advisory routing rule
+- `plugins/compound-engineering/agents/document-review/` — the 7 persona agents with their confidence calibration bands
+- `tests/fixtures/ce-doc-review/` — three seeded fixtures (rename, auth, feature) for manual calibration testing; see each fixture's header comment for its specific seed map
+- `docs/solutions/developer-experience/branch-based-plugin-install-and-testing-2026-03-26.md` — how to run the skill from a branch checkout for testing
--- a/docs/solutions/skill-design/discoverability-check-for-documented-solutions-2026-03-30.md
+++ b/docs/solutions/skill-design/discoverability-check-for-documented-solutions-2026-03-30.md
@@ -3,7 +3,7 @@ title: Discoverability check for documented solutions in project instruction fil
 date: 2026-03-30
 category: skill-design
 module: compound-engineering
-problem_type: best_practice
+problem_type: convention
 component: tooling
 severity: medium
 applies_when:
--- a/docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md
+++ b/docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md
@@ -3,7 +3,7 @@ title: "Git workflow skills need explicit state machines for branch, push, and P
 category: skill-design
 date: 2026-03-27
 module: plugins/compound-engineering/skills/git-commit and git-commit-push-pr
-problem_type: best_practice
+problem_type: architecture_pattern
 component: tooling
 symptoms:
  - Detached HEAD could fall through to invalid push or PR paths
--- a/docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md
+++ b/docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md
@@ -1,6 +1,6 @@
 ---
 title: "Pass paths, not content, when dispatching sub-agents"
-problem_type: best_practice
+problem_type: design_pattern
 component: tooling
 root_cause: inadequate_documentation
 resolution_type: workflow_improvement
--- a/docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md
+++ b/docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md
@@ -3,7 +3,7 @@ title: Research agent dispatch is intentionally separated across the skill pipel
 date: 2026-04-05
 category: skill-design
 module: compound-engineering
-problem_type: best_practice
+problem_type: architecture_pattern
 component: tooling
 severity: low
 applies_when:
--- a/docs/solutions/workflow/todo-status-lifecycle.md
+++ b/docs/solutions/workflow/todo-status-lifecycle.md
@@ -13,7 +13,7 @@ related_components:
  - plugins/compound-engineering/skills/ce-review/
  - plugins/compound-engineering/skills/todo-triage/
  - plugins/compound-engineering/skills/todo-create/
-problem_type: correctness-gap
+problem_type: workflow_issue
 ---

 # Status-Gated Todo Resolution