feat(doc-review, learnings-researcher): tiers, chain grouping, rewrite (#601)
Some checks failed
CI / pr-title (push) Has been cancelled
CI / test (push) Has been cancelled
Release PR / release-pr (push) Has been cancelled
Release PR / publish-cli (push) Has been cancelled

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trevin Chow
2026-04-19 20:25:47 -07:00
committed by GitHub
parent 409b07fbc7
commit c1f68d4d55
39 changed files with 3142 additions and 290 deletions

View File

@@ -5,7 +5,7 @@ tags: [converter, target-provider, plugin-conversion, multi-platform, pattern]
created: 2026-02-23
severity: medium
component: converter-cli
problem_type: best_practice
problem_type: architecture_pattern
root_cause: architectural_pattern
---

View File

@@ -3,7 +3,7 @@ title: "Codex Delegation Best Practices"
date: 2026-04-01
category: best-practices
module: "Codex delegation / skill design"
problem_type: best_practice
problem_type: convention
component: tooling
severity: medium
applies_when:

View File

@@ -3,7 +3,7 @@ title: Conditional visual aids in generated documents and PR descriptions
date: 2026-03-29
category: best-practices
module: compound-engineering plugin skills
problem_type: best_practice
problem_type: design_pattern
component: documentation
symptoms:
- "Generated documents and PR descriptions lack visual aids that would improve comprehension of complex workflows and relationships"

View File

@@ -3,7 +3,7 @@ title: "Prefer Python over bash for multi-step pipeline scripts"
date: 2026-04-09
category: best-practices
module: "skill scripting / ce-demo-reel"
problem_type: best_practice
problem_type: tooling_decision
component: tooling
severity: medium
applies_when:

View File

@@ -5,7 +5,7 @@ tags: [codex, converter, skills, prompts, workflows, deprecation]
created: 2026-03-15
severity: medium
component: codex-target
problem_type: best_practice
problem_type: convention
root_cause: outdated_target_model
---

View File

@@ -0,0 +1,110 @@
---
title: "ce-doc-review calibration patterns: tier classification, chain grouping, and FYI routing"
date: 2026-04-19
category: skill-design
module: compound-engineering / ce-doc-review
problem_type: design_pattern
component: tooling
severity: medium
tags:
- ce-doc-review
- autofix-classification
- synthesis-pipeline
- persona-calibration
- premise-dependency
- fyi-routing
- calibration
applies_when:
- Changing persona confidence calibration bands in `plugins/compound-engineering/agents/document-review/`
- Modifying the synthesis pipeline in `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md`
- Adjusting the subagent template's output contract in `references/subagent-template.md`
- Adding or modifying seeded test fixtures under `tests/fixtures/ce-doc-review/`
- Debugging why a finding landed in a different tier than expected
---
# ce-doc-review calibration patterns
Calibration work on ce-doc-review (PR #601 series, 2026-04-18 and -19) surfaced several non-obvious patterns in how the synthesis pipeline classifies findings. These patterns are durable: they will re-surface any time personas or synthesis guidance are retuned. Future contributors changing calibration should expect them and not "fix" them as bugs.
## Tier classification is context-sensitive, not purely formal
The naive read of the tier spec says `safe_auto` = "one clear correct fix, applied silently." In practice, the same shape of finding can legitimately land in different tiers depending on scope and verifiability. Two recurring patterns:
### External stale cross-reference → gated_auto (not safe_auto)
When the document says `see Unit 7` and Unit 7 doesn't exist in the same document, that's an **internal** stale cross-reference — coherence can verify from the document text alone and apply `safe_auto`. When the document says `see docs/guides/keyboard-nav.md Section 4` and that file isn't verifiable from the document content, that's an **external** cross-reference; applying "delete this reference" silently risks masking a legitimate external doc. The reviewer should route these to `gated_auto` with a "verify before applying" fix, not `safe_auto`.
Observed in: feature-plan fixture runs. The external cross-ref landed at P2 0.70 gated_auto with the fix "Verify docs/guides/keyboard-nav.md exists... If stale, either remove the reference or replace with inline guidance."
### Multi-surface terminology drift → gated_auto (not safe_auto)
When two synonyms appear in prose only (`data store` / `database`), `safe_auto` normalizes correctly. When the drift crosses surfaces — UI copy, aria-labels, toast messages, analytics events, file names, code identifiers — the fix's scope exceeds prose normalization and warrants user confirmation. Security-adjacent terminology (`token` / `credential` / `secret` / `API key`) carries different semantic weight and should also route to `gated_auto` with a glossary-fix recommendation.
Observed in: auth-plan fixture runs (security-lens escalated), feature-plan fixture runs (UI-surface escalated).
**Do not tighten coherence's `safe_auto` guidance to force these into `safe_auto`.** The reclassification is reviewer judgment doing useful work.
## Premise-dependency chains have scope hierarchy
Synthesis step 3.5c groups manual findings whose fixes cascade from a single premise challenge. When multiple premise-level candidates surface, they may be **peer roots** (independent premises at different scopes) or **nested** (one premise's resolution moots the other). The decision rules:
### Peer vs nested — mechanical test, not example-based
> "Two candidate roots are peers when accepting root A's proposed fix would not resolve root B's concern (and vice versa). They are nested when one root's fix would moot the other — in which case the subsumed candidate becomes a dependent of the surviving root."
Apply symmetrically: check both directions before deciding. Example-based teaching ("e.g., 'drop the alias'") overfits to specific vocabulary; a mechanical decision test generalizes across domains.
### Surviving root under nested — scope dominates confidence
When nested, the surviving root is the one whose fix moots the other — **not** the higher-confidence candidate. In a rename plan, "rename premise unsupported" (0.82) dominates "alias machinery unjustified" (0.98) because rejecting the rename moots the alias entirely, while rejecting the alias still leaves the rename standing. Earlier synthesis picked the higher-confidence candidate as root, which stranded the broader-scope premise's natural dependents as independent findings.
Confidence is for tie-breaking *among peers*, not for deciding which of two nested candidates dominates.
### Multi-root requires explicit elevation
Synthesis defaults to picking a single root when multiple candidates match. A phrase like "typically 02 roots surface per review" anchors the synthesizer to elevate only one. Explicit guidance to elevate ALL matching candidates (subject only to the peer-vs-nested test) is needed. The criteria themselves are the filter — no numerical cap on roots.
## FYI routing requires band + template-level anchoring
The FYI bucket (manual findings with confidence 0.400.65) stayed empty for initial calibration runs because personas had only two bands defined (HIGH ≥0.80, MODERATE 0.600.79) with "Suppress below 0.50." Advisory observations with no articulable consequence had nowhere to land — they were either promoted above gate (appearing as real decisions) or suppressed entirely.
Two changes together populate the FYI bucket reliably:
1. **Per-persona LOW (0.400.59) Advisory band** tailored to each persona's scope. Each of the 7 personas needs its own band; a single template-level rule doesn't override persona-specific calibrations.
2. **Template-level advisory rule** in `subagent-template.md`'s output-contract using the "what actually breaks if we don't fix this?" heuristic. Anchors the scoring decision when a persona's own rubric doesn't make the band's applicability obvious.
Either alone is insufficient. Persona bands without the template rule produce inconsistent results across personas; the template rule without per-persona bands has nothing to calibrate against.
## Schema compliance requires inline enum callouts, not just `{schema}` injection
The subagent template injects the full JSON schema into each persona's prompt. Schema conformance nonetheless broke on longer personas (adversarial at 89 lines, scope-guardian at 54 lines) — severity emitted as `"high"/"medium"/"low"` instead of `P0/P1/P2/P3`, evidence as strings instead of arrays.
The fix that worked: a **"Schema conformance — hard constraints"** block at the top of the output contract prose, naming the exact enum values and forbidding common deviations. Schema injection alone gets pushed down in attention by dense persona rubrics; inline enum callouts anchor them at the top of the output contract and survive longer prompts.
A severity translation rule ("if your persona's prose discusses 'critical/important/low-signal', map to P0/P1/P2/P3 at emit time") prevents informal priority language in persona rubrics from leaking into JSON output.
## Coverage/rendering count invariants need a single source of truth
Early chain runs reported coverage count (`1 root with 6 dependents`) that didn't match the rendered output (5 dependents shown). The spec didn't name which step's count was authoritative (candidate count from Step 2, post-safeguard from Step 3, or post-cap from Step 4), so the orchestrator used different values for coverage and rendering.
**Invariant to preserve:** the `dependents` array populated in the final annotation step (after all filtering) is the single source of truth for both coverage and rendering. A finding appearing in a root's `dependents` array must appear nested under that root in presentation and must NOT appear at its own severity position. Coverage count equals the length of the `dependents` array.
Any future pipeline change that adds filtering or reorganization steps must re-state which post-step snapshot is authoritative.
## Reviewer variance is inherent; single runs aren't baselines
Across 7+ runs on the rename fixture, the same document produced user-engagement counts of 0, 1, 2, 3 for `safe_auto` applied and 14, 19, 6, 12, 8, 6 for total user decisions. Calibration work reduced but did not eliminate variance. Primary variance sources:
- **Adversarial reviewer activation** — the activation signals (requirement count, architectural decisions, high-stakes domain) produce non-deterministic decisions at borderline documents
- **Root selection when multiple candidates exist** — even with scope-dominance guidance, the synthesizer's root choice varies across runs
- **Confidence calibration on borderline findings** — the same finding lands in FYI on one run and manual on the next, because the reviewer scored 0.63 vs 0.68
**Testing implication:** validate calibration changes against multiple runs, not single samples. A single "bad" run is likely noise; a pattern across 3+ runs is signal. Seeded fixtures document expected tier distributions as targets, not as pass/fail assertions.
## Related documentation
- `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md` — canonical synthesis pipeline spec, including 3.5c premise-dependency chain linking
- `plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md` — output contract with schema conformance block and advisory routing rule
- `plugins/compound-engineering/agents/document-review/` — the 7 persona agents with their confidence calibration bands
- `tests/fixtures/ce-doc-review/` — three seeded fixtures (rename, auth, feature) for manual calibration testing; see each fixture's header comment for its specific seed map
- `docs/solutions/developer-experience/branch-based-plugin-install-and-testing-2026-03-26.md` — how to run the skill from a branch checkout for testing

View File

@@ -3,7 +3,7 @@ title: Discoverability check for documented solutions in project instruction fil
date: 2026-03-30
category: skill-design
module: compound-engineering
problem_type: best_practice
problem_type: convention
component: tooling
severity: medium
applies_when:

View File

@@ -3,7 +3,7 @@ title: "Git workflow skills need explicit state machines for branch, push, and P
category: skill-design
date: 2026-03-27
module: plugins/compound-engineering/skills/git-commit and git-commit-push-pr
problem_type: best_practice
problem_type: architecture_pattern
component: tooling
symptoms:
- Detached HEAD could fall through to invalid push or PR paths

View File

@@ -1,6 +1,6 @@
---
title: "Pass paths, not content, when dispatching sub-agents"
problem_type: best_practice
problem_type: design_pattern
component: tooling
root_cause: inadequate_documentation
resolution_type: workflow_improvement

View File

@@ -3,7 +3,7 @@ title: Research agent dispatch is intentionally separated across the skill pipel
date: 2026-04-05
category: skill-design
module: compound-engineering
problem_type: best_practice
problem_type: architecture_pattern
component: tooling
severity: low
applies_when:

View File

@@ -13,7 +13,7 @@ related_components:
- plugins/compound-engineering/skills/ce-review/
- plugins/compound-engineering/skills/todo-triage/
- plugins/compound-engineering/skills/todo-create/
problem_type: correctness-gap
problem_type: workflow_issue
---
# Status-Gated Todo Resolution