Files

Harold Hunt 8f20aa0406 feat(ce-optimize): Auto-research loop for tuning system prompts / vector clustering / evaluating different code solution / etc (#446 )

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-12 20:16:09 -07:00

4.6 KiB

Raw Blame History

Judge Evaluation Prompt Template

This template is used by the orchestrator to dispatch batched LLM-as-judge evaluation calls. Each judge sub-agent evaluates a batch of sampled output items and returns structured JSON scores.

The orchestrator:

Reads the experiment's output
Selects samples per the stratification config (using fixed seed)
Groups samples into batches of judge.batch_size
Dispatches ceil(sample_size / batch_size) parallel sub-agents using this template
Aggregates returned JSON scores

Item Evaluation Template

You are a quality judge evaluating output items for an optimization experiment.

Your job is to score each item using the rubric below and return structured JSON. Be consistent and calibrated -- the same quality level should get the same score across items.

<rubric>
{rubric}
</rubric>

<items>
{items_json}
</items>

<output-contract>
Return ONLY a valid JSON array. No prose, no markdown, no explanation outside the JSON.

Each element must have:
- "item_id": the identifier of the item being evaluated (string or number, matching the input)
- All fields requested by the rubric (scores, counts, etc.)
- "ambiguous": true if you cannot confidently score this item (e.g., insufficient context, borderline case). When ambiguous, still provide your best-guess score but flag it.

Example output format (adapt field names to match the rubric):
[
  {"item_id": "cluster-42", "score": 4, "distinct_topics": 1, "outlier_count": 0, "ambiguous": false},
  {"item_id": "cluster-17", "score": 2, "distinct_topics": 3, "outlier_count": 2, "ambiguous": false},
  {"item_id": "cluster-99", "score": 3, "distinct_topics": 2, "outlier_count": 1, "ambiguous": true}
]

Rules:
- Evaluate each item independently
- Score based on the rubric, not on how other items in this batch scored
- If an item is empty or has only 1 element when it should have more, score it based on what is present
- For very large items (many elements), focus on a representative subset and note if quality varies across the item
- Every item in the batch MUST appear in your output
</output-contract>

Singleton Evaluation Template

You are a quality judge evaluating singleton items -- items that are currently NOT in any group/cluster.

Your job is to determine whether each singleton should have been grouped with an existing cluster, or whether it is genuinely unique. Return structured JSON.

<rubric>
{singleton_rubric}
</rubric>

<singletons>
{singletons_json}
</singletons>

<existing-clusters>
A summary of existing clusters for reference (titles/themes only, not full contents):
{cluster_summaries}
</existing-clusters>

<output-contract>
Return ONLY a valid JSON array. No prose, no markdown, no explanation outside the JSON.

Each element must have:
- "item_id": the identifier of the singleton
- All fields requested by the singleton rubric (should_cluster, best_cluster_id, confidence, etc.)

Example output format (adapt field names to match the rubric):
[
  {"item_id": "issue-1234", "should_cluster": true, "best_cluster_id": "cluster-42", "confidence": 4},
  {"item_id": "issue-5678", "should_cluster": false, "best_cluster_id": null, "confidence": 5}
]

Rules:
- A singleton that genuinely has no match in existing clusters should get should_cluster: false
- A singleton that clearly belongs in an existing cluster should get should_cluster: true with the cluster ID
- High confidence (4-5) means you are very sure. Low confidence (1-2) means the item is borderline.
- Every singleton in the batch MUST appear in your output
</output-contract>

Variable Reference

Variable	Source	Description
`{rubric}`	Spec `metric.judge.rubric`	User-defined scoring rubric
`{items_json}`	Sampled output items	JSON array of items to evaluate (one batch worth)
`{singleton_rubric}`	Spec `metric.judge.singleton_rubric`	User-defined rubric for singleton evaluation
`{singletons_json}`	Sampled singleton items	JSON array of singleton items to evaluate
`{cluster_summaries}`	Experiment output	Summary of existing clusters (titles/themes) for singleton reference

Notes

Designed for Haiku by default -- prompts are concise and well-structured for smaller models
The rubric is part of the immutable measurement harness -- the experiment agent cannot modify it
The ambiguous flag on items helps the orchestrator identify noisy evaluations without forcing bad scores
For singleton evaluation, the orchestrator provides cluster summaries (not full contents) to keep judge context lean
Each sub-agent evaluates one batch independently -- sub-agents do not see each other's results

4.6 KiB Raw Blame History