Files
claude-engineering-plugin/plugins/compound-engineering/skills/dspy-ruby/references/optimization.md
Vicente Reig Rincón de Arellano e8f3bbcb35 refactor(skills): update dspy-ruby skill to DSPy.rb v0.34.3 API (#162)
Rewrite all reference files, asset templates, and SKILL.md to use
current API patterns (.call(), result.field, T::Enum classes,
Tools::Base). Add two new reference files (toolsets, observability)
covering tools DSL, event system, and Langfuse integration.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 12:01:43 -06:00

604 lines
18 KiB
Markdown

# DSPy.rb Optimization
## MIPROv2
MIPROv2 (Multi-prompt Instruction Proposal with Retrieval Optimization) is the primary instruction tuner in DSPy.rb. It proposes new instructions and few-shot demonstrations per predictor, evaluates them on mini-batches, and retains candidates that improve the metric. It ships as a separate gem to keep the Gaussian Process dependency tree out of apps that do not need it.
### Installation
```ruby
# Gemfile
gem "dspy"
gem "dspy-miprov2"
```
Bundler auto-requires `dspy/miprov2`. No additional `require` statement is needed.
### AutoMode presets
Use `DSPy::Teleprompt::MIPROv2::AutoMode` for preconfigured optimizers:
```ruby
light = DSPy::Teleprompt::MIPROv2::AutoMode.light(metric: metric) # 6 trials, greedy
medium = DSPy::Teleprompt::MIPROv2::AutoMode.medium(metric: metric) # 12 trials, adaptive
heavy = DSPy::Teleprompt::MIPROv2::AutoMode.heavy(metric: metric) # 18 trials, Bayesian
```
| Preset | Trials | Strategy | Use case |
|----------|--------|------------|-----------------------------------------------------|
| `light` | 6 | `:greedy` | Quick wins on small datasets or during prototyping. |
| `medium` | 12 | `:adaptive`| Balanced exploration vs. runtime for most pilots. |
| `heavy` | 18 | `:bayesian`| Highest accuracy targets or multi-stage programs. |
### Manual configuration with dry-configurable
`DSPy::Teleprompt::MIPROv2` includes `Dry::Configurable`. Configure at the class level (defaults for all instances) or instance level (overrides class defaults).
**Class-level defaults:**
```ruby
DSPy::Teleprompt::MIPROv2.configure do |config|
config.optimization_strategy = :bayesian
config.num_trials = 30
config.bootstrap_sets = 10
end
```
**Instance-level overrides:**
```ruby
optimizer = DSPy::Teleprompt::MIPROv2.new(metric: metric)
optimizer.configure do |config|
config.num_trials = 15
config.num_instruction_candidates = 6
config.bootstrap_sets = 5
config.max_bootstrapped_examples = 4
config.max_labeled_examples = 16
config.optimization_strategy = :adaptive # :greedy, :adaptive, :bayesian
config.early_stopping_patience = 3
config.init_temperature = 1.0
config.final_temperature = 0.1
config.minibatch_size = nil # nil = auto
config.auto_seed = 42
end
```
The `optimization_strategy` setting accepts symbols (`:greedy`, `:adaptive`, `:bayesian`) and coerces them internally to `DSPy::Teleprompt::OptimizationStrategy` T::Enum values.
The old `config:` constructor parameter is removed. Passing `config:` raises `ArgumentError`.
### Auto presets via configure
Instead of `AutoMode`, set the preset through the configure block:
```ruby
optimizer = DSPy::Teleprompt::MIPROv2.new(metric: metric)
optimizer.configure do |config|
config.auto_preset = DSPy::Teleprompt::AutoPreset.deserialize("medium")
end
```
### Compile and inspect
```ruby
program = DSPy::Predict.new(MySignature)
result = optimizer.compile(
program,
trainset: train_examples,
valset: val_examples
)
optimized_program = result.optimized_program
puts "Best score: #{result.best_score_value}"
```
The `result` object exposes:
- `optimized_program` -- ready-to-use predictor with updated instruction and demos.
- `optimization_trace[:trial_logs]` -- per-trial record of instructions, demos, and scores.
- `metadata[:optimizer]` -- `"MIPROv2"`, useful when persisting experiments from multiple optimizers.
### Multi-stage programs
MIPROv2 generates dataset summaries for each predictor and proposes per-stage instructions. For a ReAct agent with `thought_generator` and `observation_processor` predictors, the optimizer handles credit assignment internally. The metric only needs to evaluate the final output.
### Bootstrap sampling
During the bootstrap phase MIPROv2:
1. Generates dataset summaries from the training set.
2. Bootstraps few-shot demonstrations by running the baseline program.
3. Proposes candidate instructions grounded in the summaries and bootstrapped examples.
4. Evaluates each candidate on mini-batches drawn from the validation set.
Control the bootstrap phase with `bootstrap_sets`, `max_bootstrapped_examples`, and `max_labeled_examples`.
### Bayesian optimization
When `optimization_strategy` is `:bayesian` (or when using the `heavy` preset), MIPROv2 fits a Gaussian Process surrogate over past trial scores to select the next candidate. This replaces random search with informed exploration, reducing the number of trials needed to find high-scoring instructions.
---
## GEPA
GEPA (Genetic-Pareto Reflective Prompt Evolution) is a feedback-driven optimizer. It runs the program on a small batch, collects scores and textual feedback, and asks a reflection LM to rewrite the instruction. Improved candidates are retained on a Pareto frontier.
### Installation
```ruby
# Gemfile
gem "dspy"
gem "dspy-gepa"
```
The `dspy-gepa` gem depends on the `gepa` core optimizer gem automatically.
### Metric contract
GEPA metrics return `DSPy::Prediction` with both a numeric score and a feedback string. Do not return a plain boolean.
```ruby
metric = lambda do |example, prediction|
expected = example.expected_values[:label]
predicted = prediction.label
score = predicted == expected ? 1.0 : 0.0
feedback = if score == 1.0
"Correct (#{expected}) for: \"#{example.input_values[:text][0..60]}\""
else
"Misclassified (expected #{expected}, got #{predicted}) for: \"#{example.input_values[:text][0..60]}\""
end
DSPy::Prediction.new(score: score, feedback: feedback)
end
```
Keep the score in `[0, 1]`. Always include a short feedback message explaining what happened -- GEPA hands this text to the reflection model so it can reason about failures.
### Feedback maps
`feedback_map` targets individual predictors inside a composite module. Each entry receives keyword arguments and returns a `DSPy::Prediction`:
```ruby
feedback_map = {
'self' => lambda do |predictor_output:, predictor_inputs:, module_inputs:, module_outputs:, captured_trace:|
expected = module_inputs.expected_values[:label]
predicted = predictor_output.label
DSPy::Prediction.new(
score: predicted == expected ? 1.0 : 0.0,
feedback: "Classifier saw \"#{predictor_inputs[:text][0..80]}\" -> #{predicted} (expected #{expected})"
)
end
}
```
For single-predictor programs, key the map with `'self'`. For multi-predictor chains, add entries per component so the reflection LM sees localized context at each step. Omit `feedback_map` entirely if the top-level metric already covers the basics.
### Configuring the teleprompter
```ruby
teleprompter = DSPy::Teleprompt::GEPA.new(
metric: metric,
reflection_lm: DSPy::ReflectionLM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY']),
feedback_map: feedback_map,
config: {
max_metric_calls: 600,
minibatch_size: 6,
skip_perfect_score: false
}
)
```
Key configuration knobs:
| Knob | Purpose |
|----------------------|-------------------------------------------------------------------------------------------|
| `max_metric_calls` | Hard budget on evaluation calls. Set to at least the validation set size plus a few minibatches. |
| `minibatch_size` | Examples per reflective replay batch. Smaller = cheaper iterations, noisier scores. |
| `skip_perfect_score` | Set `true` to stop early when a candidate reaches score `1.0`. |
### Minibatch sizing
| Goal | Suggested size | Rationale |
|-------------------------------------------------|----------------|------------------------------------------------------------|
| Explore many candidates within a tight budget | 3--6 | Cheap iterations, more prompt variants, noisier metrics. |
| Stable metrics when each rollout is costly | 8--12 | Smoother scores, fewer candidates unless budget is raised. |
| Investigate specific failure modes | 3--4 then 8+ | Start with breadth, increase once patterns emerge. |
### Compile and evaluate
```ruby
program = DSPy::Predict.new(MySignature)
result = teleprompter.compile(program, trainset: train, valset: val)
optimized_program = result.optimized_program
test_metrics = evaluate(optimized_program, test)
```
The `result` object exposes:
- `optimized_program` -- predictor with updated instruction and few-shot examples.
- `best_score_value` -- validation score for the best candidate.
- `metadata` -- candidate counts, trace hashes, and telemetry IDs.
### Reflection LM
Swap `DSPy::ReflectionLM` for any callable object that accepts the reflection prompt hash and returns a string. The default reflection signature extracts the new instruction from triple backticks in the response.
### Experiment tracking
Plug `GEPA::Logging::ExperimentTracker` into a persistence layer:
```ruby
tracker = GEPA::Logging::ExperimentTracker.new
tracker.with_subscriber { |event| MyModel.create!(payload: event) }
teleprompter = DSPy::Teleprompt::GEPA.new(
metric: metric,
reflection_lm: reflection_lm,
experiment_tracker: tracker,
config: { max_metric_calls: 900 }
)
```
The tracker emits Pareto update events, merge decisions, and candidate evolution records as JSONL.
### Pareto frontier
GEPA maintains a diverse candidate pool and samples from the Pareto frontier instead of mutating only the top-scoring program. This balances exploration and prevents the search from collapsing onto a single lineage.
Enable the merge proposer after multiple strong lineages emerge:
```ruby
config: {
max_metric_calls: 900,
enable_merge_proposer: true
}
```
Premature merges eat budget without meaningful gains. Gate merge on having several validated candidates first.
### Advanced options
- `acceptance_strategy:` -- plug in bespoke Pareto filters or early-stop heuristics.
- Telemetry spans emit via `GEPA::Telemetry`. Enable global observability with `DSPy.configure { |c| c.observability = true }` to stream spans to an OpenTelemetry exporter.
---
## Evaluation Framework
`DSPy::Evals` provides batch evaluation of predictors against test datasets with built-in and custom metrics.
### Basic usage
```ruby
metric = proc do |example, prediction|
prediction.answer == example.expected_values[:answer]
end
evaluator = DSPy::Evals.new(predictor, metric: metric)
result = evaluator.evaluate(
test_examples,
display_table: true,
display_progress: true
)
puts "Pass rate: #{(result.pass_rate * 100).round(1)}%"
puts "Passed: #{result.passed_examples}/#{result.total_examples}"
```
### DSPy::Example
Convert raw data into `DSPy::Example` instances before passing to optimizers or evaluators. Each example carries `input_values` and `expected_values`:
```ruby
examples = rows.map do |row|
DSPy::Example.new(
input_values: { text: row[:text] },
expected_values: { label: row[:label] }
)
end
train, val, test = split_examples(examples, train_ratio: 0.6, val_ratio: 0.2, seed: 42)
```
Hold back a test set from the optimization loop. Optimizers work on train/val; only the test set proves generalization.
### Built-in metrics
```ruby
# Exact match -- prediction must exactly equal expected value
metric = DSPy::Metrics.exact_match(field: :answer, case_sensitive: true)
# Contains -- prediction must contain expected substring
metric = DSPy::Metrics.contains(field: :answer, case_sensitive: false)
# Numeric difference -- numeric output within tolerance
metric = DSPy::Metrics.numeric_difference(field: :answer, tolerance: 0.01)
# Composite AND -- all sub-metrics must pass
metric = DSPy::Metrics.composite_and(
DSPy::Metrics.exact_match(field: :answer),
DSPy::Metrics.contains(field: :reasoning)
)
```
### Custom metrics
```ruby
quality_metric = lambda do |example, prediction|
return false unless prediction
score = 0.0
score += 0.5 if prediction.answer == example.expected_values[:answer]
score += 0.3 if prediction.explanation && prediction.explanation.length > 50
score += 0.2 if prediction.confidence && prediction.confidence > 0.8
score >= 0.7
end
evaluator = DSPy::Evals.new(predictor, metric: quality_metric)
```
Access prediction fields with dot notation (`prediction.answer`), not hash notation.
### Observability hooks
Register callbacks without editing the evaluator:
```ruby
DSPy::Evals.before_example do |payload|
example = payload[:example]
DSPy.logger.info("Evaluating example #{example.id}") if example.respond_to?(:id)
end
DSPy::Evals.after_batch do |payload|
result = payload[:result]
Langfuse.event(
name: 'eval.batch',
metadata: {
total: result.total_examples,
passed: result.passed_examples,
score: result.score
}
)
end
```
Available hooks: `before_example`, `after_example`, `before_batch`, `after_batch`.
### Langfuse score export
Enable `export_scores: true` to emit `score.create` events for each evaluated example and a batch score at the end:
```ruby
evaluator = DSPy::Evals.new(
predictor,
metric: metric,
export_scores: true,
score_name: 'qa_accuracy' # default: 'evaluation'
)
result = evaluator.evaluate(test_examples)
# Emits per-example scores + overall batch score via DSPy::Scores::Exporter
```
Scores attach to the current trace context automatically and flow to Langfuse asynchronously.
### Evaluation results
```ruby
result = evaluator.evaluate(test_examples)
result.score # Overall score (0.0 to 1.0)
result.passed_count # Examples that passed
result.failed_count # Examples that failed
result.error_count # Examples that errored
result.results.each do |r|
r.passed # Boolean
r.score # Numeric score
r.error # Error message if the example errored
end
```
### Integration with optimizers
```ruby
metric = proc do |example, prediction|
expected = example.expected_values[:answer].to_s.strip.downcase
predicted = prediction.answer.to_s.strip.downcase
!expected.empty? && predicted.include?(expected)
end
optimizer = DSPy::Teleprompt::MIPROv2::AutoMode.medium(metric: metric)
result = optimizer.compile(
DSPy::Predict.new(QASignature),
trainset: train_examples,
valset: val_examples
)
evaluator = DSPy::Evals.new(result.optimized_program, metric: metric)
test_result = evaluator.evaluate(test_examples, display_table: true)
puts "Test accuracy: #{(test_result.pass_rate * 100).round(2)}%"
```
---
## Storage System
`DSPy::Storage` persists optimization results, tracks history, and manages multiple versions of optimized programs.
### ProgramStorage (low-level)
```ruby
storage = DSPy::Storage::ProgramStorage.new(storage_path: "./dspy_storage")
# Save
saved = storage.save_program(
result.optimized_program,
result,
metadata: {
signature_class: 'ClassifyText',
optimizer: 'MIPROv2',
examples_count: examples.size
}
)
puts "Stored with ID: #{saved.program_id}"
# Load
saved = storage.load_program(program_id)
predictor = saved.program
score = saved.optimization_result[:best_score_value]
# List
storage.list_programs.each do |p|
puts "#{p[:program_id]} -- score: #{p[:best_score]} -- saved: #{p[:saved_at]}"
end
```
### StorageManager (recommended)
```ruby
manager = DSPy::Storage::StorageManager.new
# Save with tags
saved = manager.save_optimization_result(
result,
tags: ['production', 'sentiment-analysis'],
description: 'Optimized sentiment classifier v2'
)
# Find programs
programs = manager.find_programs(
optimizer: 'MIPROv2',
min_score: 0.85,
tags: ['production']
)
recent = manager.find_programs(
max_age_days: 7,
signature_class: 'ClassifyText'
)
# Get best program for a signature
best = manager.get_best_program('ClassifyText')
predictor = best.program
```
Global shorthand:
```ruby
DSPy::Storage::StorageManager.save(result, metadata: { version: '2.0' })
DSPy::Storage::StorageManager.load(program_id)
DSPy::Storage::StorageManager.best('ClassifyText')
```
### Checkpoints
Create and restore checkpoints during long-running optimizations:
```ruby
# Save a checkpoint
manager.create_checkpoint(
current_result,
'iteration_50',
metadata: { iteration: 50, current_score: 0.87 }
)
# Restore
restored = manager.restore_checkpoint('iteration_50')
program = restored.program
# Auto-checkpoint every N iterations
if iteration % 10 == 0
manager.create_checkpoint(current_result, "auto_checkpoint_#{iteration}")
end
```
### Import and export
Share programs between environments:
```ruby
storage = DSPy::Storage::ProgramStorage.new
# Export
storage.export_programs(['abc123', 'def456'], './export_backup.json')
# Import
imported = storage.import_programs('./export_backup.json')
puts "Imported #{imported.size} programs"
```
### Optimization history
```ruby
history = manager.get_optimization_history
history[:summary][:total_programs]
history[:summary][:avg_score]
history[:optimizer_stats].each do |optimizer, stats|
puts "#{optimizer}: #{stats[:count]} programs, best: #{stats[:best_score]}"
end
history[:trends][:improvement_percentage]
```
### Program comparison
```ruby
comparison = manager.compare_programs(id_a, id_b)
comparison[:comparison][:score_difference]
comparison[:comparison][:better_program]
comparison[:comparison][:age_difference_hours]
```
### Storage configuration
```ruby
config = DSPy::Storage::StorageManager::StorageConfig.new
config.storage_path = Rails.root.join('dspy_storage')
config.auto_save = true
config.save_intermediate_results = false
config.max_stored_programs = 100
manager = DSPy::Storage::StorageManager.new(config: config)
```
### Cleanup
Remove old programs. Cleanup retains the best performing and most recent programs using a weighted score (70% performance, 30% recency):
```ruby
deleted_count = manager.cleanup_old_programs
```
### Storage events
The storage system emits structured log events for monitoring:
- `dspy.storage.save_start`, `dspy.storage.save_complete`, `dspy.storage.save_error`
- `dspy.storage.load_start`, `dspy.storage.load_complete`, `dspy.storage.load_error`
- `dspy.storage.delete`, `dspy.storage.export`, `dspy.storage.import`, `dspy.storage.cleanup`
### File layout
```
dspy_storage/
programs/
abc123def456.json
789xyz012345.json
history.json
```
---
## API rules
- Call predictors with `.call()`, not `.forward()`.
- Access prediction fields with dot notation (`result.answer`), not hash notation (`result[:answer]`).
- GEPA metrics return `DSPy::Prediction.new(score:, feedback:)`, not a boolean.
- MIPROv2 metrics may return `true`/`false`, a numeric score, or `DSPy::Prediction`.