Rewrite all reference files, asset templates, and SKILL.md to use current API patterns (.call(), result.field, T::Enum classes, Tools::Base). Add two new reference files (toolsets, observability) covering tools DSL, event system, and Langfuse integration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
18 KiB
DSPy.rb Optimization
MIPROv2
MIPROv2 (Multi-prompt Instruction Proposal with Retrieval Optimization) is the primary instruction tuner in DSPy.rb. It proposes new instructions and few-shot demonstrations per predictor, evaluates them on mini-batches, and retains candidates that improve the metric. It ships as a separate gem to keep the Gaussian Process dependency tree out of apps that do not need it.
Installation
# Gemfile
gem "dspy"
gem "dspy-miprov2"
Bundler auto-requires dspy/miprov2. No additional require statement is needed.
AutoMode presets
Use DSPy::Teleprompt::MIPROv2::AutoMode for preconfigured optimizers:
light = DSPy::Teleprompt::MIPROv2::AutoMode.light(metric: metric) # 6 trials, greedy
medium = DSPy::Teleprompt::MIPROv2::AutoMode.medium(metric: metric) # 12 trials, adaptive
heavy = DSPy::Teleprompt::MIPROv2::AutoMode.heavy(metric: metric) # 18 trials, Bayesian
| Preset | Trials | Strategy | Use case |
|---|---|---|---|
light |
6 | :greedy |
Quick wins on small datasets or during prototyping. |
medium |
12 | :adaptive |
Balanced exploration vs. runtime for most pilots. |
heavy |
18 | :bayesian |
Highest accuracy targets or multi-stage programs. |
Manual configuration with dry-configurable
DSPy::Teleprompt::MIPROv2 includes Dry::Configurable. Configure at the class level (defaults for all instances) or instance level (overrides class defaults).
Class-level defaults:
DSPy::Teleprompt::MIPROv2.configure do |config|
config.optimization_strategy = :bayesian
config.num_trials = 30
config.bootstrap_sets = 10
end
Instance-level overrides:
optimizer = DSPy::Teleprompt::MIPROv2.new(metric: metric)
optimizer.configure do |config|
config.num_trials = 15
config.num_instruction_candidates = 6
config.bootstrap_sets = 5
config.max_bootstrapped_examples = 4
config.max_labeled_examples = 16
config.optimization_strategy = :adaptive # :greedy, :adaptive, :bayesian
config.early_stopping_patience = 3
config.init_temperature = 1.0
config.final_temperature = 0.1
config.minibatch_size = nil # nil = auto
config.auto_seed = 42
end
The optimization_strategy setting accepts symbols (:greedy, :adaptive, :bayesian) and coerces them internally to DSPy::Teleprompt::OptimizationStrategy T::Enum values.
The old config: constructor parameter is removed. Passing config: raises ArgumentError.
Auto presets via configure
Instead of AutoMode, set the preset through the configure block:
optimizer = DSPy::Teleprompt::MIPROv2.new(metric: metric)
optimizer.configure do |config|
config.auto_preset = DSPy::Teleprompt::AutoPreset.deserialize("medium")
end
Compile and inspect
program = DSPy::Predict.new(MySignature)
result = optimizer.compile(
program,
trainset: train_examples,
valset: val_examples
)
optimized_program = result.optimized_program
puts "Best score: #{result.best_score_value}"
The result object exposes:
optimized_program-- ready-to-use predictor with updated instruction and demos.optimization_trace[:trial_logs]-- per-trial record of instructions, demos, and scores.metadata[:optimizer]--"MIPROv2", useful when persisting experiments from multiple optimizers.
Multi-stage programs
MIPROv2 generates dataset summaries for each predictor and proposes per-stage instructions. For a ReAct agent with thought_generator and observation_processor predictors, the optimizer handles credit assignment internally. The metric only needs to evaluate the final output.
Bootstrap sampling
During the bootstrap phase MIPROv2:
- Generates dataset summaries from the training set.
- Bootstraps few-shot demonstrations by running the baseline program.
- Proposes candidate instructions grounded in the summaries and bootstrapped examples.
- Evaluates each candidate on mini-batches drawn from the validation set.
Control the bootstrap phase with bootstrap_sets, max_bootstrapped_examples, and max_labeled_examples.
Bayesian optimization
When optimization_strategy is :bayesian (or when using the heavy preset), MIPROv2 fits a Gaussian Process surrogate over past trial scores to select the next candidate. This replaces random search with informed exploration, reducing the number of trials needed to find high-scoring instructions.
GEPA
GEPA (Genetic-Pareto Reflective Prompt Evolution) is a feedback-driven optimizer. It runs the program on a small batch, collects scores and textual feedback, and asks a reflection LM to rewrite the instruction. Improved candidates are retained on a Pareto frontier.
Installation
# Gemfile
gem "dspy"
gem "dspy-gepa"
The dspy-gepa gem depends on the gepa core optimizer gem automatically.
Metric contract
GEPA metrics return DSPy::Prediction with both a numeric score and a feedback string. Do not return a plain boolean.
metric = lambda do |example, prediction|
expected = example.expected_values[:label]
predicted = prediction.label
score = predicted == expected ? 1.0 : 0.0
feedback = if score == 1.0
"Correct (#{expected}) for: \"#{example.input_values[:text][0..60]}\""
else
"Misclassified (expected #{expected}, got #{predicted}) for: \"#{example.input_values[:text][0..60]}\""
end
DSPy::Prediction.new(score: score, feedback: feedback)
end
Keep the score in [0, 1]. Always include a short feedback message explaining what happened -- GEPA hands this text to the reflection model so it can reason about failures.
Feedback maps
feedback_map targets individual predictors inside a composite module. Each entry receives keyword arguments and returns a DSPy::Prediction:
feedback_map = {
'self' => lambda do |predictor_output:, predictor_inputs:, module_inputs:, module_outputs:, captured_trace:|
expected = module_inputs.expected_values[:label]
predicted = predictor_output.label
DSPy::Prediction.new(
score: predicted == expected ? 1.0 : 0.0,
feedback: "Classifier saw \"#{predictor_inputs[:text][0..80]}\" -> #{predicted} (expected #{expected})"
)
end
}
For single-predictor programs, key the map with 'self'. For multi-predictor chains, add entries per component so the reflection LM sees localized context at each step. Omit feedback_map entirely if the top-level metric already covers the basics.
Configuring the teleprompter
teleprompter = DSPy::Teleprompt::GEPA.new(
metric: metric,
reflection_lm: DSPy::ReflectionLM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY']),
feedback_map: feedback_map,
config: {
max_metric_calls: 600,
minibatch_size: 6,
skip_perfect_score: false
}
)
Key configuration knobs:
| Knob | Purpose |
|---|---|
max_metric_calls |
Hard budget on evaluation calls. Set to at least the validation set size plus a few minibatches. |
minibatch_size |
Examples per reflective replay batch. Smaller = cheaper iterations, noisier scores. |
skip_perfect_score |
Set true to stop early when a candidate reaches score 1.0. |
Minibatch sizing
| Goal | Suggested size | Rationale |
|---|---|---|
| Explore many candidates within a tight budget | 3--6 | Cheap iterations, more prompt variants, noisier metrics. |
| Stable metrics when each rollout is costly | 8--12 | Smoother scores, fewer candidates unless budget is raised. |
| Investigate specific failure modes | 3--4 then 8+ | Start with breadth, increase once patterns emerge. |
Compile and evaluate
program = DSPy::Predict.new(MySignature)
result = teleprompter.compile(program, trainset: train, valset: val)
optimized_program = result.optimized_program
test_metrics = evaluate(optimized_program, test)
The result object exposes:
optimized_program-- predictor with updated instruction and few-shot examples.best_score_value-- validation score for the best candidate.metadata-- candidate counts, trace hashes, and telemetry IDs.
Reflection LM
Swap DSPy::ReflectionLM for any callable object that accepts the reflection prompt hash and returns a string. The default reflection signature extracts the new instruction from triple backticks in the response.
Experiment tracking
Plug GEPA::Logging::ExperimentTracker into a persistence layer:
tracker = GEPA::Logging::ExperimentTracker.new
tracker.with_subscriber { |event| MyModel.create!(payload: event) }
teleprompter = DSPy::Teleprompt::GEPA.new(
metric: metric,
reflection_lm: reflection_lm,
experiment_tracker: tracker,
config: { max_metric_calls: 900 }
)
The tracker emits Pareto update events, merge decisions, and candidate evolution records as JSONL.
Pareto frontier
GEPA maintains a diverse candidate pool and samples from the Pareto frontier instead of mutating only the top-scoring program. This balances exploration and prevents the search from collapsing onto a single lineage.
Enable the merge proposer after multiple strong lineages emerge:
config: {
max_metric_calls: 900,
enable_merge_proposer: true
}
Premature merges eat budget without meaningful gains. Gate merge on having several validated candidates first.
Advanced options
acceptance_strategy:-- plug in bespoke Pareto filters or early-stop heuristics.- Telemetry spans emit via
GEPA::Telemetry. Enable global observability withDSPy.configure { |c| c.observability = true }to stream spans to an OpenTelemetry exporter.
Evaluation Framework
DSPy::Evals provides batch evaluation of predictors against test datasets with built-in and custom metrics.
Basic usage
metric = proc do |example, prediction|
prediction.answer == example.expected_values[:answer]
end
evaluator = DSPy::Evals.new(predictor, metric: metric)
result = evaluator.evaluate(
test_examples,
display_table: true,
display_progress: true
)
puts "Pass rate: #{(result.pass_rate * 100).round(1)}%"
puts "Passed: #{result.passed_examples}/#{result.total_examples}"
DSPy::Example
Convert raw data into DSPy::Example instances before passing to optimizers or evaluators. Each example carries input_values and expected_values:
examples = rows.map do |row|
DSPy::Example.new(
input_values: { text: row[:text] },
expected_values: { label: row[:label] }
)
end
train, val, test = split_examples(examples, train_ratio: 0.6, val_ratio: 0.2, seed: 42)
Hold back a test set from the optimization loop. Optimizers work on train/val; only the test set proves generalization.
Built-in metrics
# Exact match -- prediction must exactly equal expected value
metric = DSPy::Metrics.exact_match(field: :answer, case_sensitive: true)
# Contains -- prediction must contain expected substring
metric = DSPy::Metrics.contains(field: :answer, case_sensitive: false)
# Numeric difference -- numeric output within tolerance
metric = DSPy::Metrics.numeric_difference(field: :answer, tolerance: 0.01)
# Composite AND -- all sub-metrics must pass
metric = DSPy::Metrics.composite_and(
DSPy::Metrics.exact_match(field: :answer),
DSPy::Metrics.contains(field: :reasoning)
)
Custom metrics
quality_metric = lambda do |example, prediction|
return false unless prediction
score = 0.0
score += 0.5 if prediction.answer == example.expected_values[:answer]
score += 0.3 if prediction.explanation && prediction.explanation.length > 50
score += 0.2 if prediction.confidence && prediction.confidence > 0.8
score >= 0.7
end
evaluator = DSPy::Evals.new(predictor, metric: quality_metric)
Access prediction fields with dot notation (prediction.answer), not hash notation.
Observability hooks
Register callbacks without editing the evaluator:
DSPy::Evals.before_example do |payload|
example = payload[:example]
DSPy.logger.info("Evaluating example #{example.id}") if example.respond_to?(:id)
end
DSPy::Evals.after_batch do |payload|
result = payload[:result]
Langfuse.event(
name: 'eval.batch',
metadata: {
total: result.total_examples,
passed: result.passed_examples,
score: result.score
}
)
end
Available hooks: before_example, after_example, before_batch, after_batch.
Langfuse score export
Enable export_scores: true to emit score.create events for each evaluated example and a batch score at the end:
evaluator = DSPy::Evals.new(
predictor,
metric: metric,
export_scores: true,
score_name: 'qa_accuracy' # default: 'evaluation'
)
result = evaluator.evaluate(test_examples)
# Emits per-example scores + overall batch score via DSPy::Scores::Exporter
Scores attach to the current trace context automatically and flow to Langfuse asynchronously.
Evaluation results
result = evaluator.evaluate(test_examples)
result.score # Overall score (0.0 to 1.0)
result.passed_count # Examples that passed
result.failed_count # Examples that failed
result.error_count # Examples that errored
result.results.each do |r|
r.passed # Boolean
r.score # Numeric score
r.error # Error message if the example errored
end
Integration with optimizers
metric = proc do |example, prediction|
expected = example.expected_values[:answer].to_s.strip.downcase
predicted = prediction.answer.to_s.strip.downcase
!expected.empty? && predicted.include?(expected)
end
optimizer = DSPy::Teleprompt::MIPROv2::AutoMode.medium(metric: metric)
result = optimizer.compile(
DSPy::Predict.new(QASignature),
trainset: train_examples,
valset: val_examples
)
evaluator = DSPy::Evals.new(result.optimized_program, metric: metric)
test_result = evaluator.evaluate(test_examples, display_table: true)
puts "Test accuracy: #{(test_result.pass_rate * 100).round(2)}%"
Storage System
DSPy::Storage persists optimization results, tracks history, and manages multiple versions of optimized programs.
ProgramStorage (low-level)
storage = DSPy::Storage::ProgramStorage.new(storage_path: "./dspy_storage")
# Save
saved = storage.save_program(
result.optimized_program,
result,
metadata: {
signature_class: 'ClassifyText',
optimizer: 'MIPROv2',
examples_count: examples.size
}
)
puts "Stored with ID: #{saved.program_id}"
# Load
saved = storage.load_program(program_id)
predictor = saved.program
score = saved.optimization_result[:best_score_value]
# List
storage.list_programs.each do |p|
puts "#{p[:program_id]} -- score: #{p[:best_score]} -- saved: #{p[:saved_at]}"
end
StorageManager (recommended)
manager = DSPy::Storage::StorageManager.new
# Save with tags
saved = manager.save_optimization_result(
result,
tags: ['production', 'sentiment-analysis'],
description: 'Optimized sentiment classifier v2'
)
# Find programs
programs = manager.find_programs(
optimizer: 'MIPROv2',
min_score: 0.85,
tags: ['production']
)
recent = manager.find_programs(
max_age_days: 7,
signature_class: 'ClassifyText'
)
# Get best program for a signature
best = manager.get_best_program('ClassifyText')
predictor = best.program
Global shorthand:
DSPy::Storage::StorageManager.save(result, metadata: { version: '2.0' })
DSPy::Storage::StorageManager.load(program_id)
DSPy::Storage::StorageManager.best('ClassifyText')
Checkpoints
Create and restore checkpoints during long-running optimizations:
# Save a checkpoint
manager.create_checkpoint(
current_result,
'iteration_50',
metadata: { iteration: 50, current_score: 0.87 }
)
# Restore
restored = manager.restore_checkpoint('iteration_50')
program = restored.program
# Auto-checkpoint every N iterations
if iteration % 10 == 0
manager.create_checkpoint(current_result, "auto_checkpoint_#{iteration}")
end
Import and export
Share programs between environments:
storage = DSPy::Storage::ProgramStorage.new
# Export
storage.export_programs(['abc123', 'def456'], './export_backup.json')
# Import
imported = storage.import_programs('./export_backup.json')
puts "Imported #{imported.size} programs"
Optimization history
history = manager.get_optimization_history
history[:summary][:total_programs]
history[:summary][:avg_score]
history[:optimizer_stats].each do |optimizer, stats|
puts "#{optimizer}: #{stats[:count]} programs, best: #{stats[:best_score]}"
end
history[:trends][:improvement_percentage]
Program comparison
comparison = manager.compare_programs(id_a, id_b)
comparison[:comparison][:score_difference]
comparison[:comparison][:better_program]
comparison[:comparison][:age_difference_hours]
Storage configuration
config = DSPy::Storage::StorageManager::StorageConfig.new
config.storage_path = Rails.root.join('dspy_storage')
config.auto_save = true
config.save_intermediate_results = false
config.max_stored_programs = 100
manager = DSPy::Storage::StorageManager.new(config: config)
Cleanup
Remove old programs. Cleanup retains the best performing and most recent programs using a weighted score (70% performance, 30% recency):
deleted_count = manager.cleanup_old_programs
Storage events
The storage system emits structured log events for monitoring:
dspy.storage.save_start,dspy.storage.save_complete,dspy.storage.save_errordspy.storage.load_start,dspy.storage.load_complete,dspy.storage.load_errordspy.storage.delete,dspy.storage.export,dspy.storage.import,dspy.storage.cleanup
File layout
dspy_storage/
programs/
abc123def456.json
789xyz012345.json
history.json
API rules
- Call predictors with
.call(), not.forward(). - Access prediction fields with dot notation (
result.answer), not hash notation (result[:answer]). - GEPA metrics return
DSPy::Prediction.new(score:, feedback:), not a boolean. - MIPROv2 metrics may return
true/false, a numeric score, orDSPy::Prediction.