Files
claude-engineering-plugin/plugins/compound-engineering/skills/dspy-ruby/references/optimization.md
Vicente Reig Rincón de Arellano e8f3bbcb35 refactor(skills): update dspy-ruby skill to DSPy.rb v0.34.3 API (#162)
Rewrite all reference files, asset templates, and SKILL.md to use
current API patterns (.call(), result.field, T::Enum classes,
Tools::Base). Add two new reference files (toolsets, observability)
covering tools DSL, event system, and Langfuse integration.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 12:01:43 -06:00

18 KiB

DSPy.rb Optimization

MIPROv2

MIPROv2 (Multi-prompt Instruction Proposal with Retrieval Optimization) is the primary instruction tuner in DSPy.rb. It proposes new instructions and few-shot demonstrations per predictor, evaluates them on mini-batches, and retains candidates that improve the metric. It ships as a separate gem to keep the Gaussian Process dependency tree out of apps that do not need it.

Installation

# Gemfile
gem "dspy"
gem "dspy-miprov2"

Bundler auto-requires dspy/miprov2. No additional require statement is needed.

AutoMode presets

Use DSPy::Teleprompt::MIPROv2::AutoMode for preconfigured optimizers:

light  = DSPy::Teleprompt::MIPROv2::AutoMode.light(metric: metric)   # 6 trials, greedy
medium = DSPy::Teleprompt::MIPROv2::AutoMode.medium(metric: metric)  # 12 trials, adaptive
heavy  = DSPy::Teleprompt::MIPROv2::AutoMode.heavy(metric: metric)   # 18 trials, Bayesian
Preset Trials Strategy Use case
light 6 :greedy Quick wins on small datasets or during prototyping.
medium 12 :adaptive Balanced exploration vs. runtime for most pilots.
heavy 18 :bayesian Highest accuracy targets or multi-stage programs.

Manual configuration with dry-configurable

DSPy::Teleprompt::MIPROv2 includes Dry::Configurable. Configure at the class level (defaults for all instances) or instance level (overrides class defaults).

Class-level defaults:

DSPy::Teleprompt::MIPROv2.configure do |config|
  config.optimization_strategy = :bayesian
  config.num_trials = 30
  config.bootstrap_sets = 10
end

Instance-level overrides:

optimizer = DSPy::Teleprompt::MIPROv2.new(metric: metric)
optimizer.configure do |config|
  config.num_trials = 15
  config.num_instruction_candidates = 6
  config.bootstrap_sets = 5
  config.max_bootstrapped_examples = 4
  config.max_labeled_examples = 16
  config.optimization_strategy = :adaptive       # :greedy, :adaptive, :bayesian
  config.early_stopping_patience = 3
  config.init_temperature = 1.0
  config.final_temperature = 0.1
  config.minibatch_size = nil                     # nil = auto
  config.auto_seed = 42
end

The optimization_strategy setting accepts symbols (:greedy, :adaptive, :bayesian) and coerces them internally to DSPy::Teleprompt::OptimizationStrategy T::Enum values.

The old config: constructor parameter is removed. Passing config: raises ArgumentError.

Auto presets via configure

Instead of AutoMode, set the preset through the configure block:

optimizer = DSPy::Teleprompt::MIPROv2.new(metric: metric)
optimizer.configure do |config|
  config.auto_preset = DSPy::Teleprompt::AutoPreset.deserialize("medium")
end

Compile and inspect

program = DSPy::Predict.new(MySignature)

result = optimizer.compile(
  program,
  trainset: train_examples,
  valset: val_examples
)

optimized_program = result.optimized_program
puts "Best score: #{result.best_score_value}"

The result object exposes:

  • optimized_program -- ready-to-use predictor with updated instruction and demos.
  • optimization_trace[:trial_logs] -- per-trial record of instructions, demos, and scores.
  • metadata[:optimizer] -- "MIPROv2", useful when persisting experiments from multiple optimizers.

Multi-stage programs

MIPROv2 generates dataset summaries for each predictor and proposes per-stage instructions. For a ReAct agent with thought_generator and observation_processor predictors, the optimizer handles credit assignment internally. The metric only needs to evaluate the final output.

Bootstrap sampling

During the bootstrap phase MIPROv2:

  1. Generates dataset summaries from the training set.
  2. Bootstraps few-shot demonstrations by running the baseline program.
  3. Proposes candidate instructions grounded in the summaries and bootstrapped examples.
  4. Evaluates each candidate on mini-batches drawn from the validation set.

Control the bootstrap phase with bootstrap_sets, max_bootstrapped_examples, and max_labeled_examples.

Bayesian optimization

When optimization_strategy is :bayesian (or when using the heavy preset), MIPROv2 fits a Gaussian Process surrogate over past trial scores to select the next candidate. This replaces random search with informed exploration, reducing the number of trials needed to find high-scoring instructions.


GEPA

GEPA (Genetic-Pareto Reflective Prompt Evolution) is a feedback-driven optimizer. It runs the program on a small batch, collects scores and textual feedback, and asks a reflection LM to rewrite the instruction. Improved candidates are retained on a Pareto frontier.

Installation

# Gemfile
gem "dspy"
gem "dspy-gepa"

The dspy-gepa gem depends on the gepa core optimizer gem automatically.

Metric contract

GEPA metrics return DSPy::Prediction with both a numeric score and a feedback string. Do not return a plain boolean.

metric = lambda do |example, prediction|
  expected  = example.expected_values[:label]
  predicted = prediction.label

  score = predicted == expected ? 1.0 : 0.0
  feedback = if score == 1.0
    "Correct (#{expected}) for: \"#{example.input_values[:text][0..60]}\""
  else
    "Misclassified (expected #{expected}, got #{predicted}) for: \"#{example.input_values[:text][0..60]}\""
  end

  DSPy::Prediction.new(score: score, feedback: feedback)
end

Keep the score in [0, 1]. Always include a short feedback message explaining what happened -- GEPA hands this text to the reflection model so it can reason about failures.

Feedback maps

feedback_map targets individual predictors inside a composite module. Each entry receives keyword arguments and returns a DSPy::Prediction:

feedback_map = {
  'self' => lambda do |predictor_output:, predictor_inputs:, module_inputs:, module_outputs:, captured_trace:|
    expected  = module_inputs.expected_values[:label]
    predicted = predictor_output.label

    DSPy::Prediction.new(
      score: predicted == expected ? 1.0 : 0.0,
      feedback: "Classifier saw \"#{predictor_inputs[:text][0..80]}\" -> #{predicted} (expected #{expected})"
    )
  end
}

For single-predictor programs, key the map with 'self'. For multi-predictor chains, add entries per component so the reflection LM sees localized context at each step. Omit feedback_map entirely if the top-level metric already covers the basics.

Configuring the teleprompter

teleprompter = DSPy::Teleprompt::GEPA.new(
  metric: metric,
  reflection_lm: DSPy::ReflectionLM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY']),
  feedback_map: feedback_map,
  config: {
    max_metric_calls: 600,
    minibatch_size: 6,
    skip_perfect_score: false
  }
)

Key configuration knobs:

Knob Purpose
max_metric_calls Hard budget on evaluation calls. Set to at least the validation set size plus a few minibatches.
minibatch_size Examples per reflective replay batch. Smaller = cheaper iterations, noisier scores.
skip_perfect_score Set true to stop early when a candidate reaches score 1.0.

Minibatch sizing

Goal Suggested size Rationale
Explore many candidates within a tight budget 3--6 Cheap iterations, more prompt variants, noisier metrics.
Stable metrics when each rollout is costly 8--12 Smoother scores, fewer candidates unless budget is raised.
Investigate specific failure modes 3--4 then 8+ Start with breadth, increase once patterns emerge.

Compile and evaluate

program = DSPy::Predict.new(MySignature)

result = teleprompter.compile(program, trainset: train, valset: val)
optimized_program = result.optimized_program

test_metrics = evaluate(optimized_program, test)

The result object exposes:

  • optimized_program -- predictor with updated instruction and few-shot examples.
  • best_score_value -- validation score for the best candidate.
  • metadata -- candidate counts, trace hashes, and telemetry IDs.

Reflection LM

Swap DSPy::ReflectionLM for any callable object that accepts the reflection prompt hash and returns a string. The default reflection signature extracts the new instruction from triple backticks in the response.

Experiment tracking

Plug GEPA::Logging::ExperimentTracker into a persistence layer:

tracker = GEPA::Logging::ExperimentTracker.new
tracker.with_subscriber { |event| MyModel.create!(payload: event) }

teleprompter = DSPy::Teleprompt::GEPA.new(
  metric: metric,
  reflection_lm: reflection_lm,
  experiment_tracker: tracker,
  config: { max_metric_calls: 900 }
)

The tracker emits Pareto update events, merge decisions, and candidate evolution records as JSONL.

Pareto frontier

GEPA maintains a diverse candidate pool and samples from the Pareto frontier instead of mutating only the top-scoring program. This balances exploration and prevents the search from collapsing onto a single lineage.

Enable the merge proposer after multiple strong lineages emerge:

config: {
  max_metric_calls: 900,
  enable_merge_proposer: true
}

Premature merges eat budget without meaningful gains. Gate merge on having several validated candidates first.

Advanced options

  • acceptance_strategy: -- plug in bespoke Pareto filters or early-stop heuristics.
  • Telemetry spans emit via GEPA::Telemetry. Enable global observability with DSPy.configure { |c| c.observability = true } to stream spans to an OpenTelemetry exporter.

Evaluation Framework

DSPy::Evals provides batch evaluation of predictors against test datasets with built-in and custom metrics.

Basic usage

metric = proc do |example, prediction|
  prediction.answer == example.expected_values[:answer]
end

evaluator = DSPy::Evals.new(predictor, metric: metric)

result = evaluator.evaluate(
  test_examples,
  display_table: true,
  display_progress: true
)

puts "Pass rate: #{(result.pass_rate * 100).round(1)}%"
puts "Passed: #{result.passed_examples}/#{result.total_examples}"

DSPy::Example

Convert raw data into DSPy::Example instances before passing to optimizers or evaluators. Each example carries input_values and expected_values:

examples = rows.map do |row|
  DSPy::Example.new(
    input_values: { text: row[:text] },
    expected_values: { label: row[:label] }
  )
end

train, val, test = split_examples(examples, train_ratio: 0.6, val_ratio: 0.2, seed: 42)

Hold back a test set from the optimization loop. Optimizers work on train/val; only the test set proves generalization.

Built-in metrics

# Exact match -- prediction must exactly equal expected value
metric = DSPy::Metrics.exact_match(field: :answer, case_sensitive: true)

# Contains -- prediction must contain expected substring
metric = DSPy::Metrics.contains(field: :answer, case_sensitive: false)

# Numeric difference -- numeric output within tolerance
metric = DSPy::Metrics.numeric_difference(field: :answer, tolerance: 0.01)

# Composite AND -- all sub-metrics must pass
metric = DSPy::Metrics.composite_and(
  DSPy::Metrics.exact_match(field: :answer),
  DSPy::Metrics.contains(field: :reasoning)
)

Custom metrics

quality_metric = lambda do |example, prediction|
  return false unless prediction

  score = 0.0
  score += 0.5 if prediction.answer == example.expected_values[:answer]
  score += 0.3 if prediction.explanation && prediction.explanation.length > 50
  score += 0.2 if prediction.confidence && prediction.confidence > 0.8
  score >= 0.7
end

evaluator = DSPy::Evals.new(predictor, metric: quality_metric)

Access prediction fields with dot notation (prediction.answer), not hash notation.

Observability hooks

Register callbacks without editing the evaluator:

DSPy::Evals.before_example do |payload|
  example = payload[:example]
  DSPy.logger.info("Evaluating example #{example.id}") if example.respond_to?(:id)
end

DSPy::Evals.after_batch do |payload|
  result = payload[:result]
  Langfuse.event(
    name: 'eval.batch',
    metadata: {
      total: result.total_examples,
      passed: result.passed_examples,
      score: result.score
    }
  )
end

Available hooks: before_example, after_example, before_batch, after_batch.

Langfuse score export

Enable export_scores: true to emit score.create events for each evaluated example and a batch score at the end:

evaluator = DSPy::Evals.new(
  predictor,
  metric: metric,
  export_scores: true,
  score_name: 'qa_accuracy'   # default: 'evaluation'
)

result = evaluator.evaluate(test_examples)
# Emits per-example scores + overall batch score via DSPy::Scores::Exporter

Scores attach to the current trace context automatically and flow to Langfuse asynchronously.

Evaluation results

result = evaluator.evaluate(test_examples)

result.score            # Overall score (0.0 to 1.0)
result.passed_count     # Examples that passed
result.failed_count     # Examples that failed
result.error_count      # Examples that errored

result.results.each do |r|
  r.passed              # Boolean
  r.score               # Numeric score
  r.error               # Error message if the example errored
end

Integration with optimizers

metric = proc do |example, prediction|
  expected  = example.expected_values[:answer].to_s.strip.downcase
  predicted = prediction.answer.to_s.strip.downcase
  !expected.empty? && predicted.include?(expected)
end

optimizer = DSPy::Teleprompt::MIPROv2::AutoMode.medium(metric: metric)

result = optimizer.compile(
  DSPy::Predict.new(QASignature),
  trainset: train_examples,
  valset: val_examples
)

evaluator = DSPy::Evals.new(result.optimized_program, metric: metric)
test_result = evaluator.evaluate(test_examples, display_table: true)
puts "Test accuracy: #{(test_result.pass_rate * 100).round(2)}%"

Storage System

DSPy::Storage persists optimization results, tracks history, and manages multiple versions of optimized programs.

ProgramStorage (low-level)

storage = DSPy::Storage::ProgramStorage.new(storage_path: "./dspy_storage")

# Save
saved = storage.save_program(
  result.optimized_program,
  result,
  metadata: {
    signature_class: 'ClassifyText',
    optimizer: 'MIPROv2',
    examples_count: examples.size
  }
)
puts "Stored with ID: #{saved.program_id}"

# Load
saved = storage.load_program(program_id)
predictor = saved.program
score = saved.optimization_result[:best_score_value]

# List
storage.list_programs.each do |p|
  puts "#{p[:program_id]} -- score: #{p[:best_score]} -- saved: #{p[:saved_at]}"
end
manager = DSPy::Storage::StorageManager.new

# Save with tags
saved = manager.save_optimization_result(
  result,
  tags: ['production', 'sentiment-analysis'],
  description: 'Optimized sentiment classifier v2'
)

# Find programs
programs = manager.find_programs(
  optimizer: 'MIPROv2',
  min_score: 0.85,
  tags: ['production']
)

recent = manager.find_programs(
  max_age_days: 7,
  signature_class: 'ClassifyText'
)

# Get best program for a signature
best = manager.get_best_program('ClassifyText')
predictor = best.program

Global shorthand:

DSPy::Storage::StorageManager.save(result, metadata: { version: '2.0' })
DSPy::Storage::StorageManager.load(program_id)
DSPy::Storage::StorageManager.best('ClassifyText')

Checkpoints

Create and restore checkpoints during long-running optimizations:

# Save a checkpoint
manager.create_checkpoint(
  current_result,
  'iteration_50',
  metadata: { iteration: 50, current_score: 0.87 }
)

# Restore
restored = manager.restore_checkpoint('iteration_50')
program = restored.program

# Auto-checkpoint every N iterations
if iteration % 10 == 0
  manager.create_checkpoint(current_result, "auto_checkpoint_#{iteration}")
end

Import and export

Share programs between environments:

storage = DSPy::Storage::ProgramStorage.new

# Export
storage.export_programs(['abc123', 'def456'], './export_backup.json')

# Import
imported = storage.import_programs('./export_backup.json')
puts "Imported #{imported.size} programs"

Optimization history

history = manager.get_optimization_history

history[:summary][:total_programs]
history[:summary][:avg_score]

history[:optimizer_stats].each do |optimizer, stats|
  puts "#{optimizer}: #{stats[:count]} programs, best: #{stats[:best_score]}"
end

history[:trends][:improvement_percentage]

Program comparison

comparison = manager.compare_programs(id_a, id_b)
comparison[:comparison][:score_difference]
comparison[:comparison][:better_program]
comparison[:comparison][:age_difference_hours]

Storage configuration

config = DSPy::Storage::StorageManager::StorageConfig.new
config.storage_path = Rails.root.join('dspy_storage')
config.auto_save = true
config.save_intermediate_results = false
config.max_stored_programs = 100

manager = DSPy::Storage::StorageManager.new(config: config)

Cleanup

Remove old programs. Cleanup retains the best performing and most recent programs using a weighted score (70% performance, 30% recency):

deleted_count = manager.cleanup_old_programs

Storage events

The storage system emits structured log events for monitoring:

  • dspy.storage.save_start, dspy.storage.save_complete, dspy.storage.save_error
  • dspy.storage.load_start, dspy.storage.load_complete, dspy.storage.load_error
  • dspy.storage.delete, dspy.storage.export, dspy.storage.import, dspy.storage.cleanup

File layout

dspy_storage/
  programs/
    abc123def456.json
    789xyz012345.json
  history.json

API rules

  • Call predictors with .call(), not .forward().
  • Access prediction fields with dot notation (result.answer), not hash notation (result[:answer]).
  • GEPA metrics return DSPy::Prediction.new(score:, feedback:), not a boolean.
  • MIPROv2 metrics may return true/false, a numeric score, or DSPy::Prediction.