Files

Vicente Reig Rincón de Arellano e8f3bbcb35 refactor(skills): update dspy-ruby skill to DSPy.rb v0.34.3 API (#162 )

Rewrite all reference files, asset templates, and SKILL.md to use
current API patterns (.call(), result.field, T::Enum classes,
Tools::Base). Add two new reference files (toolsets, observability)
covering tools DSL, event system, and Langfuse integration.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-09 12:01:43 -06:00

18 KiB

Raw Blame History

DSPy.rb Optimization

MIPROv2

MIPROv2 (Multi-prompt Instruction Proposal with Retrieval Optimization) is the primary instruction tuner in DSPy.rb. It proposes new instructions and few-shot demonstrations per predictor, evaluates them on mini-batches, and retains candidates that improve the metric. It ships as a separate gem to keep the Gaussian Process dependency tree out of apps that do not need it.

Installation

# Gemfile
gem "dspy"
gem "dspy-miprov2"

Bundler auto-requires dspy/miprov2. No additional require statement is needed.

AutoMode presets

Use DSPy::Teleprompt::MIPROv2::AutoMode for preconfigured optimizers:

light  = DSPy::Teleprompt::MIPROv2::AutoMode.light(metric: metric)   # 6 trials, greedy
medium = DSPy::Teleprompt::MIPROv2::AutoMode.medium(metric: metric)  # 12 trials, adaptive
heavy  = DSPy::Teleprompt::MIPROv2::AutoMode.heavy(metric: metric)   # 18 trials, Bayesian

Preset	Trials	Strategy	Use case
`light`	6	`:greedy`	Quick wins on small datasets or during prototyping.
`medium`	12	`:adaptive`	Balanced exploration vs. runtime for most pilots.
`heavy`	18	`:bayesian`	Highest accuracy targets or multi-stage programs.

Manual configuration with dry-configurable

DSPy::Teleprompt::MIPROv2 includes Dry::Configurable. Configure at the class level (defaults for all instances) or instance level (overrides class defaults).

Class-level defaults:

DSPy::Teleprompt::MIPROv2.configure do |config|
  config.optimization_strategy = :bayesian
  config.num_trials = 30
  config.bootstrap_sets = 10
end

Instance-level overrides:

optimizer = DSPy::Teleprompt::MIPROv2.new(metric: metric)
optimizer.configure do |config|
  config.num_trials = 15
  config.num_instruction_candidates = 6
  config.bootstrap_sets = 5
  config.max_bootstrapped_examples = 4
  config.max_labeled_examples = 16
  config.optimization_strategy = :adaptive       # :greedy, :adaptive, :bayesian
  config.early_stopping_patience = 3
  config.init_temperature = 1.0
  config.final_temperature = 0.1
  config.minibatch_size = nil                     # nil = auto
  config.auto_seed = 42
end

The optimization_strategy setting accepts symbols (:greedy, :adaptive, :bayesian) and coerces them internally to DSPy::Teleprompt::OptimizationStrategy T::Enum values.

The old config: constructor parameter is removed. Passing config: raises ArgumentError.

Auto presets via configure

Instead of AutoMode, set the preset through the configure block:

optimizer = DSPy::Teleprompt::MIPROv2.new(metric: metric)
optimizer.configure do |config|
  config.auto_preset = DSPy::Teleprompt::AutoPreset.deserialize("medium")
end

Compile and inspect

program = DSPy::Predict.new(MySignature)

result = optimizer.compile(
  program,
  trainset: train_examples,
  valset: val_examples
)

optimized_program = result.optimized_program
puts "Best score: #{result.best_score_value}"

The result object exposes:

optimized_program -- ready-to-use predictor with updated instruction and demos.
optimization_trace[:trial_logs] -- per-trial record of instructions, demos, and scores.
metadata[:optimizer] -- "MIPROv2", useful when persisting experiments from multiple optimizers.

Multi-stage programs

MIPROv2 generates dataset summaries for each predictor and proposes per-stage instructions. For a ReAct agent with thought_generator and observation_processor predictors, the optimizer handles credit assignment internally. The metric only needs to evaluate the final output.

Bootstrap sampling

During the bootstrap phase MIPROv2:

Generates dataset summaries from the training set.
Bootstraps few-shot demonstrations by running the baseline program.
Proposes candidate instructions grounded in the summaries and bootstrapped examples.
Evaluates each candidate on mini-batches drawn from the validation set.

Control the bootstrap phase with bootstrap_sets, max_bootstrapped_examples, and max_labeled_examples.

Bayesian optimization

When optimization_strategy is :bayesian (or when using the heavy preset), MIPROv2 fits a Gaussian Process surrogate over past trial scores to select the next candidate. This replaces random search with informed exploration, reducing the number of trials needed to find high-scoring instructions.

GEPA

GEPA (Genetic-Pareto Reflective Prompt Evolution) is a feedback-driven optimizer. It runs the program on a small batch, collects scores and textual feedback, and asks a reflection LM to rewrite the instruction. Improved candidates are retained on a Pareto frontier.

Installation

# Gemfile
gem "dspy"
gem "dspy-gepa"

The dspy-gepa gem depends on the gepa core optimizer gem automatically.

Metric contract

GEPA metrics return DSPy::Prediction with both a numeric score and a feedback string. Do not return a plain boolean.

metric = lambda do |example, prediction|
  expected  = example.expected_values[:label]
  predicted = prediction.label

  score = predicted == expected ? 1.0 : 0.0
  feedback = if score == 1.0
    "Correct (#{expected}) for: \"#{example.input_values[:text][0..60]}\""
  else
    "Misclassified (expected #{expected}, got #{predicted}) for: \"#{example.input_values[:text][0..60]}\""
  end

  DSPy::Prediction.new(score: score, feedback: feedback)
end

Keep the score in [0, 1]. Always include a short feedback message explaining what happened -- GEPA hands this text to the reflection model so it can reason about failures.

Feedback maps

feedback_map targets individual predictors inside a composite module. Each entry receives keyword arguments and returns a DSPy::Prediction:

feedback_map = {
  'self' => lambda do |predictor_output:, predictor_inputs:, module_inputs:, module_outputs:, captured_trace:|
    expected  = module_inputs.expected_values[:label]
    predicted = predictor_output.label

    DSPy::Prediction.new(
      score: predicted == expected ? 1.0 : 0.0,
      feedback: "Classifier saw \"#{predictor_inputs[:text][0..80]}\" -> #{predicted} (expected #{expected})"
    )
  end
}

For single-predictor programs, key the map with 'self'. For multi-predictor chains, add entries per component so the reflection LM sees localized context at each step. Omit feedback_map entirely if the top-level metric already covers the basics.

Configuring the teleprompter

teleprompter = DSPy::Teleprompt::GEPA.new(
  metric: metric,
  reflection_lm: DSPy::ReflectionLM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY']),
  feedback_map: feedback_map,
  config: {
    max_metric_calls: 600,
    minibatch_size: 6,
    skip_perfect_score: false
  }
)

Key configuration knobs:

Knob	Purpose
`max_metric_calls`	Hard budget on evaluation calls. Set to at least the validation set size plus a few minibatches.
`minibatch_size`	Examples per reflective replay batch. Smaller = cheaper iterations, noisier scores.
`skip_perfect_score`	Set `true` to stop early when a candidate reaches score `1.0`.

Minibatch sizing

Goal	Suggested size	Rationale
Explore many candidates within a tight budget	3--6	Cheap iterations, more prompt variants, noisier metrics.
Stable metrics when each rollout is costly	8--12	Smoother scores, fewer candidates unless budget is raised.
Investigate specific failure modes	3--4 then 8+	Start with breadth, increase once patterns emerge.

Compile and evaluate

program = DSPy::Predict.new(MySignature)

result = teleprompter.compile(program, trainset: train, valset: val)
optimized_program = result.optimized_program

test_metrics = evaluate(optimized_program, test)

The result object exposes:

optimized_program -- predictor with updated instruction and few-shot examples.
best_score_value -- validation score for the best candidate.
metadata -- candidate counts, trace hashes, and telemetry IDs.

Reflection LM

Swap DSPy::ReflectionLM for any callable object that accepts the reflection prompt hash and returns a string. The default reflection signature extracts the new instruction from triple backticks in the response.

Experiment tracking

Plug GEPA::Logging::ExperimentTracker into a persistence layer:

tracker = GEPA::Logging::ExperimentTracker.new
tracker.with_subscriber { |event| MyModel.create!(payload: event) }

teleprompter = DSPy::Teleprompt::GEPA.new(
  metric: metric,
  reflection_lm: reflection_lm,
  experiment_tracker: tracker,
  config: { max_metric_calls: 900 }
)

The tracker emits Pareto update events, merge decisions, and candidate evolution records as JSONL.

Pareto frontier

GEPA maintains a diverse candidate pool and samples from the Pareto frontier instead of mutating only the top-scoring program. This balances exploration and prevents the search from collapsing onto a single lineage.

Enable the merge proposer after multiple strong lineages emerge:

config: {
  max_metric_calls: 900,
  enable_merge_proposer: true
}

Premature merges eat budget without meaningful gains. Gate merge on having several validated candidates first.

Advanced options

acceptance_strategy: -- plug in bespoke Pareto filters or early-stop heuristics.
Telemetry spans emit via GEPA::Telemetry. Enable global observability with DSPy.configure { |c| c.observability = true } to stream spans to an OpenTelemetry exporter.

Evaluation Framework

DSPy::Evals provides batch evaluation of predictors against test datasets with built-in and custom metrics.

Basic usage

metric = proc do |example, prediction|
  prediction.answer == example.expected_values[:answer]
end

evaluator = DSPy::Evals.new(predictor, metric: metric)

result = evaluator.evaluate(
  test_examples,
  display_table: true,
  display_progress: true
)

puts "Pass rate: #{(result.pass_rate * 100).round(1)}%"
puts "Passed: #{result.passed_examples}/#{result.total_examples}"

DSPy::Example

Convert raw data into DSPy::Example instances before passing to optimizers or evaluators. Each example carries input_values and expected_values:

examples = rows.map do |row|
  DSPy::Example.new(
    input_values: { text: row[:text] },
    expected_values: { label: row[:label] }
  )
end

train, val, test = split_examples(examples, train_ratio: 0.6, val_ratio: 0.2, seed: 42)

Hold back a test set from the optimization loop. Optimizers work on train/val; only the test set proves generalization.

Built-in metrics

# Exact match -- prediction must exactly equal expected value
metric = DSPy::Metrics.exact_match(field: :answer, case_sensitive: true)

# Contains -- prediction must contain expected substring
metric = DSPy::Metrics.contains(field: :answer, case_sensitive: false)

# Numeric difference -- numeric output within tolerance
metric = DSPy::Metrics.numeric_difference(field: :answer, tolerance: 0.01)

# Composite AND -- all sub-metrics must pass
metric = DSPy::Metrics.composite_and(
  DSPy::Metrics.exact_match(field: :answer),
  DSPy::Metrics.contains(field: :reasoning)
)

Custom metrics

quality_metric = lambda do |example, prediction|
  return false unless prediction

  score = 0.0
  score += 0.5 if prediction.answer == example.expected_values[:answer]
  score += 0.3 if prediction.explanation && prediction.explanation.length > 50
  score += 0.2 if prediction.confidence && prediction.confidence > 0.8
  score >= 0.7
end

evaluator = DSPy::Evals.new(predictor, metric: quality_metric)

Access prediction fields with dot notation (prediction.answer), not hash notation.

Observability hooks

DSPy::Evals.before_example do |payload|
  example = payload[:example]
  DSPy.logger.info("Evaluating example #{example.id}") if example.respond_to?(:id)
end

DSPy::Evals.after_batch do |payload|
  result = payload[:result]
  Langfuse.event(
    name: 'eval.batch',
    metadata: {
      total: result.total_examples,
      passed: result.passed_examples,
      score: result.score
    }
  )
end

Available hooks: before_example, after_example, before_batch, after_batch.

Langfuse score export

Enable export_scores: true to emit score.create events for each evaluated example and a batch score at the end:

evaluator = DSPy::Evals.new(
  predictor,
  metric: metric,
  export_scores: true,
  score_name: 'qa_accuracy'   # default: 'evaluation'
)

result = evaluator.evaluate(test_examples)
# Emits per-example scores + overall batch score via DSPy::Scores::Exporter

Scores attach to the current trace context automatically and flow to Langfuse asynchronously.

Evaluation results

result = evaluator.evaluate(test_examples)

result.score            # Overall score (0.0 to 1.0)
result.passed_count     # Examples that passed
result.failed_count     # Examples that failed
result.error_count      # Examples that errored

result.results.each do |r|
  r.passed              # Boolean
  r.score               # Numeric score
  r.error               # Error message if the example errored
end

Integration with optimizers

metric = proc do |example, prediction|
  expected  = example.expected_values[:answer].to_s.strip.downcase
  predicted = prediction.answer.to_s.strip.downcase
  !expected.empty? && predicted.include?(expected)
end

optimizer = DSPy::Teleprompt::MIPROv2::AutoMode.medium(metric: metric)

result = optimizer.compile(
  DSPy::Predict.new(QASignature),
  trainset: train_examples,
  valset: val_examples
)

evaluator = DSPy::Evals.new(result.optimized_program, metric: metric)
test_result = evaluator.evaluate(test_examples, display_table: true)
puts "Test accuracy: #{(test_result.pass_rate * 100).round(2)}%"

Storage System

DSPy::Storage persists optimization results, tracks history, and manages multiple versions of optimized programs.

ProgramStorage (low-level)

storage = DSPy::Storage::ProgramStorage.new(storage_path: "./dspy_storage")

# Save
saved = storage.save_program(
  result.optimized_program,
  result,
  metadata: {
    signature_class: 'ClassifyText',
    optimizer: 'MIPROv2',
    examples_count: examples.size
  }
)
puts "Stored with ID: #{saved.program_id}"

# Load
saved = storage.load_program(program_id)
predictor = saved.program
score = saved.optimization_result[:best_score_value]

# List
storage.list_programs.each do |p|
  puts "#{p[:program_id]} -- score: #{p[:best_score]} -- saved: #{p[:saved_at]}"
end

StorageManager (recommended)

manager = DSPy::Storage::StorageManager.new

# Save with tags
saved = manager.save_optimization_result(
  result,
  tags: ['production', 'sentiment-analysis'],
  description: 'Optimized sentiment classifier v2'
)

# Find programs
programs = manager.find_programs(
  optimizer: 'MIPROv2',
  min_score: 0.85,
  tags: ['production']
)

recent = manager.find_programs(
  max_age_days: 7,
  signature_class: 'ClassifyText'
)

# Get best program for a signature
best = manager.get_best_program('ClassifyText')
predictor = best.program

Global shorthand:

DSPy::Storage::StorageManager.save(result, metadata: { version: '2.0' })
DSPy::Storage::StorageManager.load(program_id)
DSPy::Storage::StorageManager.best('ClassifyText')

Checkpoints

Create and restore checkpoints during long-running optimizations:

# Save a checkpoint
manager.create_checkpoint(
  current_result,
  'iteration_50',
  metadata: { iteration: 50, current_score: 0.87 }
)

# Restore
restored = manager.restore_checkpoint('iteration_50')
program = restored.program

# Auto-checkpoint every N iterations
if iteration % 10 == 0
  manager.create_checkpoint(current_result, "auto_checkpoint_#{iteration}")
end

Import and export

Share programs between environments:

storage = DSPy::Storage::ProgramStorage.new

# Export
storage.export_programs(['abc123', 'def456'], './export_backup.json')

# Import
imported = storage.import_programs('./export_backup.json')
puts "Imported #{imported.size} programs"

Optimization history

history = manager.get_optimization_history

history[:summary][:total_programs]
history[:summary][:avg_score]

history[:optimizer_stats].each do |optimizer, stats|
  puts "#{optimizer}: #{stats[:count]} programs, best: #{stats[:best_score]}"
end

history[:trends][:improvement_percentage]

Program comparison

comparison = manager.compare_programs(id_a, id_b)
comparison[:comparison][:score_difference]
comparison[:comparison][:better_program]
comparison[:comparison][:age_difference_hours]

Storage configuration

config = DSPy::Storage::StorageManager::StorageConfig.new
config.storage_path = Rails.root.join('dspy_storage')
config.auto_save = true
config.save_intermediate_results = false
config.max_stored_programs = 100

manager = DSPy::Storage::StorageManager.new(config: config)

Cleanup

Remove old programs. Cleanup retains the best performing and most recent programs using a weighted score (70% performance, 30% recency):

deleted_count = manager.cleanup_old_programs

Storage events

The storage system emits structured log events for monitoring:

dspy.storage.save_start, dspy.storage.save_complete, dspy.storage.save_error
dspy.storage.load_start, dspy.storage.load_complete, dspy.storage.load_error
dspy.storage.delete, dspy.storage.export, dspy.storage.import, dspy.storage.cleanup

File layout

dspy_storage/
  programs/
    abc123def456.json
    789xyz012345.json
  history.json

API rules

Call predictors with .call(), not .forward().
Access prediction fields with dot notation (result.answer), not hash notation (result[:answer]).
GEPA metrics return DSPy::Prediction.new(score:, feedback:), not a boolean.
MIPROv2 metrics may return true/false, a numeric score, or DSPy::Prediction.

18 KiB Raw Blame History

DSPy.rb Optimization

MIPROv2

Installation

AutoMode presets

Manual configuration with dry-configurable

Auto presets via configure

Compile and inspect

Multi-stage programs

Bootstrap sampling

Bayesian optimization

GEPA

Installation

Metric contract

Feedback maps

Configuring the teleprompter

Minibatch sizing

Compile and evaluate

Reflection LM

Experiment tracking

Pareto frontier

Advanced options

Evaluation Framework

Basic usage

DSPy::Example

Built-in metrics

Custom metrics

Observability hooks

Langfuse score export

Evaluation results

Integration with optimizers

Storage System

ProgramStorage (low-level)

StorageManager (recommended)

Checkpoints

Import and export

Optimization history

Program comparison

Storage configuration

Cleanup

Storage events

File layout

API rules

18 KiB

Raw Blame History