# DSPy.rb Optimization ## MIPROv2 MIPROv2 (Multi-prompt Instruction Proposal with Retrieval Optimization) is the primary instruction tuner in DSPy.rb. It proposes new instructions and few-shot demonstrations per predictor, evaluates them on mini-batches, and retains candidates that improve the metric. It ships as a separate gem to keep the Gaussian Process dependency tree out of apps that do not need it. ### Installation ```ruby # Gemfile gem "dspy" gem "dspy-miprov2" ``` Bundler auto-requires `dspy/miprov2`. No additional `require` statement is needed. ### AutoMode presets Use `DSPy::Teleprompt::MIPROv2::AutoMode` for preconfigured optimizers: ```ruby light = DSPy::Teleprompt::MIPROv2::AutoMode.light(metric: metric) # 6 trials, greedy medium = DSPy::Teleprompt::MIPROv2::AutoMode.medium(metric: metric) # 12 trials, adaptive heavy = DSPy::Teleprompt::MIPROv2::AutoMode.heavy(metric: metric) # 18 trials, Bayesian ``` | Preset | Trials | Strategy | Use case | |----------|--------|------------|-----------------------------------------------------| | `light` | 6 | `:greedy` | Quick wins on small datasets or during prototyping. | | `medium` | 12 | `:adaptive`| Balanced exploration vs. runtime for most pilots. | | `heavy` | 18 | `:bayesian`| Highest accuracy targets or multi-stage programs. | ### Manual configuration with dry-configurable `DSPy::Teleprompt::MIPROv2` includes `Dry::Configurable`. Configure at the class level (defaults for all instances) or instance level (overrides class defaults). **Class-level defaults:** ```ruby DSPy::Teleprompt::MIPROv2.configure do |config| config.optimization_strategy = :bayesian config.num_trials = 30 config.bootstrap_sets = 10 end ``` **Instance-level overrides:** ```ruby optimizer = DSPy::Teleprompt::MIPROv2.new(metric: metric) optimizer.configure do |config| config.num_trials = 15 config.num_instruction_candidates = 6 config.bootstrap_sets = 5 config.max_bootstrapped_examples = 4 config.max_labeled_examples = 16 config.optimization_strategy = :adaptive # :greedy, :adaptive, :bayesian config.early_stopping_patience = 3 config.init_temperature = 1.0 config.final_temperature = 0.1 config.minibatch_size = nil # nil = auto config.auto_seed = 42 end ``` The `optimization_strategy` setting accepts symbols (`:greedy`, `:adaptive`, `:bayesian`) and coerces them internally to `DSPy::Teleprompt::OptimizationStrategy` T::Enum values. The old `config:` constructor parameter is removed. Passing `config:` raises `ArgumentError`. ### Auto presets via configure Instead of `AutoMode`, set the preset through the configure block: ```ruby optimizer = DSPy::Teleprompt::MIPROv2.new(metric: metric) optimizer.configure do |config| config.auto_preset = DSPy::Teleprompt::AutoPreset.deserialize("medium") end ``` ### Compile and inspect ```ruby program = DSPy::Predict.new(MySignature) result = optimizer.compile( program, trainset: train_examples, valset: val_examples ) optimized_program = result.optimized_program puts "Best score: #{result.best_score_value}" ``` The `result` object exposes: - `optimized_program` -- ready-to-use predictor with updated instruction and demos. - `optimization_trace[:trial_logs]` -- per-trial record of instructions, demos, and scores. - `metadata[:optimizer]` -- `"MIPROv2"`, useful when persisting experiments from multiple optimizers. ### Multi-stage programs MIPROv2 generates dataset summaries for each predictor and proposes per-stage instructions. For a ReAct agent with `thought_generator` and `observation_processor` predictors, the optimizer handles credit assignment internally. The metric only needs to evaluate the final output. ### Bootstrap sampling During the bootstrap phase MIPROv2: 1. Generates dataset summaries from the training set. 2. Bootstraps few-shot demonstrations by running the baseline program. 3. Proposes candidate instructions grounded in the summaries and bootstrapped examples. 4. Evaluates each candidate on mini-batches drawn from the validation set. Control the bootstrap phase with `bootstrap_sets`, `max_bootstrapped_examples`, and `max_labeled_examples`. ### Bayesian optimization When `optimization_strategy` is `:bayesian` (or when using the `heavy` preset), MIPROv2 fits a Gaussian Process surrogate over past trial scores to select the next candidate. This replaces random search with informed exploration, reducing the number of trials needed to find high-scoring instructions. --- ## GEPA GEPA (Genetic-Pareto Reflective Prompt Evolution) is a feedback-driven optimizer. It runs the program on a small batch, collects scores and textual feedback, and asks a reflection LM to rewrite the instruction. Improved candidates are retained on a Pareto frontier. ### Installation ```ruby # Gemfile gem "dspy" gem "dspy-gepa" ``` The `dspy-gepa` gem depends on the `gepa` core optimizer gem automatically. ### Metric contract GEPA metrics return `DSPy::Prediction` with both a numeric score and a feedback string. Do not return a plain boolean. ```ruby metric = lambda do |example, prediction| expected = example.expected_values[:label] predicted = prediction.label score = predicted == expected ? 1.0 : 0.0 feedback = if score == 1.0 "Correct (#{expected}) for: \"#{example.input_values[:text][0..60]}\"" else "Misclassified (expected #{expected}, got #{predicted}) for: \"#{example.input_values[:text][0..60]}\"" end DSPy::Prediction.new(score: score, feedback: feedback) end ``` Keep the score in `[0, 1]`. Always include a short feedback message explaining what happened -- GEPA hands this text to the reflection model so it can reason about failures. ### Feedback maps `feedback_map` targets individual predictors inside a composite module. Each entry receives keyword arguments and returns a `DSPy::Prediction`: ```ruby feedback_map = { 'self' => lambda do |predictor_output:, predictor_inputs:, module_inputs:, module_outputs:, captured_trace:| expected = module_inputs.expected_values[:label] predicted = predictor_output.label DSPy::Prediction.new( score: predicted == expected ? 1.0 : 0.0, feedback: "Classifier saw \"#{predictor_inputs[:text][0..80]}\" -> #{predicted} (expected #{expected})" ) end } ``` For single-predictor programs, key the map with `'self'`. For multi-predictor chains, add entries per component so the reflection LM sees localized context at each step. Omit `feedback_map` entirely if the top-level metric already covers the basics. ### Configuring the teleprompter ```ruby teleprompter = DSPy::Teleprompt::GEPA.new( metric: metric, reflection_lm: DSPy::ReflectionLM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY']), feedback_map: feedback_map, config: { max_metric_calls: 600, minibatch_size: 6, skip_perfect_score: false } ) ``` Key configuration knobs: | Knob | Purpose | |----------------------|-------------------------------------------------------------------------------------------| | `max_metric_calls` | Hard budget on evaluation calls. Set to at least the validation set size plus a few minibatches. | | `minibatch_size` | Examples per reflective replay batch. Smaller = cheaper iterations, noisier scores. | | `skip_perfect_score` | Set `true` to stop early when a candidate reaches score `1.0`. | ### Minibatch sizing | Goal | Suggested size | Rationale | |-------------------------------------------------|----------------|------------------------------------------------------------| | Explore many candidates within a tight budget | 3--6 | Cheap iterations, more prompt variants, noisier metrics. | | Stable metrics when each rollout is costly | 8--12 | Smoother scores, fewer candidates unless budget is raised. | | Investigate specific failure modes | 3--4 then 8+ | Start with breadth, increase once patterns emerge. | ### Compile and evaluate ```ruby program = DSPy::Predict.new(MySignature) result = teleprompter.compile(program, trainset: train, valset: val) optimized_program = result.optimized_program test_metrics = evaluate(optimized_program, test) ``` The `result` object exposes: - `optimized_program` -- predictor with updated instruction and few-shot examples. - `best_score_value` -- validation score for the best candidate. - `metadata` -- candidate counts, trace hashes, and telemetry IDs. ### Reflection LM Swap `DSPy::ReflectionLM` for any callable object that accepts the reflection prompt hash and returns a string. The default reflection signature extracts the new instruction from triple backticks in the response. ### Experiment tracking Plug `GEPA::Logging::ExperimentTracker` into a persistence layer: ```ruby tracker = GEPA::Logging::ExperimentTracker.new tracker.with_subscriber { |event| MyModel.create!(payload: event) } teleprompter = DSPy::Teleprompt::GEPA.new( metric: metric, reflection_lm: reflection_lm, experiment_tracker: tracker, config: { max_metric_calls: 900 } ) ``` The tracker emits Pareto update events, merge decisions, and candidate evolution records as JSONL. ### Pareto frontier GEPA maintains a diverse candidate pool and samples from the Pareto frontier instead of mutating only the top-scoring program. This balances exploration and prevents the search from collapsing onto a single lineage. Enable the merge proposer after multiple strong lineages emerge: ```ruby config: { max_metric_calls: 900, enable_merge_proposer: true } ``` Premature merges eat budget without meaningful gains. Gate merge on having several validated candidates first. ### Advanced options - `acceptance_strategy:` -- plug in bespoke Pareto filters or early-stop heuristics. - Telemetry spans emit via `GEPA::Telemetry`. Enable global observability with `DSPy.configure { |c| c.observability = true }` to stream spans to an OpenTelemetry exporter. --- ## Evaluation Framework `DSPy::Evals` provides batch evaluation of predictors against test datasets with built-in and custom metrics. ### Basic usage ```ruby metric = proc do |example, prediction| prediction.answer == example.expected_values[:answer] end evaluator = DSPy::Evals.new(predictor, metric: metric) result = evaluator.evaluate( test_examples, display_table: true, display_progress: true ) puts "Pass rate: #{(result.pass_rate * 100).round(1)}%" puts "Passed: #{result.passed_examples}/#{result.total_examples}" ``` ### DSPy::Example Convert raw data into `DSPy::Example` instances before passing to optimizers or evaluators. Each example carries `input_values` and `expected_values`: ```ruby examples = rows.map do |row| DSPy::Example.new( input_values: { text: row[:text] }, expected_values: { label: row[:label] } ) end train, val, test = split_examples(examples, train_ratio: 0.6, val_ratio: 0.2, seed: 42) ``` Hold back a test set from the optimization loop. Optimizers work on train/val; only the test set proves generalization. ### Built-in metrics ```ruby # Exact match -- prediction must exactly equal expected value metric = DSPy::Metrics.exact_match(field: :answer, case_sensitive: true) # Contains -- prediction must contain expected substring metric = DSPy::Metrics.contains(field: :answer, case_sensitive: false) # Numeric difference -- numeric output within tolerance metric = DSPy::Metrics.numeric_difference(field: :answer, tolerance: 0.01) # Composite AND -- all sub-metrics must pass metric = DSPy::Metrics.composite_and( DSPy::Metrics.exact_match(field: :answer), DSPy::Metrics.contains(field: :reasoning) ) ``` ### Custom metrics ```ruby quality_metric = lambda do |example, prediction| return false unless prediction score = 0.0 score += 0.5 if prediction.answer == example.expected_values[:answer] score += 0.3 if prediction.explanation && prediction.explanation.length > 50 score += 0.2 if prediction.confidence && prediction.confidence > 0.8 score >= 0.7 end evaluator = DSPy::Evals.new(predictor, metric: quality_metric) ``` Access prediction fields with dot notation (`prediction.answer`), not hash notation. ### Observability hooks Register callbacks without editing the evaluator: ```ruby DSPy::Evals.before_example do |payload| example = payload[:example] DSPy.logger.info("Evaluating example #{example.id}") if example.respond_to?(:id) end DSPy::Evals.after_batch do |payload| result = payload[:result] Langfuse.event( name: 'eval.batch', metadata: { total: result.total_examples, passed: result.passed_examples, score: result.score } ) end ``` Available hooks: `before_example`, `after_example`, `before_batch`, `after_batch`. ### Langfuse score export Enable `export_scores: true` to emit `score.create` events for each evaluated example and a batch score at the end: ```ruby evaluator = DSPy::Evals.new( predictor, metric: metric, export_scores: true, score_name: 'qa_accuracy' # default: 'evaluation' ) result = evaluator.evaluate(test_examples) # Emits per-example scores + overall batch score via DSPy::Scores::Exporter ``` Scores attach to the current trace context automatically and flow to Langfuse asynchronously. ### Evaluation results ```ruby result = evaluator.evaluate(test_examples) result.score # Overall score (0.0 to 1.0) result.passed_count # Examples that passed result.failed_count # Examples that failed result.error_count # Examples that errored result.results.each do |r| r.passed # Boolean r.score # Numeric score r.error # Error message if the example errored end ``` ### Integration with optimizers ```ruby metric = proc do |example, prediction| expected = example.expected_values[:answer].to_s.strip.downcase predicted = prediction.answer.to_s.strip.downcase !expected.empty? && predicted.include?(expected) end optimizer = DSPy::Teleprompt::MIPROv2::AutoMode.medium(metric: metric) result = optimizer.compile( DSPy::Predict.new(QASignature), trainset: train_examples, valset: val_examples ) evaluator = DSPy::Evals.new(result.optimized_program, metric: metric) test_result = evaluator.evaluate(test_examples, display_table: true) puts "Test accuracy: #{(test_result.pass_rate * 100).round(2)}%" ``` --- ## Storage System `DSPy::Storage` persists optimization results, tracks history, and manages multiple versions of optimized programs. ### ProgramStorage (low-level) ```ruby storage = DSPy::Storage::ProgramStorage.new(storage_path: "./dspy_storage") # Save saved = storage.save_program( result.optimized_program, result, metadata: { signature_class: 'ClassifyText', optimizer: 'MIPROv2', examples_count: examples.size } ) puts "Stored with ID: #{saved.program_id}" # Load saved = storage.load_program(program_id) predictor = saved.program score = saved.optimization_result[:best_score_value] # List storage.list_programs.each do |p| puts "#{p[:program_id]} -- score: #{p[:best_score]} -- saved: #{p[:saved_at]}" end ``` ### StorageManager (recommended) ```ruby manager = DSPy::Storage::StorageManager.new # Save with tags saved = manager.save_optimization_result( result, tags: ['production', 'sentiment-analysis'], description: 'Optimized sentiment classifier v2' ) # Find programs programs = manager.find_programs( optimizer: 'MIPROv2', min_score: 0.85, tags: ['production'] ) recent = manager.find_programs( max_age_days: 7, signature_class: 'ClassifyText' ) # Get best program for a signature best = manager.get_best_program('ClassifyText') predictor = best.program ``` Global shorthand: ```ruby DSPy::Storage::StorageManager.save(result, metadata: { version: '2.0' }) DSPy::Storage::StorageManager.load(program_id) DSPy::Storage::StorageManager.best('ClassifyText') ``` ### Checkpoints Create and restore checkpoints during long-running optimizations: ```ruby # Save a checkpoint manager.create_checkpoint( current_result, 'iteration_50', metadata: { iteration: 50, current_score: 0.87 } ) # Restore restored = manager.restore_checkpoint('iteration_50') program = restored.program # Auto-checkpoint every N iterations if iteration % 10 == 0 manager.create_checkpoint(current_result, "auto_checkpoint_#{iteration}") end ``` ### Import and export Share programs between environments: ```ruby storage = DSPy::Storage::ProgramStorage.new # Export storage.export_programs(['abc123', 'def456'], './export_backup.json') # Import imported = storage.import_programs('./export_backup.json') puts "Imported #{imported.size} programs" ``` ### Optimization history ```ruby history = manager.get_optimization_history history[:summary][:total_programs] history[:summary][:avg_score] history[:optimizer_stats].each do |optimizer, stats| puts "#{optimizer}: #{stats[:count]} programs, best: #{stats[:best_score]}" end history[:trends][:improvement_percentage] ``` ### Program comparison ```ruby comparison = manager.compare_programs(id_a, id_b) comparison[:comparison][:score_difference] comparison[:comparison][:better_program] comparison[:comparison][:age_difference_hours] ``` ### Storage configuration ```ruby config = DSPy::Storage::StorageManager::StorageConfig.new config.storage_path = Rails.root.join('dspy_storage') config.auto_save = true config.save_intermediate_results = false config.max_stored_programs = 100 manager = DSPy::Storage::StorageManager.new(config: config) ``` ### Cleanup Remove old programs. Cleanup retains the best performing and most recent programs using a weighted score (70% performance, 30% recency): ```ruby deleted_count = manager.cleanup_old_programs ``` ### Storage events The storage system emits structured log events for monitoring: - `dspy.storage.save_start`, `dspy.storage.save_complete`, `dspy.storage.save_error` - `dspy.storage.load_start`, `dspy.storage.load_complete`, `dspy.storage.load_error` - `dspy.storage.delete`, `dspy.storage.export`, `dspy.storage.import`, `dspy.storage.cleanup` ### File layout ``` dspy_storage/ programs/ abc123def456.json 789xyz012345.json history.json ``` --- ## API rules - Call predictors with `.call()`, not `.forward()`. - Access prediction fields with dot notation (`result.answer`), not hash notation (`result[:answer]`). - GEPA metrics return `DSPy::Prediction.new(score:, feedback:)`, not a boolean. - MIPROv2 metrics may return `true`/`false`, a numeric score, or `DSPy::Prediction`.