[2.9.0] Rename plugin to compound-engineering

BREAKING: Plugin renamed from compounding-engineering to compound-engineering. Users will need to reinstall with the new name: claude /plugin install compound-engineering Changes: - Renamed plugin directory and all references - Updated documentation counts (24 agents, 19 commands) - Added julik-frontend-races-reviewer to docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-02 17:32:04 -08:00
parent 4b49e5344d
commit 6c5b3e40db
121 changed files with 136 additions and 117 deletions
--- a/plugins/compound-engineering/skills/dspy-ruby/references/optimization.md
+++ b/plugins/compound-engineering/skills/dspy-ruby/references/optimization.md
@@ -0,0 +1,623 @@
+# DSPy.rb Testing, Optimization & Observability
+
+## Testing
+
+DSPy.rb enables standard RSpec testing patterns for LLM logic, making your AI applications testable and maintainable.
+
+### Basic Testing Setup
+
+```ruby
+require 'rspec'
+require 'dspy'
+
+RSpec.describe EmailClassifier do
+  before do
+    DSPy.configure do |c|
+      c.lm = DSPy::LM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY'])
+    end
+  end
+
+  describe '#classify' do
+    it 'classifies technical support emails correctly' do
+      classifier = EmailClassifier.new
+      result = classifier.forward(
+        email_subject: "Can't log in",
+        email_body: "I'm unable to access my account"
+      )
+
+      expect(result[:category]).to eq('Technical')
+      expect(result[:priority]).to be_in(['High', 'Medium', 'Low'])
+    end
+  end
+end
+```
+
+### Mocking LLM Responses
+
+Test your modules without making actual API calls:
+
+```ruby
+RSpec.describe MyModule do
+  it 'handles mock responses correctly' do
+    # Create a mock predictor that returns predetermined results
+    mock_predictor = instance_double(DSPy::Predict)
+    allow(mock_predictor).to receive(:forward).and_return({
+      category: 'Technical',
+      priority: 'High',
+      confidence: 0.95
+    })
+
+    # Inject mock into your module
+    module_instance = MyModule.new
+    module_instance.instance_variable_set(:@predictor, mock_predictor)
+
+    result = module_instance.forward(input: 'test data')
+    expect(result[:category]).to eq('Technical')
+  end
+end
+```
+
+### Testing Type Safety
+
+Verify that signatures enforce type constraints:
+
+```ruby
+RSpec.describe EmailClassificationSignature do
+  it 'validates output types' do
+    predictor = DSPy::Predict.new(EmailClassificationSignature)
+
+    # This should work
+    result = predictor.forward(
+      email_subject: 'Test',
+      email_body: 'Test body'
+    )
+    expect(result[:category]).to be_a(String)
+
+    # Test that invalid types are caught
+    expect {
+      # Simulate LLM returning invalid type
+      predictor.send(:validate_output, { category: 123 })
+    }.to raise_error(DSPy::ValidationError)
+  end
+end
+```
+
+### Testing Edge Cases
+
+Always test boundary conditions and error scenarios:
+
+```ruby
+RSpec.describe EmailClassifier do
+  it 'handles empty emails' do
+    classifier = EmailClassifier.new
+    result = classifier.forward(
+      email_subject: '',
+      email_body: ''
+    )
+    # Define expected behavior for edge case
+    expect(result[:category]).to eq('General')
+  end
+
+  it 'handles very long emails' do
+    long_body = 'word ' * 10000
+    classifier = EmailClassifier.new
+
+    expect {
+      classifier.forward(
+        email_subject: 'Test',
+        email_body: long_body
+      )
+    }.not_to raise_error
+  end
+
+  it 'handles special characters' do
+    classifier = EmailClassifier.new
+    result = classifier.forward(
+      email_subject: 'Test <script>alert("xss")</script>',
+      email_body: 'Body with émojis 🎉 and spëcial çharacters'
+    )
+
+    expect(result[:category]).to be_in(['Technical', 'Billing', 'General'])
+  end
+end
+```
+
+### Integration Testing
+
+Test complete workflows end-to-end:
+
+```ruby
+RSpec.describe EmailProcessingPipeline do
+  it 'processes email through complete pipeline' do
+    pipeline = EmailProcessingPipeline.new
+
+    result = pipeline.forward(
+      email_subject: 'Billing question',
+      email_body: 'How do I update my payment method?'
+    )
+
+    # Verify the complete pipeline output
+    expect(result[:classification]).to eq('Billing')
+    expect(result[:priority]).to eq('Medium')
+    expect(result[:suggested_response]).to include('payment')
+    expect(result[:assigned_team]).to eq('billing_support')
+  end
+end
+```
+
+### VCR for Deterministic Tests
+
+Use VCR to record and replay API responses:
+
+```ruby
+require 'vcr'
+
+VCR.configure do |config|
+  config.cassette_library_dir = 'spec/vcr_cassettes'
+  config.hook_into :webmock
+  config.filter_sensitive_data('<OPENAI_API_KEY>') { ENV['OPENAI_API_KEY'] }
+end
+
+RSpec.describe EmailClassifier do
+  it 'classifies emails consistently', :vcr do
+    VCR.use_cassette('email_classification') do
+      classifier = EmailClassifier.new
+      result = classifier.forward(
+        email_subject: 'Test subject',
+        email_body: 'Test body'
+      )
+
+      expect(result[:category]).to eq('Technical')
+    end
+  end
+end
+```
+
+## Optimization
+
+DSPy.rb provides powerful optimization capabilities to automatically improve your prompts and modules.
+
+### MIPROv2 Optimization
+
+MIPROv2 is an advanced multi-prompt optimization technique that uses bootstrap sampling, instruction generation, and Bayesian optimization.
+
+```ruby
+require 'dspy/mipro'
+
+# Define your module to optimize
+class EmailClassifier < DSPy::Module
+  def initialize
+    super
+    @predictor = DSPy::ChainOfThought.new(EmailClassificationSignature)
+  end
+
+  def forward(input)
+    @predictor.forward(input)
+  end
+end
+
+# Prepare training data
+training_examples = [
+  {
+    input: { email_subject: "Can't log in", email_body: "Password reset not working" },
+    expected_output: { category: 'Technical', priority: 'High' }
+  },
+  {
+    input: { email_subject: "Billing question", email_body: "How much does premium cost?" },
+    expected_output: { category: 'Billing', priority: 'Medium' }
+  },
+  # Add more examples...
+]
+
+# Define evaluation metric
+def accuracy_metric(example, prediction)
+  (example[:expected_output][:category] == prediction[:category]) ? 1.0 : 0.0
+end
+
+# Run optimization
+optimizer = DSPy::MIPROv2.new(
+  metric: method(:accuracy_metric),
+  num_candidates: 10,
+  num_threads: 4
+)
+
+optimized_module = optimizer.compile(
+  EmailClassifier.new,
+  trainset: training_examples
+)
+
+# Use optimized module
+result = optimized_module.forward(
+  email_subject: "New email",
+  email_body: "New email content"
+)
+```
+
+### Bootstrap Few-Shot Learning
+
+Automatically generate few-shot examples from your training data:
+
+```ruby
+require 'dspy/teleprompt'
+
+# Create a teleprompter for few-shot optimization
+teleprompter = DSPy::BootstrapFewShot.new(
+  metric: method(:accuracy_metric),
+  max_bootstrapped_demos: 5,
+  max_labeled_demos: 3
+)
+
+# Compile the optimized module
+optimized = teleprompter.compile(
+  MyModule.new,
+  trainset: training_examples
+)
+```
+
+### Custom Optimization Metrics
+
+Define custom metrics for your specific use case:
+
+```ruby
+def custom_metric(example, prediction)
+  score = 0.0
+
+  # Category accuracy (60% weight)
+  score += 0.6 if example[:expected_output][:category] == prediction[:category]
+
+  # Priority accuracy (40% weight)
+  score += 0.4 if example[:expected_output][:priority] == prediction[:priority]
+
+  score
+end
+
+# Use in optimization
+optimizer = DSPy::MIPROv2.new(
+  metric: method(:custom_metric),
+  num_candidates: 10
+)
+```
+
+### A/B Testing Different Approaches
+
+Compare different module implementations:
+
+```ruby
+# Approach A: ChainOfThought
+class ApproachA < DSPy::Module
+  def initialize
+    super
+    @predictor = DSPy::ChainOfThought.new(EmailClassificationSignature)
+  end
+
+  def forward(input)
+    @predictor.forward(input)
+  end
+end
+
+# Approach B: ReAct with tools
+class ApproachB < DSPy::Module
+  def initialize
+    super
+    @predictor = DSPy::ReAct.new(
+      EmailClassificationSignature,
+      tools: [KnowledgeBaseTool.new]
+    )
+  end
+
+  def forward(input)
+    @predictor.forward(input)
+  end
+end
+
+# Evaluate both approaches
+def evaluate_approach(approach_class, test_set)
+  approach = approach_class.new
+  scores = test_set.map do |example|
+    prediction = approach.forward(example[:input])
+    accuracy_metric(example, prediction)
+  end
+  scores.sum / scores.size
+end
+
+approach_a_score = evaluate_approach(ApproachA, test_examples)
+approach_b_score = evaluate_approach(ApproachB, test_examples)
+
+puts "Approach A accuracy: #{approach_a_score}"
+puts "Approach B accuracy: #{approach_b_score}"
+```
+
+## Observability
+
+Track your LLM application's performance, token usage, and behavior in production.
+
+### OpenTelemetry Integration
+
+DSPy.rb automatically integrates with OpenTelemetry when configured:
+
+```ruby
+require 'opentelemetry/sdk'
+require 'dspy'
+
+# Configure OpenTelemetry
+OpenTelemetry::SDK.configure do |c|
+  c.service_name = 'my-dspy-app'
+  c.use_all # Use all available instrumentation
+end
+
+# DSPy automatically creates traces for predictions
+predictor = DSPy::Predict.new(MySignature)
+result = predictor.forward(input: 'data')
+# Traces are automatically sent to your OpenTelemetry collector
+```
+
+### Langfuse Integration
+
+Track detailed LLM execution traces with Langfuse:
+
+```ruby
+require 'dspy/langfuse'
+
+# Configure Langfuse
+DSPy.configure do |c|
+  c.lm = DSPy::LM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY'])
+  c.langfuse = {
+    public_key: ENV['LANGFUSE_PUBLIC_KEY'],
+    secret_key: ENV['LANGFUSE_SECRET_KEY'],
+    host: ENV['LANGFUSE_HOST'] || 'https://cloud.langfuse.com'
+  }
+end
+
+# All predictions are automatically traced
+predictor = DSPy::Predict.new(MySignature)
+result = predictor.forward(input: 'data')
+# View detailed traces in Langfuse dashboard
+```
+
+### Manual Token Tracking
+
+Track token usage without external services:
+
+```ruby
+class TokenTracker
+  def initialize
+    @total_tokens = 0
+    @request_count = 0
+  end
+
+  def track_prediction(predictor, input)
+    start_time = Time.now
+    result = predictor.forward(input)
+    duration = Time.now - start_time
+
+    # Get token usage from response metadata
+    tokens = result.metadata[:usage][:total_tokens] rescue 0
+    @total_tokens += tokens
+    @request_count += 1
+
+    puts "Request ##{@request_count}: #{tokens} tokens in #{duration}s"
+    puts "Total tokens used: #{@total_tokens}"
+
+    result
+  end
+end
+
+# Usage
+tracker = TokenTracker.new
+predictor = DSPy::Predict.new(MySignature)
+
+result = tracker.track_prediction(predictor, { input: 'data' })
+```
+
+### Custom Logging
+
+Add detailed logging to your modules:
+
+```ruby
+class EmailClassifier < DSPy::Module
+  def initialize
+    super
+    @predictor = DSPy::ChainOfThought.new(EmailClassificationSignature)
+    @logger = Logger.new(STDOUT)
+  end
+
+  def forward(input)
+    @logger.info "Classifying email: #{input[:email_subject]}"
+
+    start_time = Time.now
+    result = @predictor.forward(input)
+    duration = Time.now - start_time
+
+    @logger.info "Classification: #{result[:category]} (#{duration}s)"
+
+    if result[:reasoning]
+      @logger.debug "Reasoning: #{result[:reasoning]}"
+    end
+
+    result
+  rescue => e
+    @logger.error "Classification failed: #{e.message}"
+    raise
+  end
+end
+```
+
+### Performance Monitoring
+
+Monitor latency and performance metrics:
+
+```ruby
+class PerformanceMonitor
+  def initialize
+    @metrics = {
+      total_requests: 0,
+      total_duration: 0.0,
+      errors: 0,
+      success_count: 0
+    }
+  end
+
+  def monitor_request
+    start_time = Time.now
+    @metrics[:total_requests] += 1
+
+    begin
+      result = yield
+      @metrics[:success_count] += 1
+      result
+    rescue => e
+      @metrics[:errors] += 1
+      raise
+    ensure
+      duration = Time.now - start_time
+      @metrics[:total_duration] += duration
+
+      if @metrics[:total_requests] % 10 == 0
+        print_stats
+      end
+    end
+  end
+
+  def print_stats
+    avg_duration = @metrics[:total_duration] / @metrics[:total_requests]
+    success_rate = @metrics[:success_count].to_f / @metrics[:total_requests]
+
+    puts "\n=== Performance Stats ==="
+    puts "Total requests: #{@metrics[:total_requests]}"
+    puts "Average duration: #{avg_duration.round(3)}s"
+    puts "Success rate: #{(success_rate * 100).round(2)}%"
+    puts "Errors: #{@metrics[:errors]}"
+    puts "========================\n"
+  end
+end
+
+# Usage
+monitor = PerformanceMonitor.new
+predictor = DSPy::Predict.new(MySignature)
+
+result = monitor.monitor_request do
+  predictor.forward(input: 'data')
+end
+```
+
+### Error Rate Tracking
+
+Monitor and alert on error rates:
+
+```ruby
+class ErrorRateMonitor
+  def initialize(alert_threshold: 0.1)
+    @alert_threshold = alert_threshold
+    @recent_results = []
+    @window_size = 100
+  end
+
+  def track_result(success:)
+    @recent_results << success
+    @recent_results.shift if @recent_results.size > @window_size
+
+    error_rate = calculate_error_rate
+    alert_if_needed(error_rate)
+
+    error_rate
+  end
+
+  private
+
+  def calculate_error_rate
+    failures = @recent_results.count(false)
+    failures.to_f / @recent_results.size
+  end
+
+  def alert_if_needed(error_rate)
+    if error_rate > @alert_threshold
+      puts "⚠️  ALERT: Error rate #{(error_rate * 100).round(2)}% exceeds threshold!"
+      # Send notification, page oncall, etc.
+    end
+  end
+end
+```
+
+## Best Practices
+
+### 1. Start with Tests
+
+Write tests before optimizing:
+
+```ruby
+# Define test cases first
+test_cases = [
+  { input: {...}, expected: {...} },
+  # More test cases...
+]
+
+# Ensure baseline functionality
+test_cases.each do |tc|
+  result = module.forward(tc[:input])
+  assert result[:category] == tc[:expected][:category]
+end
+
+# Then optimize
+optimized = optimizer.compile(module, trainset: test_cases)
+```
+
+### 2. Use Meaningful Metrics
+
+Define metrics that align with business goals:
+
+```ruby
+def business_aligned_metric(example, prediction)
+  # High-priority errors are more costly
+  if example[:expected_output][:priority] == 'High'
+    return prediction[:priority] == 'High' ? 1.0 : 0.0
+  else
+    return prediction[:category] == example[:expected_output][:category] ? 0.8 : 0.0
+  end
+end
+```
+
+### 3. Monitor in Production
+
+Always track production performance:
+
+```ruby
+class ProductionModule < DSPy::Module
+  def initialize
+    super
+    @predictor = DSPy::ChainOfThought.new(MySignature)
+    @monitor = PerformanceMonitor.new
+    @error_tracker = ErrorRateMonitor.new
+  end
+
+  def forward(input)
+    @monitor.monitor_request do
+      result = @predictor.forward(input)
+      @error_tracker.track_result(success: true)
+      result
+    rescue => e
+      @error_tracker.track_result(success: false)
+      raise
+    end
+  end
+end
+```
+
+### 4. Version Your Modules
+
+Track which version of your module is deployed:
+
+```ruby
+class EmailClassifierV2 < DSPy::Module
+  VERSION = '2.1.0'
+
+  def initialize
+    super
+    @predictor = DSPy::ChainOfThought.new(EmailClassificationSignature)
+  end
+
+  def forward(input)
+    result = @predictor.forward(input)
+    result.merge(model_version: VERSION)
+  end
+end
+```