Judge Alignment: Teaching AI to Match Human Preferences
Transform Generic Judges into Domain Experts
Judge alignment is the process of refining LLM judges to match human evaluation standards. Through systematic learning from human feedback, judges evolve from generic evaluators to domain-specific experts that understand your unique quality criteria.
Why Alignment Matters
Even the most sophisticated LLMs need calibration to match your specific evaluation standards. What constitutes "good" customer service varies by industry. Medical accuracy requirements differ from general health advice. Alignment bridges this gap, teaching judges your specific quality standards through example.
Learn from Expert Feedback
Judges improve by learning from your domain experts' assessments, capturing nuanced quality criteria that generic prompts miss.
Consistent Standards at Scale
Once aligned, judges apply your exact quality standards consistently across millions of evaluations.
Continuous Improvement
As your standards evolve, judges can be re-aligned with new feedback, maintaining relevance over time.
Reduced Evaluation Errors
Aligned judges show 30-50% reduction in false positives/negatives compared to generic evaluation prompts.
How Judge Alignment Works
Alignment Lifecycle
Quick Start: Align Your First Judge
For alignment to work, each trace must have BOTH judge assessments AND human feedback with the same assessment name. The alignment process learns by comparing judge assessments with human feedback on the same traces.
The assessment name must exactly match the judge name - if your judge is named "product_quality", both the judge's assessments and human feedback must use the name "product_quality".
The order doesn't matter - humans can provide feedback before or after the judge evaluates.
Note: Alignment is currently supported only for field-based evaluation using {{ inputs }}
and {{ outputs }}
templates. Support for Agent-as-a-Judge evaluation ({{ trace }}
) and expectations ({{ expectations }}
) is not yet available.
Step 1: Setup and Generate Traces
First, create your judge and generate traces with initial assessments:
from mlflow.genai.judges import make_judge
from mlflow.genai.judges.optimizers import SIMBAAlignmentOptimizer
from mlflow.entities import AssessmentSource, AssessmentSourceType
import mlflow
# Create experiment and initial judge
experiment_id = mlflow.create_experiment("product-quality-alignment")
mlflow.set_experiment(experiment_id=experiment_id)
initial_judge = make_judge(
name="product_quality",
instructions=(
"Evaluate if the product description in {{ outputs }} "
"is accurate and helpful for the query in {{ inputs }}. "
"Rate as: excellent, good, fair, or poor"
),
model="anthropic:/claude-opus-4-1-20250805",
)
# Generate traces from your application (minimum 10 required)
traces = []
for i in range(15): # Generate 15 traces (more than minimum of 10)
with mlflow.start_span(f"product_description_{i}") as span:
# Your application logic
query = f"Product query {i}"
description = f"Product description for query {i}"
span.set_inputs({"query": query})
span.set_outputs({"description": description})
traces.append(span.trace_id)
# Run the judge on these traces to get initial assessments
for trace_id in traces:
trace = mlflow.get_trace(trace_id)
# Extract inputs and outputs from the trace for field-based evaluation
inputs = trace.data.spans[0].inputs # Get inputs from trace
outputs = trace.data.spans[0].outputs # Get outputs from trace
# Judge evaluates using field-based approach (inputs/outputs)
judge_result = initial_judge(inputs=inputs, outputs=outputs)
# Judge's assessment is automatically logged when called
Step 2: Collect Human Feedback
After running your judge on traces, you need to collect human feedback. You can either:
- Use the MLflow UI (recommended): Review traces and add feedback through the intuitive interface
- Log programmatically: If you already have ground truth labels
For detailed instructions on collecting feedback, see Collecting Feedback for Alignment below.
Step 3: Align and Register
After collecting feedback, align your judge and register it:
- Default Optimizer (Recommended)
- Explicit Optimizer
# Retrieve traces with both judge and human assessments
traces_for_alignment = mlflow.search_traces(
experiment_ids=[experiment_id], max_results=15, return_type="list"
)
# Align the judge using human corrections (minimum 10 traces recommended)
if len(traces_for_alignment) >= 10:
optimizer = SIMBAAlignmentOptimizer(model="anthropic:/claude-opus-4-1-20250805")
# Run alignment - shows minimal progress by default:
# INFO: Starting SIMBA optimization with 15 examples (set logging to DEBUG for detailed output)
# INFO: SIMBA optimization completed
aligned_judge = initial_judge.align(optimizer, traces_for_alignment)
# Register the aligned judge
aligned_judge.register(experiment_id=experiment_id)
print("Judge aligned successfully with human feedback")
else:
print(f"Need at least 10 traces for alignment, have {len(traces_for_alignment)}")
from mlflow.genai.judges.optimizers import SIMBAAlignmentOptimizer
# Retrieve traces with both judge and human assessments
traces_for_alignment = mlflow.search_traces(
experiment_ids=[experiment_id], max_results=15, return_type="list"
)
# Align the judge using human corrections (minimum 10 traces recommended)
if len(traces_for_alignment) >= 10:
# Explicitly specify SIMBA with custom model configuration
optimizer = SIMBAAlignmentOptimizer(model="anthropic:/claude-opus-4-1-20250805")
aligned_judge = initial_judge.align(traces_for_alignment, optimizer)
# Register the aligned judge
aligned_judge.register(experiment_id=experiment_id)
print("Judge aligned successfully with human feedback")
else:
print(f"Need at least 10 traces for alignment, have {len(traces_for_alignment)}")
The SIMBA Alignment Optimizer
MLflow provides the default alignment optimizer using DSPy's implementation of SIMBA (Simplified Multi-Bootstrap Aggregation). When you call align()
without specifying an optimizer, the SIMBA optimizer is used automatically:
# Default: Uses SIMBA optimizer automatically
aligned_judge = initial_judge.align(traces_with_feedback)
# Explicit: Same as above but with custom model specification
from mlflow.genai.judges.optimizers import SIMBAAlignmentOptimizer
optimizer = SIMBAAlignmentOptimizer(
model="anthropic:/claude-opus-4-1-20250805" # Model used for optimization
)
aligned_judge = initial_judge.align(traces_with_feedback, optimizer)
# Requirements for alignment:
# - Minimum 10 traces with BOTH judge assessments and human feedback
# - Both assessments must use the same name (matching the judge name)
# - Order doesn't matter - humans can assess before or after judge
# - Mix of agreements and disagreements between judge and human recommended
When using align()
without an optimizer parameter, MLflow automatically uses the SIMBA optimizer. This simplifies the alignment process while still allowing customization when needed.
Controlling Optimization Output
By default, alignment shows minimal progress information to keep logs clean. If you need to debug the optimization process or see detailed iteration progress, enable DEBUG logging:
import logging
# Enable detailed optimization output
logging.getLogger("mlflow.genai.judges.optimizers.simba").setLevel(logging.DEBUG)
# Now alignment will show:
# - Detailed iteration-by-iteration progress
# - Score improvements at each step
# - Strategy selection details
# - Full DSPy optimization output
aligned_judge = initial_judge.align(optimizer, traces_with_feedback)
# Reset to default (minimal output) after debugging
logging.getLogger("mlflow.genai.judges.optimizers.simba").setLevel(logging.INFO)
Enable DEBUG logging when:
- Optimization seems stuck or is taking too long
- You want to understand how the optimizer is improving instructions
- Debugging alignment failures or unexpected results
- Learning how SIMBA optimization works internally
Keep it at INFO (default) for production use to avoid verbose output.
Collecting Feedback for Alignment
The quality of alignment depends on the quality and quantity of feedback. Choose the approach that best fits your situation:
Feedback Collection Approaches
- MLflow UI (Recommended)
- Programmatic (Ground Truth)
When to use: You don't have existing ground truth labels and need to collect human feedback.
The MLflow UI provides an intuitive interface for reviewing traces and adding feedback:
- Navigate to the Traces tab in your experiment
- Click on individual traces to review inputs, outputs, and any existing judge assessments
- Add feedback by clicking the "Add Feedback" button
- Select the assessment name that matches your judge name (e.g., "product_quality")
- Provide your rating according to your evaluation criteria
Tips for effective feedback collection:
- If you're not a domain expert: Distribute traces among team members or domain experts for review
- If you are the domain expert: Create a rubric or guidelines document to ensure consistency
- For multiple reviewers: Organize feedback sessions where reviewers can work through batches together
- For consistency: Document your evaluation criteria clearly before starting
The UI automatically logs feedback in the correct format for alignment.

When to use: You have existing ground truth labels from your data.
If you already have labeled data, you can programmatically log it as feedback:
import mlflow
from mlflow.entities import AssessmentSource, AssessmentSourceType
# Your existing ground truth dataset
ground_truth_data = [
{"trace_id": "trace1", "label": "excellent", "query": "What is MLflow?"},
{"trace_id": "trace2", "label": "poor", "query": "How to use tracking?"},
{"trace_id": "trace3", "label": "good", "query": "How to log models?"},
]
# Log ground truth as feedback for alignment
for item in ground_truth_data:
mlflow.log_feedback(
trace_id=item["trace_id"],
name="product_quality", # Must match your judge name
value=item["label"],
source=AssessmentSource(
source_type=AssessmentSourceType.HUMAN, source_id="ground_truth_dataset"
),
)
print(f"Logged {len(ground_truth_data)} ground truth labels for alignment")
This approach is efficient when you have pre-labeled data from:
• Previous manual labeling efforts • Expert annotations • Production feedback systems • Test datasets with known correct answers
Diverse Reviewers
Include feedback from multiple experts to capture different perspectives and reduce individual bias.
Balanced Examples
Include both positive and negative examples. Aim for at least 30% of each to help the judge learn boundaries.
Sufficient Volume
Collect at least 10 feedback examples (minimum for SIMBA), but 50-100 examples typically yield better results.
Consistent Standards
Ensure reviewers use consistent criteria. Provide guidelines or rubrics to standardize assessments.
Custom Alignment Optimizers
MLflow's alignment system is designed as a plugin architecture, allowing you to create custom optimizers for different alignment strategies. This extensibility enables you to implement domain-specific optimization approaches while leveraging MLflow's judge infrastructure.
Creating a Custom Optimizer
To create a custom alignment optimizer, extend the AlignmentOptimizer abstract base class:
from mlflow.genai.judges.base import AlignmentOptimizer, Judge
from mlflow.entities.trace import Trace
class MyCustomOptimizer(AlignmentOptimizer):
"""Custom optimizer implementation for judge alignment."""
def __init__(self, model: str = None, **kwargs):
"""Initialize your optimizer with custom parameters."""
self.model = model
# Add any custom initialization logic
def align(self, judge: Judge, traces: list[Trace]) -> Judge:
"""
Implement your alignment algorithm.
Args:
judge: The judge to be optimized
traces: List of traces containing human feedback
Returns:
A new Judge instance with improved alignment
"""
# Your custom alignment logic here
# 1. Extract feedback from traces
# 2. Analyze disagreements between judge and human
# 3. Generate improved instructions
# 4. Return new judge with better alignment
# Example: Return judge with modified instructions
from mlflow.genai.judges import make_judge
improved_instructions = self._optimize_instructions(judge.instructions, traces)
return make_judge(
name=judge.name,
instructions=improved_instructions,
model=judge.model,
)
def _optimize_instructions(self, instructions: str, traces: list[Trace]) -> str:
"""Your custom optimization logic."""
# Implement your optimization strategy
pass
Using Custom Optimizers
Once implemented, use your custom optimizer just like the built-in ones:
# Create your custom optimizer
custom_optimizer = MyCustomOptimizer(model="your-model")
# Use it for alignment
aligned_judge = initial_judge.align(traces_with_feedback, custom_optimizer)
Available Optimizers
MLflow currently provides:
- SIMBAAlignmentOptimizer (default): Uses DSPy's Simplified Multi-Bootstrap Aggregation for robust alignment
- Custom optimizers: Extend
AlignmentOptimizer
to implement your own strategies
The plugin architecture ensures that new optimization strategies can be added without modifying the core judge system, promoting extensibility and experimentation with different alignment approaches.
Testing Alignment Effectiveness
Validate that alignment improved your judge:
def test_alignment_improvement(
original_judge, aligned_judge, test_traces: list
) -> dict:
"""Compare judge performance before and after alignment."""
original_correct = 0
aligned_correct = 0
for trace in test_traces:
# Get human ground truth from trace assessments
feedbacks = trace.search_assessments(type="feedback")
human_feedback = next(
(f for f in feedbacks if f.source.source_type == "HUMAN"), None
)
if not human_feedback:
continue
# Get judge evaluations
original_eval = original_judge(trace=trace)
aligned_eval = aligned_judge(trace=trace)
# Check agreement with human
if original_eval.value == human_feedback.value:
original_correct += 1
if aligned_eval.value == human_feedback.value:
aligned_correct += 1
total = len(test_traces)
return {
"original_accuracy": original_correct / total,
"aligned_accuracy": aligned_correct / total,
"improvement": (aligned_correct - original_correct) / total,
}