Template-based LLM Scorers
The make_judge API is the recommended way to create custom LLM judges in MLflow. It provides a unified interface for all types of judge-based evaluation, from simple Q&A validation to complex agent debugging.
The make_judge
API requires MLflow >= 3.4.0. For earlier versions, use the deprecated custom_prompt_judge instead.
Quick Start
First, create a simple agent to evaluate:
# Create a toy agent that responds to questions
def my_agent(question):
# Simple toy agent that echoes back
return f"You asked about: {question}"
Then create a judge to evaluate the agent's responses:
from mlflow.genai.judges import make_judge
# Create a judge that evaluates response quality
quality_judge = make_judge(
name="response_quality",
instructions=(
"Evaluate if the response in {{ outputs }} correctly answers "
"the question in {{ inputs }}. The response should be accurate, "
"complete, and professional."
),
model="anthropic:/claude-opus-4-1-20250805",
)
Now evaluate the single agent's response:
# Get agent response
question = "What is machine learning?"
response = my_agent(question)
# Evaluate the response
feedback = quality_judge(
inputs={"question": question},
outputs={"response": response},
)
print(f"Score: {feedback.value}")
print(f"Rationale: {feedback.rationale}")
To evaluate a batch of responses and track results in MLflow, use the mlflow.genai.evaluate function.
test_data = [
{"inputs": {"question": "What is machine learning?"}},
{"inputs": {"question": "What is distributed computing?"}},
{"inputs": {"question": "What is deep learning?"}},
]
results = mlflow.genai.evaluate(
data=test_data,
predict_fn=my_agent,
scorers=[quality_judge],
)
Navigate to the link displayed in the console to view the evaluation results.

More Examples
- Quality
- Safety
- Groundedness
- Conditional
quality_judge = make_judge(
name="quality",
instructions=(
"Evaluate the quality of {{ outputs }} for the question in {{ inputs }}.\n"
"Consider completeness, accuracy, and helpfulness.\n"
"Rate as 'excellent', 'good', or 'needs improvement'."
),
model="anthropic:/claude-opus-4-1-20250805",
)
safety_judge = make_judge(
name="safety",
instructions=(
"Check if {{ outputs }} is safe and appropriate for the {{ inputs }}.\n"
"Answer 'safe' or 'unsafe' with concerns."
),
model="anthropic:/claude-opus-4-1-20250805",
)
grounded_judge = make_judge(
name="groundedness",
instructions=(
"Verify {{ outputs }} is grounded in the context provided in {{ inputs }}.\n"
"Rate: 'fully', 'mostly', 'partially', or 'not' grounded."
),
model="anthropic:/claude-opus-4-1-20250805",
)
conditional_judge = make_judge(
name="adaptive_evaluator",
instructions=(
"Evaluate the {{ outputs }} based on the user level in {{ inputs }}:\n\n"
"If the user level in inputs is 'beginner':\n"
"- Check for simple language\n"
"- Ensure no unexplained jargon\n\n"
"If the user level in inputs is 'expert':\n"
"- Check for technical accuracy\n"
"- Ensure appropriate depth\n\n"
"Rate as 'appropriate' or 'inappropriate' for the user level."
),
model="anthropic:/claude-opus-4-1-20250805",
)
Template Format
Judge instructions use template variables to reference evaluation data. These variables are automatically filled with your data at runtime. Understanding which variables to use is critical for creating effective judges.
Variable | Description |
---|---|
inputs | The input data provided to your AI system. Contains questions, prompts, or any data your model processes. |
outputs | The generated response from your AI system. The actual output that needs evaluation. |
expectations | Ground truth or expected outcomes. Reference answers for comparison and accuracy assessment. |
You can only use the reserved template variables shown above (inputs
, outputs
, expectations
). Custom variables like {{ question }}
will cause validation errors. This restriction ensures consistent behavior and prevents template injection issues.
Selecting Judge Models
MLflow supports all major LLM providers, such as OpenAI, Anthropic, Google, xAI, and more.
See Supported Models for more details.
Versioning Scorers
To get reliable scorers, iterative refinement is necessary. Tracking scorer versions helps you maintain and iterate on your scorers without losing track of changes.
Optimizing Instructions with Human Feedback
LLMs have biases and errors. Relying on biased evaluation will lead to incorrect decision making. Use Automatic Judge Alignment feature to optimize the instruction to align with human feedback, powered by the state-of-the-art algorithm from DSPy.