Skip to main content

Template-based LLM Scorers

The make_judge API is the recommended way to create custom LLM judges in MLflow. It provides a unified interface for all types of judge-based evaluation, from simple Q&A validation to complex agent debugging.

Version Requirements

The make_judge API requires MLflow >= 3.4.0. For earlier versions, use the deprecated custom_prompt_judge instead.

Quick Start

First, create a simple agent to evaluate:

# Create a toy agent that responds to questions
def my_agent(question):
# Simple toy agent that echoes back
return f"You asked about: {question}"

Then create a judge to evaluate the agent's responses:

from mlflow.genai.judges import make_judge

# Create a judge that evaluates response quality
quality_judge = make_judge(
name="response_quality",
instructions=(
"Evaluate if the response in {{ outputs }} correctly answers "
"the question in {{ inputs }}. The response should be accurate, "
"complete, and professional."
),
model="anthropic:/claude-opus-4-1-20250805",
)

Now evaluate the single agent's response:

# Get agent response
question = "What is machine learning?"
response = my_agent(question)

# Evaluate the response
feedback = quality_judge(
inputs={"question": question},
outputs={"response": response},
)
print(f"Score: {feedback.value}")
print(f"Rationale: {feedback.rationale}")

To evaluate a batch of responses and track results in MLflow, use the mlflow.genai.evaluate function.

test_data = [
{"inputs": {"question": "What is machine learning?"}},
{"inputs": {"question": "What is distributed computing?"}},
{"inputs": {"question": "What is deep learning?"}},
]

results = mlflow.genai.evaluate(
data=test_data,
predict_fn=my_agent,
scorers=[quality_judge],
)

Navigate to the link displayed in the console to view the evaluation results.

Make Judge Example

More Examples

quality_judge = make_judge(
name="quality",
instructions=(
"Evaluate the quality of {{ outputs }} for the question in {{ inputs }}.\n"
"Consider completeness, accuracy, and helpfulness.\n"
"Rate as 'excellent', 'good', or 'needs improvement'."
),
model="anthropic:/claude-opus-4-1-20250805",
)

Template Format

Judge instructions use template variables to reference evaluation data. These variables are automatically filled with your data at runtime. Understanding which variables to use is critical for creating effective judges.

VariableDescription
inputsThe input data provided to your AI system. Contains questions, prompts, or any data your model processes.
outputsThe generated response from your AI system. The actual output that needs evaluation.
expectationsGround truth or expected outcomes. Reference answers for comparison and accuracy assessment.
Only Reserved Variables Allowed

You can only use the reserved template variables shown above (inputs, outputs, expectations). Custom variables like {{ question }} will cause validation errors. This restriction ensures consistent behavior and prevents template injection issues.

Selecting Judge Models

MLflow supports all major LLM providers, such as OpenAI, Anthropic, Google, xAI, and more.

See Supported Models for more details.

Versioning Scorers

To get reliable scorers, iterative refinement is necessary. Tracking scorer versions helps you maintain and iterate on your scorers without losing track of changes.

Optimizing Instructions with Human Feedback

LLMs have biases and errors. Relying on biased evaluation will lead to incorrect decision making. Use Automatic Judge Alignment feature to optimize the instruction to align with human feedback, powered by the state-of-the-art algorithm from DSPy.

Next Steps