Skip to main content

Agent-based Scorer (aka. Agent-as-a-Judge)

Agent-as-a-Judge represents a paradigm shift in LLM evaluation. Instead of simply assessing inputs and outputs, these judges act as autonomous agents equipped with tools to investigate your application's execution in depth.

How it works

Agent-as-a-Judge uses the following tools to investigate traces logged to the MLflow backend. These tools enable the judge to act like an experienced debugger, systematically exploring your application's execution.

ToolDescription
GetTraceInfoRetrieves high-level information about a trace including timing, status, and metadata.
ListSpansLists all spans in a trace with their hierarchy, timing, and basic attributes.
GetSpanFetches detailed information about a specific span including inputs, outputs, and custom attributes.
SearchTraceRegexSearches for patterns across all span data using regular expressions.
Why not directly pass a trace to LLM?

While it works for simple cases, traces from real-world applications are often large and complex. Passing the entire trace to LLM will quickly run into context window limit and degrade the judge accuracy. Agentic-approach uses tools to explore the trace structure and fetch the necessary details without eating up the context window.

Comparison with LLM-as-a-Judge

Understanding when to use each approach depends on where you are in your development lifecycle:

AspectAgent-as-a-JudgeLLM-as-a-Judge
Ease of setupSimple - just describe what to investigateRequires careful prompt engineering and refinement
What they evaluateComplete execution traces and trajectorySpecific inputs and outputs fields
PerformanceSlower (explores trace in detail)Fast execution
CostHigher (more context and tool usage)Lower (less context)

When to use Agent-as-a-Judge?

Agent-as-a-Judge is suitable for bootstrapping the evaluation flywheel.

  • Getting started with a new application
  • Revising and refining your agent
  • Identifying failure patterns
  • Understanding unexpected behavior

When to use LLM-as-a-Judge?

LLM-as-a-Judge is more efficient for evaluating a particular criteria, therefore suitable for continuous evaluation and production use.

  • Production monitoring
  • Regression testing
  • Final validation before deployment
  • Meeting specific quality expectations

Quickstart

To create an Agent-as-a-Judge, simply call the make_judge API and pass an instruction with the {{ trace }} template variable:

import mlflow
from mlflow.genai.judges import make_judge
import time

performance_judge = make_judge(
name="performance_analyzer",
instructions=(
"Analyze the {{ trace }} for performance issues.\n\n"
"Check for:\n"
"- Operations taking longer than 2 seconds\n"
"- Redundant API calls or database queries\n"
"- Inefficient data processing patterns\n"
"- Proper use of caching mechanisms\n\n"
"Rate as: 'optimal', 'acceptable', or 'needs_improvement'"
),
model="openai:/gpt-5",
# model="anthropic:/claude-opus-4-1-20250805",
)
note

The usage of {{ trace }} template variable is important. If the template does not contain {{ trace }}, MLflow assumes the scorer is a normal LLM-as-a-Judge and does not use MCP tools.

Then, generate a trace from your application and pass it to the scorer:

@mlflow.trace
def slow_data_processor(query: str):
"""Example application with performance bottlenecks."""
with mlflow.start_span("fetch_data") as span:
time.sleep(2.5)
span.set_inputs({"query": query})
span.set_outputs({"data": ["item1", "item2", "item3"]})

with mlflow.start_span("process_data") as span:
for i in range(3):
with mlflow.start_span(f"redundant_api_call_{i}"):
time.sleep(0.5)
span.set_outputs({"processed": "results"})

return "Processing complete"


result = slow_data_processor("SELECT * FROM users")
trace_id = mlflow.get_last_active_trace_id()
trace = mlflow.get_trace(trace_id)

feedback = performance_judge(trace=trace)

print(f"Performance Rating: {feedback.value}")
print(f"Analysis: {feedback.rationale}")
Performance Rating: needs_improvement
Analysis: Found critical performance issues:
1. The 'fetch_data' span took 2.5 seconds, exceeding the 2-second threshold
2. Detected 3 redundant API calls (redundant_api_call_0, redundant_api_call_1,
redundant_api_call_2) that appear to be duplicate operations
3. Total execution time of 4 seconds could be optimized by parallelizing
the redundant operations or implementing caching
Agent-as-a-Judge Evaluation Results

Running the Judge against Batch Traces

To apply the scorer to a batch of traces, use the mlflow.genai.evaluate API.

import mlflow

# Retrieve traces from MLflow
traces = mlflow.search_traces(filter_string="timestamp > 1727174400000")

# Run evaluation with Agent-as-a-Judge
results = mlflow.genai.evaluate(
data=traces,
scorers=[performance_judge],
)

Advanced Examples

tool_optimization_judge = make_judge(
name="tool_optimizer",
instructions=(
"Analyze tool usage patterns in {{ trace }}.\n\n"
"Check for:\n"
"1. Unnecessary tool calls (could be answered without tools)\n"
"2. Wrong tool selection (better tool available)\n"
"3. Inefficient sequencing (could parallelize or reorder)\n"
"4. Missing tool usage (should have used a tool)\n\n"
"Provide specific optimization suggestions.\n"
"Rate efficiency as: 'optimal', 'good', 'suboptimal', or 'poor'"
),
model="anthropic:/claude-opus-4-1-20250805",
)

Debugging Agent Judges

To see the actual MCP tool calls that the Agent-as-a-Judge makes while analyzing your trace, enable debug logging:

import logging

# Enable debug logging to see agent tool calls
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("mlflow.genai.judges")
logger.setLevel(logging.DEBUG)

# Now when you run the judge, you'll see detailed tool usage
feedback = performance_judge(trace=trace)

With debug logging enabled, you'll see output like:

DEBUG:mlflow.genai.judges:Calling tool: GetTraceInfo
DEBUG:mlflow.genai.judges:Tool response: {"trace_id": "abc123", "duration_ms": 4000, ...}
DEBUG:mlflow.genai.judges:Calling tool: ListSpans
DEBUG:mlflow.genai.judges:Tool response: [{"span_id": "def456", "name": "fetch_data", ...}]
DEBUG:mlflow.genai.judges:Calling tool: GetSpan with span_id=def456
DEBUG:mlflow.genai.judges:Tool response: {"duration_ms": 2500, "inputs": {"query": "SELECT * FROM users"}, ...}

Next Steps