Agent Trace Evaluation with TruLens Scorers in MLflow
MLflow's third-party scorer framework already supports LLM-as-a-judge evaluations from DeepEval, RAGAS, and Phoenix, an ecosystem with 32M+ monthly PyPI downloads. We're excited to announce the TruLens integration as we continue our efforts to expand support for various third-party evaluation frameworks.
An agent doesn't just produce an answer. It makes a plan, picks tools, executes a multi-step workflow, and adapts when steps fail. A correct final answer can mask a flawed plan, redundant tool calls, or broken reasoning along the way. To catch those problems, you need to evaluate what happened inside the execution trace, not just what came out the other end.
The integration adds 10 scorers that bring the Agent GPA framework to MLflow. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs) to evaluate agent behavior. MLflow already supports trace-based judges and agentic metrics from DeepEval and RAGAS, but with the TruLens integration, MLflow now supports the structured three-dimensional lens (Goal, Plan, Action) developed by the TruLens team at Snowflake.
The Agent GPA Framework
GPA stands for Goal-Plan-Action, and it evaluates three alignment dimensions in an agent's execution:
Goal-Plan alignment asks: did the agent make a good strategy? An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine.
PlanQualitychecks whether the plan decomposes the goal into feasible subtasks.ToolSelectionchecks whether the agent picked the right tools for each subtask.
Plan-Action alignment asks: did the agent follow through? Did it skip steps, reorder things, or repeat work?
PlanAdherencechecks whether the agent's actual actions match its stated plan.ToolCallingchecks whether function calls are valid, with correct parameters and complete inputs.
Holistic alignment looks at the trajectory as a whole.
LogicalConsistencychecks whether each step is coherent with prior context and reasoning.ExecutionEfficiencychecks whether the agent reached the goal without redundant calls.
On the TRAIL benchmark, GPA judges identify 95% of human-labeled agent errors (267/281), compared to 55% for baseline trace-aware judges that also read the execution trace but lack the structured Goal-Plan-Action decomposition. That 40-percentage-point gap shows that reading the trace alone is not enough. How you structure the evaluation matters.
Pass a trace and the scorer handles the rest. Because agent traces often exceed LLM context windows, TruLens GPA scorers pre-process the trace to reduce its size while preserving key information including the agent plan and tool calls. Under the hood, the integration serializes your MLflow trace to JSON and passes the processed span tree to TruLens' provider, which evaluates each dimension with chain-of-thought reasoning. You get back a score and a rationale explaining what it found.
How Trace Evaluation Catches What Output Evaluation Misses
Here's a concrete scenario. Say you have a travel-planning agent that should: (1) search for flights, (2) check hotel availability, (3) book both. The agent returns "Your trip is booked!" and it looks correct. But the trace tells a different story:
Span 1: search_flights("NYC", "LAX", "2026-04-01") -> 3 results
Span 2: search_flights("NYC", "LAX", "2026-04-01") -> 3 results <- duplicate
Span 3: book_flight(flight_id="FL123") -> confirmed
Span 4: search_hotels("LAX", "2026-04-01") -> 2 results
Span 5: book_hotel(hotel_id=None) -> error
Span 6: book_hotel(hotel_id="H456") -> confirmed
Output-only evaluation gives this a pass - the trip got booked. However, trace-level evaluation catches three problems:
- ExecutionEfficiency: redundant flight search (Span 2 duplicates Span 1)
- ToolCalling:
book_hotelcalled withNonebefore retry (Span 5) - PlanAdherence: the agent booked a flight before searching for hotels
Combining Agent and RAG Evaluation
You can mix agent trace scorers, RAG scorers, and scorers from other frameworks in a single mlflow.genai.evaluate() call. The trace scorers read the span tree, while RAG scorers like Groundedness extract context from retrieval spans in the trace automatically. All scorers support a model parameter for choosing your LLM provider (OpenAI, Anthropic, or any LiteLLM-compatible provider).
import mlflow
from mlflow.genai.scorers.trulens import (
Groundedness,
PlanAdherence,
ExecutionEfficiency,
)
from mlflow.genai.scorers.phoenix import Hallucination
traces = mlflow.search_traces(locations=["..."])
results = mlflow.genai.evaluate(
data=traces,
scorers=[
# Agent behavior (reads the full span tree)
PlanAdherence(model="openai:/gpt-5.2"),
ExecutionEfficiency(model="openai:/gpt-5.2"),
# RAG quality (extracts context from retrieval spans)
Groundedness(model="openai:/gpt-5.2"),
# Content quality (Phoenix)
Hallucination(model="openai:/gpt-5.2"),
],
)
Each scorer runs independently and writes results to the same experiment. Results land in the MLflow assessment table alongside any other evaluation results.
The trace detail view shows the full span tree on the left and TruLens GPA assessments on the right. Expand any assessment to see the chain-of-thought reasoning behind the score.
Getting Started
To get started, install MLflow and TruLens with the LiteLLM provider. For full API details, see the TruLens scorers documentation.
pip install mlflow>=3.10.0 trulens trulens-providers-litellm
from mlflow.genai.scorers.trulens import PlanAdherence, Groundedness
# Agent trace scorer
scorer = PlanAdherence(model="openai:/gpt-5.2")
feedback = scorer(trace=my_agent_trace)
print(feedback.value) # "yes" or "no" based on threshold
print(feedback.rationale) # Chain-of-thought reasoning
# RAG scorer (extracts context from retrieval spans in trace)
scorer = Groundedness(model="openai:/gpt-5.2", threshold=0.6)
feedback = scorer(trace=my_rag_trace)
print(feedback.value) # "yes" or "no"
print(feedback.rationale) # Why it passed or failed
print(feedback.metadata["score"]) # 0.85
Resources
- Third-Party Scorers Overview
- Trace-Based Judges
- Agent GPA Framework (Snowflake Engineering Blog)
- Trace-Aware Agent Evaluation for MLflow (Snowflake Engineering Blog)
- Agent GPA Paper (arXiv)
- TRAIL: Trace Reasoning and Agentic Issue Localization (arXiv)
- TruLens MLflow Integration Documentation
- Introducing DeepEval, RAGAS, and Phoenix Judges in MLflow
Provenance
I (Debu Sinha) contributed the TruLens integration (PR #19492) to MLflow's open-source third-party scorer framework, adding 10 scorers: 4 RAG metrics and 6 agent trace evaluators based on the Agent GPA framework. The integration went through four review rounds with Samraj Moorjani (Software Engineer at Databricks, MLflow maintainer), with final approval from Avesh C. Singh (Software Engineer at Databricks). It follows the scorer pattern Moorjani established in the DeepEval and RAGAS integrations and extends it to agent trace evaluation, a category that requires reading the full span tree rather than just inputs and outputs.
Josh Reini (TruLens maintainer, Snowflake) reviewed the integration's scorer semantics and validated the trace-aware evaluation behavior. Reini published a companion post on the Snowflake Engineering Blog covering the Agent GPA research and TRAIL benchmark results in depth. A cross-project documentation PR was also merged into the TruLens repository.
Related artifacts:
- Upstream MLflow TruLens PR #19492 (merged)
- TruLens documentation PR #2344 (merged, cross-project)
- Introducing DeepEval, RAGAS, and Phoenix Judges in MLflow (companion blog)

