Agent Trace Evaluation with TruLens Scorers in MLflow

March 4, 2026 · 8 min read

Lead Specialist Solutions Architect at Databricks

Developer Advocate at Snowflake

MLflow's third-party scorer framework already supports LLM-as-a-judge evaluations from DeepEval, RAGAS, and Phoenix, an ecosystem with 32M+ monthly PyPI downloads. We're excited to announce the TruLens integration as we continue our efforts to expand support for various third-party evaluation frameworks.

An agent doesn't just produce an answer. It makes a plan, picks tools, executes a multi-step workflow, and adapts when steps fail. A correct final answer can mask a flawed plan, redundant tool calls, or broken reasoning along the way. To catch those problems, you need to evaluate what happened inside the execution trace, not just what came out the other end.

The integration adds 10 scorers that bring the Agent GPA framework to MLflow. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs) to evaluate agent behavior. MLflow already supports trace-based judges and agentic metrics from DeepEval and RAGAS, but with the TruLens integration, MLflow now supports the structured three-dimensional lens (Goal, Plan, Action) developed by the TruLens team at Snowflake.

The Agent GPA Framework

GPA stands for Goal-Plan-Action, and it evaluates three alignment dimensions in an agent's execution:

Goal-Plan alignment asks: did the agent make a good strategy? An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine.

PlanQuality checks whether the plan decomposes the goal into feasible subtasks.
ToolSelection checks whether the agent picked the right tools for each subtask.

Plan-Action alignment asks: did the agent follow through? Did it skip steps, reorder things, or repeat work?

PlanAdherence checks whether the agent's actual actions match its stated plan.
ToolCalling checks whether function calls are valid, with correct parameters and complete inputs.

Holistic alignment looks at the trajectory as a whole.

LogicalConsistency checks whether each step is coherent with prior context and reasoning.
ExecutionEfficiency checks whether the agent reached the goal without redundant calls.

On the TRAIL benchmark, GPA judges identify 95% of human-labeled agent errors (267/281), compared to 55% for baseline trace-aware judges that also read the execution trace but lack the structured Goal-Plan-Action decomposition. That 40-percentage-point gap shows that reading the trace alone is not enough. How you structure the evaluation matters.

Pass a trace and the scorer handles the rest. Because agent traces often exceed LLM context windows, TruLens GPA scorers pre-process the trace to reduce its size while preserving key information including the agent plan and tool calls. Under the hood, the integration serializes your MLflow trace to JSON and passes the processed span tree to TruLens' provider, which evaluates each dimension with chain-of-thought reasoning. You get back a score and a rationale explaining what it found.

Architecture diagram showing the TruLens trace evaluation pipeline: MLflow agent trace with spans is serialized to JSON, passed to the TruLens GPA Provider backed by a model provider, which evaluates across six scorer dimensions grouped by Goal-Plan, Plan-Action, and Holistic alignment, producing scores and rationales that flow into MLflow Feedback and the assessment table UI

How Trace Evaluation Catches What Output Evaluation Misses

Here's a concrete scenario. Say you have a travel-planning agent that should: (1) search for flights, (2) check hotel availability, (3) book both. The agent returns "Your trip is booked!" and it looks correct. But the trace tells a different story:

Span 1: search_flights("NYC", "LAX", "2026-04-01") -> 3 results
Span 2: search_flights("NYC", "LAX", "2026-04-01") -> 3 results  <- duplicate
Span 3: book_flight(flight_id="FL123") -> confirmed
Span 4: search_hotels("LAX", "2026-04-01") -> 2 results
Span 5: book_hotel(hotel_id=None) -> error
Span 6: book_hotel(hotel_id="H456") -> confirmed

Output-only evaluation gives this a pass - the trip got booked. However, trace-level evaluation catches three problems:

ExecutionEfficiency: redundant flight search (Span 2 duplicates Span 1)
ToolCalling: book_hotel called with None before retry (Span 5)
PlanAdherence: the agent booked a flight before searching for hotels

Combining Agent and RAG Evaluation

You can mix agent trace scorers, RAG scorers, and scorers from other frameworks in a single mlflow.genai.evaluate() call. The trace scorers read the span tree, while RAG scorers like Groundedness extract context from retrieval spans in the trace automatically. All scorers support a model parameter for choosing your LLM provider (OpenAI, Anthropic, or any LiteLLM-compatible provider).

import mlflow
from mlflow.genai.scorers.trulens import (
    Groundedness,
    PlanAdherence,
    ExecutionEfficiency,
)
from mlflow.genai.scorers.phoenix import Hallucination

traces = mlflow.search_traces(locations=["..."])

results = mlflow.genai.evaluate(
    data=traces,
    scorers=[
        # Agent behavior (reads the full span tree)
        PlanAdherence(model="openai:/gpt-5.2"),
        ExecutionEfficiency(model="openai:/gpt-5.2"),
        # RAG quality (extracts context from retrieval spans)
        Groundedness(model="openai:/gpt-5.2"),
        # Content quality (Phoenix)
        Hallucination(model="openai:/gpt-5.2"),
    ],
)

Each scorer runs independently and writes results to the same experiment. Results land in the MLflow assessment table alongside any other evaluation results.

MLflow trace detail showing travel-planning agent with spans on the left, and TruLens GPA assessments on the right with execution_efficiency expanded to show its rationale

The trace detail view shows the full span tree on the left and TruLens GPA assessments on the right. Expand any assessment to see the chain-of-thought reasoning behind the score.

MLflow Traces list showing one agent trace with assessment columns: execution_efficiency (0.67), logical_consistency (0.67), plan_adherence (0.67) with average score bar charts for each scorer

Getting Started

To get started, install MLflow and TruLens with the LiteLLM provider. For full API details, see the TruLens scorers documentation.

pip install mlflow>=3.10.0 trulens trulens-providers-litellm

from mlflow.genai.scorers.trulens import PlanAdherence, Groundedness

# Agent trace scorer
scorer = PlanAdherence(model="openai:/gpt-5.2")
feedback = scorer(trace=my_agent_trace)
print(feedback.value)      # "yes" or "no" based on threshold
print(feedback.rationale)  # Chain-of-thought reasoning

# RAG scorer (extracts context from retrieval spans in trace)
scorer = Groundedness(model="openai:/gpt-5.2", threshold=0.6)
feedback = scorer(trace=my_rag_trace)
print(feedback.value)              # "yes" or "no"
print(feedback.rationale)          # Why it passed or failed
print(feedback.metadata["score"])  # 0.85

Resources

Provenance

I (Debu Sinha) contributed the TruLens integration (PR #19492) to MLflow's open-source third-party scorer framework, adding 10 scorers: 4 RAG metrics and 6 agent trace evaluators based on the Agent GPA framework. The integration went through four review rounds with Samraj Moorjani (Software Engineer at Databricks, MLflow maintainer), with final approval from Avesh C. Singh (Software Engineer at Databricks). It follows the scorer pattern Moorjani established in the DeepEval and RAGAS integrations and extends it to agent trace evaluation, a category that requires reading the full span tree rather than just inputs and outputs.

Josh Reini (TruLens maintainer, Snowflake) reviewed the integration's scorer semantics and validated the trace-aware evaluation behavior. Reini published a companion post on the Snowflake Engineering Blog covering the Agent GPA research and TRAIL benchmark results in depth. A cross-project documentation PR was also merged into the TruLens repository.

Related artifacts:

Upstream MLflow TruLens PR #19492 (merged)
TruLens documentation PR #2344 (merged, cross-project)
Introducing DeepEval, RAGAS, and Phoenix Judges in MLflow (companion blog)

LLMs & Agents

Model Training

LLMs & Agents

Model Training

Cookbook

Ambassador Program

Agent Trace Evaluation with TruLens Scorers in MLflow

The Agent GPA Framework

How Trace Evaluation Catches What Output Evaluation Misses

Combining Agent and RAG Evaluation

Getting Started

Resources

Provenance

LLMs & Agents

Model Training

LLMs & Agents

Model Training

Cookbook

Ambassador Program

The Agent GPA Framework​

How Trace Evaluation Catches What Output Evaluation Misses​

Combining Agent and RAG Evaluation​

Getting Started​

Resources​

Provenance​

The Agent GPA Framework

How Trace Evaluation Catches What Output Evaluation Misses

Combining Agent and RAG Evaluation

Getting Started

Resources

Provenance