DeepEval

DeepEval is a comprehensive evaluation framework for LLM applications that provides metrics for RAG systems, agents, conversational AI, and safety evaluation. MLflow's DeepEval integration allows you to use most DeepEval metrics as MLflow scorers.

Prerequisites

DeepEval scorers require the deepeval package:

bash
pip install deepeval

Quick Start

You can call DeepEval scorers directly:

python
from mlflow.genai.scorers.deepeval import AnswerRelevancy

scorer = AnswerRelevancy(threshold=0.7, model="openai:/gpt-4")
feedback = scorer(
    inputs="What is MLflow?",
    outputs="MLflow is an open-source platform for managing machine learning workflows.",
)

print(feedback.value)  # "yes" or "no"
print(feedback.metadata["score"])  # 0.85

Or use them in mlflow.genai.evaluate:

python
import mlflow
from mlflow.genai.scorers.deepeval import AnswerRelevancy, Faithfulness

eval_dataset = [
    {
        "inputs": {"query": "What is MLflow?"},
        "outputs": "MLflow is an open-source platform for managing machine learning workflows.",
    },
    {
        "inputs": {"query": "How do I track experiments?"},
        "outputs": "You can use mlflow.start_run() to begin tracking experiments.",
    },
]

results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[
        AnswerRelevancy(threshold=0.7, model="openai:/gpt-4"),
        Faithfulness(threshold=0.8, model="openai:/gpt-4"),
    ],
)

Available DeepEval Scorers

DeepEval scorers are organized into categories based on their evaluation focus:

RAG (Retrieval-Augmented Generation) Metrics

Evaluate retrieval quality and answer generation in RAG systems:

Scorer	What does it evaluate?	DeepEval Docs
AnswerRelevancy	Is the output relevant to the input query?	Link
Faithfulness	Is the output factually consistent with retrieval context?	Link
ContextualRecall	Does retrieval context contain all necessary information?	Link
ContextualPrecision	Are relevant nodes ranked higher than irrelevant ones?	Link
ContextualRelevancy	Is the retrieval context relevant to the query?	Link

Agentic Metrics

Evaluate AI agent performance and behavior:

Scorer	What does it evaluate?	DeepEval Docs
TaskCompletion	Does the agent successfully complete its assigned task?	Link
ToolCorrectness	Does the agent use the correct tools?	Link
ArgumentCorrectness	Are tool arguments correct?	Link
StepEfficiency	Does the agent take an optimal path?	Link
PlanAdherence	Does the agent follow its plan?	Link
PlanQuality	Is the agent's plan well-structured?	Link

Conversational Metrics

Evaluate multi-turn conversations and dialogue systems:

Scorer	What does it evaluate?	DeepEval Docs
TurnRelevancy	Is each turn relevant to the conversation?	Link
RoleAdherence	Does the assistant maintain its assigned role?	Link
KnowledgeRetention	Does the agent retain information across turns?	Link
ConversationCompleteness	Are all user questions addressed?	Link
GoalAccuracy	Does the conversation achieve its goal?	Link
ToolUse	Does the agent use tools appropriately in conversation?	Link
TopicAdherence	Does the conversation stay on topic?	Link

Safety Metrics

Detect harmful content, bias, and policy violations:

Scorer	What does it evaluate?	DeepEval Docs
Bias	Does the output contain biased content?	Link
Toxicity	Does the output contain toxic language?	Link
NonAdvice	Does the model inappropriately provide advice in restricted domains?	Link
Misuse	Could the output be used for harmful purposes?	Link
PIILeakage	Does the output leak personally identifiable information?	Link
RoleViolation	Does the assistant break out of its assigned role?	Link

Other

Additional evaluation metrics for common use cases:

Scorer	What does it evaluate?	DeepEval Docs
Hallucination	Does the LLM fabricate information not in the context?	Link
Summarization	Is the summary accurate and complete?	Link
JsonCorrectness	Does JSON output match the expected schema?	Link
PromptAlignment	Does the output align with prompt instructions?	Link

Non-LLM

Fast, rule-based metrics that don't require LLM calls:

Scorer	What does it evaluate?	DeepEval Docs
ExactMatch	Does output exactly match expected output?	Link
PatternMatch	Does output match a regex pattern?	Link

Creating Scorers by Name

You can also create DeepEval scorers dynamically using get_scorer:

python
from mlflow.genai.scorers.deepeval import get_scorer

# Create scorer by name
scorer = get_scorer(
    metric_name="AnswerRelevancy",
    threshold=0.7,
    model="openai:/gpt-4",
)

feedback = scorer(
    inputs="What is MLflow?",
    outputs="MLflow is a platform for ML workflows.",
)

Configuration

DeepEval scorers accept all parameters supported by the underlying DeepEval metrics. Any additional keyword arguments are passed directly to the DeepEval metric constructor:

python
from mlflow.genai.scorers.deepeval import AnswerRelevancy, TurnRelevancy

# Common parameters
scorer = AnswerRelevancy(
    model="openai:/gpt-4",  # Model URI (also supports "databricks", "databricks:/endpoint", etc.)
    threshold=0.7,  # Pass/fail threshold (0.0-1.0, scorer passes if score >= threshold)
    include_reason=True,  # Include detailed rationale in feedback
)

# Metric-specific parameters are passed through to DeepEval
conversational_scorer = TurnRelevancy(
    model="openai:/gpt-4o",
    threshold=0.8,
    window_size=3,  # DeepEval-specific: number of conversation turns to consider
    strict_mode=True,  # DeepEval-specific: enforce stricter evaluation criteria
)

Refer to the DeepEval documentation for metric-specific parameters.

Next Steps

Evaluate Agents

Learn specialized techniques for evaluating AI agents with tool usage

Learn more →

Evaluate Traces

Evaluate production traces to understand application behavior

Learn more →

Built-in Judges

Explore MLflow's built-in evaluation judges

Learn more →

Prerequisites​

Quick Start​

Available DeepEval Scorers​

RAG (Retrieval-Augmented Generation) Metrics​

Agentic Metrics​

Conversational Metrics​

Safety Metrics​

Other​

Non-LLM​

Creating Scorers by Name​

Configuration​

Next Steps​

Evaluate Agents

Evaluate Traces

Built-in Judges

Prerequisites

Quick Start

Available DeepEval Scorers

RAG (Retrieval-Augmented Generation) Metrics

Agentic Metrics

Conversational Metrics

Safety Metrics

Other

Non-LLM

Creating Scorers by Name

Configuration

Next Steps