DeepEval
DeepEval is a comprehensive evaluation framework for LLM applications that provides metrics for RAG systems, agents, conversational AI, and safety evaluation. MLflow's DeepEval integration allows you to use most DeepEval metrics as MLflow scorers.
Prerequisites
DeepEval scorers require the deepeval package:
pip install deepeval
Quick Start
You can call DeepEval scorers directly:
from mlflow.genai.scorers.deepeval import AnswerRelevancy
scorer = AnswerRelevancy(threshold=0.7, model="openai:/gpt-4")
feedback = scorer(
inputs="What is MLflow?",
outputs="MLflow is an open-source platform for managing machine learning workflows.",
)
print(feedback.value) # "yes" or "no"
print(feedback.metadata["score"]) # 0.85
Or use them in mlflow.genai.evaluate:
import mlflow
from mlflow.genai.scorers.deepeval import AnswerRelevancy, Faithfulness
eval_dataset = [
{
"inputs": {"query": "What is MLflow?"},
"outputs": "MLflow is an open-source platform for managing machine learning workflows.",
},
{
"inputs": {"query": "How do I track experiments?"},
"outputs": "You can use mlflow.start_run() to begin tracking experiments.",
},
]
results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
AnswerRelevancy(threshold=0.7, model="openai:/gpt-4"),
Faithfulness(threshold=0.8, model="openai:/gpt-4"),
],
)
Available DeepEval Scorers
DeepEval scorers are organized into categories based on their evaluation focus:
RAG (Retrieval-Augmented Generation) Metrics
Evaluate retrieval quality and answer generation in RAG systems:
| Scorer | What does it evaluate? | DeepEval Docs |
|---|---|---|
| AnswerRelevancy | Is the output relevant to the input query? | Link |
| Faithfulness | Is the output factually consistent with retrieval context? | Link |
| ContextualRecall | Does retrieval context contain all necessary information? | Link |
| ContextualPrecision | Are relevant nodes ranked higher than irrelevant ones? | Link |
| ContextualRelevancy | Is the retrieval context relevant to the query? | Link |
Agentic Metrics
Evaluate AI agent performance and behavior:
| Scorer | What does it evaluate? | DeepEval Docs |
|---|---|---|
| TaskCompletion | Does the agent successfully complete its assigned task? | Link |
| ToolCorrectness | Does the agent use the correct tools? | Link |
| ArgumentCorrectness | Are tool arguments correct? | Link |
| StepEfficiency | Does the agent take an optimal path? | Link |
| PlanAdherence | Does the agent follow its plan? | Link |
| PlanQuality | Is the agent's plan well-structured? | Link |
Conversational Metrics
Evaluate multi-turn conversations and dialogue systems:
| Scorer | What does it evaluate? | DeepEval Docs |
|---|---|---|
| TurnRelevancy | Is each turn relevant to the conversation? | Link |
| RoleAdherence | Does the assistant maintain its assigned role? | Link |
| KnowledgeRetention | Does the agent retain information across turns? | Link |
| ConversationCompleteness | Are all user questions addressed? | Link |
| GoalAccuracy | Does the conversation achieve its goal? | Link |
| ToolUse | Does the agent use tools appropriately in conversation? | Link |
| TopicAdherence | Does the conversation stay on topic? | Link |
Safety Metrics
Detect harmful content, bias, and policy violations:
| Scorer | What does it evaluate? | DeepEval Docs |
|---|---|---|
| Bias | Does the output contain biased content? | Link |
| Toxicity | Does the output contain toxic language? | Link |
| NonAdvice | Does the model inappropriately provide advice in restricted domains? | Link |
| Misuse | Could the output be used for harmful purposes? | Link |
| PIILeakage | Does the output leak personally identifiable information? | Link |
| RoleViolation | Does the assistant break out of its assigned role? | Link |
Other
Additional evaluation metrics for common use cases:
| Scorer | What does it evaluate? | DeepEval Docs |
|---|---|---|
| Hallucination | Does the LLM fabricate information not in the context? | Link |
| Summarization | Is the summary accurate and complete? | Link |
| JsonCorrectness | Does JSON output match the expected schema? | Link |
| PromptAlignment | Does the output align with prompt instructions? | Link |
Non-LLM
Fast, rule-based metrics that don't require LLM calls:
| Scorer | What does it evaluate? | DeepEval Docs |
|---|---|---|
| ExactMatch | Does output exactly match expected output? | Link |
| PatternMatch | Does output match a regex pattern? | Link |
Creating Scorers by Name
You can also create DeepEval scorers dynamically using get_scorer:
from mlflow.genai.scorers.deepeval import get_scorer
# Create scorer by name
scorer = get_scorer(
metric_name="AnswerRelevancy",
threshold=0.7,
model="openai:/gpt-4",
)
feedback = scorer(
inputs="What is MLflow?",
outputs="MLflow is a platform for ML workflows.",
)
Configuration
DeepEval scorers accept all parameters supported by the underlying DeepEval metrics. Any additional keyword arguments are passed directly to the DeepEval metric constructor:
from mlflow.genai.scorers.deepeval import AnswerRelevancy, TurnRelevancy
# Common parameters
scorer = AnswerRelevancy(
model="openai:/gpt-4", # Model URI (also supports "databricks", "databricks:/endpoint", etc.)
threshold=0.7, # Pass/fail threshold (0.0-1.0, scorer passes if score >= threshold)
include_reason=True, # Include detailed rationale in feedback
)
# Metric-specific parameters are passed through to DeepEval
conversational_scorer = TurnRelevancy(
model="openai:/gpt-4o",
threshold=0.8,
window_size=3, # DeepEval-specific: number of conversation turns to consider
strict_mode=True, # DeepEval-specific: enforce stricter evaluation criteria
)
Refer to the DeepEval documentation for metric-specific parameters.