Skip to main content

Genie Evaluation with LLM Judges

· 3 min read

With traced Databricks Genie conversations from the Conversation Tracing Pipeline, you can now score each message to find out which ones have quality issues and why. This cookbook runs three types of checks:

  • Built-in judges check relevance, safety, and whether Genie's answers are grounded in retrieved data.
  • Custom judges check Genie-specific quality like response usefulness and SQL correctness.
  • Code-based scorers run deterministic checks with zero LLM cost.

Every scorer returns "yes" (pass) or "no" (fail). The Space Improvement Generator reads these results and generates fixes for the Genie conversations that failed.

Prerequisites

You need traces from the Conversation Tracing Pipeline logged to an MLflow experiment.

Step 1: Set Up the Experiment

Point to the same MLflow experiment where the tracing pipeline logged its traces.

import mlflow
from mlflow.entities import Feedback
from mlflow.genai.scorers import (
Guidelines,
RelevanceToQuery,
RetrievalGroundedness,
Safety,
scorer,
)

EXPERIMENT_NAME = "/Users/your-user-name/genie_eval"
mlflow.set_experiment(EXPERIMENT_NAME)

Step 2: Define LLM Judges

These built-in scorers automatically extract inputs and outputs from traces. No labels required.

relevance = RelevanceToQuery()
safety = Safety()
groundedness = RetrievalGroundedness()
  • RelevanceToQuery -is the response directly relevant to the user's question?
  • Safety -is the content free from harmful material?
  • RetrievalGroundedness -is the response grounded in the retrieved data?

Step 3: Define Custom Judges

Guidelines lets you define pass/fail rules in plain English for Genie-specific quality.

response_quality = Guidelines(
name="genie_response_quality",
guidelines=[
"The response must directly address the user's data question "
"rather than giving a vague or generic reply.",
"If SQL was generated, the response must include a data-driven "
"answer, not just echo the SQL query back.",
"The response must not say 'I cannot answer' when the question "
"is about data that should be available in the tables.",
],
)

sql_quality = Guidelines(
name="genie_sql_quality",
guidelines=[
"If SQL is present, it must use appropriate aggregation "
"functions (SUM, COUNT, AVG) matching the user's intent.",
"The SQL must include appropriate WHERE clauses to filter "
"data as the user requested.",
"The SQL must not use SELECT * on large tables without a "
"LIMIT or specific filter.",
],
)

Step 4: Define Code-Based Scorers

These run deterministically with zero LLM cost.

@scorer
def has_response(outputs) -> Feedback:
"""Check if Genie returned a text response."""
resp = outputs.get("response") if isinstance(outputs, dict) else None
if resp and len(str(resp).strip()) > 0:
return Feedback(value="yes", rationale=f"{len(resp)} chars")
return Feedback(value="no", rationale="No text response")


@scorer
def no_error(outputs) -> Feedback:
"""Check that the interaction completed without errors."""
err = outputs.get("error") if isinstance(outputs, dict) else None
if err and str(err).strip():
return Feedback(value="no", rationale=f"Error: {str(err)[:200]}")
return Feedback(value="yes", rationale="No errors")

Step 5: Run Evaluation

Results are logged as assessments on each trace in the experiment.

experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)
traces_df = mlflow.search_traces(
locations=[experiment.experiment_id],
order_by=["timestamp DESC"],
max_results=100,
)
print(f"Found {len(traces_df)} traces to evaluate")

eval_results = mlflow.genai.evaluate(
data=traces_df,
scorers=[
relevance,
safety,
groundedness,
response_quality,
sql_quality,
has_response,
no_error,
],
)

Adjust max_results to evaluate more or fewer traces.

Results

After evaluation, each trace has assessment columns showing pass/fail results from every scorer.

Traces with assessment columns showing judge results

Click a trace to see the full assessment panel with scores and rationales from each judge.

Trace detail with assessment panel showing all judge scores

Next Steps