MLflow Data Model
MLflow's data model provides a structured approach to developing and managing GenAI applications by organizing how you log, debug, and evaluate them to achieve quality, cost, and latency goals. This structured approach addresses key challenges in reproducibility, quality assessment, and iterative development.
Overviewโ
The MLflow data model consists of several interconnected entities that work together to support your GenAI application development workflow:
๐งช Experiment - The root container for your GenAI application
๐ค LoggedModel - A first-class entity representing your AI model or agent with integrated tracking
๐ Trace - A log of inputs, outputs, and intermediate steps from a single application execution
๐ Assessments - Quality judgments on a Trace, categorized as Feedback or Expectations
๐ฏ Scorers - Definitions of automated evaluation functions that produce Feedback
๐ Evaluation Datasets - Curated sets of inputs (and optional Expectations) for offline testing
๐ Evaluation Runs - Results from running app versions against Evaluation Datasets, containing new, scored Traces
๐ท๏ธ Labeling Sessions - Collections of Traces organized for human review
MLflow Experimentโ
An Experiment is the top-level container for each distinct application or use case. It contains all Traces from development and production alongside all other entities in the data model. We recommend creating a single Experiment for each application.
Setting up an Experimentโ
import mlflow
# Create or set an experiment
mlflow.set_experiment("my-genai-app")
# Or create explicitly
experiment = mlflow.create_experiment("my-genai-app")
MLflow LoggedModel: Model Managementโ
A LoggedModel is a first-class entity that represents your AI model, agent, or GenAI application within an Experiment. It provides unified tracking of model artifacts, execution traces, evaluation metrics, and metadata throughout the development lifecycle.
Key Features of LoggedModelโ
LoggedModel serves as the central hub that connects:
- Model artifacts and configuration parameters
- Execution traces from development and production
- Evaluation metrics and performance assessments
- Version history and deployment tracking
Creating LoggedModelsโ
- Direct Model Logging
- Active Model Pattern
- External Model Reference
Create a LoggedModel by logging your model directly:
import mlflow
# Log a model with comprehensive metadata
logged_model = mlflow.langchain.log_model(
    lc_model=your_chain,
    name="customer_support_agent",
    params={"temperature": 0.1, "max_tokens": 2000},
    model_type="agent",
    input_example={"messages": "How can I help you?"},
)
print(f"Model ID: {logged_model.model_id}")
Use the active model pattern for automatic trace linking:
# Set active model for automatic trace association
mlflow.set_active_model(name="customer_support_agent")
# Enable autologging
mlflow.langchain.autolog()
# All traces will be automatically linked to the active model
response = your_model.invoke({"messages": "Hello!"})
Reference models stored outside MLflow:
# Create external model reference
external_model = mlflow.create_external_model(
    name="production_model_v2",
    model_type="agent",
    params={"version": "2.1", "endpoint": "api.example.com"},
)
LoggedModel Benefitsโ
Unified Tracking: All model-related artifacts, traces, and metrics are organized under a single entity, providing complete visibility into model behavior and performance.
Automatic Trace Linking: When using the active model pattern, all execution traces are automatically associated with the LoggedModel, eliminating manual tracking overhead.
Version Management: LoggedModel supports systematic versioning and comparison across different model iterations, enabling data-driven model selection.
Evaluation Integration: Evaluation metrics and results are directly linked to LoggedModel, providing comprehensive performance assessment.
MLflow Traces: The Foundationโ
The foundational concept is the Trace: a single, complete execution of your GenAI application (e.g., a user request or API call).
How Traces are Generatedโ
Traces are generated through:
- Automatic Instrumentation
- Decorators and Context Managers
Automatic tracing is enabled with a single line of code for 20+ popular LLM SDKs:
import mlflow
# Enable automatic tracing for OpenAI
mlflow.openai.autolog()
# Your existing code works unchanged
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4", messages=[{"role": "user", "content": "Hello!"}]
)
Using the MLflow tracing APIs for fine-grained control:
@mlflow.trace
def my_custom_function(input_data):
    # Your custom logic here
    result = process_data(input_data)
    return result
Purpose of Tracesโ
Traces enable:
- Observability: Gain insights into application performance
- Debugging: Understand execution flow to resolve issues
- Quality Evaluation: Assess response quality over time
- Human Review: Provide data for expert annotation
Assessments: Quality Judgmentsโ
Assessments are qualitative or quantitative judgments attached to Traces to understand and improve GenAI application quality. A Trace can have multiple Assessments, primarily Feedback or Expectations.
Feedback Assessmentsโ
Feedback captures evaluations of a Trace and includes:
- ๐ท๏ธ Name - Developer-defined category (e.g., relevance,correctness)
- โญ Score - Evaluation (e.g., thumbs up/down, numerical rating)
- ๐ญ Rationale - Optional textual explanation for the score
# Log feedback programmatically
mlflow.log_feedback(
    trace_id="trace-123",
    name="relevance",
    value=4,
    rationale="Response was highly relevant to the question",
)
Expectation Assessmentsโ
Expectations are ground truth labels for a Trace (e.g., expected_facts, expected_response). These are primarily used in offline evaluation to compare app output against known correct answers.
Scorers: Automated Quality Measurementโ
Scorers are functions that programmatically assess Trace quality, producing Feedback. They can be:
Code-based Heuristicsโ
def check_response_length(trace):
    """Custom scorer to check if response is appropriate length"""
    response = trace.outputs.get("response", "")
    if 50 <= len(response) <= 500:
        return {"score": 1, "rationale": "Response length is appropriate"}
    else:
        return {"score": 0, "rationale": "Response too short or too long"}
LLM Judgesโ
from mlflow.metrics import genai
# Use built-in LLM judge
relevance_metric = genai.relevance()
# Evaluate traces with the metric
results = mlflow.genai.evaluate(
    predict_fn=your_model, data=evaluation_data, scorers=[relevance_metric]
)
Evaluation Datasets and Runsโ
MLflow provides systematic offline testing through Evaluation Datasets and Evaluation Runs.
Evaluation Datasetsโ
An Evaluation Dataset is a curated collection of example inputs used to evaluate and improve app performance:
# Create an evaluation dataset
dataset = mlflow.data.from_dict(
    {
        "inputs": ["What is MLflow?", "How do I log metrics?"],
        "expectations": {
            "expected_outputs": ["MLflow is...", "To log metrics..."],
        },
    }
)
# Register the dataset
mlflow.log_input(dataset, context="evaluation")
Evaluation Runsโ
An Evaluation Run stores results from running a new app version against an Evaluation Dataset:
# Run evaluation
results = mlflow.genai.evaluate(
    predict_fn=your_model,
    data=evaluation_dataset,
    scorers=[relevance_metric, accuracy_metric],
)
The evaluation process:
- New app version processes inputs from Evaluation Dataset
- MLflow generates a new Trace for each input
- Configured Scorers annotate Traces with Feedback
- All annotated Traces are stored in the Evaluation Run
Labeling Sessions: Human Reviewโ
Labeling Sessions organize Traces for human review, typically through the MLflow UI. Domain experts can browse these Traces and attach Feedback as Assessments.
# Create a labeling session
session = mlflow.genai.create_labeling_session(name="quality-review-session")
Complete Data Model Structureโ
All components operate within an Experiment, forming a comprehensive hierarchy:
Getting Startedโ
To begin using the MLflow data model:
- ๐งช Set up an experiment for your GenAI application
- ๐ค Create or set a LoggedModel to organize your model tracking
- ๐ Enable automatic tracing for your LLM library
- ๐ฏ Define custom scorers for your quality metrics
- ๐ Create evaluation datasets from representative examples
- ๐ Run evaluations to compare different versions
- ๐ฅ Review traces and add human feedback as needed
import mlflow
# 1. Set up experiment
mlflow.set_experiment("my-genai-app")
# 2. Create LoggedModel
logged_model = mlflow.langchain.log_model(
    lc_model=your_model, name="my_agent", params={"temperature": 0.1}
)
# 3. Enable tracing with active model
mlflow.set_active_model(name="my_agent")
mlflow.langchain.autolog()
# 4. Your app code runs normally
# Traces are automatically captured and linked to LoggedModel
# 5. Evaluate and iterate
results = mlflow.genai.evaluate(
    predict_fn=your_model, data=evaluation_data, scorers=[your_custom_scorers]
)
Next Stepsโ
- ๐ค LoggedModel Guide: Learn comprehensive model lifecycle management
- ๐ Automatic Tracing: Learn how to enable automatic tracing for your LLM library
- ๐ ๏ธ Custom Tracing: Add manual instrumentation to your application
- ๐ Evaluation Guide: Dive deeper into evaluation workflows
- ๐ฅ๏ธ MLflow UI: Explore traces and results in the web interface
MLflow's comprehensive data model empowers systematic observation, debugging, evaluation, and improvement of GenAI applications, providing the foundation for building high-quality, reliable, and maintainable GenAI systems.