mlflow.data

The mlflow.data module helps you record your model training and evaluation datasets to runs with MLflow Tracking, as well as retrieve dataset information from runs. It provides the following important interfaces:

  • Dataset: Represents a dataset used in model training or evaluation, including features, targets, and metadata such as the dataset’s name, digest (hash) schema, profile, and source. You can log this metadata to a run in MLflow Tracking using the mlflow.log_input() API. mlflow.data provides APIs for constructing Datasets from a variety of Python data objects, including Pandas DataFrames (mlflow.data.from_pandas()), NumPy arrays (mlflow.data.from_numpy()), Spark DataFrames (mlflow.data.from_spark() / mlflow.data.load_delta()), and more.

  • DatasetSource: Represents the source of a dataset. For example, this may be a directory of files stored in S3, a Delta Table, or a web URL. Each Dataset references the source from which it was derived. A Dataset’s features and targets may differ from the source if transformations and filtering were applied. You can get the DatasetSource of a dataset logged to a run in MLflow Tracking using the mlflow.data.get_source() API.

The following example demonstrates how to use mlflow.data to log a training dataset to a run, retrieve information about the dataset from the run, and load the dataset’s source.

import mlflow.data
import pandas as pd
from mlflow.data.pandas_dataset import PandasDataset

# Construct a Pandas DataFrame using iris flower data from a web URL
dataset_source_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df = pd.read_csv(dataset_source_url)
# Construct an MLflow PandasDataset from the Pandas DataFrame, and specify the web URL
# as the source
dataset: PandasDataset = mlflow.data.from_pandas(df, source=dataset_source_url)

with mlflow.start_run():
    # Log the dataset to the MLflow Run. Specify the "training" context to indicate that the
    # dataset is used for model training
    mlflow.log_input(dataset, context="training")

# Retrieve the run, including dataset information
run = mlflow.get_run(mlflow.last_active_run().info.run_id)
dataset_info = run.inputs.dataset_inputs[0].dataset
print(f"Dataset name: {dataset_info.name}")
print(f"Dataset digest: {dataset_info.digest}")
print(f"Dataset profile: {dataset_info.profile}")
print(f"Dataset schema: {dataset_info.schema}")

# Load the dataset's source, which downloads the content from the source URL to the local
# filesystem
dataset_source = mlflow.data.get_source(dataset_info)
dataset_source.load()
class mlflow.data.dataset.Dataset(source: mlflow.data.dataset_source.DatasetSource, name: Optional[str] = None, digest: Optional[str] = None)[source]

Bases: object

Note

Experimental: This class may change or be removed in a future release without warning.

Represents a dataset for use with MLflow Tracking, including the name, digest (hash), schema, and profile of the dataset as well as source information (e.g. the S3 bucket or managed Delta table from which the dataset was derived). Most datasets expose features and targets for training and evaluation as well.

property digest

A unique hash or fingerprint of the dataset, e.g. "498c7496".

property name

The name of the dataset, e.g. "iris_data", "myschema.mycatalog.mytable@v1", etc.

abstract property profile

Optional summary statistics for the dataset, such as the number of rows in a table, the mean / median / std of each table column, etc.

abstract property schema

Optional dataset schema, such as an instance of mlflow.types.Schema representing the features and targets of the dataset.

property source

Information about the dataset’s source, represented as an instance of DatasetSource. For example, this may be the S3 location or the name of the managed Delta Table from which the dataset was derived.

to_json()str[source]

Obtains a JSON string representation of the Dataset.

Returns

A JSON string representation of the Dataset.

class mlflow.data.dataset_source.DatasetSource[source]

Bases: object

Note

Experimental: This class may change or be removed in a future release without warning.

Represents the source of a dataset used in MLflow Tracking, providing information such as cloud storage location, delta table name / version, etc.

from_json(cls, source_json: str)DatasetSource[source]
abstract load()Any[source]

Loads files / objects referred to by the DatasetSource. For example, depending on the type of DatasetSource, this may download source CSV files from S3 to the local filesystem, load a source Delta Table as a Spark DataFrame, etc.

Returns

The downloaded source, e.g. a local filesystem path, a Spark DataFrame, etc.

to_json()str[source]

Obtains a JSON string representation of the DatasetSource.

Returns

A JSON string representation of the DatasetSource.

mlflow.data.get_source(dataset: Union[Dataset, DatasetInput, mlflow.data.dataset.Dataset])mlflow.data.dataset_source.DatasetSource[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Obtains the source of the specified dataset or dataset input.

Parameters

dataset – An instance of mlflow.data.dataset.Dataset, mlflow.entities.Dataset, or mlflow.entities.DatasetInput.

Returns

An instance of DatasetSource.

pandas

mlflow.data.from_pandas(df: pandas.core.frame.DataFrame, source: Optional[Union[str, mlflow.data.dataset_source.DatasetSource]] = None, targets: Optional[str] = None, name: Optional[str] = None, digest: Optional[str] = None)mlflow.data.pandas_dataset.PandasDataset[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Constructs a PandasDataset instance from a Pandas DataFrame, optional targets, and source.

Parameters
  • df – A Pandas DataFrame.

  • source – The source from which the DataFrame was derived, e.g. a filesystem path, an S3 URI, an HTTPS URL, a delta table name with version, or spark table etc. source may be specified as a URI, a path-like string, or an instance of DatasetSource. If unspecified, the source is assumed to be the code location (e.g. notebook cell, script, etc.) where from_pandas is being called.

  • targets – An optional target column name for supervised training. This column must be present in the dataframe (df).

  • name – The name of the dataset. If unspecified, a name is generated.

  • digest – The dataset digest (hash). If unspecified, a digest is computed automatically.

class mlflow.data.pandas_dataset.PandasDataset[source]

Represents a Pandas DataFrame for use with MLflow Tracking.

property df

The underlying pandas DataFrame.

property profile

A profile of the dataset. May be None if a profile cannot be computed.

property schema

An instance of mlflow.types.Schema representing the tabular dataset. May be None if the schema cannot be inferred from the dataset.

property source

The source of the dataset.

property targets

The name of the target column. May be None if no target column is available.

NumPy

mlflow.data.from_numpy(features: Union[numpy.ndarray, Dict[str, numpy.ndarray]], source: Optional[Union[str, mlflow.data.dataset_source.DatasetSource]] = None, targets: Optional[Union[numpy.ndarray, Dict[str, numpy.ndarray]]] = None, name: Optional[str] = None, digest: Optional[str] = None)mlflow.data.numpy_dataset.NumpyDataset[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Constructs a NumpyDataset object from NumPy features, optional targets, and source. If the source is path like, then this will construct a DatasetSource object from the source path. Otherwise, the source is assumed to be a DatasetSource object.

Parameters
  • features – NumPy features, represented as an np.ndarray or dictionary of named np.ndarrays.

  • source – The source from which the numpy data was derived, e.g. a filesystem path, an S3 URI, an HTTPS URL, a delta table name with version, or spark table etc. source may be specified as a URI, a path-like string, or an instance of DatasetSource. If unspecified, the source is assumed to be the code location (e.g. notebook cell, script, etc.) where from_numpy is being called.

  • targets – Optional NumPy targets, represented as an np.ndarray or dictionary of named np.ndarrays.

  • name – The name of the dataset. If unspecified, a name is generated.

  • digest – The dataset digest (hash). If unspecified, a digest is computed automatically.

class mlflow.data.numpy_dataset.NumpyDataset[source]

Note

Experimental: This class may change or be removed in a future release without warning.

Represents a NumPy dataset for use with MLflow Tracking.

property features

The features of the dataset.

property profile

A profile of the dataset. May be None if a profile cannot be computed.

property schema

MLflow TensorSpec schema representing the dataset features and targets (optional).

property source

The source of the dataset.

property targets

The targets of the dataset. May be None if no targets are available.

Spark

mlflow.data.load_delta(path: Optional[str] = None, table_name: Optional[str] = None, version: Optional[str] = None, targets: Optional[str] = None, name: Optional[str] = None, digest: Optional[str] = None)mlflow.data.spark_dataset.SparkDataset[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Loads a SparkDataset from a Delta table for use with MLflow Tracking.

Parameters
  • path – The path to the Delta table. Either path or table_name must be specified.

  • table_name – The name of the Delta table. Either path or table_name must be specified.

  • version – The Delta table version. If not specified, the version will be inferred.

  • targets – Optional. The name of the Delta table column containing targets (labels) for supervised learning.

  • name – The name of the dataset. E.g. “wiki_train”. If unspecified, a name is automatically generated.

  • digest – The digest (hash, fingerprint) of the dataset. If unspecified, a digest is automatically computed.

Returns

An instance of SparkDataset.

mlflow.data.from_spark(df: pyspark.sql.DataFrame, path: Optional[str] = None, table_name: Optional[str] = None, version: Optional[str] = None, sql: Optional[str] = None, targets: Optional[str] = None, name: Optional[str] = None, digest: Optional[str] = None)mlflow.data.spark_dataset.SparkDataset[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Given a Spark DataFrame, constructs a SparkDataset object for use with MLflow Tracking.

Parameters
  • df – The Spark DataFrame from which to construct a SparkDataset.

  • path – The path of the Spark or Delta source that the DataFrame originally came from. Note that the path does not have to match the DataFrame exactly, since the DataFrame may have been modified by Spark operations. This is used to reload the dataset upon request via SparkDataset.source.load(). If none of path, table_name, or sql are specified, a CodeDatasetSource is used, which will source information from the run context.

  • table_name – The name of the Spark or Delta table that the DataFrame originally came from. Note that the table does not have to match the DataFrame exactly, since the DataFrame may have been modified by Spark operations. This is used to reload the dataset upon request via SparkDataset.source.load(). If none of path, table_name, or sql are specified, a CodeDatasetSource is used, which will source information from the run context.

  • version – If the DataFrame originally came from a Delta table, specifies the version of the Delta table. This is used to reload the dataset upon request via SparkDataset.source.load(). version cannot be specified if sql is specified.

  • sql – The Spark SQL statement that was originally used to construct the DataFrame. Note that the Spark SQL statement does not have to match the DataFrame exactly, since the DataFrame may have been modified by Spark operations. This is used to reload the dataset upon request via SparkDataset.source.load(). If none of path, table_name, or sql are specified, a CodeDatasetSource is used, which will source information from the run context.

  • targets – Optional. The name of the Data Frame column containing targets (labels) for supervised learning.

  • name – The name of the dataset. E.g. “wiki_train”. If unspecified, a name is automatically generated.

  • digest – The digest (hash, fingerprint) of the dataset. If unspecified, a digest is automatically computed.

Returns

An instance of SparkDataset.

class mlflow.data.spark_dataset.SparkDataset[source]

Note

Experimental: This class may change or be removed in a future release without warning.

Represents a Spark dataset (e.g. data derived from a Spark Table / file directory or Delta Table) for use with MLflow Tracking.

property df

The Spark DataFrame instance.

Returns

The Spark DataFrame instance.

property profile

A profile of the dataset. May be None if no profile is available.

property schema

The MLflow ColSpec schema of the Spark dataset.

property source

Spark dataset source information.

Returns

An instance of SparkDatasetSource or DeltaDatasetSource.

property targets

The name of the Spark DataFrame column containing targets (labels) for supervised learning.

Returns

The string name of the Spark DataFrame column containing targets.

Hugging Face

mlflow.data.huggingface_dataset.from_huggingface(ds, path: Optional[str] = None, targets: Optional[str] = None, data_dir: Optional[str] = None, data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None, revision=None, task=None, name: Optional[str] = None, digest: Optional[str] = None)mlflow.data.huggingface_dataset.HuggingFaceDataset[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Given a Hugging Face datasets.Dataset, constructs an MLflow HuggingFaceDataset object for use with MLflow Tracking.

Parameters
  • ds – A Hugging Face dataset. Must be an instance of datasets.Dataset. Other types, such as datasets.DatasetDict, are not supported.

  • path – The path of the Hugging Face dataset used to construct the source. This is used by the datasets.load_dataset() function to reload the dataset upon request via HuggingFaceDataset.source.load(). If no path is specified, a CodeDatasetSource is used, which will source information from the run context.

  • targets – The name of the Hugging Face dataset.Dataset column containing targets (labels) for supervised learning.

  • data_dir – The data_dir of the Hugging Face dataset configuration. This is used by the datasets.load_dataset() function to reload the dataset upon request via HuggingFaceDataset.source.load().

  • data_files – Paths to source data file(s) for the Hugging Face dataset configuration. This is used by the datasets.load_dataset() function to reload the dataset upon request via HuggingFaceDataset.source.load().

  • revision – Version of the dataset script to load. This is used by the datasets.load_dataset() function to reload the dataset upon request via HuggingFaceDataset.source.load().

  • task – The task to prepare the Hugging Face dataset for during training and evaluation. This is used by the datasets.load_dataset() function to reload the dataset upon request via HuggingFaceDataset.source.load().

  • name – The name of the dataset. E.g. “wiki_train”. If unspecified, a name is automatically generated.

  • digest – The digest (hash, fingerprint) of the dataset. If unspecified, a digest is automatically computed.

class mlflow.data.huggingface_dataset.HuggingFaceDataset[source]

Note

Experimental: This class may change or be removed in a future release without warning.

Represents a HuggingFace dataset for use with MLflow Tracking.

property ds

The Hugging Face datasets.Dataset instance.

Returns

The Hugging Face datasets.Dataset instance.

property profile

Summary statistics for the Hugging Face dataset, including the number of rows, size, and size in bytes.

property schema

The MLflow ColSpec schema of the Hugging Face dataset.

property source

Hugging Face dataset source information.

Returns

A mlflow.data.huggingface_dataset_source.HuggingFaceDatasetSource instance.

property targets

The name of the Hugging Face dataset column containing targets (labels) for supervised learning.

Returns

The string name of the Hugging Face dataset column containing targets.

to_evaluation_dataset(path=None, feature_names=None)mlflow.models.evaluation.base.EvaluationDataset[source]

Converts the dataset to an EvaluationDataset for model evaluation. Required for use with mlflow.evaluate().

TensorFlow

mlflow.data.tensorflow_dataset.from_tensorflow(features, source: Optional[Union[str, mlflow.data.dataset_source.DatasetSource]] = None, targets=None, name: Optional[str] = None, digest: Optional[str] = None)mlflow.data.tensorflow_dataset.TensorFlowDataset[source]

Constructs a TensorFlowDataset object from TensorFlow data, optional targets, and source. If the source is path like, then this will construct a DatasetSource object from the source path. Otherwise, the source is assumed to be a DatasetSource object.

Parameters
  • features – A TensorFlow dataset or tensor of features.

  • source – The source from which the data was derived, e.g. a filesystem path, an S3 URI, an HTTPS URL, a delta table name with version, or spark table etc. If source is not a path like string, pass in a DatasetSource object directly. If no source is specified, a CodeDatasetSource is used, which will source information from the run context.

  • targets – A TensorFlow dataset or tensor of targets. Optional.

  • name – The name of the dataset. If unspecified, a name is generated.

  • digest – A dataset digest (hash). If unspecified, a digest is computed automatically.

class mlflow.data.tensorflow_dataset.TensorFlowDataset[source]

Note

Experimental: This class may change or be removed in a future release without warning.

Represents a TensorFlow dataset for use with MLflow Tracking.

property data

The underlying TensorFlow data.

property profile

A profile of the dataset. May be None if no profile is available.

property schema

An MLflow TensorSpec schema representing the tensor dataset

property source

The source of the dataset.

property targets

The targets of the dataset.

to_evaluation_dataset(path=None, feature_names=None)mlflow.models.evaluation.base.EvaluationDataset[source]

Converts the dataset to an EvaluationDataset for model evaluation. Only supported if the dataset is a Tensor. Required for use with mlflow.evaluate().

class mlflow.models.evaluation.base.EvaluationDataset[source]

An input dataset for model evaluation. This is intended for use with the mlflow.models.evaluate() API.

NUM_SAMPLE_ROWS_FOR_HASH = 5
SPARK_DATAFRAME_LIMIT = 10000
property feature_names
property features_data

return features data as a numpy array or a pandas DataFrame.

property has_targets

Returns True if the dataset has targets, False otherwise.

property hash

Dataset hash, includes hash on first 20 rows and last 20 rows.

property labels_data

return labels data as a numpy array

property name

Dataset name, which is specified dataset name or the dataset hash if user don’t specify name.

property path

Dataset path

property targets_name

return targets name

Dataset Sources

class mlflow.data.filesystem_dataset_source.FileSystemDatasetSource[source]

Note

Experimental: This class may change or be removed in a future release without warning.

Represents the source of a dataset stored on a filesystem, e.g. a local UNIX filesystem, blob storage services like S3, etc.

abstract load(dst_path=None)str[source]

Downloads the dataset source to the local filesystem.

Parameters

dst_path – Path of the local filesystem destination directory to which to download the dataset source. If the directory does not exist, it is created. If unspecified, the dataset source is downloaded to a new uniquely-named directory on the local filesystem, unless the dataset source already exists on the local filesystem, in which case its local path is returned directly.

Returns

The path to the downloaded dataset source on the local filesystem.

abstract property uri

The URI referring to the dataset source filesystem location.

Returns

The URI referring to the dataset source filesystem location, e.g “s3://mybucket/path/to/mydataset”, “/tmp/path/to/my/dataset” etc.

class mlflow.data.http_dataset_source.HTTPDatasetSource[source]

Represents the source of a dataset stored at a web location and referred to by an HTTP or HTTPS URL.

load(dst_path=None)str[source]

Downloads the dataset source to the local filesystem.

Parameters

dst_path – Path of the local filesystem destination directory to which to download the dataset source. If the directory does not exist, it is created. If unspecified, the dataset source is downloaded to a new uniquely-named directory on the local filesystem.

Returns

The path to the downloaded dataset source on the local filesystem.

property url

The HTTP/S URL referring to the dataset source location.

Returns

The HTTP/S URL referring to the dataset source location.

class mlflow.data.huggingface_dataset_source.HuggingFaceDatasetSource[source]

Note

Experimental: This class may change or be removed in a future release without warning.

Represents the source of a Hugging Face dataset used in MLflow Tracking.

load(**kwargs)[source]

Loads the dataset source as a Hugging Face Dataset.

Parameters

kwargs – Additional keyword arguments used for loading the dataset with the Hugging Face datasets.load_dataset() method. The following keyword arguments are used automatically from the dataset source but may be overridden by values passed in **kwargs: path, name, data_dir, data_files, split, revision, task.

Returns

An instance of datasets.Dataset.

class mlflow.data.delta_dataset_source.DeltaDatasetSource[source]

Note

Experimental: This class may change or be removed in a future release without warning.

Represents the source of a dataset stored at in a delta table.

property delta_table_name
property delta_table_version
load(**kwargs)[source]

Loads the dataset source as a Delta Dataset Source. :return: An instance of pyspark.sql.DataFrame.

property path
class mlflow.data.spark_dataset_source.SparkDatasetSource[source]

Note

Experimental: This class may change or be removed in a future release without warning.

Represents the source of a dataset stored in a spark table.

load(**kwargs)[source]

Loads the dataset source as a Spark Dataset Source. :return: An instance of pyspark.sql.DataFrame.