mlflow.data
The mlflow.data
module helps you record your model training and evaluation datasets to
runs with MLflow Tracking, as well as retrieve dataset information from runs. It provides the
following important interfaces:
Dataset
: Represents a dataset used in model training or evaluation, including features, targets, and metadata such as the dataset’s name, digest (hash) schema, profile, and source. You can log this metadata to a run in MLflow Tracking using themlflow.log_input()
API.mlflow.data
provides APIs for constructingDatasets
from a variety of Python data objects, including Pandas DataFrames (mlflow.data.from_pandas()
), NumPy arrays (mlflow.data.from_numpy()
), Spark DataFrames (mlflow.data.from_spark()
/mlflow.data.load_delta()
), and more.DatasetSource
: Represents the source of a dataset. For example, this may be a directory of files stored in S3, a Delta Table, or a web URL. EachDataset
references the source from which it was derived. ADataset
’s features and targets may differ from the source if transformations and filtering were applied. You can get theDatasetSource
of a dataset logged to a run in MLflow Tracking using themlflow.data.get_source()
API.
The following example demonstrates how to use mlflow.data
to log a training dataset to a run,
retrieve information about the dataset from the run, and load the dataset’s source.
import mlflow.data
import pandas as pd
from mlflow.data.pandas_dataset import PandasDataset
# Construct a Pandas DataFrame using iris flower data from a web URL
dataset_source_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df = pd.read_csv(dataset_source_url)
# Construct an MLflow PandasDataset from the Pandas DataFrame, and specify the web URL
# as the source
dataset: PandasDataset = mlflow.data.from_pandas(df, source=dataset_source_url)
with mlflow.start_run():
# Log the dataset to the MLflow Run. Specify the "training" context to indicate that the
# dataset is used for model training
mlflow.log_input(dataset, context="training")
# Retrieve the run, including dataset information
run = mlflow.get_run(mlflow.last_active_run().info.run_id)
dataset_info = run.inputs.dataset_inputs[0].dataset
print(f"Dataset name: {dataset_info.name}")
print(f"Dataset digest: {dataset_info.digest}")
print(f"Dataset profile: {dataset_info.profile}")
print(f"Dataset schema: {dataset_info.schema}")
# Load the dataset's source, which downloads the content from the source URL to the local
# filesystem
dataset_source = mlflow.data.get_source(dataset_info)
dataset_source.load()
-
class
mlflow.data.dataset.
Dataset
(source: mlflow.data.dataset_source.DatasetSource, name: Optional[str] = None, digest: Optional[str] = None)[source] Bases:
object
Note
Experimental: This class may change or be removed in a future release without warning.
Represents a dataset for use with MLflow Tracking, including the name, digest (hash), schema, and profile of the dataset as well as source information (e.g. the S3 bucket or managed Delta table from which the dataset was derived). Most datasets expose features and targets for training and evaluation as well.
-
abstract property
profile
Optional summary statistics for the dataset, such as the number of rows in a table, the mean / median / std of each table column, etc.
-
abstract property
schema
Optional dataset schema, such as an instance of
mlflow.types.Schema
representing the features and targets of the dataset.
-
property
source
Information about the dataset’s source, represented as an instance of
DatasetSource
. For example, this may be the S3 location or the name of the managed Delta Table from which the dataset was derived.
-
abstract property
-
class
mlflow.data.dataset_source.
DatasetSource
[source] Bases:
object
Note
Experimental: This class may change or be removed in a future release without warning.
Represents the source of a dataset used in MLflow Tracking, providing information such as cloud storage location, delta table name / version, etc.
-
from_json
(cls, source_json: str) → DatasetSource[source]
-
abstract
load
() → Any[source] Loads files / objects referred to by the DatasetSource. For example, depending on the type of
DatasetSource
, this may download source CSV files from S3 to the local filesystem, load a source Delta Table as a Spark DataFrame, etc.- Returns
The downloaded source, e.g. a local filesystem path, a Spark DataFrame, etc.
-
to_json
() → str[source] Obtains a JSON string representation of the
DatasetSource
.- Returns
A JSON string representation of the
DatasetSource
.
-
-
mlflow.data.
get_source
(dataset: Union[Dataset, DatasetInput, mlflow.data.dataset.Dataset]) → mlflow.data.dataset_source.DatasetSource[source] Note
Experimental: This function may change or be removed in a future release without warning.
Obtains the source of the specified dataset or dataset input.
- Parameters
dataset – An instance of
mlflow.data.dataset.Dataset
,mlflow.entities.Dataset
, ormlflow.entities.DatasetInput
.- Returns
An instance of
DatasetSource
.
pandas
-
mlflow.data.
from_pandas
(df: pandas.core.frame.DataFrame, source: Optional[Union[str, mlflow.data.dataset_source.DatasetSource]] = None, targets: Optional[str] = None, name: Optional[str] = None, digest: Optional[str] = None) → mlflow.data.pandas_dataset.PandasDataset[source] Note
Experimental: This function may change or be removed in a future release without warning.
Constructs a
PandasDataset
instance from a Pandas DataFrame, optional targets, and source.- Parameters
df – A Pandas DataFrame.
source – The source from which the DataFrame was derived, e.g. a filesystem path, an S3 URI, an HTTPS URL, a delta table name with version, or spark table etc.
source
may be specified as a URI, a path-like string, or an instance ofDatasetSource
. If unspecified, the source is assumed to be the code location (e.g. notebook cell, script, etc.) wherefrom_pandas
is being called.targets – An optional target column name for supervised training. This column must be present in the dataframe (
df
).name – The name of the dataset. If unspecified, a name is generated.
digest – The dataset digest (hash). If unspecified, a digest is computed automatically.
-
class
mlflow.data.pandas_dataset.
PandasDataset
[source] Represents a Pandas DataFrame for use with MLflow Tracking.
-
property
schema
An instance of
mlflow.types.Schema
representing the tabular dataset. May beNone
if the schema cannot be inferred from the dataset.
-
property
NumPy
-
mlflow.data.
from_numpy
(features: Union[numpy.ndarray, Dict[str, numpy.ndarray]], source: Optional[Union[str, mlflow.data.dataset_source.DatasetSource]] = None, targets: Optional[Union[numpy.ndarray, Dict[str, numpy.ndarray]]] = None, name: Optional[str] = None, digest: Optional[str] = None) → mlflow.data.numpy_dataset.NumpyDataset[source] Note
Experimental: This function may change or be removed in a future release without warning.
Constructs a
NumpyDataset
object from NumPy features, optional targets, and source. If the source is path like, then this will construct a DatasetSource object from the source path. Otherwise, the source is assumed to be a DatasetSource object.- Parameters
features – NumPy features, represented as an np.ndarray or dictionary of named np.ndarrays.
source – The source from which the numpy data was derived, e.g. a filesystem path, an S3 URI, an HTTPS URL, a delta table name with version, or spark table etc.
source
may be specified as a URI, a path-like string, or an instance ofDatasetSource
. If unspecified, the source is assumed to be the code location (e.g. notebook cell, script, etc.) wherefrom_numpy
is being called.targets – Optional NumPy targets, represented as an np.ndarray or dictionary of named np.ndarrays.
name – The name of the dataset. If unspecified, a name is generated.
digest – The dataset digest (hash). If unspecified, a digest is computed automatically.
-
class
mlflow.data.numpy_dataset.
NumpyDataset
[source] Note
Experimental: This class may change or be removed in a future release without warning.
Represents a NumPy dataset for use with MLflow Tracking.
Spark
-
mlflow.data.
load_delta
(path: Optional[str] = None, table_name: Optional[str] = None, version: Optional[str] = None, targets: Optional[str] = None, name: Optional[str] = None, digest: Optional[str] = None) → mlflow.data.spark_dataset.SparkDataset[source] Note
Experimental: This function may change or be removed in a future release without warning.
Loads a
SparkDataset
from a Delta table for use with MLflow Tracking.- Parameters
path – The path to the Delta table. Either
path
ortable_name
must be specified.table_name – The name of the Delta table. Either
path
ortable_name
must be specified.version – The Delta table version. If not specified, the version will be inferred.
targets – Optional. The name of the Delta table column containing targets (labels) for supervised learning.
name – The name of the dataset. E.g. “wiki_train”. If unspecified, a name is automatically generated.
digest – The digest (hash, fingerprint) of the dataset. If unspecified, a digest is automatically computed.
- Returns
An instance of
SparkDataset
.
-
mlflow.data.
from_spark
(df: pyspark.sql.DataFrame, path: Optional[str] = None, table_name: Optional[str] = None, version: Optional[str] = None, sql: Optional[str] = None, targets: Optional[str] = None, name: Optional[str] = None, digest: Optional[str] = None) → mlflow.data.spark_dataset.SparkDataset[source] Note
Experimental: This function may change or be removed in a future release without warning.
Given a Spark DataFrame, constructs a
SparkDataset
object for use with MLflow Tracking.- Parameters
df – The Spark DataFrame from which to construct a SparkDataset.
path – The path of the Spark or Delta source that the DataFrame originally came from. Note that the path does not have to match the DataFrame exactly, since the DataFrame may have been modified by Spark operations. This is used to reload the dataset upon request via
SparkDataset.source.load()
. If none ofpath
,table_name
, orsql
are specified, a CodeDatasetSource is used, which will source information from the run context.table_name – The name of the Spark or Delta table that the DataFrame originally came from. Note that the table does not have to match the DataFrame exactly, since the DataFrame may have been modified by Spark operations. This is used to reload the dataset upon request via
SparkDataset.source.load()
. If none ofpath
,table_name
, orsql
are specified, a CodeDatasetSource is used, which will source information from the run context.version – If the DataFrame originally came from a Delta table, specifies the version of the Delta table. This is used to reload the dataset upon request via
SparkDataset.source.load()
.version
cannot be specified ifsql
is specified.sql – The Spark SQL statement that was originally used to construct the DataFrame. Note that the Spark SQL statement does not have to match the DataFrame exactly, since the DataFrame may have been modified by Spark operations. This is used to reload the dataset upon request via
SparkDataset.source.load()
. If none ofpath
,table_name
, orsql
are specified, a CodeDatasetSource is used, which will source information from the run context.targets – Optional. The name of the Data Frame column containing targets (labels) for supervised learning.
name – The name of the dataset. E.g. “wiki_train”. If unspecified, a name is automatically generated.
digest – The digest (hash, fingerprint) of the dataset. If unspecified, a digest is automatically computed.
- Returns
An instance of
SparkDataset
.
-
class
mlflow.data.spark_dataset.
SparkDataset
[source] Note
Experimental: This class may change or be removed in a future release without warning.
Represents a Spark dataset (e.g. data derived from a Spark Table / file directory or Delta Table) for use with MLflow Tracking.
-
property
source
Spark dataset source information.
- Returns
An instance of
SparkDatasetSource
orDeltaDatasetSource
.
-
property
Hugging Face
-
mlflow.data.huggingface_dataset.
from_huggingface
(ds, path: Optional[str] = None, targets: Optional[str] = None, data_dir: Optional[str] = None, data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None, revision=None, task=None, name: Optional[str] = None, digest: Optional[str] = None) → mlflow.data.huggingface_dataset.HuggingFaceDataset[source] Note
Experimental: This function may change or be removed in a future release without warning.
Given a Hugging Face
datasets.Dataset
, constructs an MLflowHuggingFaceDataset
object for use with MLflow Tracking.- Parameters
ds – A Hugging Face dataset. Must be an instance of
datasets.Dataset
. Other types, such asdatasets.DatasetDict
, are not supported.path – The path of the Hugging Face dataset used to construct the source. This is used by the
datasets.load_dataset()
function to reload the dataset upon request viaHuggingFaceDataset.source.load()
. If no path is specified, a CodeDatasetSource is used, which will source information from the run context.targets – The name of the Hugging Face
dataset.Dataset
column containing targets (labels) for supervised learning.data_dir – The data_dir of the Hugging Face dataset configuration. This is used by the
datasets.load_dataset()
function to reload the dataset upon request viaHuggingFaceDataset.source.load()
.data_files – Paths to source data file(s) for the Hugging Face dataset configuration. This is used by the
datasets.load_dataset()
function to reload the dataset upon request viaHuggingFaceDataset.source.load()
.revision – Version of the dataset script to load. This is used by the
datasets.load_dataset()
function to reload the dataset upon request viaHuggingFaceDataset.source.load()
.task – The task to prepare the Hugging Face dataset for during training and evaluation. This is used by the
datasets.load_dataset()
function to reload the dataset upon request viaHuggingFaceDataset.source.load()
.name – The name of the dataset. E.g. “wiki_train”. If unspecified, a name is automatically generated.
digest – The digest (hash, fingerprint) of the dataset. If unspecified, a digest is automatically computed.
-
class
mlflow.data.huggingface_dataset.
HuggingFaceDataset
[source] Note
Experimental: This class may change or be removed in a future release without warning.
Represents a HuggingFace dataset for use with MLflow Tracking.
-
property
ds
The Hugging Face
datasets.Dataset
instance.- Returns
The Hugging Face
datasets.Dataset
instance.
-
property
profile
Summary statistics for the Hugging Face dataset, including the number of rows, size, and size in bytes.
-
property
source
Hugging Face dataset source information.
- Returns
A
mlflow.data.huggingface_dataset_source.HuggingFaceDatasetSource
instance.
-
property
targets
The name of the Hugging Face dataset column containing targets (labels) for supervised learning.
- Returns
The string name of the Hugging Face dataset column containing targets.
-
to_evaluation_dataset
(path=None, feature_names=None) → mlflow.models.evaluation.base.EvaluationDataset[source] Converts the dataset to an EvaluationDataset for model evaluation. Required for use with mlflow.evaluate().
-
property
TensorFlow
-
mlflow.data.tensorflow_dataset.
from_tensorflow
(features, source: Optional[Union[str, mlflow.data.dataset_source.DatasetSource]] = None, targets=None, name: Optional[str] = None, digest: Optional[str] = None) → mlflow.data.tensorflow_dataset.TensorFlowDataset[source] Constructs a TensorFlowDataset object from TensorFlow data, optional targets, and source. If the source is path like, then this will construct a DatasetSource object from the source path. Otherwise, the source is assumed to be a DatasetSource object.
- Parameters
features – A TensorFlow dataset or tensor of features.
source – The source from which the data was derived, e.g. a filesystem path, an S3 URI, an HTTPS URL, a delta table name with version, or spark table etc. If source is not a path like string, pass in a DatasetSource object directly. If no source is specified, a CodeDatasetSource is used, which will source information from the run context.
targets – A TensorFlow dataset or tensor of targets. Optional.
name – The name of the dataset. If unspecified, a name is generated.
digest – A dataset digest (hash). If unspecified, a digest is computed automatically.
-
class
mlflow.data.tensorflow_dataset.
TensorFlowDataset
[source] Note
Experimental: This class may change or be removed in a future release without warning.
Represents a TensorFlow dataset for use with MLflow Tracking.
-
to_evaluation_dataset
(path=None, feature_names=None) → mlflow.models.evaluation.base.EvaluationDataset[source] Converts the dataset to an EvaluationDataset for model evaluation. Only supported if the dataset is a Tensor. Required for use with mlflow.evaluate().
-
-
class
mlflow.models.evaluation.base.
EvaluationDataset
[source] An input dataset for model evaluation. This is intended for use with the
mlflow.models.evaluate()
API.
Dataset Sources
-
class
mlflow.data.filesystem_dataset_source.
FileSystemDatasetSource
[source] Note
Experimental: This class may change or be removed in a future release without warning.
Represents the source of a dataset stored on a filesystem, e.g. a local UNIX filesystem, blob storage services like S3, etc.
-
abstract
load
(dst_path=None) → str[source] Downloads the dataset source to the local filesystem.
- Parameters
dst_path – Path of the local filesystem destination directory to which to download the dataset source. If the directory does not exist, it is created. If unspecified, the dataset source is downloaded to a new uniquely-named directory on the local filesystem, unless the dataset source already exists on the local filesystem, in which case its local path is returned directly.
- Returns
The path to the downloaded dataset source on the local filesystem.
-
abstract
-
class
mlflow.data.http_dataset_source.
HTTPDatasetSource
[source] Represents the source of a dataset stored at a web location and referred to by an HTTP or HTTPS URL.
-
load
(dst_path=None) → str[source] Downloads the dataset source to the local filesystem.
- Parameters
dst_path – Path of the local filesystem destination directory to which to download the dataset source. If the directory does not exist, it is created. If unspecified, the dataset source is downloaded to a new uniquely-named directory on the local filesystem.
- Returns
The path to the downloaded dataset source on the local filesystem.
-
-
class
mlflow.data.huggingface_dataset_source.
HuggingFaceDatasetSource
[source] Note
Experimental: This class may change or be removed in a future release without warning.
Represents the source of a Hugging Face dataset used in MLflow Tracking.
-
load
(**kwargs)[source] Loads the dataset source as a Hugging Face Dataset.
- Parameters
kwargs – Additional keyword arguments used for loading the dataset with the Hugging Face
datasets.load_dataset()
method. The following keyword arguments are used automatically from the dataset source but may be overridden by values passed in**kwargs
:path
,name
,data_dir
,data_files
,split
,revision
,task
.- Returns
An instance of
datasets.Dataset
.
-
-
class
mlflow.data.delta_dataset_source.
DeltaDatasetSource
[source] Note
Experimental: This class may change or be removed in a future release without warning.
Represents the source of a dataset stored at in a delta table.
-
load
(**kwargs)[source] Loads the dataset source as a Delta Dataset Source. :return: An instance of
pyspark.sql.DataFrame
.
-
-
class
mlflow.data.spark_dataset_source.
SparkDatasetSource
[source] Note
Experimental: This class may change or be removed in a future release without warning.
Represents the source of a dataset stored in a spark table.
-
load
(**kwargs)[source] Loads the dataset source as a Spark Dataset Source. :return: An instance of
pyspark.sql.DataFrame
.
-