mlflow.pyspark.ml

class mlflow.pyspark.ml.AutologgingEstimatorMetadata(hierarchy, uid_to_indexed_name_map, param_search_estimators)

Bases: tuple

property hierarchy

Alias for field number 0

property param_search_estimators

Alias for field number 2

property uid_to_indexed_name_map

Alias for field number 1

mlflow.pyspark.ml.autolog(log_models=True, disable=False, exclusive=False, disable_for_unsupported_versions=False, silent=False)[source]

Note

Experimental: This method may change or be removed in a future release without warning.

Note

Autologging is known to be compatible with the following package versions: 3.0.0 <= pyspark <= 3.1.2. Autologging may not succeed when used with package versions outside of this range.

Enables (or disables) and configures autologging for pyspark ml estimators. This method is not threadsafe. This API requires Spark 3.0 or above.

When is autologging performed?

Autologging is performed when you call Estimator.fit except for estimators (featurizers) under pyspark.ml.feature.

Logged information
Parameters
  • Parameters obtained by estimator.params. If a param value is also an Estimator, then params in the the wrapped estimator will also be logged, the nested param key will be {estimator_uid}.{param_name}

Tags
  • An estimator class name (e.g. “LinearRegression”).

  • A fully qualified estimator class name (e.g. “pyspark.ml.regression.LinearRegression”).

Artifacts
  • An MLflow Model with the mlflow.spark flavor containing a fitted estimator (logged by mlflow.spark.log_model()). Note that large models may not be autologged for performance and storage space considerations, and autologging for Pipelines and hyperparameter tuning meta-estimators (e.g. CrossValidator) is not yet supported. See log_models param below for details.

How does autologging work for meta estimators?

When a meta estimator (e.g. Pipeline, CrossValidator, TrainValidationSplit, OneVsRest) calls fit(), it internally calls fit() on its child estimators. Autologging does NOT perform logging on these constituent fit() calls.

A “estimator_info.json” artifact is logged, which includes a hierarchy entry describing the hierarchy of the meta estimator. The hierarchy includes expanded entries for all nested stages, such as nested pipeline stages.

Parameter search

In addition to recording the information discussed above, autologging for parameter search meta estimators (CrossValidator and TrainValidationSplit) records child runs with metrics for each set of explored parameters, as well as artifacts and parameters for the best model and the best parameters (if available). For better readability, the “estimatorParamMaps” param in parameter search estimator will be recorded inside “estimator_info” artifact, see following description. Inside “estimator_info.json” artifact, in addition to the “hierarchy”, records 2 more items: “tuning_parameter_map_list”: a list contains all parameter maps used in tuning, and “tuned_estimator_parameter_map”: the parameter map of the tuned estimator. Records a “best_parameters.json” artifacts, contains the best parameter it searched out. Records a “search_results.csv” artifacts, contains search results, it is a table with 2 columns: “params” and “metric”.

Parameters
  • log_models – If True, if trained models are in allowlist, they are logged as MLflow model artifacts. If False, trained models are not logged. Note: the built-in allowlist excludes some models (e.g. ALS models) which can be large. To specify a custom allowlist, create a file containing a newline-delimited list of fully-qualified estimator classnames, and set the “spark.mlflow.pysparkml.autolog.logModelAllowlistFile” Spark config to the path of your allowlist file.

  • disable – If True, disables the scikit-learn autologging integration. If False, enables the pyspark ML autologging integration.

  • exclusive – If True, autologged content is not logged to user-created fluent runs. If False, autologged content is logged to the active fluent run, which may be user-created.

  • disable_for_unsupported_versions – If True, disable autologging for versions of pyspark that have not been tested against this version of the MLflow client or are incompatible.

  • silent – If True, suppress all event logs and warnings from MLflow during pyspark ML autologging. If False, show all events and warnings during pyspark ML autologging.

The default log model allowlist in mlflow
# classification
pyspark.ml.classification.LinearSVCModel
pyspark.ml.classification.DecisionTreeClassificationModel
pyspark.ml.classification.GBTClassificationModel
pyspark.ml.classification.LogisticRegressionModel
pyspark.ml.classification.RandomForestClassificationModel
pyspark.ml.classification.NaiveBayesModel

# clustering
pyspark.ml.clustering.BisectingKMeansModel
pyspark.ml.clustering.KMeansModel
pyspark.ml.clustering.GaussianMixtureModel

# Regression
pyspark.ml.regression.AFTSurvivalRegressionModel
pyspark.ml.regression.DecisionTreeRegressionModel
pyspark.ml.regression.GBTRegressionModel
pyspark.ml.regression.GeneralizedLinearRegressionModel
pyspark.ml.regression.LinearRegressionModel
pyspark.ml.regression.RandomForestRegressionModel

# Featurizer model
pyspark.ml.feature.BucketedRandomProjectionLSHModel
pyspark.ml.feature.ChiSqSelectorModel
pyspark.ml.feature.CountVectorizerModel
pyspark.ml.feature.IDFModel
pyspark.ml.feature.ImputerModel
pyspark.ml.feature.MaxAbsScalerModel
pyspark.ml.feature.MinHashLSHModel
pyspark.ml.feature.MinMaxScalerModel
pyspark.ml.feature.OneHotEncoderModel
pyspark.ml.feature.RobustScalerModel
pyspark.ml.feature.RFormulaModel
pyspark.ml.feature.StandardScalerModel
pyspark.ml.feature.StringIndexerModel
pyspark.ml.feature.VarianceThresholdSelectorModel
pyspark.ml.feature.VectorIndexerModel
pyspark.ml.feature.UnivariateFeatureSelectorModel

# composite model
pyspark.ml.classification.OneVsRestModel

# pipeline model
pyspark.ml.pipeline.PipelineModel

# Hyper-parameter tuning
pyspark.ml.tuning.CrossValidatorModel
pyspark.ml.tuning.TrainValidationSplitModel