mlflow.pyspark.ml
- mlflow.pyspark.ml.autolog(log_models=True, log_datasets=True, disable=False, exclusive=False, disable_for_unsupported_versions=False, silent=False, log_post_training_metrics=True, registered_model_name=None, log_input_examples=False, log_model_signatures=True, log_model_allowlist=None, extra_tags=None)[source]
- Note - Autologging is known to be compatible with the following package versions: - 3.2.1<=- pyspark<=- 3.5.5. Autologging may not succeed when used with package versions outside of this range.- Enables (or disables) and configures autologging for pyspark ml estimators. This method is not threadsafe. This API requires Spark 3.0 or above. - When is autologging performed?
- Autologging is performed when you call - Estimator.fitexcept for estimators (featurizers) under- pyspark.ml.feature.
- Logged information
- Parameters
- Parameters obtained by - estimator.params. If a param value is also an- Estimator, then params in the the wrapped estimator will also be logged, the nested param key will be {estimator_uid}.{param_name}
 
- Tags
- An estimator class name (e.g. “LinearRegression”). 
- A fully qualified estimator class name (e.g. “pyspark.ml.regression.LinearRegression”). 
 
 - Post training metrics
- When users call evaluator APIs after model training, MLflow tries to capture the Evaluator.evaluate results and log them as MLflow metrics to the Run associated with the model. All pyspark ML evaluators are supported. - For post training metrics autologging, the metric key format is: “{metric_name}[-{call_index}]_{dataset_name}” - The metric name is the name returned by Evaluator.getMetricName() 
- If multiple calls are made to the same pyspark ML evaluator metric, each subsequent call adds a “call_index” (starting from 2) to the metric key. 
- MLflow uses the prediction input dataset variable name as the “dataset_name” in the metric key. The “prediction input dataset variable” refers to the variable which was used as the dataset argument of model.transform call. Note: MLflow captures the “prediction input dataset” instance in the outermost call frame and fetches the variable name in the outermost call frame. If the “prediction input dataset” instance is an intermediate expression without a defined variable name, the dataset name is set to “unknown_dataset”. If multiple “prediction input dataset” instances have the same variable name, then subsequent ones will append an index (starting from 2) to the inspected dataset name. 
 - Limitations
- MLflow cannot find run information for other objects derived from a given prediction result (e.g. by doing some transformation on the prediction result dataset). 
 
 
- Artifacts
- An MLflow Model with the - mlflow.sparkflavor containing a fitted estimator (logged by- mlflow.spark.log_model()). Note that large models may not be autologged for performance and storage space considerations, and autologging for Pipelines and hyperparameter tuning meta-estimators (e.g. CrossValidator) is not yet supported. See- log_modelsparam below for details.
- For post training metrics API calls, a “metric_info.json” artifact is logged. This is a JSON object whose keys are MLflow post training metric names (see “Post training metrics” section for the key format) and whose values are the corresponding evaluator information, including evaluator class name and evaluator params. 
 
 
- How does autologging work for meta estimators?
- When a meta estimator (e.g. Pipeline, CrossValidator, TrainValidationSplit, OneVsRest) calls - fit(), it internally calls- fit()on its child estimators. Autologging does NOT perform logging on these constituent- fit()calls.- A “estimator_info.json” artifact is logged, which includes a hierarchy entry describing the hierarchy of the meta estimator. The hierarchy includes expanded entries for all nested stages, such as nested pipeline stages. - Parameter search
- In addition to recording the information discussed above, autologging for parameter search meta estimators (CrossValidator and TrainValidationSplit) records child runs with metrics for each set of explored parameters, as well as artifacts and parameters for the best model and the best parameters (if available). For better readability, the “estimatorParamMaps” param in parameter search estimator will be recorded inside “estimator_info” artifact, see following description. Inside “estimator_info.json” artifact, in addition to the “hierarchy”, records 2 more items: “tuning_parameter_map_list”: a list contains all parameter maps used in tuning, and “tuned_estimator_parameter_map”: the parameter map of the tuned estimator. Records a “best_parameters.json” artifacts, contains the best parameter it searched out. Records a “search_results.csv” artifacts, contains search results, it is a table with 2 columns: “params” and “metric”. 
 
 - Parameters
- log_models – If - True, if trained models are in allowlist, they are logged as MLflow model artifacts. If- False, trained models are not logged. Note: the built-in allowlist excludes some models (e.g. ALS models) which can be large. To specify a custom allowlist, create a file containing a newline-delimited list of fully-qualified estimator classnames, and set the “spark.mlflow.pysparkml.autolog.logModelAllowlistFile” Spark config to the path of your allowlist file.
- log_datasets – If - True, dataset information is logged to MLflow Tracking. If- False, dataset information is not logged.
- disable – If - True, disables the scikit-learn autologging integration. If- False, enables the pyspark ML autologging integration.
- exclusive – If - True, autologged content is not logged to user-created fluent runs. If- False, autologged content is logged to the active fluent run, which may be user-created.
- disable_for_unsupported_versions – If - True, disable autologging for versions of pyspark that have not been tested against this version of the MLflow client or are incompatible.
- silent – If - True, suppress all event logs and warnings from MLflow during pyspark ML autologging. If- False, show all events and warnings during pyspark ML autologging.
- log_post_training_metrics – If - True, post training metrics are logged. Defaults to- True. See the post training metrics section for more details.
- registered_model_name – If given, each time a model is trained, it is registered as a new model version of the registered model with this name. The registered model is created if it does not already exist. 
- log_input_examples – If - True, input examples from training datasets are collected and logged along with pyspark ml model artifacts during training. If- False, input examples are not logged.
- log_model_signatures – - If - True,- ModelSignaturesdescribing model inputs and outputs are collected and logged along with spark ml pipeline/estimator artifacts during training. If- Falsesignatures are not logged.- Warning - Currently, only scalar Spark data types are supported. If model inputs/outputs contain non-scalar Spark data types such as - pyspark.ml.linalg.Vector, signatures are not logged.
- log_model_allowlist – - If given, it overrides the default log model allowlist in mlflow. This takes precedence over the spark configuration of “spark.mlflow.pysparkml.autolog.logModelAllowlistFile”. - The default log model allowlist in mlflow
- # classification pyspark.ml.classification.LinearSVCModel pyspark.ml.classification.DecisionTreeClassificationModel pyspark.ml.classification.GBTClassificationModel pyspark.ml.classification.LogisticRegressionModel pyspark.ml.classification.RandomForestClassificationModel pyspark.ml.classification.NaiveBayesModel # clustering pyspark.ml.clustering.BisectingKMeansModel pyspark.ml.clustering.KMeansModel pyspark.ml.clustering.GaussianMixtureModel # Regression pyspark.ml.regression.AFTSurvivalRegressionModel pyspark.ml.regression.DecisionTreeRegressionModel pyspark.ml.regression.GBTRegressionModel pyspark.ml.regression.GeneralizedLinearRegressionModel pyspark.ml.regression.LinearRegressionModel pyspark.ml.regression.RandomForestRegressionModel # Featurizer model pyspark.ml.feature.BucketedRandomProjectionLSHModel pyspark.ml.feature.ChiSqSelectorModel pyspark.ml.feature.CountVectorizerModel pyspark.ml.feature.IDFModel pyspark.ml.feature.ImputerModel pyspark.ml.feature.MaxAbsScalerModel pyspark.ml.feature.MinHashLSHModel pyspark.ml.feature.MinMaxScalerModel pyspark.ml.feature.OneHotEncoderModel pyspark.ml.feature.RobustScalerModel pyspark.ml.feature.RFormulaModel pyspark.ml.feature.StandardScalerModel pyspark.ml.feature.StringIndexerModel pyspark.ml.feature.VarianceThresholdSelectorModel pyspark.ml.feature.VectorIndexerModel pyspark.ml.feature.UnivariateFeatureSelectorModel # composite model pyspark.ml.classification.OneVsRestModel # pipeline model pyspark.ml.pipeline.PipelineModel # Hyper-parameter tuning pyspark.ml.tuning.CrossValidatorModel pyspark.ml.tuning.TrainValidationSplitModel # SynapeML models synapse.ml.cognitive.* synapse.ml.exploratory.* synapse.ml.featurize.* synapse.ml.geospatial.* synapse.ml.image.* synapse.ml.io.* synapse.ml.isolationforest.* synapse.ml.lightgbm.* synapse.ml.nn.* synapse.ml.opencv.* synapse.ml.stages.* synapse.ml.vw.* 
 
- extra_tags – A dictionary of extra tags to set on each managed run created by autologging.