Fine-Tuning Transformers with MLflow for Enhanced Model Management
Welcome to our in-depth tutorial on fine-tuning Transformers models with enhanced management using MLflow.
What You Will Learn in This Tutorialβ
- Understand the process of fine-tuning a Transformers model.
- Learn to effectively log and manage the training cycle using MLflow.
- Master logging the trained model separately in MLflow.
- Gain insights into using the trained model for practical inference tasks.
Our approach will provide a holistic understanding of model fine-tuning and management, ensuring that you're well-equipped to handle similar tasks in your projects.
Emphasizing Fine-Tuningβ
Fine-tuning pre-trained models is a common practice in machine learning, especially in the field of NLP. It involves adjusting a pre-trained model to make it more suitable for a specific task. This process is essential as it allows the leveraging of pre-existing knowledge in the model, significantly improving performance on specific datasets or tasks.
Role of MLflow in Model Lifecycleβ
Integrating MLflow in this process is crucial for:
- Training Cycle Logging: Keeping a detailed log of the training cycle, including parameters, metrics, and intermediate results.
- Model Logging and Management: Separately logging the trained model, tracking its versions, and managing its lifecycle post-training.
- Inference and Deployment: Using the logged model for inference, ensuring easy transition from training to deployment.
# Disable tokenizers warnings when constructing pipelines
%env TOKENIZERS_PARALLELISM=false
import warnings
# Disable a few less-than-useful UserWarnings from setuptools and pydantic
warnings.filterwarnings("ignore", category=UserWarning)
env: TOKENIZERS_PARALLELISM=false
Preparing the Dataset and Environment for Fine-Tuningβ
Key Steps in this Sectionβ
- Loading the Dataset: Utilizing the
sms_spam
dataset for spam detection. - Splitting the Dataset: Dividing the dataset into training and test sets with an 80/20 distribution.
- Importing Necessary Libraries: Including libraries like
evaluate
,mlflow
,numpy
, and essential components from thetransformers
library.
Before diving into the fine-tuning process, setting up our environment and preparing the dataset is crucial. This step involves loading the dataset, splitting it into training and testing sets, and initializing essential components of the Transformers library. These preparatory steps lay the groundwork for an efficient fine-tuning process.
This setup ensures that we have a solid foundation for fine-tuning our model, with all the necessary data and tools at our disposal. In the following Python code, we'll execute these steps to kickstart our model fine-tuning journey.
import evaluate
import numpy as np
from datasets import load_dataset
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments,
pipeline,
)
import mlflow
# Load the "sms_spam" dataset.
sms_dataset = load_dataset("sms_spam")
# Split train/test by an 8/2 ratio.
sms_train_test = sms_dataset["train"].train_test_split(test_size=0.2)
train_dataset = sms_train_test["train"]
test_dataset = sms_train_test["test"]
Found cached dataset sms_spam (/Users/benjamin.wilson/.cache/huggingface/datasets/sms_spam/plain_text/1.0.0/53f051d3b5f62d99d61792c91acefe4f1577ad3e4c216fb0ad39e30b9f20019c)
0%| | 0/1 [00:00<?, ?it/s]
Tokenization and Dataset Preparationβ
In the next code block, we tokenize our text data, preparing it for the fine-tuning process of our model.
With our dataset loaded and split, the next step is to prepare our text data for the model. This involves tokenizing the text, a crucial process in NLP where text is converted into a format that's understandable and usable by our model.
Tokenization Processβ
- Loading the Tokenizer: Using the
AutoTokenizer
from thetransformers
library for thedistilbert-base-uncased
model's tokenizer. - Defining the Tokenization Function: Creating a function to tokenize text data, including padding and truncation.
- Applying Tokenization to the Dataset: Processing both the training and testing sets for model readiness.
Tokenization is a critical step in preparing text data for NLP tasks. It ensures that the data is in a format that the model can process, and by handling aspects like padding and truncation, it ensures consistency across our dataset, which is vital for training stability and model performance.
# Load the tokenizer for "distilbert-base-uncased" model.
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize_function(examples):
# Pad/truncate each text to 512 tokens. Enforcing the same shape
# could make the training faster.
return tokenizer(
examples["sms"],
padding="max_length",
truncation=True,
max_length=128,
)
seed = 22
# Tokenize the train and test datasets
train_tokenized = train_dataset.map(tokenize_function)
train_tokenized = train_tokenized.remove_columns(["sms"]).shuffle(seed=seed)
test_tokenized = test_dataset.map(tokenize_function)
test_tokenized = test_tokenized.remove_columns(["sms"]).shuffle(seed=seed)
Map: 0%| | 0/4459 [00:00<?, ? examples/s]
Map: 0%| | 0/1115 [00:00<?, ? examples/s]
Model Initialization and Label Mappingβ
Next, we'll set up label mappings and initialize the model for our text classification task.
Having prepared our data, the next crucial step is to initialize our model and set up label mappings. This involves defining a clear relationship between the labels in our dataset and their corresponding representations in the model.
Setting Up Label Mappingsβ
- Defining Label Mappings: Creating bi-directional mappings between integer labels and textual representations ("ham" and "spam").
Initializing the Modelβ
- Model Selection: Choosing the
distilbert-base-uncased
model for its balance of performance and efficiency. - Model Configuration: Configuring the model for sequence classification with the defined label mappings.
Proper model initialization and label mapping are key to ensuring that the model accurately understands and processes the task at hand. By explicitly defining these mappings and selecting an appropriate pre-trained model, we lay the groundwork for effective and efficient fine-tuning.
# Set the mapping between int label and its meaning.
id2label = {0: "ham", 1: "spam"}
label2id = {"ham": 0, "spam": 1}
# Acquire the model from the Hugging Face Hub, providing label and id mappings so that both we and the model can 'speak' the same language.
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=2,
label2id=label2id,
id2label=id2label,
)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Setting Up Evaluation Metricsβ
Next, we focus on defining and computing evaluation metrics to measure our model's performance accurately.
After initializing our model, the next critical step is to define how we'll evaluate its performance. Accurate evaluation is key to understanding how well our model is learning and performing on the task.
Choosing and Loading the Metricβ
- Metric Selection: Opting for 'accuracy' as the evaluation metric.
- Loading the Metric: Utilizing the
evaluate
library to load the 'accuracy' metric.
Defining the Metric Computation Functionβ
- Function for Metric Computation: Creating a function,
compute_metrics
, for calculating accuracy during model evaluation. - Processing Predictions: Handling logits and labels from predictions to compute accuracy.
Properly setting up evaluation metrics allows us to objectively measure the model's performance. By using standardized metrics, we can compare our model's performance against benchmarks or other models, ensuring that our fine-tuning process is effective and moving in the right direction.
# Define the target optimization metric
metric = evaluate.load("accuracy")
# Define a function for calculating our defined target optimization metric during training
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
Configuring the Training Environmentβ
In this step, we're going to configure our Trainer, supplying important training configurations via the use of the TrainingArguments
API.
With our model and metrics ready, the next important step is to configure the training environment. This involves setting up the training arguments and initializing the Trainer, a component that orchestrates the model training process.
Training Arguments Configurationβ
- Defining the Output Directory: We specify the
training_output_dir
where our model checkpoints will be saved during training. This helps in managing and storing model states at different stages of training. - Specifying Training Arguments: We create an instance of
TrainingArguments
to define various parameters for training, such as the output directory, evaluation strategy, batch sizes for training and evaluation, logging frequency, and the number of training epochs. These parameters are critical for controlling how the model is trained and evaluated.
Initializing the Trainerβ
- Creating the Trainer Instance: We use the Trainer class from the Transformers library, providing it with our model, the previously defined training arguments, datasets for training and evaluation, and the function to compute metrics.
- Role of the Trainer: The Trainer handles all aspects of training and evaluating the model, including the execution of training loops, handling of data batching, and calling the compute metrics function. It simplifies the training process, making it more streamlined and efficient.
Importance of Proper Training Configurationβ
Setting up the training environment correctly is essential for effective model training. Proper configuration ensures that the model is trained under optimal conditions, leading to better performance and more reliable results.
In the following code block, we'll configure our training environment and initialize the Trainer, setting the stage for the actual training process.
# Checkpoints will be output to this `training_output_dir`.
training_output_dir = "/tmp/sms_trainer"
training_args = TrainingArguments(
output_dir=training_output_dir,
evaluation_strategy="epoch",
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
logging_steps=8,
num_train_epochs=3,
)
# Instantiate a `Trainer` instance that will be used to initiate a training run.
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_tokenized,
eval_dataset=test_tokenized,
compute_metrics=compute_metrics,
)
# If you are running this tutorial in local mode, leave the next line commented out.
# Otherwise, uncomment the following line and set your tracking uri to your local or remote tracking server.
# mlflow.set_tracking_uri("http://127.0.0.1:8080")
Integrating MLflow for Experiment Trackingβ
The final preparatory step before beginning the training process is to integrate MLflow for experiment tracking.
MLflow is a critical tool in our workflow, enabling us to log, monitor, and compare different runs of our model training.
Setting up the MLflow Experimentβ
- Naming the Experiment: We use
mlflow.set_experiment
to create a new experiment or assign the current run to an existing experiment. In this case, we name our experiment "Spam Classifier Training". This name should be descriptive and related to the task at hand, aiding in organizing and identifying experiments later. - Role of MLflow in Training: By setting up an MLflow experiment, we can track various aspects of our model training, such as parameters, metrics, and outputs. This tracking is invaluable for comparing different models, tuning hyperparameters, and maintaining a record of our experiments.
Benefits of Experiment Trackingβ
Utilizing MLflow for experiment tracking offers several advantages:
- Organization: Keeps your training runs organized and easily accessible.
- Comparability: Allows for easy comparison of different training runs to understand the impact of changes in parameters or data.
- Reproducibility: Enhances the reproducibility of experiments by logging all necessary details.
With MLflow set up, we're now ready to begin the training process, keeping track of every important aspect along the way.
In the next code snippet, we'll set up our MLflow experiment for tracking the training of our spam classification model.
# Pick a name that you like and reflects the nature of the runs that you will be recording to the experiment.
mlflow.set_experiment("Spam Classifier Training")
<Experiment: artifact_location='file:///Users/benjamin.wilson/repos/mlflow-fork/mlflow/docs/source/llms/transformers/tutorials/fine-tuning/mlruns/258758267044147956', creation_time=1701291176206, experiment_id='258758267044147956', last_update_time=1701291176206, lifecycle_stage='active', name='Spam Classifier Training', tags={}>
Starting the Training Process with MLflowβ
In this step, we initiate the fine-tuning training run, utilizing the native auto-logging functionality to record the parameters used and loss metrics calculated during the training process.
With our model, training arguments, and MLflow experiment set up, we are now ready to start the actual training process. This step involves initiating an MLflow run, which will encapsulate all the training activities and metrics.
Initiating the MLflow Runβ
- Starting an MLflow Run: We use
mlflow.start_run()
to begin a new MLflow run. This function creates a new run context, under which all the training operations and logging will occur. - Training the Model: Inside the MLflow run context, we call
trainer.train()
to start training our model. This function will run the training loop, processing the data in batches, updating model parameters, and evaluating the model.