In machine learning, it's important to understand the difference between the fit, transform, and fit_transform methods in Scikit-learn. These methods are used to transform data in different ways, and each one has its specific purpose.
In this blog post, I'll discuss the differences between these three methods, as well as the concepts of transformers and Model objects in Scikit-learn. I will also show you how to use these methods in practice, with some code examples.
Transformers and Models in scikit-learn
While creating our machine learning model we follow the following steps:
Data Preprocessing
Exploratory Data Analysis
Feature Engineering
Feature Scaling
Model Training
Model creation
Hyper Parameter tuning
To support these processes, Scikit-learn library provides us with two categories of objects: Models and transformers.
Transformers and models are two important concepts in Scikit-learn. Transformers are objects that transform data in some way, while models are objects that make predictions based on data.
Transformers
Transformers are used to prepare data for modeling by doing feature transformations. They can be used to perform a variety of tasks, such as:
Scaling data
Normalizing data
Encoding categorical data
Removing missing values
Transformers are typically used in a pipeline, which is a sequence of transformers that are applied to data in order. For example, you might have a pipeline that first scales the data, then normalizes it, and then encodes the categorical data.
Models
Models are used to make predictions based on data. They are typically trained on a dataset of labeled data, and then they can be used to make predictions on new data.
There are many different types of models in Scikit-learn, such as:
Linear regression models
Logistic regression models
Decision tree models
Random forest models
Support vector machines
Models are typically used in conjunction with transformers. For example, you might use a transformer to prepare the data for a model, and then use the model to make predictions.
Fit, transform and fit transform w.r.t. transformers and models
Depending upon whether these methods are applied on models or Transformers, they perform different tasks.
Transformers
As we discussed earlier, transformers are used for feature transformation. This process comprises of 2 steps: fit, transform (performed in that order)
During "fit", we calculate the values needed by our transformer, and then use these values in the "transform" step for feature transformation.
.fit():
It is used to calculate the values for the transformer that we will use.
For Eg: if we are using Standard Scaler, (standard deviation =1 and mean =0) we will need to find both standard deviation and mean for the feature which we wish to transform. Note that it does not change the feature values, it only calculates the variables which will be needed while applying the transformation formula.
.transform()
It will transform the selected feature based on the transformer we are using and the value that is calculated in the previous step
.fit_transform()
Used to perform both fit and transform steps in one line itself
Model
Model Object is used to perform two main tasks: Training/ Creating our model and fine-tuning the hyperparameters. This model object receives the data which is pre-processed in previous steps as the input.
We perform two main operations on this Object:
fit: adjusting parameters and weights by using training data
predict: Giving predictions on testing data
Now, It is possible that the testing data is not in the desired form. In that case, the testing data will be transformed according to the transformation applied during the training(remember fit and transform for preprocessing?). We will not apply any new transformation formula. This is done to avoid overfitting in our model.
Practical implementation
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create a model object (Support Vector Machine Classifier)
model = SVC()
# Create a transformer object (StandardScaler)
scaler = StandardScaler()
# Demonstrating 'fit' and 'transform' for transformer object
scaler.fit(X_train) # Fit the transformer to the training data
X_train_scaled = scaler.transform(X_train) # Transform the training data
X_test_scaled = scaler.transform(X_test) # Transform the testing data using the same transformer
# Alternatively, you can use 'fit_transform' to combine the above steps into one
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)
# Demonstrating 'fit' and 'predict' for model object
model.fit(X_train_scaled, y_train) # Fit the model to the scaled training data
y_pred = model.predict(X_test_scaled) # Predict using the model on the scaled testing data
# Alternatively, you can use 'fit' and 'predict' together with 'fit_transform' on the transformer
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)
# model.fit(X_train_scaled, y_train)
# y_pred = model.predict(X_test_scaled)
# Print the accuracy of the model
accuracy = sum(y_pred == y_test) / len(y_test)
print("Accuracy:", accuracy)
In this example, we first load the Iris dataset and split it into training and testing sets. Then, we create a StandardScaler
object as a transformer and an SVC
(Support Vector Machine) classifier as a model. We demonstrate how to use fit
and transform
separately for the transformer and the model. Alternatively, we can use the fit_transform
method for the transformer to perform both steps in one go. Finally, we fit the model to the transformed training data and predict on the transformed testing data, calculating the accuracy of the model's prediction
Conclusion
Understanding the differences between the fit
, fit_transform
, and transform
methods in scikit-learn is essential for effectively using machine learning models and transformers.
The fit
method is used to train a model or transformer on the input data, allowing it to learn from the training samples. On the other hand, the transform
method is used to apply the learned transformation on new data, making it possible to preprocess or transform unseen samples based on the previously learned parameters.
When convenience and efficiency are paramount, the fit_transform
method offers a streamlined approach by combining the fitting and transforming steps in one operation.
By mastering these methods and understanding their distinct purposes, data scientists and machine learning practitioners can leverage scikit-learn's powerful functionalities to build robust and efficient pipelines for data preprocessing and modeling.
With the right application of these methods, one can significantly enhance the accuracy and performance of machine learning models, unlocking the full potential of their data-driven applications. So, embrace the versatility of fit
, fit_transform
, and transform
methods in scikit-learn, and elevate your machine learning projects to new heights of success.
Happy coding and may your data explorations lead to valuable insights and discoveries!