The Ultimate Guide to fit, transform, and fit_transform in Scikit-learn

The Ultimate Guide to fit, transform, and fit_transform in Scikit-learn

In machine learning, it's important to understand the difference between the fit, transform, and fit_transform methods in Scikit-learn. These methods are used to transform data in different ways, and each one has its specific purpose.

In this blog post, I'll discuss the differences between these three methods, as well as the concepts of transformers and Model objects in Scikit-learn. I will also show you how to use these methods in practice, with some code examples.

Transformers and Models in scikit-learn

While creating our machine learning model we follow the following steps:

  • Data Preprocessing

    • Exploratory Data Analysis

    • Feature Engineering

    • Feature Scaling

  • Model Training

    • Model creation

    • Hyper Parameter tuning

To support these processes, Scikit-learn library provides us with two categories of objects: Models and transformers.

Transformers and models are two important concepts in Scikit-learn. Transformers are objects that transform data in some way, while models are objects that make predictions based on data.

Transformers

Transformers are used to prepare data for modeling by doing feature transformations. They can be used to perform a variety of tasks, such as:

  • Scaling data

  • Normalizing data

  • Encoding categorical data

  • Removing missing values

Transformers are typically used in a pipeline, which is a sequence of transformers that are applied to data in order. For example, you might have a pipeline that first scales the data, then normalizes it, and then encodes the categorical data.

Models

Models are used to make predictions based on data. They are typically trained on a dataset of labeled data, and then they can be used to make predictions on new data.

There are many different types of models in Scikit-learn, such as:

  • Linear regression models

  • Logistic regression models

  • Decision tree models

  • Random forest models

  • Support vector machines

Models are typically used in conjunction with transformers. For example, you might use a transformer to prepare the data for a model, and then use the model to make predictions.

Fit, transform and fit transform w.r.t. transformers and models

Depending upon whether these methods are applied on models or Transformers, they perform different tasks.

Transformers

As we discussed earlier, transformers are used for feature transformation. This process comprises of 2 steps: fit, transform (performed in that order)
During "fit", we calculate the values needed by our transformer, and then use these values in the "transform" step for feature transformation.

.fit():
It is used to calculate the values for the transformer that we will use.
For Eg: if we are using Standard Scaler, (standard deviation =1 and mean =0) we will need to find both standard deviation and mean for the feature which we wish to transform. Note that it does not change the feature values, it only calculates the variables which will be needed while applying the transformation formula.

.transform()
It will transform the selected feature based on the transformer we are using and the value that is calculated in the previous step

.fit_transform()
Used to perform both fit and transform steps in one line itself

Model

Model Object is used to perform two main tasks: Training/ Creating our model and fine-tuning the hyperparameters. This model object receives the data which is pre-processed in previous steps as the input.

We perform two main operations on this Object:

  • fit: adjusting parameters and weights by using training data

  • predict: Giving predictions on testing data

Now, It is possible that the testing data is not in the desired form. In that case, the testing data will be transformed according to the transformation applied during the training(remember fit and transform for preprocessing?). We will not apply any new transformation formula. This is done to avoid overfitting in our model.

Practical implementation

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a model object (Support Vector Machine Classifier)
model = SVC()

# Create a transformer object (StandardScaler)
scaler = StandardScaler()

# Demonstrating 'fit' and 'transform' for transformer object
scaler.fit(X_train)  # Fit the transformer to the training data
X_train_scaled = scaler.transform(X_train)  # Transform the training data
X_test_scaled = scaler.transform(X_test)    # Transform the testing data using the same transformer

# Alternatively, you can use 'fit_transform' to combine the above steps into one
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)

# Demonstrating 'fit' and 'predict' for model object
model.fit(X_train_scaled, y_train)  # Fit the model to the scaled training data
y_pred = model.predict(X_test_scaled)  # Predict using the model on the scaled testing data

# Alternatively, you can use 'fit' and 'predict' together with 'fit_transform' on the transformer
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)
# model.fit(X_train_scaled, y_train)
# y_pred = model.predict(X_test_scaled)

# Print the accuracy of the model
accuracy = sum(y_pred == y_test) / len(y_test)
print("Accuracy:", accuracy)

In this example, we first load the Iris dataset and split it into training and testing sets. Then, we create a StandardScaler object as a transformer and an SVC (Support Vector Machine) classifier as a model. We demonstrate how to use fit and transform separately for the transformer and the model. Alternatively, we can use the fit_transform method for the transformer to perform both steps in one go. Finally, we fit the model to the transformed training data and predict on the transformed testing data, calculating the accuracy of the model's prediction

Conclusion

Understanding the differences between the fit, fit_transform, and transform methods in scikit-learn is essential for effectively using machine learning models and transformers.

The fit method is used to train a model or transformer on the input data, allowing it to learn from the training samples. On the other hand, the transform method is used to apply the learned transformation on new data, making it possible to preprocess or transform unseen samples based on the previously learned parameters.

When convenience and efficiency are paramount, the fit_transform method offers a streamlined approach by combining the fitting and transforming steps in one operation.

By mastering these methods and understanding their distinct purposes, data scientists and machine learning practitioners can leverage scikit-learn's powerful functionalities to build robust and efficient pipelines for data preprocessing and modeling.

With the right application of these methods, one can significantly enhance the accuracy and performance of machine learning models, unlocking the full potential of their data-driven applications. So, embrace the versatility of fit, fit_transform, and transform methods in scikit-learn, and elevate your machine learning projects to new heights of success.

Happy coding and may your data explorations lead to valuable insights and discoveries!