XGBoost - Regressor

XGBoost is a commonly used ensembling. In one of my previous blogs, we discussed how to build XGBoost Classifier to solve classification problems. Today we will see how to build an XGBoost regressor to solve regression problems.

While most of the steps remain similar, there will be some changes because we will be dealing with continuous values.

We will take a sample training dataset consisting of three features: Exp, Gap, and Salary. Our task is to calculate the salary of a person based on exp in no: of years, whether or not she/he has a gap. Here, Gap will be a categorical feature.

Step 1: Create a Base model.

Since, the output feature is a continuous value the, output of this base model will be assumed to be the average salary. i.e. (40+42+52+60+62)/5 = 51K

Step 2: Calculate the residuals.

We know that in XGBoost algorithm, residuals are used to calculate the output of the decision tree.

Residual for a record is given by Salary in that record - Average Salary

Step 3: Creating the decision tree.

First, we will have to select the feature which will be used as the root node of the decision tree for splitting. This will be done based on Information Gain.
Even though Exp. is a continuous feature, we still need to do binary split in our tree( that is, only two branches should be created.). To deal with this problem, we will select a record(For eg: 2.5) based on which split will happen. The record which gives maximum Gain will be selected for splitting the root node.

Let us assume we begin by splitting the Tree based on the first record. Then, the left branches will have records that have experience less than equal to 2 years- {-11}, and the right node will have records that have experience greater than 2 years - {-9, +1, +9, +11)

Now we will calculate the similarity weight. The similarity weight formula for the XGBoost regressor will be slightly different than the XGBoost Classifier.

$$\frac{{(\Sigma \text{Residual})^2}}{{No: of\space Residuals+ \alpha}}$$

For left node : (-11^2)/2 = 60.5
right node: 28.5
root node: 0.16

GAIN = left node S.W. + right node S.W. - root node S.W.

\= 60.5 + 28.5 - 0.16 = 88.84

Now, if we split based on the second record, the following will be our decision tree

Similarly, we will check for every record. For simplicity, we will assume that splitting based on the second record of experience feature gives us maximum gain. Now we will further split the tree based on credit.
since the left node of experience belongs to the same category of credit, we will not calculate them.
Now whenever we get the data, we will pass it to the decision tree and the output will be the output of the leaf node. The leaf node contains many continuous values, so the output of the leaf node will be equal to the average of all these values.

We will again calculate new residuals, based on number of decision trees we want in our boosting algorithm.

The final output of this XGBoost regressor:

$${\text{Base learner output} + \alpha1(T1) + \alpha(T2)}....$$

(Here alpha1, alpha2... are hyperparameters and base learner output was calculated in the first step)

Assuming we have only one decision tree, and alpha1 = 0.5,

output for the first record= 51 + 0.5*(-10) = 46

That's it!! Hope this was helpful and provided you a greater understanding of XGBoost algorithm.