How Are Variables’ Splitting Value Decided in a Decision Tree?

Understand how a decision tree selects a variable at each step and how the split value is decided for that variable while training for a Regression problem

SAURABH RANA
5 min readMar 14, 2022
Photo by Fabrice Villard on UnsplashA Quick intro to decision trees-

A Quick intro to decision trees-

A decision tree is a supervised learning algorithm that uses a tree-like model of decisions, and it can be used for both classification and regression problems.

Another advantage with decision trees is that unlike other algorithms, they are very easy to interpret since their tree-like model often mimic human-level thinking, which makes them one of the most popular supervised algorithm.

Creating a decision tree for regression or classification is an iterative process of splitting the data into subsets, and then splitting it up further on each of the branches. The difference is in classification, we are trying to predict the discrete class labels, whereas, in regression, we are trying to predict a continuous quantity.

The most intuitive approach to explain any ML algorithm is with the help of a sample dataset on which we can apply steps in parallel undertaken behind the scenes by the algorithm.

To explain regression with decision tree, I am using a kaggle dataset of US car prices (download using this link- Car Price Prediction).

Objective: To predict Car price

(4 Independent Variable + 1 Dependent Variable)

Open Source-https://www.kaggle.com/hellbuoy/car-price-prediction

Sample Data:-

https://www.kaggle.com/hellbuoy/car-price-prediction

Let’s train the model using the following hyper parameters:-

Max Depth=3 , Min Sample Leaf=5

Training Process of Decision tree for regression

The decision tree will fit each independent variable individually one by one to identify which one of them predicts the target variable (Car price in our case) most accurately.

While training an independent variable against a target variable in the decision tree, unlike linear regression(assuming you have already studied linear regression), we don’t derive slope value; instead, we derive the split value for each variable, which best predicts the target variable.

To find the best split value, the Model will start splitting the independent variable at each possible value and will derive the predicted value by calculating the mean value of price on LHS and RHS; then it will calculate the MSE(average square error) value by comparing the predicted value with actual value.

The split value at which we are getting the lowest MSE value will be selected as the optimal one.

Let’s visualize how the optimal split value is derived for variable — “citympg”

Image by Author

At split value=15, the variable “citympg” is predicting the car price equal to-

40786$ if “citympg≤15”

12536$ if “citympg>15”

MSE = 4.5573e+07

Image by Author

At split value=20, the variable “citympg” is predicting the car price equal to-

21416.8$ if “citympg≤20”

10115$ if “citympg>20”

MSE = 3.81882e+07

So if we compare just these two split values (i.e., 15 & 20), then split value=20 is a better option since we are getting a lower MSE value for it, which basically means the prediction is more accurate if we split “citympg” variable at 20.

Just for visualization purposes, we have considered only two split values but the decision tree model in the backend will compute the MSE value for each split value.

Derived MSE for all possible splits starting from 13 (lowest value in the “citympg”) to 49 (highest value in the “citympg”)

Gif by author

Let’s stack up the derived MSE for each split value on a graph-

Image by Author

From this graph it can visualised that we are getting the lowest MSE when split value is in the range->22.1–22.5

We will repeat the same steps for each variable to find out which variable is able to predict the output most accurately.

Image by Author

Variable : carwidth

Split Value : 69

MSE : 35827155.7252

Image by Author

Variable : horsepower

Split Value : 118

MSE : 29734915.6144

Image by Author

Variable : peakrpm

Split Value : 4778

MSE : 56276165.7455

Image by Author

Variable : citympg

Split Value : 22.5

MSE : 32011599.5319

So after visualizing MSE vs Split Value graph for each variable, we found out that the variable “horsepower” has the lowest MSE, which means this variable can predict the target variable(price) more accurately than other variables.

Therefore the model will split the root node using the variable “horsepower” based on the condition horsepower≤118 , such that data entries that satisfy this condition will become one node, and data entries that do not fulfill this condition will become another node.

Now we are left with one root node and two new sub nodes-

Image by Author

Now again the decision tree for the two sub nodes will repeat all the steps we executed on the root node -

  • Select variables one by one
  • Train the variable by finding the split value where MSE is the lowest
  • Compare MSE values of all the variables
  • Select the variable with the lowest MSE

As the training process will move on, these two sub-nodes will be divided into a maximum of four sub-nodes (Depending on Min Sample Leaf condition), which further will be divided into a maximum of eight leaf nodes, and then the decision tree training will stop because we have used “Max Depth=3” as the hyper-parameter.

Final Regression tree

Image by Author

This model might not be very accurate, which was not the point of this article anyway. As the title says, “Decision Tree Behind The Scenes,” I just wanted to explain how the algorithm runs in the backend and how it processes your training data to give you accurate predictions so that you all don’t use it as a black-box model.

--

--