Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconArtificial Intelligencebreadcumb forward arrow iconDecision Tree Regression: What You Need to Know in 2024

Decision Tree Regression: What You Need to Know in 2024

Last updated:
14th Jun, 2023
Views
Read Time
7 Mins
share image icon
In this article
Chevron in toc
View All
Decision Tree Regression: What You Need to Know in 2024

To begin with, a regression model is a model that gives as output a numeric value when given some input values that are also numeric. This differs from what a classification model does. It classifies the test data into various classes or groups involved in a given problem statement.

Best Machine Learning and AI Courses Online

The size of the group can be as small as 2 and as big as 1000 or more. There are multiple regression models like linear regression, multivariate regression, Ridge regression, logistic regression, and many more.

Decision tree regression models also belong to this pool of regression models. The predictive model will either classify or predict a numeric value that makes use of binary rules to determine the output or target value.

Ads of upGrad blog

In-demand Machine Learning Skills

The decision tree model, as the name suggests, is a tree like model that has leaves, branches, and nodes.

Terminologies to Remember

Before we delve into the algorithm, here are some important terminologies that you all should be aware of.

1.Root node: It is the top most node from where the splitting begins.

2.Splitting: Process of subdividing a single node into multiple sub nodes. 

3.Terminal node or leaf node: Nodes that don’t split further are called terminal nodes. 

4.Pruning: The process of removal of sub nodes .

5.Parent node: The node that splits further into sub nodes.

6.Child node: The sub nodes that have emerged out from the parent node.

Read: Guide to Decision Tree Algorithm

How does it work?

The decision tree breaks down the data set into smaller subsets. A decision leaf splits into two or more branches that represent the value of the attribute under examination. The topmost node in the decision tree is the best predictor called the root node. ID3 is the algorithm that builds up the decision tree.

It employs a top to down approach and splits are made based on standard deviation. Just for a quick revision, Standard deviation is the degree of distribution or dispersion of a set of data points from its mean value.

Interpretability: Decision trees offer an unambiguous and straightforward picture of the decision-making process.

Nonlinearity: Decision trees are capable of capturing nonlinear connections between input data and the target variables.

Missing data: Decision trees are capable of handling missing data without the need for imputation.

Feature Importance: Decision trees can provide knowledge regarding the relative value of several characteristics in forecasting the target variable.

Outlier Sensitivity: Decision trees are less susceptible to outliers than other regression techniques.

It quantifies the overall variability of the data distribution. A higher value of dispersion or variability means greater is the standard deviation indicating the greater spread of the data points from the mean value. We use standard deviation to measure the uniformity of the sample.

If the sample is totally homogeneous, its standard deviation is zero. And similarly, higher is the degree of heterogeneity, greater will be the standard deviation. Mean of the sample and the number of samples are required to calculate standard deviation.

We use a mathematical function — Coefficient of Deviation that decides when the splitting should stop It is calculated by dividing the standard deviation by the mean of all the samples.

The final value would be the average of the leaf nodes. Say, for example, if the month November is the node that splits further into various salaries over the years in the month of November (until 2021). For the year 2022, the salary for the month of November would be the average of all the salaries under the node November.

Moving on to standard deviation of two classes or attributes(like for the above example, salary can be based either on hourly basis or monthly basis).

To construct an accurate decision tree, the goal should be to find attributes that return upon calculation and return the highest standard deviation reduction. In simple words, the most homogenous branches.

The process of creating a Decision tree for regression covers four important steps.

1. Firstly, we calculate the standard deviation of the target variable. Consider the target variable to be salary like in previous examples. With the example in place, we will calculate the standard deviation of the set of salary values.

2. In step 2, the data set is further split into different attributes. talking about attributes, as the target value is salary, we can think of the possible attributes as — months, hours, the mood of the boss, designation, year in the company, and so on. Then, the standard deviation for each branch is calculated using the above formula. the standard deviation so obtained is subtracted from the standard deviation before the split. The result at hand is called the standard deviation reduction.

Checkout: Types of Binary Tree

3. Once the difference has been calculated as mentioned in the previous step, the best attribute is the one for which the standard deviation reduction value is largest. That means the standard deviation before the split should be greater than the standard deviation before the split. Actually, mod of the difference is taken and so vice versa is also possible.

4. The entire dataset is classified based on the importance of the selected attribute. On the non-leaf branches, this method is continued recursively till all the available data is processed. Now consider month is selected as the best splitting attribute based on the standard deviation reduction value. So we will have 12 branches for each month. These branches will further split to select the best attribute from the remaining set of attributes.

5. In reality, we require some finishing criteria. For this, we make use of the coefficient of deviation or CV for a branch that becomes smaller than a certain threshold like 10%. When we achieve this criterion we stop the tree building process. Because no further splitting happens, the value that falls under this attribute will be the average of all the values under that node.

Must Read: Decision Tree Classification

Implementation

Decision Tree Regression can be implemented using Python language and scikit-learn library. It can be found under the sklearn.tree.DecisionTreeRegressor.

Some of the important parameters are as follows

1.criterion: To measure the quality of a split. It’s value can be “mse” or the mean squared error, “friedman_mse”, and “mae” or the mean absolute error. Default value is mse.

2.max_depth: It represents the maximum depth of the tree. Default value is None.

3.max_features: It represents the number of features to look for when deciding the best split. Default value is None. 

4.splitter: This parameter is used to choose the split at each node. Available values are “best” and “random”. Default value is best.

Methods to avoid overfitting in decision tree regression

Setting a Maximum Depth Limit: The decision tree’s depth is constrained by the max_depth parameter, which keeps it from overcomplicating and overfitting the training set of data.

Pruning: After the decision tree has been constructed, pruning procedures may be used to eliminate pointless branches or nodes that don’t substantially improve the predicted performance.

Decision tree regression in machine learning can be used with ensemble techniques to boost forecasting precision. Both Random Forest and Gradient Boosting, two well-liked ensemble approaches, make use of many decision trees.

Example from sklearn documentation

>>> from sklearn.datasets import load_diabetes

>>> from sklearn.model_selection import cross_val_score

>>> from sklearn.tree import DecisionTreeRegressor

>>> X, y = load_diabetes(return_X_y=True)

>>> regressor = DecisionTreeRegressor(random_state=0)

>>> cross_val_score(regressor, X, y, cv=10)

                  # doctest: +SKIP

array([-0.39…, -0.46…,  0.02…,  0.06…, -0.50…,

      0.16…,  0.11…, -0.73…, -0.30…, -0.00…])

Popular AI and ML Blogs & Free Courses

Limitations of  Decision Tree Regression

Overfitting: Decision trees are vulnerable to overfitting, particularly when they grow too deep or complicated. Poor generalizations based on unknown data may result from this. Overfitting can be reduced using methods like pruning, regularization, and establishing a limit depth.

Instability: Decision trees are unstable and sensitive to even minor modifications in the training set of data. Adding or deleting a few data points can drastically alter a tree’s structure. Random forests and other ensemble techniques can aid in enhancing stability.

Relationships in Linear Form: Decision trees are not very good at capturing relationships in linear form between attributes and the target variable. They work better for issues with complicated or non-linear relationships.

Ads of upGrad blog

Decision tree regression is capable of handling both categorical as well as numerical information in the section on attributes and attribute selection. Before being used in the procedure, categorical variables must be converted into numerical form. One-hot encoding and label encoding are examples of common encoding methods.

Conclusion

The structure of the Data Science Program designed to facilitate you in becoming a true talent in the field of Data Science, which makes it easier to bag the best employer in the market. Register today to begin your learning path journey with upGrad!

If you are curious to learn about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Profile

Pavan Vadapalli

Blog Author
Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology strategy.
Get Free Consultation

Select Coursecaret down icon
Selectcaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Machine Learning Course

Frequently Asked Questions (FAQs)

1What is regression analysis in machine learning?

Regression is a set of mathematical algorithms used in machine learning to predict a continuous result based on the value of one or more predictor variables. Under the umbrella of supervised machine learning, regression analysis is a fundamental topic. It simply helps in understanding the relationships between variables. It recognizes the impact of one variable and its activity on the other variable. Both input characteristics and output labels are used to train the regression algorithm.

2What is meant by multicollinearity in machine learning?

Multicollinearity is a condition in which the independent variables in a dataset are substantially more connected among themselves than with the other variables. In a regression model, this indicates that one independent variable may be predicted from another independent variable. In terms of the influence of independent variables in a model, multicollinearity can lead to broader confidence intervals, resulting in less reliable probability. It shouldn't be in the dataset since it messes with the ranking of the most affective variable.

3What is meant by bagging in machine learning?

When the provided dataset is noisy, bagging is used, which is a form of ensemble learning strategy that lowers variance. Bootstrap aggregation is another synonym for bagging. Bagging is the process of selecting a random sample of data from a training set with replacement—that is, the individual data points can be picked up many times. In machine learning, the random forest algorithm is basically an extension of the bagging process.

Explore Free Courses

Suggested Blogs

Artificial Intelligence course fees
5385
Artificial intelligence (AI) was one of the most used words in 2023, which emphasizes how important and widespread this technology has become. If you
Read More

by venkatesh Rajanala

29 Feb 2024

Artificial Intelligence in Banking 2024: Examples & Challenges
6109
Introduction Millennials and their changing preferences have led to a wide-scale disruption of daily processes in many industries and a simultaneous g
Read More

by Pavan Vadapalli

27 Feb 2024

Top 9 Python Libraries for Machine Learning in 2024
75574
Machine learning is the most algorithm-intense field in computer science. Gone are those days when people had to code all algorithms for machine learn
Read More

by upGrad

19 Feb 2024

Top 15 IoT Interview Questions & Answers 2024 – For Beginners & Experienced
64428
These days, the minute you indulge in any technology-oriented discussion, interview questions on cloud computing come up in some form or the other. Th
Read More

by Kechit Goyal

19 Feb 2024

Data Preprocessing in Machine Learning: 7 Easy Steps To Follow
152727
Summary: In this article, you will learn about data preprocessing in Machine Learning: 7 easy steps to follow. Acquire the dataset Import all the cr
Read More

by Kechit Goyal

18 Feb 2024

Artificial Intelligence Salary in India [For Beginners & Experienced] in 2024
908666
Artificial Intelligence (AI) has been one of the hottest buzzwords in the tech sphere for quite some time now. As Data Science is advancing, both AI a
Read More

by upGrad

18 Feb 2024

24 Exciting IoT Project Ideas & Topics For Beginners 2024 [Latest]
759497
Summary: In this article, you will learn the 24 Exciting IoT Project Ideas & Topics. Take a glimpse at the project ideas listed below. Smart Agr
Read More

by Kechit Goyal

18 Feb 2024

Natural Language Processing (NLP) Projects & Topics For Beginners [2023]
107594
What are Natural Language Processing Projects? NLP project ideas advanced encompass various applications and research areas that leverage computation
Read More

by Pavan Vadapalli

17 Feb 2024

45+ Interesting Machine Learning Project Ideas For Beginners [2024]
328125
Summary: In this Article, you will learn Stock Prices Predictor Sports Predictor Develop A Sentiment Analyzer Enhance Healthcare Prepare ML Algorith
Read More

by Jaideep Khare

16 Feb 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon