ID3 Algorithm in Machine Learning
By Rahul Singh
Updated on Jun 22, 2026 | 10 min read | 3.3K+ views
Share:
Looks like you're browsing from the
United StatesSome programs may not be available in your location
You're browsing from the
United States
Some programs may not be available in your location
Switch to upGrad USAll courses
Certifications
More
By Rahul Singh
Updated on Jun 22, 2026 | 10 min read | 3.3K+ views
Share:
Table of Contents
The ID3 (Iterative Dichotomiser 3) algorithm is a widely used decision tree algorithm developed by Ross Quinlan in 1986 for classification problems. It builds a decision tree by repeatedly selecting the feature that provides the highest information gain, splitting the dataset into smaller and more organized groups. This top-down approach continues until the data is classified into distinct categories, or no further meaningful splits can be made.
This blog covers everything you need to know about ID3 algorithm in machine learning, whether you are just starting out or want a deeper technical understanding. You will learn how ID3 works, the math behind it (explained simply), a step-by-step id3 algorithm example in machine learning, and how to implement it in Python.
The ID3 algorithm builds a decision tree by splitting data step by step. At each step, it picks the feature that gives the most information about the target label.
The process follows a top-down, greedy approach. "Greedy" here means it picks the best split at each node right now, without looking ahead to future splits. It does not backtrack or revise earlier choices.
Here is how the algorithm runs from start to finish:
Entropy measures disorder or uncertainty in a dataset. If a dataset has an equal mix of classes, entropy is high. If all examples belong to one class, entropy is zero.
The formula for entropy is:
Entropy(S) = -sum [ p(i) * log2(p(i)) ]
Where p(i) is the proportion of examples belonging to class i.
Example: If a dataset has 5 "Yes" and 5 "No" labels:
Entropy = -(0.5 * log2(0.5)) - (0.5 * log2(0.5)) = 1.0
An entropy of 1.0 is maximum uncertainty. An entropy of 0 means all samples belong to one class.
Information gain tells you how much a feature reduces uncertainty. A higher information gain means the feature is more useful for splitting.
The formula is:
Information Gain = Entropy(Parent) - Weighted Average Entropy(Children)
The iterative dichotomiser 3 algorithm always picks the feature with the highest information gain at each step.
Also Read: A Detailed Guide to Feature Selection in Machine Learning
Let us walk through a classic id3 algorithm example in machine learning using a weather dataset to predict whether someone will play tennis.
Outlook |
Temperature |
Humidity |
Wind |
Play Tennis |
| Sunny | Hot | High | Weak | No |
| Sunny | Hot | High | Strong | No |
| Overcast | Hot | High | Weak | Yes |
| Rain | Mild | High | Weak | Yes |
| Rain | Cool | Normal | Weak | Yes |
| Rain | Cool | Normal | Strong | No |
| Overcast | Cool | Normal | Strong | Yes |
| Sunny | Mild | High | Weak | No |
| Sunny | Cool | Normal | Weak | Yes |
| Rain | Mild | Normal | Weak | Yes |
| Sunny | Mild | Normal | Strong | Yes |
| Overcast | Mild | High | Strong | Yes |
| Overcast | Hot | Normal | Weak | Yes |
| Rain | Mild | High | Strong | No |
Total: 9 Yes, 5 No. 14 examples.
Entropy(S) = -(9/14)*log2(9/14) - (5/14)*log2(5/14)
= -(0.643 * -0.637) - (0.357 * -1.485)
= 0.940
After splitting on each feature and calculating weighted child entropy, the information gains are approximately:
Feature |
Information Gain |
| Outlook | 0.246 |
| Temperature | 0.029 |
| Humidity | 0.151 |
| Wind | 0.048 |
Outlook has the highest information gain, so it becomes the root node.
Also Read: Understanding Decision Tree In AI: Types, Examples, and How to Create One
The algorithm repeats the process for the Sunny and Rain subsets. For the Sunny subset, Humidity gives the highest information gain. For Rain, Wind does.
The final tree cleanly classifies all 14 examples correctly.
Now let us look at how to implement the machine learning id3 algorithm in Python. We will keep things simple and practical.
You need Python 3 and a basic understanding of dictionaries and recursion.
import math
from collections import Counter
# Calculate entropy of a list of labels
def entropy(labels):
n = len(labels)
if n == 0:
return 0
counts = Counter(labels)
return -sum((count/n) * math.log2(count/n) for count in counts.values())
# Calculate information gain of a feature
def information_gain(data, feature, target):
total_entropy = entropy([row[target] for row in data])
# Get unique values for this feature
values = set(row[feature] for row in data)
# Weighted entropy after split
weighted_entropy = 0
for val in values:
subset = [row for row in data if row[feature] == val]
weight = len(subset) / len(data)
weighted_entropy += weight * entropy([row[target] for row in subset])
return total_entropy - weighted_entropy
# Build the ID3 decision tree
def id3(data, features, target):
labels = [row[target] for row in data]
# If all labels are same, return that label
if len(set(labels)) == 1:
return labels[0]
# If no features left, return majority label
if not features:
return Counter(labels).most_common(1)[0][0]
# Find best feature to split on
best_feature = max(features, key=lambda f: information_gain(data, f, target))
tree = {best_feature: {}}
remaining_features = [f for f in features if f != best_feature]
# Build subtrees for each value
for val in set(row[best_feature] for row in data):
subset = [row for row in data if row[best_feature] == val]
if not subset:
tree[best_feature][val] = Counter(labels).most_common(1)[0][0]
else:
tree[best_feature][val] = id3(subset, remaining_features, target)
return tree
# Sample data (abbreviated)
dataset = [
{"Outlook": "Sunny", "Humidity": "High", "Wind": "Weak", "Play": "No"},
{"Outlook": "Overcast", "Humidity": "High", "Wind": "Weak", "Play": "Yes"},
{"Outlook": "Rain", "Humidity": "Normal", "Wind": "Weak", "Play": "Yes"},
# ... add all 14 rows
]
features = ["Outlook", "Humidity", "Wind"]
target = "Play"
tree = id3(dataset, features, target)
print(tree)
This implementation is clean enough to understand but also extensible. For production use, scikit-learn's DecisionTreeClassifier with criterion='entropy' runs the same logic at scale.
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
import pandas as pd
# Encode categorical data
data = pd.DataFrame(dataset)
le = LabelEncoder()
for col in data.columns:
data[col] = le.fit_transform(data[col])
X = data.drop("Play", axis=1)
y = data["Play"]
# ID3 uses entropy criterion
clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(X, y)
The criterion='entropy' parameter tells scikit-learn to use information gain just like the iterative dichotomiser 3 algorithm does.
Also Read: Scikit Learn Library in Python: Features and Applications
Understanding where ID3 Algorithm in Machine Learning shines and where it falls short helps you decide when to use it.
Limitation |
What It Means |
| Overfitting | ID3 builds trees that fit training data perfectly, often too perfectly |
| No pruning | It does not simplify the tree after building it |
| Numerical features | It handles only categorical data natively; continuous values need binning |
| Bias toward many values | Features with many unique values get higher information gain unfairly |
| No missing values | ID3 cannot handle missing data without preprocessing |
These limitations are why Quinlan later developed C4.5 and C5.0, which handle continuous features, missing values, and pruning.
The iterative dichotomiser 3 algorithm was the starting point, but the field has moved further. Here is how it compares:
Feature |
ID3 |
C4.5 |
CART |
| Splitting criterion | Information Gain | Gain Ratio | Gini Impurity |
| Numerical features | No | Yes | Yes |
| Pruning | No | Yes | Yes |
| Missing values | No | Yes | Yes |
| Multi-way splits | Yes | Yes | Binary only |
| Used in scikit-learn | No (only entropy) | Partially | Yes (default) |
When to use ID3: For learning, coursework, and categorical-only datasets where interpretability matters most.
When to use CART or C4.5: For real-world projects where data is messy, features are numerical, or you need better generalization.
Also Read: 5 Must-Know Steps in Data Preprocessing for Beginners!
Despite its age, the ideas behind machine learning id3 algorithm are used in several domains.
The algorithm's strength is its transparency. In regulated industries like finance and healthcare, being able to explain a prediction is not optional. ID3 and its descendants make that possible.
The ID3 algorithm in machine learning is a foundational concept that every ML learner should understand. It teaches you the core idea of how machines make decisions: by asking the most informative questions first, using entropy and information gain as guides.
Starting from a simple dataset, ID3 builds a tree that mirrors how a human expert might reason through a problem. The math is straightforward, the logic is visual, and the id3 algorithm example in machine learning with the tennis dataset makes it concrete.
Want personalized guidance on AI and upskilling? Speak with an expert for a free 1:1 counselling session today.
The ID3 algorithm, short for Iterative Dichotomiser 3, is a decision tree learning algorithm developed by Ross Quinlan in 1986. It builds a tree by selecting the feature with the highest information gain at each node, splitting the dataset step by step until all subsets are pure or no features remain.
Iterative Dichotomiser 3 refers to the method of iteratively splitting (dichotomising) a dataset into smaller groups based on feature values. The "3" indicates it was the third version Quinlan developed. Each iteration selects the most informative split using information gain.
Information gain measures how much a particular feature reduces uncertainty or entropy in the dataset after splitting. In the ID3 algorithm, the feature with the highest information gain is chosen as the splitting criterion at each node of the decision tree.
C4.5 is an improved version of ID3 built by the same researcher. While ID3 uses information gain and works only with categorical data, C4.5 uses gain ratio, handles continuous numerical features, manages missing values, and supports post-pruning to reduce overfitting. C4.5 is more practical for real-world datasets.
No, the standard ID3 algorithm does not natively support continuous numerical features. You need to manually bin or discretize numerical values into categories before using ID3. C4.5 and CART both handle this automatically, which is one reason they are more widely used in practice.
ID3 builds the tree until it perfectly classifies every training example. It has no pruning step to simplify the tree. This means it often creates branches that capture noise or rare cases in the training data rather than general patterns, which leads to poor performance on unseen data.
You can implement the machine learning id3 algorithm in Python from scratch using recursion. The core steps are calculating entropy, computing information gain for each feature, picking the best feature, splitting the data, and repeating. You can also use scikit-learn's DecisionTreeClassifier with the parameter criterion='entropy' to replicate ID3 behavior at scale.
The iterative dichotomiser 3 algorithm works best with categorical data, where each feature takes a limited set of discrete values. It is well-suited for small to medium-sized datasets where interpretability is important. For numerical, large, or mixed-type datasets, algorithms like CART or C4.5 perform better.
In production systems, ID3 has largely been replaced by its successors like C4.5, CART, and ensemble methods such as Random Forest and Gradient Boosting. However, ID3 remains highly relevant in education and research. Understanding it builds the conceptual foundation needed to learn more powerful algorithms.
The time complexity of the ID3 algorithm is O(m * n * log n) approximately, where m is the number of features and n is the number of training examples. Building the tree requires evaluating information gain for every feature at each node, which can become slow for very large datasets with many features.
ID3 uses information gain (based on entropy) as its splitting criterion and can produce multi-way splits based on the number of feature values. CART, which stands for Classification and Regression Trees, uses Gini impurity and always creates binary splits. CART also handles regression tasks and supports pruning, which ID3 does not.
78 articles published
Rahul Singh is an Associate Content Writer at upGrad, with a strong interest in Data Science, Machine Learning, and Artificial Intelligence. He combines technical development skills with data-driven s...
India’s #1 Tech University
Executive Program in Generative AI for Leaders
76%
seats filled