Iris Dataset Classification Project Using Python
By Rohit Sharma
Updated on Jul 22, 2025 | 6 min read | 1.5K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 22, 2025 | 6 min read | 1.5K+ views
Share:
The Iris dataset is one of the most commonly being used datasets by beginners who aspire to dive into the world of data science and machine learning. It offers a very clean and well-structured way to comprehend classification, feature importance, and model accuracy.
In this project, a machine learning model will be built to classify an iris flower into one of the three species: Setosa, Versicolor, or Virginica, based on the length and width of their sepals and petals. The project is ideal for those who are new to supervised learning and can provide a strong foundation in the main ML pipeline: going through the loading of data, preprocessing, model building, and evaluation.
Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.
It is better to have at least some of the background in:
For this project, the following tools and libraries will be used:
Tool / Library |
Purpose |
Python |
Programming language used for development |
Google Colab |
Cloud-based notebook environment for running Python code |
Pandas |
For data manipulation and analysis |
NumPy |
For numerical operations and array handling |
Matplotlib |
To visualize data through basic plots and graphs |
Seaborn |
For enhanced and beautiful statistical visualizations |
Scikit-learn |
To train classification models and evaluate performance |
Both of these simple-but-effective algorithms will be applied in classifying iris flower species:
You can complete this Iris dataset classification project in about 1.5 to 2 hours. It’s a beginner-level machine learning project that offers a hands-on introduction to supervised classification.
Let’s start building the project from scratch. We will start by:
Without any further delay, let’s start!
To build the Iris flower classification model, we will use the popular Iris dataset available on Kaggle. It contains 150 samples divided equally among three species of iris flowers: Setosa, Versicolor, and Virginica.
Follow the steps mentioned below to download the dataset:
Now that you have downloaded the file, upload it to Google Colab using the code below:
from google.colab import files
# Upload the CSV file
uploaded = files.upload()
This will prompt you to choose a file from your system. Select the Iris.csv file you just downloaded.
Now, use the below code to load the dataset into a Pandas DataFrame as well as preview the first 5 rows:
import pandas as pd
# Load the CSV file
df = pd.read_csv('Iris.csv')
# Show the first five rows
df.head()
Doing so will help you verify that the dataset is loaded correctly.
Output
Id |
SepalLengthCm |
SepalWidthCm |
PetalLengthCm |
PetalWidthCm |
Species |
|
0 |
1 |
5.1 |
3.5 |
1.4 |
0.2 |
Iris-setosa |
1 |
2 |
4.9 |
3.0 |
1.4 |
0.2 |
Iris-setosa |
2 |
3 |
4.7 |
3.2 |
1.3 |
0.2 |
Iris-setosa |
3 |
4 |
4.6 |
3.1 |
1.5 |
0.2 |
Iris-setosa |
4 |
5 |
5.0 |
3.6 |
1.4 |
0.2 |
Iris-setosa
|
As we can see that the above Iris dataset is well-structured. Therefore, we will just take a quick look at its basic structure, and column names. We should also check for missing values, just to be sure all data are good to go for modeling.
Use the below code to do so:
# Check the shape of the dataset
print("Dataset shape:", df.shape)
# Display column names
print("Column names:", df.columns.tolist())
# Check for any missing values
print("Missing values:\n", df.isnull().sum())
# Get a quick statistical summary
df.describe()
The below output confirms that all features are numerical and that there are no missing values.
Output
Dataset shape: (150, 6)
Column names: ['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'Species']
Missing values:
Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
Id |
SepalLengthCm |
SepalWidthCm |
PetalLengthCm |
PetalWidthCm |
|
count |
150.000000 |
150.000000 |
150.000000 |
150.000000 |
150.000000 |
mean |
75.500000 |
5.843333 |
3.054000 |
3.758667 |
1.198667 |
std |
43.445368 |
0.828066 |
0.433594 |
1.764420 |
0.763161 |
min |
1.000000 |
4.300000 |
2.000000 |
1.000000 |
0.100000 |
25% |
38.250000 |
5.100000 |
2.800000 |
1.600000 |
0.300000 |
50% |
75.500000 |
5.800000 |
3.000000 |
4.350000 |
1.300000 |
75% |
112.750000 |
6.400000 |
3.300000 |
5.100000 |
1.800000 |
max |
150.000000 |
7.900000 |
4.400000 |
6.900000 |
2.500000
|
Now we will prepare it for model training by:
Here is the code to do so:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# Separate features and target
X = df.drop(['Id', 'Species'], axis=1)
y = df['Species']
# Encode target labels into numerical values
le = LabelEncoder()
y_encoded = le.fit_transform(y)
# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)
# Check the shape
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
Output
X_train shape: (120, 4)
X_test shape: (30, 4)
Now data can be utilized in classification models.
In this step, we will train the model using Logistic Regression and K Nearest Neighbors classifiers. And that’s not all. We will also evaluate the models using the - accuracy and classification reports.
Let’s start with Logistic Regression.
Here’s the code:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# Initialize and train the model
lr_model = LogisticRegression(max_iter=200)
lr_model.fit(X_train, y_train)
# Make predictions
y_pred_lr = lr_model.predict(X_test)
# Evaluate
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Classification Report:\n", classification_report(y_test, y_pred_lr, target_names=le.classes_))
Output
Logistic Regression Accuracy: 1.0
Classification Report:
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 10
Iris-versicolor 1.00 1.00 1.00 9
Iris-virginica 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
What does this output convey?
Now let’s try KNN.
Here is the code:
from sklearn.neighbors import KNeighborsClassifier
# Initialize and train the model
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)
# Make predictions
y_pred_knn = knn_model.predict(X_test)
# Evaluate
print("KNN Accuracy:", accuracy_score(y_test, y_pred_knn))
print("Classification Report:\n", classification_report(y_test, y_pred_knn, target_names=le.classes_))
Output
KNN Accuracy: 1.0
Classification Report:
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 10
Iris-versicolor 1.00 1.00 1.00 9
Iris-virginica 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
What does this output mean?
Now that we have seen the performance of both models, let’s quickly compare their results and understand how they performed on the test data:
Metric |
Logistic Regression |
K-Nearest Neighbors |
Accuracy |
100% |
100% |
Precision (avg) |
1.00 |
1.00 |
Recall (avg) |
1.00 |
1.00 |
F1-Score (avg) |
1.00 |
1.00 |
Misclassifications |
0 |
0 |
Suitability |
Simple, fast, great for linearly separable data |
Works well with small datasets, non-parametric |
In this project, you built a classification model using the famous Iris dataset. You learned how to load and explore a dataset, preprocess features, and apply machine learning algorithms like Logistic Regression and KNN.
While the results look ideal here, real-world data is often messier. Still, this project gives you a solid foundation in solving multi-class classification problems using Python and scikit-learn.
Popular Data Science Programs
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Reference link:
https://www.kaggle.com/datasets/uciml/iris/data.
Colab Link -
https://colab.research.google.com/drive/1Gc_lUuqqVyisFj27PL4PgojkntKJmLhX?usp=sharing
773 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources