View All
View All
View All
View All
View All
View All
View All
View All
View All
View All
View All

Iris Dataset Classification Project Using Python

By Rohit Sharma

Updated on Jul 22, 2025 | 6 min read | 1.5K+ views

Share:

The Iris dataset is one of the most commonly being used datasets by beginners who aspire to dive into the world of data science and machine learning. It offers a very clean and well-structured way to comprehend classification, feature importance, and model accuracy

In this project, a machine learning model will be built to classify an iris flower into one of the three species: Setosa, Versicolor, or Virginica, based on the length and width of their sepals and petals. The project is ideal for those who are new to supervised learning and can provide a strong foundation in the main ML pipeline: going through the loading of data, preprocessing, model building, and evaluation.

Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog. 

What Should You Know Beforehand?

It is better to have at least some of the background in:

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Tool / Library

Purpose

Python

Programming language used for development

Google Colab

Cloud-based notebook environment for running Python code

Pandas

For data manipulation and analysis

NumPy

For numerical operations and array handling

Matplotlib

To visualize data through basic plots and graphs

Seaborn

For enhanced and beautiful statistical visualizations

Scikit-learn

To train classification models and evaluate performance

Models That Will Be Utilized for Learning

Both of these simple-but-effective algorithms will be applied in classifying iris flower species:

  • Logistic Regression: A linear model that suits both the binary and multiclass classification tasks. It predicts the probability of a sample belonging to one of the classes and works well when the relationship between features and labels is linear.
  • K-Nearest Neighbors (KNN): A nonparametric model that classifies a data point based on the classification of its nearest neighbors. Intuitive, easy to apply, KNN is often used in classification problems like the Iris dataset since it captures nonlinear patterns. 

Time Taken and Difficulty

You can complete this Iris dataset classification project in about 1.5 to 2 hours. It’s a beginner-level machine learning project that offers a hands-on introduction to supervised classification.

How to Build the Iris Dataset Classification Model

Let’s start building the project from scratch. We will start by:

  1. Loading and exploring the dataset
  2. Preprocessing the features
  3. Visualizing relationships between flower measurements
  4. Training and evaluating classification models

Without any further delay, let’s start!

Step 1: Download the Dataset

To build the Iris flower classification model, we will use the popular Iris dataset available on Kaggle. It contains 150 samples divided equally among three species of iris flowers: Setosa, Versicolor, and Virginica.

Follow the steps mentioned below to download the dataset:

  1. Open a new tab in any web browser. 
  2. Type https://www.kaggle.com/datasets/uciml/iris/data.
  3. On the Iris Species page, in the right pane, under the Data Explorer section, click Iris.csv
  4. Click the download icon

Step 2: Upload and Load the Dataset in Google Colab

Now that you have downloaded the file, upload it to Google Colab using the code below:

from google.colab import files

# Upload the CSV file
uploaded = files.upload()

This will prompt you to choose a file from your system. Select the Iris.csv file you just downloaded.

Now, use the below code to load the dataset into a Pandas DataFrame as well as preview the first 5 rows:

import pandas as pd

# Load the CSV file
df = pd.read_csv('Iris.csv')

# Show the first five rows
df.head()

Doing so will help you verify that the dataset is loaded correctly.

Output

 

Id

SepalLengthCm

SepalWidthCm

PetalLengthCm

PetalWidthCm

Species

0

1

5.1

3.5

1.4

0.2

Iris-setosa

1

2

4.9

3.0

1.4

0.2

Iris-setosa

2

3

4.7

3.2

1.3

0.2

Iris-setosa

3

4

4.6

3.1

1.5

0.2

Iris-setosa

4

5

5.0

3.6

1.4

0.2

Iris-setosa

 

 

Step 3: Explore the Dataset

As we can see that the above Iris dataset is well-structured. Therefore, we will just take a quick look at its basic structure, and column names. We should also check for missing values, just to be sure all data are good to go for modeling.

Use the below code to do so:

# Check the shape of the dataset
print("Dataset shape:", df.shape)

# Display column names
print("Column names:", df.columns.tolist())

# Check for any missing values
print("Missing values:\n", df.isnull().sum())

# Get a quick statistical summary
df.describe()

The below output confirms that all features are numerical and that there are no missing values.

Output

Dataset shape: (150, 6)

Column names: ['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'Species']

Missing values:

 Id                           0

SepalLengthCm     0 

SepalWidthCm       0

PetalLengthCm      0

PetalWidthCm        0

Species                   0

dtype: int64 

 

Id

SepalLengthCm

SepalWidthCm

PetalLengthCm

PetalWidthCm

count

150.000000

150.000000

150.000000

150.000000

150.000000

mean

75.500000

5.843333

3.054000

3.758667

1.198667

std

43.445368

0.828066

0.433594

1.764420

0.763161

min

1.000000

4.300000

2.000000

1.000000

0.100000

25%

38.250000

5.100000

2.800000

1.600000

0.300000

50%

75.500000

5.800000

3.000000

4.350000

1.300000

75%

112.750000

6.400000

3.300000

5.100000

1.800000

max

150.000000

7.900000

4.400000

6.900000

2.500000

 

 

Step 4: Preprocess the Data

Now we will prepare it for model training by:

  • Separating features (X) and target (y)
  • Encoding the target species
  • Splitting the data into training and test sets

Here is the code to do so:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Separate features and target
X = df.drop(['Id', 'Species'], axis=1)
y = df['Species']

# Encode target labels into numerical values
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# Check the shape
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

Output

X_train shape: (120, 4)

X_test shape: (30, 4)

Now data can be utilized in classification models.

Step 5: Train and Evaluate the Models

In this step, we will train the model using Logistic Regression and K Nearest Neighbors classifiers. And that’s not all. We will also evaluate the models using the - accuracy and classification reports.

Let’s start with Logistic Regression.

Model 1: Logistic Regression

Here’s the code:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Initialize and train the model
lr_model = LogisticRegression(max_iter=200)
lr_model.fit(X_train, y_train)

# Make predictions
y_pred_lr = lr_model.predict(X_test)

# Evaluate
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Classification Report:\n", classification_report(y_test, y_pred_lr, target_names=le.classes_))

Output

Logistic Regression Accuracy: 1.0

Classification Report:

                  precision    recall  f1-score   support

 

    Iris-setosa       1.00      1.00      1.00        10

Iris-versicolor       1.00      1.00      1.00         9

 Iris-virginica       1.00      1.00      1.00        11

 

       accuracy                                 1.00        30

      macro avg       1.00      1.00      1.00        30

   weighted avg     1.00      1.00      1.00        30

What does this output convey?

  • Logistic Regression was able to obtain 100% accuracy on the test set.
  • It was able to correctly classify all 30 test samples across the three iris species: setosa, versicolor, and virginica.
  • The precision was 1.00, and the same holds for recall and F1-score. No misclassifications occurred. 

Now let’s try KNN.

Model 2: K-Nearest Neighbors (KNN)

Here is the code:

from sklearn.neighbors import KNeighborsClassifier

# Initialize and train the model
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

# Make predictions
y_pred_knn = knn_model.predict(X_test)

# Evaluate
print("KNN Accuracy:", accuracy_score(y_test, y_pred_knn))
print("Classification Report:\n", classification_report(y_test, y_pred_knn, target_names=le.classes_))

Output

KNN Accuracy: 1.0

Classification Report:

                  precision    recall  f1-score   support

 

    Iris-setosa       1.00      1.00      1.00        10

Iris-versicolor       1.00      1.00      1.00         9

 Iris-virginica       1.00      1.00      1.00        11

 

       accuracy                                  1.00        30

      macro avg       1.00      1.00      1.00        30

   weighted avg     1.00      1.00      1.00        30

What does this output mean?

  • The KNN model was able to achieve 100% accuracy on the test set.
  • It perfectly predicted the correct class for all 30 test samples.
  • Each of the three iris species was classified with precision, recall, and F1-score of 1.00.
  • The chosen K value (likely k=3 or 5 by default) worked well on this balanced dataset with clearly separable classes.

Step 6: Compare Model Performances

Now that we have seen the performance of both models, let’s quickly compare their results and understand how they performed on the test data:

Metric

Logistic Regression

K-Nearest Neighbors

Accuracy

100%

100%

Precision (avg)

1.00

1.00

Recall (avg)

1.00

1.00

F1-Score (avg)

1.00

1.00

Misclassifications

0

0

Suitability

Simple, fast, great for linearly separable data

Works well with small datasets, non-parametric

Conclusion

In this project, you built a classification model using the famous Iris dataset. You learned how to load and explore a dataset, preprocess features, and apply machine learning algorithms like Logistic Regression and KNN. 

While the results look ideal here, real-world data is often messier. Still, this project gives you a solid foundation in solving multi-class classification problems using Python and scikit-learn.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Reference link:
https://www.kaggle.com/datasets/uciml/iris/data.

Colab Link -
https://colab.research.google.com/drive/1Gc_lUuqqVyisFj27PL4PgojkntKJmLhX?usp=sharing

Rohit Sharma

773 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months