Home
Blog
Data Science
Breast Cancer Classification and Prediction with Logistic Regression

Breast Cancer Classification and Prediction with Logistic Regression

Updated on Aug 06, 2025 | 15 min read | 1.5K+ views

Table of Contents

View all

Heads Up Before You Dive In!
Breast Cancer Classification: Tools & Tech We Used
Methodology for Breast Cancer Classification & Prediction
How Much Time Do You Need
Building a Breast Cancer Classification and Prediction Model: A Step-by-Step Guide to Building One
Final Conclusion

Breast cancer is one of the most common and life-threatening diseases affecting women globally. Breast Cancer Classification and Prediction helps in finding out if a tumour is cancerous (malignant) or not (benign) at an early stage. This project uses a logistic regression model to study health data and predict the chances of breast cancer.

This is a great way to learn how machine learning can support doctors and improve early detection of breast cancer.

Ready to enter the field of data science? upGrad provides Online Data Science Courses covering Python, Machine Learning, AI, SQL, and Tableau. Taught by experts, enrol today!

Hey, if you're looking to dive deeper, check out this awesome collection of Python Data Science Projects! There's something for everyone, whether you're just starting out or you're a seasoned pro.

Popular Data Science Programs

Data Science Advanced Course Postgraduate Diploma in Data Science M Sc in Data Science Degree Cloud Computing Courses Certification Data Science Machine Learning Course

Heads Up Before You Dive In!

To work smoothly on the Breast Cancer Classification and Prediction project, make sure you’re comfortable with the following:

Basic Python programming knowledge (You should know how to write simple scripts, use loops and conditions, and define functions.)
Experience with data manipulation using Pandas and NumPy (These help in reading the dataset, handling missing values, and preparing the data for analysis)
Understanding of data visualisation with Matplotlib and Seaborn (These tools help in drawing graphs like histograms, countplots, and heatmaps to better understand the data)
Knowledge of data preprocessing techniques (You should know how to clean the data, encode categorical columns, scale features, and split the dataset into training and test sets)
Familiarity with Regression Algorithms (Since we are predicting whether a tumour is cancerous or not, it’s important to know how models like Logistic Regression and Random Forest work for classification problems)

If you're new to Python, check out this free upGrad course to boost your skills!- Learn Basic Python Programming.

Unlock your data science potential with upGrad's premier courses, offering industry-led instruction, direct mentorship, and dedicated career guidance.

Breast Cancer Classification: Tools & Tech We Used

To build and evaluate the breast cancer classification and prediction model, you’ll use popular Python tools for data analysis, visualisation, classification, and model evaluation:

Tool / Library	Purpose
Python	The main programming language used to build and run the project
Google Colab	Free online platform to run Python code with all libraries pre-installed
Pandas	Reads the dataset, handles missing values, and prepares data for analysis
NumPy	Supports numerical tasks like handling arrays and calculations
Matplotlib / Seaborn	Used for visualising patterns, distributions, and correlations in the data
scikit-learn	Helps in splitting data, encoding features, training models, and evaluation
LogisticRegression	A simple model used as a starting point for classification

Also Read - Decision Tree vs Random Forest: Use Cases & Performance Metrics

Methodology for Breast Cancer Classification & Prediction

To predict the chances of breast cancer, we build a classification model using patient data. The model learns from past data and identifies patterns to predict whether a tumour is malignant or benign. Here's what we did:

Data preprocessing and cleaning
Exploratory Data Analysis (EDA)
Classification Algorithms (Logistic Regression)
Feature Importance Analysis
Model evaluation using accuracy, precision, recall, and F1-score

Also Read - Evaluation Metrics in Machine Learning: Top 10 Metrics You Should Know

How Much Time Do You Need

You can complete this Breast Cancer Classification and Prediction project in about 4 to 5 hours. It’s designed for beginners and intermediate learners who are comfortable with basic Python, data handling, and classification tasks.

Building a Breast Cancer Classification and Prediction Model: A Step-by-Step Guide to Building One

Here’s how you can build this project from scratch:

Load the Breast Cancer Dataset
Load dataset: diagnostic features (mean radius, texture, perimeter, etc.) and target (benign/malignant tumour).
Clean and Preprocess the Data
Clean and scale data: drop irrelevant columns, handle missing values, encode categorical labels (B/M to numerical), and standardise features.
Explore and Visualise the Data
Use visual tools such as countplots, histograms, boxplots, and correlation heatmaps to understand the distribution of features and how they relate to the diagnosis outcome.
Train a Classification Model
Use Logistic Regression to classify tumours as benign or malignant based on the diagnostic features. This model is suitable for binary classification problems like this.
Evaluate Model Performance
Evaluate model performance using accuracy, precision, recall, F1-score, and confusion matrix to assess its ability to differentiate benign and malignant cases.
Identify Top Features
Feature importance scores identify key input variables for classification, highlighting significant medical indicators.

Let's get started!

Step 1: Import Required Libraries

Start by importing the essential Python libraries:

Here's the code for importing:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Also Read - Libraries in Python Explained: List of Important Libraries

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree18 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Step 2: Explore the Breast Cancer Data

In this step, you explore the dataset to understand its structure, column types, and basic statistics. This helps you plan the next steps, like cleaning and feature selection.

# Load the dataset
df = pd.read_csv('data.csv')
# --- 1. Data Exploration ---
print("--- Data Exploration ---")

# View first 5 records
print("First 5 rows of the dataset:")
print(df.head())
# Check column types and missing values
print("\nDataset Info:")
df.info()

# View summary statistics
print("\nStatistical Summary:")
print(df.describe())

Output:

--- Data Exploration ---

First 5 rows of the dataset:

id diagnosis radius_mean texture_mean perimeter_mean area_mean \

0 842302 M 17.99 10.38 122.80 1001.0

1 842517 M 20.57 17.77 132.90 1326.0

2 84300903 M 19.69 21.25 130.00 1203.0

3 84348301 M 11.42 20.38 77.58 386.1

4 84358402 M 20.29 14.34 135.10 1297.0

smoothness_mean compactness_mean concavity_mean concave points_mean \

0 0.11840 0.27760 0.3001 0.14710

1 0.08474 0.07864 0.0869 0.07017

2 0.10960 0.15990 0.1974 0.12790

3 0.14250 0.28390 0.2414 0.10520

4 0.10030 0.13280 0.1980 0.10430

... texture_worst perimeter_worst area_worst smoothness_worst \

0 ... 17.33 184.60 2019.0 0.1622

1 ... 23.41 158.80 1956.0 0.1238

2 ... 25.53 152.50 1709.0 0.1444

3 ... 26.50 98.87 567.7 0.2098

4 ... 16.67 152.20 1575.0 0.1374

compactness_worst concavity_worst concave points_worst symmetry_worst \

0 0.6656 0.7119 0.2654 0.4601

1 0.1866 0.2416 0.1860 0.2750

2 0.4245 0.4504 0.2430 0.3613

3 0.8663 0.6869 0.2575 0.6638

4 0.2050 0.4000 0.1625 0.2364

fractal_dimension_worst Unnamed: 32

0 0.11890 NaN

1 0.08902 NaN

2 0.08758 NaN

3 0.17300 NaN

4 0.07678 NaN

[5 rows x 33 columns]

Dataset Info:

RangeIndex: 569 entries, 0 to 568

Data columns (total 33 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 id 569 non-null int64

1 diagnosis 569 non-null object

2 radius_mean 569 non-null float64

3 texture_mean 569 non-null float64

4 perimeter_mean 569 non-null float64

5 area_mean 569 non-null float64

6 smoothness_mean 569 non-null float64

7 compactness_mean 569 non-null float64

8 concavity_mean 569 non-null float64

9 concave points_mean 569 non-null float64

10 symmetry_mean 569 non-null float64

11 fractal_dimension_mean 569 non-null float64

12 radius_se 569 non-null float64

13 texture_se 569 non-null float64

14 perimeter_se 569 non-null float64

15 area_se 569 non-null float64

16 smoothness_se 569 non-null float64

17 compactness_se 569 non-null float64

18 concavity_se 569 non-null float64

19 concave points_se 569 non-null float64

20 symmetry_se 569 non-null float64

21 fractal_dimension_se 569 non-null float64

22 radius_worst 569 non-null float64

23 texture_worst 569 non-null float64

24 perimeter_worst 569 non-null float64

25 area_worst 569 non-null float64

26 smoothness_worst 569 non-null float64

27 compactness_worst 569 non-null float64

28 concavity_worst 569 non-null float64

29 concave points_worst 569 non-null float64

30 symmetry_worst 569 non-null float64

31 fractal_dimension_worst 569 non-null float64

32 Unnamed: 32 0 non-null float64

dtypes: float64(31), int64(1), object(1)

memory usage: 146.8+ KB

Statistical Summary:

id radius_mean texture_mean perimeter_mean area_mean \

count 5.690000e+02 569.000000 569.000000 569.000000 569.000000

mean 3.037183e+07 14.127292 19.289649 91.969033 654.889104

std 1.250206e+08 3.524049 4.301036 24.298981 351.914129

min 8.670000e+03 6.981000 9.710000 43.790000 143.500000

25% 8.692180e+05 11.700000 16.170000 75.170000 420.300000

50% 9.060240e+05 13.370000 18.840000 86.240000 551.100000

75% 8.813129e+06 15.780000 21.800000 104.100000 782.700000

max 9.113205e+08 28.110000 39.280000 188.500000 2501.000000

smoothness_mean compactness_mean concavity_mean concave points_mean \

count 569.000000 569.000000 569.000000 569.000000

mean 0.096360 0.104341 0.088799 0.048919

std 0.014064 0.052813 0.079720 0.038803

min 0.052630 0.019380 0.000000 0.000000

25% 0.086370 0.064920 0.029560 0.020310

50% 0.095870 0.092630 0.061540 0.033500

75% 0.105300 0.130400 0.130700 0.074000

max 0.163400 0.345400 0.426800 0.201200

symmetry_mean ... texture_worst perimeter_worst area_worst \

count 569.000000 ... 569.000000 569.000000 569.000000

mean 0.181162 ... 25.677223 107.261213 880.583128

std 0.027414 ... 6.146258 33.602542 569.356993

min 0.106000 ... 12.020000 50.410000 185.200000

25% 0.161900 ... 21.080000 84.110000 515.300000

50% 0.179200 ... 25.410000 97.660000 686.500000

75% 0.195700 ... 29.720000 125.400000 1084.000000

max 0.304000 ... 49.540000 251.200000 4254.000000

smoothness_worst compactness_worst concavity_worst \

count 569.000000 569.000000 569.000000

mean 0.132369 0.254265 0.272188

std 0.022832 0.157336 0.208624

min 0.071170 0.027290 0.000000

25% 0.116600 0.147200 0.114500

50% 0.131300 0.211900 0.226700

75% 0.146000 0.339100 0.382900

max 0.222600 1.058000 1.252000

concave points_worst symmetry_worst fractal_dimension_worst \

count 569.000000 569.000000 569.000000

mean 0.114606 0.290076 0.083946

std 0.065732 0.061867 0.018061

min 0.000000 0.156500 0.055040

25% 0.064930 0.250400 0.071460

50% 0.099930 0.282200 0.080040

75% 0.161400 0.317900 0.092080

max 0.291000 0.663800 0.207500

Unnamed: 32

count 0.0

mean NaN

std NaN

min NaN

25% NaN

50% NaN

75% NaN

max NaN

[8 rows x 32 columns]

Step 3: Data Preprocessing and Cleaning

Now, we clean up unnecessary columns and convert the target column diagnosis into a binary format. This is essential to prepare the data for classification and prediction.

print("\n--- Data Preprocessing ---")
# Drop the 'id' and 'Unnamed: 32' columns as they are not needed for prediction
df = df.drop(columns=['id', 'Unnamed: 32'], errors='ignore')
# Encode the 'diagnosis' column to numerical values (Malignant=1, Benign=0)
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
print("Diagnosis column encoded.")
print(df['diagnosis'].value_counts())

Output:

--- Data Preprocessing ---

Diagnosis column encoded.
diagnosis
0 357
1 212

Name: count, dtype: int64

Also Read - Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data

Step 4: Visualise Class Distribution and Feature Correlation

In this step, you’ll understand how the diagnosis classes are distributed and how features relate to each other.

print("\n--- Data Visualization ---")
# Plot the distribution of the two classes
plt.figure(figsize=(6, 4))
sns.countplot(x='diagnosis', data=df)
plt.title('Distribution of Diagnosis (1: Malignant, 0: Benign)')

# Create a correlation heatmap to see the relationships between features
plt.figure(figsize=(20, 20))
# Select only numeric columns for the correlation matrix
numeric_df = df.select_dtypes(include=np.number)
corr_matrix = numeric_df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Features')

Output:

Also Read - Comprehensive Guide to Exploratory Data Analysis (EDA) in 2025: Tools,Types, and Best Practices

Step 5: Feature Selection and Data Splitting

In this step, we separate the input features from the target column and split the dataset for training and testing.

The code for this step is as follows:

print("\n--- Feature Selection & Data Splitting ---")
# Define features (X) and target (y)
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']

# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Data split into training and testing sets.")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

Output:

Data split into training and testing sets.
X_train shape: (455, 30)
X_test shape: (114, 30)
y_train shape: (455,)
y_test shape: (114,)

Also Read - Feature Selection in Machine Learning: Techniques, Benefits, and More

Step 6: Feature Scaling

In this step, we scale the features so that they have a mean of 0 and a standard deviation of 1. This helps improve the performance of classification algorithms.

print("\n--- Feature Scaling ---")
# Scale the features to have zero mean and unit variance
# This is important for algorithms like Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Features scaled using StandardScaler.")

Output:

Features scaled using StandardScaler.

Explore this project, Airline Passenger Traffic Analysis Project Using Python

Step 7: Model Training and Evaluation

We now train a Logistic Regression model and evaluate its performance on the test dataset.

Here is the code for training & evaluating model performance:

# --- 7. Model Training ---
print("\n--- Model Training ---")
# Initialize and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
print("Logistic Regression model trained.")
print("\n--- Model Training Results ---")
# The intercept is the bias term
print("Model Intercept:", model.intercept_)
print("\nModel Coefficients:")
# The coefficients show the weight of each feature in the prediction
coeffs = pd.DataFrame(model.coef_[0], index=X.columns, columns=['Coefficient'])
print(coeffs)
# --- 7.1 Model Evaluation ---
print("\n--- Model Evaluation ---")
# Make predictions on the test set
y_pred = model.predict(X_test_scaled)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

# Plot the confusion matrix for better visualization
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Benign', 'Malignant'], yticklabels=['Benign', 'Malignant'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.savefig('confusion_matrix.png')
print("Saved confusion matrix plot to confusion_matrix.png")

# Generate the classification report with precision, recall, f1-score
class_report = classification_report(y_test, y_pred, target_names=['Benign', 'Malignant'])
print("\nClassification Report:")
print(class_report)

# Manually calculate Sensitivity and Specificity from the confusion matrix
tn, fp, fn, tp = conf_matrix.ravel()
sensitivity = tp / (tp + fn) # Same as recall for the positive class
specificity = tn / (tn + fp) # True Negative Rate
print(f"\nSensitivity (Recall or True Positive Rate): {sensitivity:.4f}")
print(f"Specificity (True Negative Rate): {specificity:.4f}")

Output:

--- Model Training ---

Logistic Regression model trained.

--- Model Training Results ---

Model Intercept: [-0.44558453]

Model Coefficients:

Coefficient

radius_mean 0.431904

texture_mean 0.387326

perimeter_mean 0.393432

area_mean 0.465210

smoothness_mean 0.071667

compactness_mean -0.540164

concavity_mean 0.801458

concave points_mean 1.119804

symmetry_mean -0.236119

fractal_dimension_mean -0.075921

radius_se 1.268178

texture_se -0.188877

perimeter_se 0.610583

area_se 0.907186

smoothness_se 0.313307

compactness_se -0.682491

concavity_se -0.175275

concave points_se 0.311300

symmetry_se -0.500425

fractal_dimension_se -0.616230

radius_worst 0.879840

texture_worst 1.350606

perimeter_worst 0.589453

area_worst 0.841846

smoothness_worst 0.544170

compactness_worst -0.016110

concavity_worst 0.943053

concave points_worst 0.778217

symmetry_worst 1.208200

fractal_dimension_worst 0.157414

--- Model Evaluation ---

Accuracy: 0.9737

Confusion Matrix:

[[70 1]

[ 2 41]]

Classification Report:

precision recall f1-score support

Benign 0.97 0.99 0.98 71

Malignant 0.98 0.95 0.96 43

accuracy 0.97 114

macro avg 0.97 0.97 0.97 114

weighted avg 0.97 0.97 0.97 114

Sensitivity (Recall or True Positive Rate): 0.9535

Specificity (True Negative Rate): 0.9859

Also Read - Top 6 Techniques Used in Feature Engineering [Machine Learning]

Step 8: Prediction on Example Data

In this final step, we use the trained Logistic Regression model to predict the class of a new example.

# --- 8. Prediction on Example Data ---
print("\n--- Prediction on Example Data ---")
# Create a hypothetical data point for prediction.
# These values are from the first row of the dataset, which is known to be malignant.
example_data = np.array([[17.99, 10.38, 122.8, 1001.0, 0.1184, 0.2776, 0.3001, 0.1471, 0.2419, 0.07871,
                          1.095, 0.9053, 8.589, 153.4, 0.006399, 0.04904, 0.05373, 0.01587, 0.03003, 0.006193,
                          25.38, 17.33, 184.6, 2019.0, 0.1622, 0.6656, 0.7119, 0.2654, 0.4601, 0.1189]])
# IMPORTANT: Scale the new data using the same scaler fitted on the training data
example_data_scaled = scaler.transform(example_data)
# Predict the class for the example data
prediction = model.predict(example_data_scaled)
# Predict the probabilities for each class
prediction_proba = model.predict_proba(example_data_scaled)
print("Example Data Shape:", example_data.shape)
print("\nPrediction:")
if prediction[0] == 1:
    print("The model predicts the tumor is: Malignant")
else:
    print("The model predicts the tumor is: Benign")

# Print the probabilities for Benign (class 0) and Malignant (class 1)
print(f"\nPrediction Probabilities: Benign ({prediction_proba[0][0]:.4f}), Malignant ({prediction_proba[0][1]:.4f})")

Output:

--- Prediction on Example Data ---
Example Data Shape: (1, 30)

Prediction:
The model predicts the tumor is: Malignant

Key Insights from Prediction on Example Data

Input Shape: The example data has 30 features, matching the model's expected input shape: (1, 30).
Prediction Result: The model correctly predicted the tumor as Malignant, which aligns with the known label from the dataset. This confirms that the model can make accurate predictions on real cases.

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Final Conclusion

In this project, we developed a breast cancer classification and prediction model using Logistic Regression. After preprocessing, visualising, and scaling the data, the model was trained and evaluated with strong accuracy and reliable performance metrics. It successfully predicted a malignant tumour from example data, highlighting its effectiveness. This workflow demonstrates how machine learning can support early and accurate breast cancer diagnosis using real-world data.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1XgiBtgHXg0SRlV59p90JL7zzNJ5BaBBM

Frequently Asked Questions (FAQs)

1. What dataset was used in this project?

The project uses the Breast Cancer Wisconsin (Diagnostic) Dataset, which contains features computed from digitised images of breast mass biopsies.

2. Why was Logistic Regression chosen for this task?

Logistic Regression is a reliable and interpretable classification algorithm suitable for binary classification problems like predicting benign or malignant tumours.

3. How were features selected for training the model?

All relevant numeric features except the target column (diagnosis) were used. Correlation analysis helped understand feature relationships, but no manual feature elimination was applied.

4. How was the model evaluated?

The model was evaluated using accuracy, confusion matrix, classification report (precision, recall, F1-score), and also calculated sensitivity and specificity manually.

5. What does the example prediction show?

The model correctly identified a known malignant tumour from the dataset, showing high confidence in predicting it as malignant, which validates its effectiveness.

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources