Breast Cancer Classification and Prediction with Logistic Regression
By Rohit Sharma
Updated on Aug 06, 2025 | 15 min read | 1.35K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 06, 2025 | 15 min read | 1.35K+ views
Share:
Table of Contents
Breast cancer is one of the most common and life-threatening diseases affecting women globally. Breast Cancer Classification and Prediction helps in finding out if a tumour is cancerous (malignant) or not (benign) at an early stage. This project uses a logistic regression model to study health data and predict the chances of breast cancer.
This is a great way to learn how machine learning can support doctors and improve early detection of breast cancer.
Ready to enter the field of data science? upGrad provides Online Data Science Courses covering Python, Machine Learning, AI, SQL, and Tableau. Taught by experts, enrol today!
Hey, if you're looking to dive deeper, check out this awesome collection of Python Data Science Projects! There's something for everyone, whether you're just starting out or you're a seasoned pro.
Popular Data Science Programs
To work smoothly on the Breast Cancer Classification and Prediction project, make sure you’re comfortable with the following:
If you're new to Python, check out this free upGrad course to boost your skills!- Learn Basic Python Programming.
Unlock your data science potential with upGrad's premier courses, offering industry-led instruction, direct mentorship, and dedicated career guidance.
To build and evaluate the breast cancer classification and prediction model, you’ll use popular Python tools for data analysis, visualisation, classification, and model evaluation:
Tool / Library |
Purpose |
Python | The main programming language used to build and run the project |
Google Colab | Free online platform to run Python code with all libraries pre-installed |
Pandas | Reads the dataset, handles missing values, and prepares data for analysis |
NumPy | Supports numerical tasks like handling arrays and calculations |
Matplotlib / Seaborn | Used for visualising patterns, distributions, and correlations in the data |
scikit-learn | Helps in splitting data, encoding features, training models, and evaluation |
LogisticRegression | A simple model used as a starting point for classification |
Also Read - Decision Tree vs Random Forest: Use Cases & Performance Metrics
To predict the chances of breast cancer, we build a classification model using patient data. The model learns from past data and identifies patterns to predict whether a tumour is malignant or benign. Here's what we did:
Also Read - Evaluation Metrics in Machine Learning: Top 10 Metrics You Should Know
You can complete this Breast Cancer Classification and Prediction project in about 4 to 5 hours. It’s designed for beginners and intermediate learners who are comfortable with basic Python, data handling, and classification tasks.
Here’s how you can build this project from scratch:
Let's get started!
Start by importing the essential Python libraries:
Here's the code for importing:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Also Read - Libraries in Python Explained: List of Important Libraries
Data Science Courses to upskill
Explore Data Science Courses for Career Progression
In this step, you explore the dataset to understand its structure, column types, and basic statistics. This helps you plan the next steps, like cleaning and feature selection.
# Load the dataset
df = pd.read_csv('data.csv')
# --- 1. Data Exploration ---
print("--- Data Exploration ---")
# View first 5 records
print("First 5 rows of the dataset:")
print(df.head())
# Check column types and missing values
print("\nDataset Info:")
df.info()
# View summary statistics
print("\nStatistical Summary:")
print(df.describe())
Output:
--- Data Exploration ---
First 5 rows of the dataset:
id diagnosis radius_mean texture_mean perimeter_mean area_mean \
0 842302 M 17.99 10.38 122.80 1001.0
1 842517 M 20.57 17.77 132.90 1326.0
2 84300903 M 19.69 21.25 130.00 1203.0
3 84348301 M 11.42 20.38 77.58 386.1
4 84358402 M 20.29 14.34 135.10 1297.0
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017
2 0.10960 0.15990 0.1974 0.12790
3 0.14250 0.28390 0.2414 0.10520
4 0.10030 0.13280 0.1980 0.10430
... texture_worst perimeter_worst area_worst smoothness_worst \
0 ... 17.33 184.60 2019.0 0.1622
1 ... 23.41 158.80 1956.0 0.1238
2 ... 25.53 152.50 1709.0 0.1444
3 ... 26.50 98.87 567.7 0.2098
4 ... 16.67 152.20 1575.0 0.1374
compactness_worst concavity_worst concave points_worst symmetry_worst \
0 0.6656 0.7119 0.2654 0.4601
1 0.1866 0.2416 0.1860 0.2750
2 0.4245 0.4504 0.2430 0.3613
3 0.8663 0.6869 0.2575 0.6638
4 0.2050 0.4000 0.1625 0.2364
fractal_dimension_worst Unnamed: 32
0 0.11890 NaN
1 0.08902 NaN
2 0.08758 NaN
3 0.17300 NaN
4 0.07678 NaN
[5 rows x 33 columns]
Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 569 non-null int64
1 diagnosis 569 non-null object
2 radius_mean 569 non-null float64
3 texture_mean 569 non-null float64
4 perimeter_mean 569 non-null float64
5 area_mean 569 non-null float64
6 smoothness_mean 569 non-null float64
7 compactness_mean 569 non-null float64
8 concavity_mean 569 non-null float64
9 concave points_mean 569 non-null float64
10 symmetry_mean 569 non-null float64
11 fractal_dimension_mean 569 non-null float64
12 radius_se 569 non-null float64
13 texture_se 569 non-null float64
14 perimeter_se 569 non-null float64
15 area_se 569 non-null float64
16 smoothness_se 569 non-null float64
17 compactness_se 569 non-null float64
18 concavity_se 569 non-null float64
19 concave points_se 569 non-null float64
20 symmetry_se 569 non-null float64
21 fractal_dimension_se 569 non-null float64
22 radius_worst 569 non-null float64
23 texture_worst 569 non-null float64
24 perimeter_worst 569 non-null float64
25 area_worst 569 non-null float64
26 smoothness_worst 569 non-null float64
27 compactness_worst 569 non-null float64
28 concavity_worst 569 non-null float64
29 concave points_worst 569 non-null float64
30 symmetry_worst 569 non-null float64
31 fractal_dimension_worst 569 non-null float64
32 Unnamed: 32 0 non-null float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB
Statistical Summary:
id radius_mean texture_mean perimeter_mean area_mean \
count 5.690000e+02 569.000000 569.000000 569.000000 569.000000
mean 3.037183e+07 14.127292 19.289649 91.969033 654.889104
std 1.250206e+08 3.524049 4.301036 24.298981 351.914129
min 8.670000e+03 6.981000 9.710000 43.790000 143.500000
25% 8.692180e+05 11.700000 16.170000 75.170000 420.300000
50% 9.060240e+05 13.370000 18.840000 86.240000 551.100000
75% 8.813129e+06 15.780000 21.800000 104.100000 782.700000
max 9.113205e+08 28.110000 39.280000 188.500000 2501.000000
smoothness_mean compactness_mean concavity_mean concave points_mean \
count 569.000000 569.000000 569.000000 569.000000
mean 0.096360 0.104341 0.088799 0.048919
std 0.014064 0.052813 0.079720 0.038803
min 0.052630 0.019380 0.000000 0.000000
25% 0.086370 0.064920 0.029560 0.020310
50% 0.095870 0.092630 0.061540 0.033500
75% 0.105300 0.130400 0.130700 0.074000
max 0.163400 0.345400 0.426800 0.201200
symmetry_mean ... texture_worst perimeter_worst area_worst \
count 569.000000 ... 569.000000 569.000000 569.000000
mean 0.181162 ... 25.677223 107.261213 880.583128
std 0.027414 ... 6.146258 33.602542 569.356993
min 0.106000 ... 12.020000 50.410000 185.200000
25% 0.161900 ... 21.080000 84.110000 515.300000
50% 0.179200 ... 25.410000 97.660000 686.500000
75% 0.195700 ... 29.720000 125.400000 1084.000000
max 0.304000 ... 49.540000 251.200000 4254.000000
smoothness_worst compactness_worst concavity_worst \
count 569.000000 569.000000 569.000000
mean 0.132369 0.254265 0.272188
std 0.022832 0.157336 0.208624
min 0.071170 0.027290 0.000000
25% 0.116600 0.147200 0.114500
50% 0.131300 0.211900 0.226700
75% 0.146000 0.339100 0.382900
max 0.222600 1.058000 1.252000
concave points_worst symmetry_worst fractal_dimension_worst \
count 569.000000 569.000000 569.000000
mean 0.114606 0.290076 0.083946
std 0.065732 0.061867 0.018061
min 0.000000 0.156500 0.055040
25% 0.064930 0.250400 0.071460
50% 0.099930 0.282200 0.080040
75% 0.161400 0.317900 0.092080
max 0.291000 0.663800 0.207500
Unnamed: 32
count 0.0
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
[8 rows x 32 columns]
Now, we clean up unnecessary columns and convert the target column diagnosis into a binary format. This is essential to prepare the data for classification and prediction.
print("\n--- Data Preprocessing ---")
# Drop the 'id' and 'Unnamed: 32' columns as they are not needed for prediction
df = df.drop(columns=['id', 'Unnamed: 32'], errors='ignore')
# Encode the 'diagnosis' column to numerical values (Malignant=1, Benign=0)
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
print("Diagnosis column encoded.")
print(df['diagnosis'].value_counts())
Output:
--- Data Preprocessing ---
Diagnosis column encoded.
diagnosis
0 357
1 212
Name: count, dtype: int64
Also Read - Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data
In this step, you’ll understand how the diagnosis classes are distributed and how features relate to each other.
print("\n--- Data Visualization ---")
# Plot the distribution of the two classes
plt.figure(figsize=(6, 4))
sns.countplot(x='diagnosis', data=df)
plt.title('Distribution of Diagnosis (1: Malignant, 0: Benign)')
# Create a correlation heatmap to see the relationships between features
plt.figure(figsize=(20, 20))
# Select only numeric columns for the correlation matrix
numeric_df = df.select_dtypes(include=np.number)
corr_matrix = numeric_df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Features')
Output:
Also Read - Comprehensive Guide to Exploratory Data Analysis (EDA) in 2025: Tools,Types, and Best Practices
In this step, we separate the input features from the target column and split the dataset for training and testing.
The code for this step is as follows:
print("\n--- Feature Selection & Data Splitting ---")
# Define features (X) and target (y)
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']
# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Data split into training and testing sets.")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
Output:
Data split into training and testing sets.
X_train shape: (455, 30)
X_test shape: (114, 30)
y_train shape: (455,)
y_test shape: (114,)
Also Read - Feature Selection in Machine Learning: Techniques, Benefits, and More
In this step, we scale the features so that they have a mean of 0 and a standard deviation of 1. This helps improve the performance of classification algorithms.
print("\n--- Feature Scaling ---")
# Scale the features to have zero mean and unit variance
# This is important for algorithms like Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Features scaled using StandardScaler.")
Output:
Features scaled using StandardScaler.
Explore this project, Airline Passenger Traffic Analysis Project Using Python
We now train a Logistic Regression model and evaluate its performance on the test dataset.
Here is the code for training & evaluating model performance:
# --- 7. Model Training ---
print("\n--- Model Training ---")
# Initialize and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
print("Logistic Regression model trained.")
print("\n--- Model Training Results ---")
# The intercept is the bias term
print("Model Intercept:", model.intercept_)
print("\nModel Coefficients:")
# The coefficients show the weight of each feature in the prediction
coeffs = pd.DataFrame(model.coef_[0], index=X.columns, columns=['Coefficient'])
print(coeffs)
# --- 7.1 Model Evaluation ---
print("\n--- Model Evaluation ---")
# Make predictions on the test set
y_pred = model.predict(X_test_scaled)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
# Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)
# Plot the confusion matrix for better visualization
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Benign', 'Malignant'], yticklabels=['Benign', 'Malignant'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.savefig('confusion_matrix.png')
print("Saved confusion matrix plot to confusion_matrix.png")
# Generate the classification report with precision, recall, f1-score
class_report = classification_report(y_test, y_pred, target_names=['Benign', 'Malignant'])
print("\nClassification Report:")
print(class_report)
# Manually calculate Sensitivity and Specificity from the confusion matrix
tn, fp, fn, tp = conf_matrix.ravel()
sensitivity = tp / (tp + fn) # Same as recall for the positive class
specificity = tn / (tn + fp) # True Negative Rate
print(f"\nSensitivity (Recall or True Positive Rate): {sensitivity:.4f}")
print(f"Specificity (True Negative Rate): {specificity:.4f}")
Output:
--- Model Training ---
Logistic Regression model trained.
--- Model Training Results ---
Model Intercept: [-0.44558453]
Model Coefficients:
Coefficient
radius_mean 0.431904
texture_mean 0.387326
perimeter_mean 0.393432
area_mean 0.465210
smoothness_mean 0.071667
compactness_mean -0.540164
concavity_mean 0.801458
concave points_mean 1.119804
symmetry_mean -0.236119
fractal_dimension_mean -0.075921
radius_se 1.268178
texture_se -0.188877
perimeter_se 0.610583
area_se 0.907186
smoothness_se 0.313307
compactness_se -0.682491
concavity_se -0.175275
concave points_se 0.311300
symmetry_se -0.500425
fractal_dimension_se -0.616230
radius_worst 0.879840
texture_worst 1.350606
perimeter_worst 0.589453
area_worst 0.841846
smoothness_worst 0.544170
compactness_worst -0.016110
concavity_worst 0.943053
concave points_worst 0.778217
symmetry_worst 1.208200
fractal_dimension_worst 0.157414
--- Model Evaluation ---
Accuracy: 0.9737
Confusion Matrix:
[[70 1]
[ 2 41]]
Classification Report:
precision recall f1-score support
Benign 0.97 0.99 0.98 71
Malignant 0.98 0.95 0.96 43
accuracy 0.97 114
macro avg 0.97 0.97 0.97 114
weighted avg 0.97 0.97 0.97 114
Sensitivity (Recall or True Positive Rate): 0.9535
Specificity (True Negative Rate): 0.9859
Also Read - Top 6 Techniques Used in Feature Engineering [Machine Learning]
In this final step, we use the trained Logistic Regression model to predict the class of a new example.
# --- 8. Prediction on Example Data ---
print("\n--- Prediction on Example Data ---")
# Create a hypothetical data point for prediction.
# These values are from the first row of the dataset, which is known to be malignant.
example_data = np.array([[17.99, 10.38, 122.8, 1001.0, 0.1184, 0.2776, 0.3001, 0.1471, 0.2419, 0.07871,
1.095, 0.9053, 8.589, 153.4, 0.006399, 0.04904, 0.05373, 0.01587, 0.03003, 0.006193,
25.38, 17.33, 184.6, 2019.0, 0.1622, 0.6656, 0.7119, 0.2654, 0.4601, 0.1189]])
# IMPORTANT: Scale the new data using the same scaler fitted on the training data
example_data_scaled = scaler.transform(example_data)
# Predict the class for the example data
prediction = model.predict(example_data_scaled)
# Predict the probabilities for each class
prediction_proba = model.predict_proba(example_data_scaled)
print("Example Data Shape:", example_data.shape)
print("\nPrediction:")
if prediction[0] == 1:
print("The model predicts the tumor is: Malignant")
else:
print("The model predicts the tumor is: Benign")
# Print the probabilities for Benign (class 0) and Malignant (class 1)
print(f"\nPrediction Probabilities: Benign ({prediction_proba[0][0]:.4f}), Malignant ({prediction_proba[0][1]:.4f})")
Output:
--- Prediction on Example Data ---
Example Data Shape: (1, 30)
Prediction:
The model predicts the tumor is: Malignant
Key Insights from Prediction on Example Data
Subscribe to upGrad's Newsletter
Join thousands of learners who receive useful tips
In this project, we developed a breast cancer classification and prediction model using Logistic Regression. After preprocessing, visualising, and scaling the data, the model was trained and evaluated with strong accuracy and reliable performance metrics. It successfully predicted a malignant tumour from example data, highlighting its effectiveness. This workflow demonstrates how machine learning can support early and accurate breast cancer diagnosis using real-world data.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1XgiBtgHXg0SRlV59p90JL7zzNJ5BaBBM
The project uses the Breast Cancer Wisconsin (Diagnostic) Dataset, which contains features computed from digitised images of breast mass biopsies.
Logistic Regression is a reliable and interpretable classification algorithm suitable for binary classification problems like predicting benign or malignant tumours.
All relevant numeric features except the target column (diagnosis) were used. Correlation analysis helped understand feature relationships, but no manual feature elimination was applied.
The model was evaluated using accuracy, confusion matrix, classification report (precision, recall, F1-score), and also calculated sensitivity and specificity manually.
The model correctly identified a known malignant tumour from the dataset, showing high confidence in predicting it as malignant, which validates its effectiveness.
834 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources