Breast Cancer Classification and Prediction with Logistic Regression

By Rohit Sharma

Updated on Aug 06, 2025 | 15 min read | 1.35K+ views

Share:

Breast cancer is one of the most common and life-threatening diseases affecting women globally. Breast Cancer Classification and Prediction helps in finding out if a tumour is cancerous (malignant) or not (benign) at an early stage. This project uses a logistic regression model to study health data and predict the chances of breast cancer. 

This is a great way to learn how machine learning can support doctors and improve early detection of breast cancer.

Ready to enter the field of data science? upGrad provides Online Data Science Courses covering Python, Machine Learning, AI, SQL, and Tableau. Taught by experts, enrol today!

Hey, if you're looking to dive deeper, check out this awesome collection of Python Data Science Projects! There's something for everyone, whether you're just starting out or you're a seasoned pro.

Heads Up Before You Dive In!

To work smoothly on the Breast Cancer Classification and Prediction project, make sure you’re comfortable with the following:

  • Basic Python programming knowledge (You should know how to write simple scripts, use loops and conditions, and define functions.)
  • Experience with data manipulation using Pandas and NumPy (These help in reading the dataset, handling missing values, and preparing the data for analysis)
  • Understanding of data visualisation with Matplotlib and Seaborn (These tools help in drawing graphs like histograms, countplots, and heatmaps to better understand the data)
  • Knowledge of data preprocessing techniques (You should know how to clean the data, encode categorical columns, scale features, and split the dataset into training and test sets)
  • Familiarity with Regression Algorithms (Since we are predicting whether a tumour is cancerous or not, it’s important to know how models like Logistic Regression and Random Forest work for classification problems)

If you're new to Python, check out this free upGrad course to boost your skills!- Learn Basic Python Programming.

Unlock your data science potential with upGrad's premier courses, offering industry-led instruction, direct mentorship, and dedicated career guidance.

Breast Cancer Classification: Tools & Tech We Used

To build and evaluate the breast cancer classification and prediction model, you’ll use popular Python tools for data analysis, visualisation, classification, and model evaluation:

Tool / Library

Purpose

Python The main programming language used to build and run the project
Google Colab Free online platform to run Python code with all libraries pre-installed
Pandas Reads the dataset, handles missing values, and prepares data for analysis
NumPy Supports numerical tasks like handling arrays and calculations
Matplotlib / Seaborn Used for visualising patterns, distributions, and correlations in the data
scikit-learn Helps in splitting data, encoding features, training models, and evaluation
LogisticRegression A simple model used as a starting point for classification

Also Read - Decision Tree vs Random Forest: Use Cases & Performance Metrics

Methodology for Breast Cancer Classification & Prediction

To predict the chances of breast cancer, we build a classification model using patient data. The model learns from past data and identifies patterns to predict whether a tumour is malignant or benign. Here's what we did:

  • Data preprocessing and cleaning
  • Exploratory Data Analysis (EDA)
  • Classification Algorithms (Logistic Regression)
  • Feature Importance Analysis
  • Model evaluation using accuracy, precision, recall, and F1-score

Also Read - Evaluation Metrics in Machine Learning: Top 10 Metrics You Should Know

How Much Time Do You Need

You can complete this Breast Cancer Classification and Prediction project in about 4 to 5 hours. It’s designed for beginners and intermediate learners who are comfortable with basic Python, data handling, and classification tasks.

Building a Breast Cancer Classification and Prediction Model: A Step-by-Step Guide to Building One

Here’s how you can build this project from scratch:

  1. Load the Breast Cancer Dataset
    Load dataset: diagnostic features (mean radius, texture, perimeter, etc.) and target (benign/malignant tumour).
  2. Clean and Preprocess the Data
    Clean and scale data: drop irrelevant columns, handle missing values, encode categorical labels (B/M to numerical), and standardise features.
  3. Explore and Visualise the Data
    Use visual tools such as countplots, histograms, boxplots, and correlation heatmaps to understand the distribution of features and how they relate to the diagnosis outcome.
  4. Train a Classification Model
    Use Logistic Regression to classify tumours as benign or malignant based on the diagnostic features. This model is suitable for binary classification problems like this.
  5. Evaluate Model Performance
    Evaluate model performance using accuracy, precision, recall, F1-score, and confusion matrix to assess its ability to differentiate benign and malignant cases.
  6. Identify Top Features
    Feature importance scores identify key input variables for classification, highlighting significant medical indicators.

Let's get started!

Step 1: Import Required Libraries

Start by importing the essential Python libraries:

Here's the code for importing:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Also Read - Libraries in Python Explained: List of Important Libraries

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Step 2: Explore the Breast Cancer Data

In this step, you explore the dataset to understand its structure, column types, and basic statistics. This helps you plan the next steps, like cleaning and feature selection.

# Load the dataset
df = pd.read_csv('data.csv')
# --- 1. Data Exploration ---
print("--- Data Exploration ---")

# View first 5 records
print("First 5 rows of the dataset:")
print(df.head())
# Check column types and missing values
print("\nDataset Info:")
df.info()

# View summary statistics
print("\nStatistical Summary:")
print(df.describe())

Output:

--- Data Exploration ---

First 5 rows of the dataset:

               id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \

0    842302        M             17.99         10.38                       122.80     1001.0   

1    842517         M             20.57         17.77                       132.90     1326.0   

2    84300903   M             19.69         21.25                       130.00     1203.0   

3   84348301     M            11.42         20.38                        77.58      386.1   

4   84358402     M           20.29         14.34                        135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \

0          0.11840           0.27760                        0.3001              0.14710   

1          0.08474           0.07864                       0.0869              0.07017   

2          0.10960           0.15990                        0.1974               0.12790   

3          0.14250           0.28390                       0.2414              0.10520   

4          0.10030           0.13280                       0.1980               0.10430   

   ...  texture_worst  perimeter_worst  area_worst  smoothness_worst  \

0  ...          17.33           184.60             2019.0            0.1622   

1  ...          23.41           158.80             1956.0            0.1238   

2  ...          25.53           152.50            1709.0            0.1444   

3  ...          26.50            98.87             567.7            0.2098   

4  ...          16.67           152.20             1575.0            0.1374   

   compactness_worst  concavity_worst  concave points_worst  symmetry_worst  \

0             0.6656           0.7119                                  0.2654            0.4601   

1             0.1866           0.2416                                   0.1860             0.2750   

2             0.4245           0.4504                                 0.2430            0.3613   

3             0.8663           0.6869                                 0.2575            0.6638   

4             0.2050           0.4000                                 0.1625             0.2364   

   fractal_dimension_worst  Unnamed: 32  

0                  0.11890          NaN  

1                  0.08902          NaN  

2                  0.08758          NaN  

3                  0.17300          NaN  

4                  0.07678          NaN  

[5 rows x 33 columns]

Dataset Info:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 569 entries, 0 to 568

Data columns (total 33 columns):

 #   Column                          Non-Null Count    Dtype  

---  ------                              --------------      -----  

 0   id                                       569 non-null       int64  

 1   diagnosis                           569 non-null      object 

 2   radius_mean                     569 non-null       float64

 3   texture_mean                   569 non-null       float64

 4   perimeter_mean               569 non-null       float64

 5   area_mean                        569 non-null       float64

 6   smoothness_mean           569 non-null       float64

 7   compactness_mean          569 non-null      float64

 8   concavity_mean                569 non-null       float64

 9   concave points_mean       569 non-null       float64

 10  symmetry_mean               569 non-null       float64

 11  fractal_dimension_mean   569 non-null      float64

 12  radius_se                           569 non-null      float64

 13  texture_se                         569 non-null      float64

 14  perimeter_se                     569 non-null      float64

 15  area_se                              569 non-null      float64

 16  smoothness_se                 569 non-null      float64

 17  compactness_se               569 non-null      float64

 18  concavity_se                      569 non-null     float64

 19  concave points_se             569 non-null     float64

 20  symmetry_se                      569 non-null    float64

 21  fractal_dimension_se          569 non-null    float64

 22  radius_worst                       569 non-null    float64

 23  texture_worst                     569 non-null     float64

 24  perimeter_worst                 569 non-null     float64

 25  area_worst                          569 non-null     float64

 26  smoothness_worst             569 non-null     float64

 27  compactness_worst           569 non-null     float64

 28  concavity_worst                 569 non-null     float64

 29  concave points_worst        569 non-null     float64

 30  symmetry_worst                 569 non-null     float64

 31  fractal_dimension_worst     569 non-null     float64

 32  Unnamed: 32                       0 non-null         float64

dtypes: float64(31), int64(1), object(1)

memory usage: 146.8+ KB

Statistical Summary:

                           id          radius_mean  texture_mean  perimeter_mean    area_mean  \

count  5.690000e+02   569.000000    569.000000      569.000000       569.000000   

mean  3.037183e+07    14.127292        19.289649        91.969033          654.889104   

std      1.250206e+08    3.524049        4.301036           24.298981          351.914129   

min     8.670000e+03    6.981000        9.710000            43.790000         143.500000   

25%    8.692180e+05    11.700000       16.170000           75.170000         420.300000   

50%    9.060240e+05    13.370000      18.840000         86.240000         551.100000   

75%    8.813129e+06     15.780000       21.800000        104.100000         782.700000   

max    9.113205e+08     28.110000        39.280000       188.500000        2501.000000   

       smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \

count       569.000000        569.000000           569.000000           569.000000   

mean       0.096360            0.104341                 0.088799               0.048919   

std           0.014064            0.052813                 0.079720               0.038803   

min          0.052630            0.019380                 0.000000              0.000000   

25%         0.086370            0.064920                0.029560               0.020310   

50%         0.095870            0.092630                0.061540               0.033500   

75%         0.105300            0.130400                 0.130700               0.074000   

max         0.163400            0.345400                0.426800              0.201200   

       symmetry_mean  ...  texture_worst  perimeter_worst   area_worst  \

count     569.000000  ... 569.000000       569.000000      569.000000   

mean     0.181162  ...        25.677223          107.261213        880.583128   

std         0.027414  ...       6.146258            33.602542        569.356993   

min        0.106000  ...       12.020000          50.410000        185.200000   

25%       0.161900  ...       21.080000          84.110000          515.300000   

50%       0.179200  ...       25.410000          97.660000         686.500000   

75%       0.195700  ...       29.720000          125.400000      1084.000000   

max       0.304000  ...      49.540000         251.200000       4254.000000   

       smoothness_worst  compactness_worst  concavity_worst  \

count        569.000000         569.000000        569.000000   

mean        0.132369              0.254265             0.272188   

std            0.022832             0.157336              0.208624   

min           0.071170               0.027290             0.000000   

25%          0.116600               0.147200             0.114500   

50%          0.131300               0.211900             0.226700   

75%          0.146000               0.339100            0.382900   

max          0.222600              1.058000             1.252000   

       concave points_worst  symmetry_worst  fractal_dimension_worst  \

count            569.000000      569.000000               569.000000   

mean            0.114606            0.290076                   0.083946   

std                0.065732           0.061867                   0.018061   

min               0.000000           0.156500                   0.055040   

25%              0.064930           0.250400                  0.071460   

50%              0.099930           0.282200                  0.080040   

75%              0.161400            0.317900                   0.092080   

max              0.291000           0.663800                  0.207500   

       Unnamed: 32  

count          0.0  

mean          NaN  

std              NaN  

min             NaN  

25%            NaN  

50%            NaN  

75%            NaN  

max            NaN  

[8 rows x 32 columns]

Step 3: Data Preprocessing and Cleaning

Now, we clean up unnecessary columns and convert the target column diagnosis into a binary format. This is essential to prepare the data for classification and prediction.

print("\n--- Data Preprocessing ---")
# Drop the 'id' and 'Unnamed: 32' columns as they are not needed for prediction
df = df.drop(columns=['id', 'Unnamed: 32'], errors='ignore')
# Encode the 'diagnosis' column to numerical values (Malignant=1, Benign=0)
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
print("Diagnosis column encoded.")
print(df['diagnosis'].value_counts())

Output:

--- Data Preprocessing ---

Diagnosis column encoded.
diagnosis
0    357
1    212

Name: count, dtype: int64

Also Read - Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data

Step 4: Visualise Class Distribution and Feature Correlation

In this step, you’ll understand how the diagnosis classes are distributed and how features relate to each other.

print("\n--- Data Visualization ---")
# Plot the distribution of the two classes
plt.figure(figsize=(6, 4))
sns.countplot(x='diagnosis', data=df)
plt.title('Distribution of Diagnosis (1: Malignant, 0: Benign)')

# Create a correlation heatmap to see the relationships between features
plt.figure(figsize=(20, 20))
# Select only numeric columns for the correlation matrix
numeric_df = df.select_dtypes(include=np.number)
corr_matrix = numeric_df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Features')

Output:

Also Read - Comprehensive Guide to Exploratory Data Analysis (EDA) in 2025: Tools,Types, and Best Practices

Step 5: Feature Selection and Data Splitting

In this step, we separate the input features from the target column and split the dataset for training and testing.

The code for this step is as follows:

print("\n--- Feature Selection & Data Splitting ---")
# Define features (X) and target (y)
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']

# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Data split into training and testing sets.")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

Output:

Data split into training and testing sets.
X_train shape: (455, 30)
X_test shape: (114, 30)
y_train shape: (455,)
y_test shape: (114,)

Also Read - Feature Selection in Machine Learning: Techniques, Benefits, and More

Step 6:  Feature Scaling

In this step, we scale the features so that they have a mean of 0 and a standard deviation of 1. This helps improve the performance of classification algorithms.

print("\n--- Feature Scaling ---")
# Scale the features to have zero mean and unit variance
# This is important for algorithms like Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Features scaled using StandardScaler.")

Output:

Features scaled using StandardScaler.

Explore this project, Airline Passenger Traffic Analysis Project Using Python

Step 7: Model Training and Evaluation

We now train a Logistic Regression model and evaluate its performance on the test dataset.

Here is the code for training & evaluating model performance: 

# --- 7. Model Training ---
print("\n--- Model Training ---")
# Initialize and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
print("Logistic Regression model trained.")
print("\n--- Model Training Results ---")
# The intercept is the bias term
print("Model Intercept:", model.intercept_)
print("\nModel Coefficients:")
# The coefficients show the weight of each feature in the prediction
coeffs = pd.DataFrame(model.coef_[0], index=X.columns, columns=['Coefficient'])
print(coeffs)
# --- 7.1 Model Evaluation ---
print("\n--- Model Evaluation ---")
# Make predictions on the test set
y_pred = model.predict(X_test_scaled)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

# Plot the confusion matrix for better visualization
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Benign', 'Malignant'], yticklabels=['Benign', 'Malignant'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.savefig('confusion_matrix.png')
print("Saved confusion matrix plot to confusion_matrix.png")

# Generate the classification report with precision, recall, f1-score
class_report = classification_report(y_test, y_pred, target_names=['Benign', 'Malignant'])
print("\nClassification Report:")
print(class_report)

# Manually calculate Sensitivity and Specificity from the confusion matrix
tn, fp, fn, tp = conf_matrix.ravel()
sensitivity = tp / (tp + fn) # Same as recall for the positive class
specificity = tn / (tn + fp) # True Negative Rate
print(f"\nSensitivity (Recall or True Positive Rate): {sensitivity:.4f}")
print(f"Specificity (True Negative Rate): {specificity:.4f}")

Output: 

--- Model Training ---

Logistic Regression model trained.

--- Model Training Results ---

Model Intercept: [-0.44558453]

Model Coefficients:

                         Coefficient

radius_mean                      0.431904

texture_mean                     0.387326

perimeter_mean                 0.393432

area_mean                          0.465210

smoothness_mean             0.071667

compactness_mean           -0.540164

concavity_mean                  0.801458

concave points_mean         1.119804

symmetry_mean                  -0.236119

fractal_dimension_mean     -0.075921

radius_se                             1.268178

texture_se                            -0.188877

perimeter_se                        0.610583

area_se                                 0.907186

smoothness_se                    0.313307

compactness_se                  -0.682491

concavity_se                         -0.175275

concave points_se                0.311300

symmetry_se                         -0.500425

fractal_dimension_se            -0.616230

radius_worst                          0.879840

texture_worst                        1.350606

perimeter_worst                    0.589453

area_worst                             0.841846

smoothness_worst                0.544170

compactness_worst              -0.016110

concavity_worst                     0.943053

concave points_worst            0.778217

symmetry_worst                     1.208200

fractal_dimension_worst        0.157414

--- Model Evaluation ---

Accuracy: 0.9737

Confusion Matrix:

[[70  1]

 [ 2 41]]

Classification Report:

                    precision    recall  f1-score   support

      Benign         0.97      0.99      0.98        71

   Malignant       0.98      0.95      0.96        43

    accuracy                                  0.97       114

   macro avg       0.97      0.97      0.97       114

weighted avg     0.97      0.97      0.97       114

Sensitivity (Recall or True Positive Rate): 0.9535

Specificity (True Negative Rate): 0.9859

Also Read - Top 6 Techniques Used in Feature Engineering [Machine Learning]

Step 8: Prediction on Example Data

In this final step, we use the trained Logistic Regression model to predict the class of a new example.

# --- 8. Prediction on Example Data ---
print("\n--- Prediction on Example Data ---")
# Create a hypothetical data point for prediction.
# These values are from the first row of the dataset, which is known to be malignant.
example_data = np.array([[17.99, 10.38, 122.8, 1001.0, 0.1184, 0.2776, 0.3001, 0.1471, 0.2419, 0.07871,
                          1.095, 0.9053, 8.589, 153.4, 0.006399, 0.04904, 0.05373, 0.01587, 0.03003, 0.006193,
                          25.38, 17.33, 184.6, 2019.0, 0.1622, 0.6656, 0.7119, 0.2654, 0.4601, 0.1189]])
# IMPORTANT: Scale the new data using the same scaler fitted on the training data
example_data_scaled = scaler.transform(example_data)
# Predict the class for the example data
prediction = model.predict(example_data_scaled)
# Predict the probabilities for each class
prediction_proba = model.predict_proba(example_data_scaled)
print("Example Data Shape:", example_data.shape)
print("\nPrediction:")
if prediction[0] == 1:
    print("The model predicts the tumor is: Malignant")
else:
    print("The model predicts the tumor is: Benign")

# Print the probabilities for Benign (class 0) and Malignant (class 1)
print(f"\nPrediction Probabilities: Benign ({prediction_proba[0][0]:.4f}), Malignant ({prediction_proba[0][1]:.4f})")

Output:

--- Prediction on Example Data ---
Example Data Shape: (1, 30)

Prediction:
The model predicts the tumor is: Malignant

Key Insights from Prediction on Example Data

  • Input Shape: The example data has 30 features, matching the model's expected input shape: (1, 30).
  • Prediction Result: The model correctly predicted the tumor as Malignant, which aligns with the known label from the dataset. This confirms that the model can make accurate predictions on real cases.

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Final Conclusion

In this project, we developed a breast cancer classification and prediction model using Logistic Regression. After preprocessing, visualising, and scaling the data, the model was trained and evaluated with strong accuracy and reliable performance metrics. It successfully predicted a malignant tumour from example data, highlighting its effectiveness. This workflow demonstrates how machine learning can support early and accurate breast cancer diagnosis using real-world data.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1XgiBtgHXg0SRlV59p90JL7zzNJ5BaBBM

Frequently Asked Questions (FAQs)

1. What dataset was used in this project?

The project uses the Breast Cancer Wisconsin (Diagnostic) Dataset, which contains features computed from digitised images of breast mass biopsies.

2. Why was Logistic Regression chosen for this task?

Logistic Regression is a reliable and interpretable classification algorithm suitable for binary classification problems like predicting benign or malignant tumours.

3. How were features selected for training the model?

All relevant numeric features except the target column (diagnosis) were used. Correlation analysis helped understand feature relationships, but no manual feature elimination was applied.

4. How was the model evaluated?

The model was evaluated using accuracy, confusion matrix, classification report (precision, recall, F1-score), and also calculated sensitivity and specificity manually.

5. What does the example prediction show?

The model correctly identified a known malignant tumour from the dataset, showing high confidence in predicting it as malignant, which validates its effectiveness.

Rohit Sharma

834 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

upGrad
new course

Certification

30 Weeks

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months