Heart Disease Prediction Project Using R

By Rohit Sharma

Updated on Aug 07, 2025 | 14 min read | 1.36K+ views

Share:

This Heart Disease Prediction Project in R is an easy-to-understand and beginner-friendly project. This blog will explain the process of building a classification model to predict the presence of heart disease. 

We'll be using Google Colab and R. The project covers essential steps like data cleaning, exploration, model training with logistic regression and random forest, and model evaluation. We will use simple R packages such as caret, randomForest, and caTools to make the workflow easy to follow.

Redefine Your Future with upGrad’s Data Science Courses. Join the next wave of AI and analytics leaders. Learn from the best, build real-world skills, and get globally recognized. Enrol today, your data-driven career starts here.

Must Read For Beginners: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

How Long This Project Takes, What Tools You'll Use, and Skills You'll Gain

This heart disease prediction project requires a certain level of skills, time, and tools. These are mentioned in the table below.

Criteria

Details

Duration 1.5 to 2 hours
Difficulty Level Beginner
Programming Skills Basic R programming, data frame handling, using Google Colab
ML Knowledge Introductory understanding of classification models (Logistic, RF)
Tools Used R, Google Colab, RPy2 magic command (%load_ext rpy2.ipython)
Libraries Used tidyverse, caret, randomForest, e1071, caTools
Key Skills Learned Data cleaning, data splitting, logistic regressionrandom forest, model evaluation, feature importance plotting
Outcome Predict heart disease presence with accuracy using basic ML techniques

From AI-powered finance to advanced data science, our globally recognized programs equip you to lead in tomorrow’s tech economy. Explore top data science and AI courses. Enrol now.

Stepwise Guide to Creating a Heart Disease Prediction Model Using R

The full breakdown of this project is discussed in this section, where each step is explained along with the output.

Step 1: Set Up Google Colab to Use R

To begin working with R in Google Colab, you'll first need to switch the notebook's default language from Python to R. This setup allows you to write and execute R code seamlessly within the Colab environment. 

Start by opening a new notebook in Google Colab. Then, navigate to the "Runtime" menu at the top, and select "Change runtime type." In the window that will appear, change the language setting to "R" from the dropdown menu. Once done, click "Save" to apply the changes. Your notebook is now ready to run R code.

Step 2: Install All Required R Packages

Once the R environment is active in Google Colab, the next step is to install all the essential R packages you'll need throughout the project. These packages support data cleaning, visualization, machine learning model creation, and evaluation. This is a one-time function. The code to install the required packages is:

# Install packages (only needed once)
install.packages("tidyverse")     # Data wrangling and visualization
install.packages("caret")         # Machine learning and model evaluation
install.packages("e1071")         # Support Vector Machine (used by caret)
install.packages("randomForest")  # Random Forest classifier
install.packages("caTools")       # Splitting the dataset

The above code will install all necessary packages for this project. The output is given below.

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

 

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

 

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘proxy’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘sparsevctrs’, ‘timeDate’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’

 

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

 

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

 

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

 

also installing the dependency ‘bitops’

Here are some fun R projects: Daily Temperature Forecast Analysis Using R

Step 3: Load the Installed Libraries into Your Session

After installing the required packages, you need to load them into your R session so their functions are available for use. This step makes sure you can access tools for data wrangling, visualization, model training, and evaluation throughout your analysis. Here’s the code to load the libraries:

# Load the installed libraries
library(tidyverse)      # For data manipulation and visualization
library(caret)          # For machine learning workflows and evaluation
library(e1071)          # Supports algorithms used by caret (like SVM)
library(randomForest)   # For building random forest models
library(caTools)        # For splitting the dataset into training/testing sets

Upon successful loading of the libraries, we get the output confirming the same:

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──

 dplyr    1.1.4      readr    2.1.5

 forcats  1.0.0      stringr  1.5.1

 ggplot2  3.5.2      tibble   3.3.0

 lubridate 1.9.4      tidyr    1.3.1

 purrr    1.1.0     

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

 dplyr::filter() masks stats::filter()

 dplyr::lag()    masks stats::lag()

Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading required package: lattice

 

Attaching package: ‘caret’

 

The following object is masked from ‘package:purrr’:

 

    lift

 

randomForest 4.7-1.2

 

Type rfNews() to see new features/changes/bug fixes.

 

Attaching package: ‘randomForest’

 

The following object is masked from ‘package:dplyr’:

 

    combine

 

The following object is masked from ‘package:ggplot2’:

 

    margin

Step 4: Load the Dataset into R

As the project is now ready to go ahead, it's time to load the heart disease dataset. After you've manually uploaded the CSV file to your Google Colab session, you can read it into R using read.csv(). This step stores the data in a variable called data, which you’ll use for analysis and model building. To confirm that everything loaded correctly, it's a good idea to preview the first few rows. Here’s the code:

# Read the CSV file
data <- read.csv("Heart Disease Prediction.csv")

# View the first few rows of the dataset
head(data)

The above code gives us a glimpse of the dataset we’ll work with:

 

age

sex

cp

trestbps

chol

fbs

restecg

thalach

exang

oldpeak

slope

ca

thal

target

 

<int>

<int>

<int>

<int>

<int>

<int>

<int>

<int>

<int>

<dbl>

<int>

<int>

<int>

<int>

1

52

1

0

125

212

0

1

168

0

1.0

2

2

3

0

2

53

1

0

140

203

1

0

155

1

3.1

0

0

3

0

3

70

1

0

145

174

0

1

125

1

2.6

0

0

3

0

4

61

1

0

148

203

0

1

161

0

0.0

2

1

3

0

5

62

0

0

138

294

1

1

106

0

1.9

1

3

2

0

6

58

0

0

100

248

0

0

122

0

1.0

1

0

2

1

 

Step 5: Understand the Structure and Summary of the Dataset

Before starting with cleaning or modeling, it's important to understand the dataset we're working with. This step helps you inspect the types of variables (e.g., numeric, categorical), the number of features, and their basic statistical properties. Here’s the code:

# Check the structure of the dataset (column types and sample values)
str(data)

# Get summary statistics for each column
summary(data)

The above code gives us the structure of the dataset.

'data.frame': 1025 obs. of  14 variables:

 $ age     : int  52 53 70 61 62 58 58 55 46 54 ...

 $ sex     : int  1 1 1 1 0 0 1 1 1 1 ...

 $ cp      : int  0 0 0 0 0 0 0 0 0 0 ...

 $ trestbps: int  125 140 145 148 138 100 114 160 120 122 ...

 $ chol    : int  212 203 174 203 294 248 318 289 249 286 ...

 $ fbs     : int  0 1 0 0 1 0 0 0 0 0 ...

 $ restecg : int  1 0 1 1 1 0 2 0 0 0 ...

 $ thalach : int  168 155 125 161 106 122 140 145 144 116 ...

 $ exang   : int  0 1 1 0 0 0 0 1 0 1 ...

 $ oldpeak : num  1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...

 $ slope   : int  2 0 0 2 1 1 0 1 2 1 ...

 $ ca      : int  2 0 0 1 3 0 3 1 0 2 ...

 $ thal    : int  3 3 3 3 2 2 1 3 3 2 ...

 $ target  : int  0 0 0 0 0 1 0 0 0 0 ...

 

     age             sex               cp            trestbps    

 Min.   :29.00   Min.   :0.0000   Min.   :0.0000   Min.   : 94.0  

 1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:120.0  

 Median :56.00   Median :1.0000   Median :1.0000   Median :130.0  

 Mean   :54.43   Mean   :0.6956   Mean   :0.9424   Mean   :131.6  

 3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:140.0  

 Max.   :77.00   Max.   :1.0000   Max.   :3.0000   Max.   :200.0  

      chol          fbs            restecg          thalach     

 Min.   :126   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  

 1st Qu.:211   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:132.0  

 Median :240   Median :0.0000   Median :1.0000   Median :152.0  

 Mean   :246   Mean   :0.1493   Mean   :0.5298   Mean   :149.1  

 3rd Qu.:275   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  

 Max.   :564   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  

     exang           oldpeak          slope             ca        

 Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  

 1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000  

 Median :0.0000   Median :0.800   Median :1.000   Median :0.0000  

 Mean   :0.3366   Mean   :1.072   Mean   :1.385   Mean   :0.7541  

 3rd Qu.:1.0000   3rd Qu.:1.800   3rd Qu.:2.000   3rd Qu.:1.0000  

 Max.   :1.0000   Max.   :6.200   Max.   :2.000   Max.   :4.0000  

      thal           target      

 Min.   :0.000   Min.   :0.0000  

 1st Qu.:2.000   1st Qu.:0.0000  

 Median :2.000   Median :1.0000  

 Mean   :2.324   Mean   :0.5132  

 3rd Qu.:3.000   3rd Qu.:1.0000  

 Max.   :3.000   Max.   :1.0000 

Step 6: Check and Handle Missing Values

Cleaning the dataset is a crucial step before any modeling. Missing values can lead to inaccurate or failed model training. This step checks how many missing (NA) values exist in each column and removes any rows containing them using na.omit(). Here’s the code to check and handle missing values:

# Check for missing values in each column
colSums(is.na(data))

# Remove rows with missing values (if any are present)
data <- na.omit(data)

The output for the above code is:

Age 0 sex 0 cp 0 trestbps 0 chol 0 fbs 0 restecg 0 thalach 0 exang 0 oldpeak 0 slope 0 ca 0 thal 0 target 0

Step 7: Convert the Target Variable to a Factor

Since this is a classification problem, the target variable (which indicates whether a person has heart disease or not) must be treated as a categorical variable. In R, we do this by converting it to a factor. This step also includes checking the class distribution to see how many observations fall into each category (e.g., 0 for no disease, 1 for disease). Here’s the code:

# Convert target to factor (for classification tasks)
data$target <- as.factor(data$target)

# View how many cases are in each class (0 = No disease, 1 = Disease)
table(data$target)

The above code gives us the output:

0   1 

499 526 

The above output means that the dataset has two classes in the target variable:

  • 499 rows where target = 0 → These represent people without heart disease.
  • 526 rows where target = 1 → These represent people with heart disease.

Read More: Spam Filter Project Using R with Naive Bayes – With CodeSpotify Music Data Analysis Project in R

Step 8: Split the Dataset into Training and Testing Sets

To evaluate your machine learning model properly, it's important to divide the dataset into two parts: one for training the model and one for testing how well it performs on unseen data. Here, we use a 70-30 split, where 70% of the data goes into training, and the remaining 30% is reserved for testing. Here’s the code for this step:

# Set seed to ensure the same random split every time
set.seed(123)

# Split data: 70% for training, 30% for testing
split <- sample.split(data$target, SplitRatio = 0.7)

# Create the training set
train <- subset(data, split == TRUE)

# Create the testing set
test <- subset(data, split == FALSE)

Step 9: Train a Logistic Regression Model

Now that the data is split, it’s time to train your first machine learning model. Logistic regression is a simple and widely used classification algorithm that works well for binary outcomes like predicting heart disease (yes or no). The code is:

# Train logistic regression model using all features
model_log <- glm(target ~ ., data = train, family = "binomial")

# Display the model summary to see feature significance and statistics
summary(model_log)

The output of the above step is:

Call:

glm(formula = target ~ ., family = "binomial", data = train)

 

Coefficients:

             Estimate Std. Error z value Pr(>|z|)    

(Intercept)  4.198600   1.732006   2.424  0.01535 *  

age         -0.010376   0.015420  -0.673  0.50100    

sex         -1.914140   0.313187  -6.112 9.85e-10 ***

cp           0.842150   0.119272   7.061 1.66e-12 ***

trestbps    -0.020664   0.006969  -2.965  0.00303 ** 

chol        -0.006319   0.002473  -2.556  0.01060 *  

fbs         -0.349104   0.348385  -1.002  0.31631    

restecg      0.342112   0.227805   1.502  0.13315    

thalach      0.026931   0.006820   3.949 7.86e-05 ***

exang       -1.053479   0.270717  -3.891 9.96e-05 ***

oldpeak     -0.577705   0.141253  -4.090 4.32e-05 ***

slope        0.400683   0.233948   1.713  0.08677 .  

ca          -0.727202   0.124163  -5.857 4.72e-09 ***

thal        -0.931300   0.185996  -5.007 5.53e-07 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

(Dispersion parameter for binomial family taken to be 1)

 

    Null deviance: 993.47  on 716  degrees of freedom

Residual deviance: 498.70  on 703  degrees of freedom

AIC: 526.7

 

Number of Fisher Scoring iterations: 6

The above output means that:

  • Important Features: Variables like sex, cp (chest pain), thalach (heart rate), exang (exercise-induced angina), and thal are statistically significant, meaning they strongly influence heart disease prediction.
  • Significance Levels: Stars (***, **, *) next to p-values show how important a feature is. More stars = more influence on the outcome.
  • Model Fit: The residual deviance dropped from 993.47 to 498.70, showing the model fits the data much better than a model with no predictors.
  • AIC Value: The AIC score (526.7) helps compare models; lower is better, and it shows this model is efficient.

Step 10: Make Predictions with the Logistic Regression Model

In this step, we will use the trained logistic regression model to predict probabilities on the test set. Then, convert those probabilities into class labels (0 or 1) using a 0.5 threshold. The code for this step is:

# Predict probabilities on the test set
pred_probs <- predict(model_log, newdata = test, type = "response")

# Convert predicted probabilities to class labels using 0.5 as the cutoff
predictions <- ifelse(pred_probs > 0.5, 1, 0)

# Convert to factor to match the format of actual labels
predictions <- as.factor(predictions)

Here are some R projects you can try: Movie Rating Analysis Project in RForest Fire Project Using R - A Step-by-Step Guide

Step 11: Evaluate the Logistic Regression Model

Now that predictions have been made, it's time to evaluate how well the model performed. We’ll use a confusion matrix to compare the predicted values against the actual values in the test set. This will show the model's accuracy, sensitivity, specificity, and more. The code is:

# Evaluate predictions using a confusion matrix
confusionMatrix(predictions, test$target)

The output for the above step is:

Confusion Matrix and Statistics

 

          Reference

Prediction   0   1

         0 119  14

         1  31 144

                                          

               Accuracy : 0.8539          

                 95% CI : (0.8094, 0.8914)

    No Information Rate : 0.513           

    P-Value [Acc > NIR] : < 2e-16         

                                          

                  Kappa : 0.7068          

                                          

 Mcnemar's Test P-Value : 0.01707         

                                          

            Sensitivity : 0.7933          

            Specificity : 0.9114          

         Pos Pred Value : 0.8947          

         Neg Pred Value : 0.8229          

             Prevalence : 0.4870          

         Detection Rate : 0.3864          

   Detection Prevalence : 0.4318          

      Balanced Accuracy : 0.8524          

                                          

       'Positive' Class : 0               

The above output means that:

  • The model achieved an accuracy of 85.4%, showing strong performance.
  • Sensitivity is 79.3% and specificity is 91.1%, indicating reliable detection of both classes.
  • Kappa score of 0.71 suggests strong agreement beyond chance.
  • McNemar’s test p-value (0.017) shows a slight imbalance in classification errors.

Step 12: Train and Evaluate a Random Forest Classifier

Now let's build a Random Forest model and assess its performance on the test data. Random Forest is an ensemble method that builds multiple decision trees and combines their outputs for better accuracy. Here’s the code:

# Train Random Forest model
model_rf <- randomForest(target ~ ., data = train, ntree = 100)

# Predict on test set
rf_predictions <- predict(model_rf, newdata = test)

# Evaluate the model
confusionMatrix(rf_predictions, test$target)

The above code gives us the output as:

Confusion Matrix and Statistics

 

          Reference

Prediction   0   1

         0 149   3

         1   1 155

                                          

               Accuracy : 0.987           

                 95% CI : (0.9671, 0.9965)

    No Information Rate : 0.513           

    P-Value [Acc > NIR] : <2e-16          

                                          

                  Kappa : 0.974           

                                          

 Mcnemar's Test P-Value : 0.6171          

                                          

            Sensitivity : 0.9933          

            Specificity : 0.9810          

         Pos Pred Value : 0.9803          

         Neg Pred Value : 0.9936          

             Prevalence : 0.4870          

         Detection Rate : 0.4838          

   Detection Prevalence : 0.4935          

      Balanced Accuracy : 0.9872          

                                          

       'Positive' Class : 0            

The above output means that:

  • High Accuracy: The model correctly predicted ~98.7% of the test cases, showing excellent performance.
  • Very Low Errors: Only 4 total misclassifications out of 308 test cases (1 false negative, 3 false positives).
  • High Sensitivity & Specificity: It correctly identified 99.3% of class 0 cases and 98.1% of class 1 cases.
  • Kappa Score of 0.974: This indicates almost perfect agreement between predicted and actual values beyond chance.

Step 13: Plot Important Features Identified by Random Forest

This step helps you visualize which features had the most impact on predicting heart disease. Random Forest automatically ranks variables based on how much they improve the model’s accuracy. The code for this step is:

# Plot variable importance from Random Forest
varImpPlot(model_rf)

This gives us a graph of the variable importance from Random Forest:

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

The above graph shows that:

  • Top Predictors: The features thalca, and cp (chest pain type) are the most important for predicting heart disease; they have the highest impact on the model’s accuracy.
  • Measured by Gini: Importance is measured by Mean Decrease in Gini, which shows how much a variable helps in splitting decision trees effectively. Higher values = more influence.
  • Least Important: Features like fbs (fasting blood sugar) and restecg (resting ECG results) contribute the least to prediction accuracy.
  • Practical Insight: Focusing on the top 4–5 features (e.g., thal, ca, cp, thalach) can give strong predictive power with less complexity.

Conclusion

In this Heart Disease Prediction project, we built a Random Forest classification model using R in Google Colab to predict the likelihood of heart disease based on clinical features like chest pain type, thalassemia, and number of major vessels.

After preprocessing the data, we split it into training and testing sets using a 70/30 ratio, trained the model, and evaluated its performance using a confusion matrix and classification metrics.

The model achieved a high accuracy of 98.7%, with strong sensitivity (99.3%) and specificity (98.1%), indicating excellent performance in identifying both heart disease and non-disease cases.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/13mkc60XdZtbFVxk25gqK4xnbzMG6ZCdr#scrollTo=K9mda-O9p5Gf

Frequently Asked Questions (FAQs)

1. What is the main goal of a Heart Disease Prediction project in R?

2. Which tools and libraries are used in this Heart Disease Prediction project?

3. What other machine learning algorithms can be used to improve model accuracy?

4. How accurate is the model, and how is performance evaluated?

5. What are some other beginner-friendly machine learning projects in R?

Rohit Sharma

827 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

upGrad Logo

Certification

3 Months