Heart Disease Prediction Project Using R
By Rohit Sharma
Updated on Aug 07, 2025 | 14 min read | 1.36K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 07, 2025 | 14 min read | 1.36K+ views
Share:
This Heart Disease Prediction Project in R is an easy-to-understand and beginner-friendly project. This blog will explain the process of building a classification model to predict the presence of heart disease.
We'll be using Google Colab and R. The project covers essential steps like data cleaning, exploration, model training with logistic regression and random forest, and model evaluation. We will use simple R packages such as caret, randomForest, and caTools to make the workflow easy to follow.
Redefine Your Future with upGrad’s Data Science Courses. Join the next wave of AI and analytics leaders. Learn from the best, build real-world skills, and get globally recognized. Enrol today, your data-driven career starts here.
Must Read For Beginners: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025
Popular Data Science Programs
This heart disease prediction project requires a certain level of skills, time, and tools. These are mentioned in the table below.
Criteria |
Details |
Duration | 1.5 to 2 hours |
Difficulty Level | Beginner |
Programming Skills | Basic R programming, data frame handling, using Google Colab |
ML Knowledge | Introductory understanding of classification models (Logistic, RF) |
Tools Used | R, Google Colab, RPy2 magic command (%load_ext rpy2.ipython) |
Libraries Used | tidyverse, caret, randomForest, e1071, caTools |
Key Skills Learned | Data cleaning, data splitting, logistic regression, random forest, model evaluation, feature importance plotting |
Outcome | Predict heart disease presence with accuracy using basic ML techniques |
From AI-powered finance to advanced data science, our globally recognized programs equip you to lead in tomorrow’s tech economy. Explore top data science and AI courses. Enrol now.
The full breakdown of this project is discussed in this section, where each step is explained along with the output.
To begin working with R in Google Colab, you'll first need to switch the notebook's default language from Python to R. This setup allows you to write and execute R code seamlessly within the Colab environment.
Start by opening a new notebook in Google Colab. Then, navigate to the "Runtime" menu at the top, and select "Change runtime type." In the window that will appear, change the language setting to "R" from the dropdown menu. Once done, click "Save" to apply the changes. Your notebook is now ready to run R code.
Once the R environment is active in Google Colab, the next step is to install all the essential R packages you'll need throughout the project. These packages support data cleaning, visualization, machine learning model creation, and evaluation. This is a one-time function. The code to install the required packages is:
# Install packages (only needed once)
install.packages("tidyverse") # Data wrangling and visualization
install.packages("caret") # Machine learning and model evaluation
install.packages("e1071") # Support Vector Machine (used by caret)
install.packages("randomForest") # Random Forest classifier
install.packages("caTools") # Splitting the dataset
The above code will install all necessary packages for this project. The output is given below.
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘proxy’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘sparsevctrs’, ‘timeDate’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
also installing the dependency ‘bitops’ |
Here are some fun R projects: Daily Temperature Forecast Analysis Using R
After installing the required packages, you need to load them into your R session so their functions are available for use. This step makes sure you can access tools for data wrangling, visualization, model training, and evaluation throughout your analysis. Here’s the code to load the libraries:
# Load the installed libraries
library(tidyverse) # For data manipulation and visualization
library(caret) # For machine learning workflows and evaluation
library(e1071) # Supports algorithms used by caret (like SVM)
library(randomForest) # For building random forest models
library(caTools) # For splitting the dataset into training/testing sets
Upon successful loading of the libraries, we get the output confirming the same:
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.2 ✔ tibble 3.3.0 ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ✔ purrr 1.1.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors Loading required package: lattice
Attaching package: ‘caret’
The following object is masked from ‘package:purrr’:
lift
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:dplyr’:
combine
The following object is masked from ‘package:ggplot2’:
margin |
As the project is now ready to go ahead, it's time to load the heart disease dataset. After you've manually uploaded the CSV file to your Google Colab session, you can read it into R using read.csv(). This step stores the data in a variable called data, which you’ll use for analysis and model building. To confirm that everything loaded correctly, it's a good idea to preview the first few rows. Here’s the code:
# Read the CSV file
data <- read.csv("Heart Disease Prediction.csv")
# View the first few rows of the dataset
head(data)
The above code gives us a glimpse of the dataset we’ll work with:
age |
sex |
cp |
trestbps |
chol |
fbs |
restecg |
thalach |
exang |
oldpeak |
slope |
ca |
thal |
target |
|
<int> |
<int> |
<int> |
<int> |
<int> |
<int> |
<int> |
<int> |
<int> |
<dbl> |
<int> |
<int> |
<int> |
<int> |
|
1 |
52 |
1 |
0 |
125 |
212 |
0 |
1 |
168 |
0 |
1.0 |
2 |
2 |
3 |
0 |
2 |
53 |
1 |
0 |
140 |
203 |
1 |
0 |
155 |
1 |
3.1 |
0 |
0 |
3 |
0 |
3 |
70 |
1 |
0 |
145 |
174 |
0 |
1 |
125 |
1 |
2.6 |
0 |
0 |
3 |
0 |
4 |
61 |
1 |
0 |
148 |
203 |
0 |
1 |
161 |
0 |
0.0 |
2 |
1 |
3 |
0 |
5 |
62 |
0 |
0 |
138 |
294 |
1 |
1 |
106 |
0 |
1.9 |
1 |
3 |
2 |
0 |
6 |
58 |
0 |
0 |
100 |
248 |
0 |
0 |
122 |
0 |
1.0 |
1 |
0 |
2 |
1 |
Before starting with cleaning or modeling, it's important to understand the dataset we're working with. This step helps you inspect the types of variables (e.g., numeric, categorical), the number of features, and their basic statistical properties. Here’s the code:
# Check the structure of the dataset (column types and sample values)
str(data)
# Get summary statistics for each column
summary(data)
The above code gives us the structure of the dataset.
'data.frame': 1025 obs. of 14 variables: $ age : int 52 53 70 61 62 58 58 55 46 54 ... $ sex : int 1 1 1 1 0 0 1 1 1 1 ... $ cp : int 0 0 0 0 0 0 0 0 0 0 ... $ trestbps: int 125 140 145 148 138 100 114 160 120 122 ... $ chol : int 212 203 174 203 294 248 318 289 249 286 ... $ fbs : int 0 1 0 0 1 0 0 0 0 0 ... $ restecg : int 1 0 1 1 1 0 2 0 0 0 ... $ thalach : int 168 155 125 161 106 122 140 145 144 116 ... $ exang : int 0 1 1 0 0 0 0 1 0 1 ... $ oldpeak : num 1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ... $ slope : int 2 0 0 2 1 1 0 1 2 1 ... $ ca : int 2 0 0 1 3 0 3 1 0 2 ... $ thal : int 3 3 3 3 2 2 1 3 3 2 ... $ target : int 0 0 0 0 0 1 0 0 0 0 ...
age sex cp trestbps Min. :29.00 Min. :0.0000 Min. :0.0000 Min. : 94.0 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:120.0 Median :56.00 Median :1.0000 Median :1.0000 Median :130.0 Mean :54.43 Mean :0.6956 Mean :0.9424 Mean :131.6 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.0000 3rd Qu.:140.0 Max. :77.00 Max. :1.0000 Max. :3.0000 Max. :200.0 chol fbs restecg thalach Min. :126 Min. :0.0000 Min. :0.0000 Min. : 71.0 1st Qu.:211 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:132.0 Median :240 Median :0.0000 Median :1.0000 Median :152.0 Mean :246 Mean :0.1493 Mean :0.5298 Mean :149.1 3rd Qu.:275 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0 Max. :564 Max. :1.0000 Max. :2.0000 Max. :202.0 exang oldpeak slope ca Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.0000 Median :0.0000 Median :0.800 Median :1.000 Median :0.0000 Mean :0.3366 Mean :1.072 Mean :1.385 Mean :0.7541 3rd Qu.:1.0000 3rd Qu.:1.800 3rd Qu.:2.000 3rd Qu.:1.0000 Max. :1.0000 Max. :6.200 Max. :2.000 Max. :4.0000 thal target Min. :0.000 Min. :0.0000 1st Qu.:2.000 1st Qu.:0.0000 Median :2.000 Median :1.0000 Mean :2.324 Mean :0.5132 3rd Qu.:3.000 3rd Qu.:1.0000 Max. :3.000 Max. :1.0000 |
Cleaning the dataset is a crucial step before any modeling. Missing values can lead to inaccurate or failed model training. This step checks how many missing (NA) values exist in each column and removes any rows containing them using na.omit(). Here’s the code to check and handle missing values:
# Check for missing values in each column
colSums(is.na(data))
# Remove rows with missing values (if any are present)
data <- na.omit(data)
The output for the above code is:
Age 0 sex 0 cp 0 trestbps 0 chol 0 fbs 0 restecg 0 thalach 0 exang 0 oldpeak 0 slope 0 ca 0 thal 0 target 0 |
Since this is a classification problem, the target variable (which indicates whether a person has heart disease or not) must be treated as a categorical variable. In R, we do this by converting it to a factor. This step also includes checking the class distribution to see how many observations fall into each category (e.g., 0 for no disease, 1 for disease). Here’s the code:
# Convert target to factor (for classification tasks)
data$target <- as.factor(data$target)
# View how many cases are in each class (0 = No disease, 1 = Disease)
table(data$target)
The above code gives us the output:
0 1 499 526 |
The above output means that the dataset has two classes in the target variable:
Read More: Spam Filter Project Using R with Naive Bayes – With Code | Spotify Music Data Analysis Project in R
To evaluate your machine learning model properly, it's important to divide the dataset into two parts: one for training the model and one for testing how well it performs on unseen data. Here, we use a 70-30 split, where 70% of the data goes into training, and the remaining 30% is reserved for testing. Here’s the code for this step:
# Set seed to ensure the same random split every time
set.seed(123)
# Split data: 70% for training, 30% for testing
split <- sample.split(data$target, SplitRatio = 0.7)
# Create the training set
train <- subset(data, split == TRUE)
# Create the testing set
test <- subset(data, split == FALSE)
Now that the data is split, it’s time to train your first machine learning model. Logistic regression is a simple and widely used classification algorithm that works well for binary outcomes like predicting heart disease (yes or no). The code is:
# Train logistic regression model using all features
model_log <- glm(target ~ ., data = train, family = "binomial")
# Display the model summary to see feature significance and statistics
summary(model_log)
The output of the above step is:
Call: glm(formula = target ~ ., family = "binomial", data = train)
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.198600 1.732006 2.424 0.01535 * age -0.010376 0.015420 -0.673 0.50100 sex -1.914140 0.313187 -6.112 9.85e-10 *** cp 0.842150 0.119272 7.061 1.66e-12 *** trestbps -0.020664 0.006969 -2.965 0.00303 ** chol -0.006319 0.002473 -2.556 0.01060 * fbs -0.349104 0.348385 -1.002 0.31631 restecg 0.342112 0.227805 1.502 0.13315 thalach 0.026931 0.006820 3.949 7.86e-05 *** exang -1.053479 0.270717 -3.891 9.96e-05 *** oldpeak -0.577705 0.141253 -4.090 4.32e-05 *** slope 0.400683 0.233948 1.713 0.08677 . ca -0.727202 0.124163 -5.857 4.72e-09 *** thal -0.931300 0.185996 -5.007 5.53e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 993.47 on 716 degrees of freedom Residual deviance: 498.70 on 703 degrees of freedom AIC: 526.7
Number of Fisher Scoring iterations: 6 |
The above output means that:
In this step, we will use the trained logistic regression model to predict probabilities on the test set. Then, convert those probabilities into class labels (0 or 1) using a 0.5 threshold. The code for this step is:
# Predict probabilities on the test set
pred_probs <- predict(model_log, newdata = test, type = "response")
# Convert predicted probabilities to class labels using 0.5 as the cutoff
predictions <- ifelse(pred_probs > 0.5, 1, 0)
# Convert to factor to match the format of actual labels
predictions <- as.factor(predictions)
Here are some R projects you can try: Movie Rating Analysis Project in R | Forest Fire Project Using R - A Step-by-Step Guide
Now that predictions have been made, it's time to evaluate how well the model performed. We’ll use a confusion matrix to compare the predicted values against the actual values in the test set. This will show the model's accuracy, sensitivity, specificity, and more. The code is:
# Evaluate predictions using a confusion matrix
confusionMatrix(predictions, test$target)
The output for the above step is:
Confusion Matrix and Statistics
Reference Prediction 0 1 0 119 14 1 31 144
Accuracy : 0.8539 95% CI : (0.8094, 0.8914) No Information Rate : 0.513 P-Value [Acc > NIR] : < 2e-16
Kappa : 0.7068
Mcnemar's Test P-Value : 0.01707
Sensitivity : 0.7933 Specificity : 0.9114 Pos Pred Value : 0.8947 Neg Pred Value : 0.8229 Prevalence : 0.4870 Detection Rate : 0.3864 Detection Prevalence : 0.4318 Balanced Accuracy : 0.8524
'Positive' Class : 0 |
The above output means that:
Now let's build a Random Forest model and assess its performance on the test data. Random Forest is an ensemble method that builds multiple decision trees and combines their outputs for better accuracy. Here’s the code:
# Train Random Forest model
model_rf <- randomForest(target ~ ., data = train, ntree = 100)
# Predict on test set
rf_predictions <- predict(model_rf, newdata = test)
# Evaluate the model
confusionMatrix(rf_predictions, test$target)
The above code gives us the output as:
Confusion Matrix and Statistics
Reference Prediction 0 1 0 149 3 1 1 155
Accuracy : 0.987 95% CI : (0.9671, 0.9965) No Information Rate : 0.513 P-Value [Acc > NIR] : <2e-16
Kappa : 0.974
Mcnemar's Test P-Value : 0.6171
Sensitivity : 0.9933 Specificity : 0.9810 Pos Pred Value : 0.9803 Neg Pred Value : 0.9936 Prevalence : 0.4870 Detection Rate : 0.4838 Detection Prevalence : 0.4935 Balanced Accuracy : 0.9872
'Positive' Class : 0 |
The above output means that:
This step helps you visualize which features had the most impact on predicting heart disease. Random Forest automatically ranks variables based on how much they improve the model’s accuracy. The code for this step is:
# Plot variable importance from Random Forest
varImpPlot(model_rf)
This gives us a graph of the variable importance from Random Forest:
The above graph shows that:
In this Heart Disease Prediction project, we built a Random Forest classification model using R in Google Colab to predict the likelihood of heart disease based on clinical features like chest pain type, thalassemia, and number of major vessels.
After preprocessing the data, we split it into training and testing sets using a 70/30 ratio, trained the model, and evaluated its performance using a confusion matrix and classification metrics.
The model achieved a high accuracy of 98.7%, with strong sensitivity (99.3%) and specificity (98.1%), indicating excellent performance in identifying both heart disease and non-disease cases.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/13mkc60XdZtbFVxk25gqK4xnbzMG6ZCdr#scrollTo=K9mda-O9p5Gf
827 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources