Diabetes Prediction Analysis Project Using R
By Rohit Sharma
Updated on Aug 11, 2025 | 16 min read | 1.4K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 11, 2025 | 16 min read | 1.4K+ views
Share:
Diabetes is one of the most prevalent diseases in the current times. With cases rising from 200 million in 1990 to 830 million in 2022, according to the WHO. This Diabetes Prediction Analysis Project in R will focus on building machine learning models to predict the likelihood of diabetes based on health-related factors like age, BMI, blood glucose level, and more.
Using R in Google Colab, this beginner-friendly project walks through data loading, cleaning, exploration, and modeling using Logistic Regression and Random Forest.
It helps users understand classification techniques, evaluate model performance, and visualize key trends in the data. This project is ideal for those looking to apply predictive analytics in healthcare using R.
Popular Data Science Programs
Also Read: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025
Before starting the Diabetes Prediction Analysis Project in R, it's necessary to understand what you’ll be working with. This project is designed for beginners and uses simple tools and libraries in R. The table below shows the required tools, libraries, and skills you’ll need for this project.
Category |
Details |
Programming Tool | R (executed in Google Colab) |
Main Libraries | tidyverse, caret, caTools, e1071, randomForest, ggplot2 |
Project Type | Binary classification using supervised learning |
Machine Learning | Logistic Regression, Random Forest |
Data Handling | Data cleaning, exploration, and preprocessing |
Evaluation Skills | Accuracy, confusion matrix, performance comparison |
Data Visualization | Scatter plots, feature relationships using ggplot2 |
Skill Level | Beginner (no advanced ML or R skills required) |
Estimated Time | 1.5 to 2 hours |
Launch your career with globally recognized programs in Data Science and AI from top universities. Gain skills, industry mentorship, and credentials that set you apart. Enrol today—lead tomorrow’s data revolution.
In this section, we’ll see how to build this diabetes prediction model using R. The code for each step is given along with the output and explanation.
To begin working with R in Google Colab, it's important to switch the notebook’s default language from Python to R. This enables the execution of R code directly within the environment.
To do this, open a new Colab notebook and click on the Runtime menu at the top. Choose the Change runtime type option. In the dialog box that shows up, change the language setting to R using the dropdown menu and then click Save.
In this step, we prepare the environment by installing and loading the libraries needed for steps like data handling, model training, data splitting, and evaluation. We only need to install the packages once, but we must load them every time you run the notebook.
# Install essential libraries (run only once)
install.packages("tidyverse") # For data wrangling and visualization
install.packages("caret") # For machine learning algorithms and evaluation
install.packages("caTools") # For splitting the dataset into train/test
install.packages("e1071") # For SVM models and performance metrics
# Load the installed libraries
library(tidyverse) # Load tidyverse for data handling and plotting
library(caret) # Load caret for model training and evaluation
library(caTools) # Load caTools for splitting data
library(e1071) # Load e1071 for classification tools like confusion matrix
The above code installs and loads the essential libraries required for the Diabetes Prediction Analysis Project. The output confirms that the libraries are installed and loaded correctly.
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘proxy’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘sparsevctrs’, ‘timeDate’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
also installing the dependency ‘bitops’
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.2 ✔ tibble 3.3.0 ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ✔ purrr 1.1.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors Loading required package: lattice
Attaching package: ‘caret’
The following object is masked from ‘package:purrr’:
lift |
Check this R project: Loan Approval Classification Using Logistic Regression in R
In this step, we will load the diabetes dataset directly from the path where it was uploaded in your Colab environment. After loading the data into a variable, the first few rows are displayed to get a look at the structure and values in the dataset. The code for this step is:
# Load the dataset directly from the uploaded path
data <- read.csv("diabetes prediction dataset.csv")
# View the first few rows of the dataset
head(data) # Displays the top rows to understand the structure
The above code loads the dataset and also gives us a glimpse of the dataset we’re working with:
gender |
age |
hypertension |
heart_disease |
smoking_history |
bmi |
HbA1c_level |
blood_glucose_level |
diabetes |
|
<chr> |
<dbl> |
<int> |
<int> |
<chr> |
<dbl> |
<dbl> |
<int> |
<int> |
|
1 |
Female |
80 |
0 |
1 |
never |
25.19 |
6.6 |
140 |
0 |
2 |
Female |
54 |
0 |
0 |
No Info |
27.32 |
6.6 |
80 |
0 |
3 |
Male |
28 |
0 |
0 |
never |
27.32 |
5.7 |
158 |
0 |
4 |
Female |
36 |
0 |
0 |
current |
23.45 |
5.0 |
155 |
0 |
5 |
Male |
76 |
1 |
1 |
current |
20.14 |
4.8 |
155 |
0 |
6 |
Female |
20 |
0 |
0 |
never |
27.32 |
6.6 |
85 |
0 |
Here’s an R project for you: Player Performance Analysis & Prediction Using R
This step helps us understand the dataset’s structure, types of variables, and whether there are any missing values. A statistical summary of each column is also generated to give an overview of the data distribution. Here’s the code to explore the dataset:
# Check structure of the dataset
str(data) # Shows data types and column structure
# Check if there are any missing values
colSums(is.na(data)) # Summarizes missing values per column
# Summary statistics
summary(data) # Gives min, max, mean, and quartiles for each column
The above code gives us a summary of the dataset for us to understand it better.
'data.frame': 100000 obs. of 9 variables: $ gender : chr "Female" "Female" "Male" "Female" ... $ age : num 80 54 28 36 76 20 44 79 42 32 ... $ hypertension : int 0 0 0 0 1 0 0 0 0 0 ... $ heart_disease : int 1 0 0 0 1 0 0 0 0 0 ... $ smoking_history : chr "never" "No Info" "never" "current" ... $ bmi : num 25.2 27.3 27.3 23.4 20.1 ... $ HbA1c_level : num 6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ... $ blood_glucose_level: int 140 80 158 155 155 85 200 85 145 100 ... $ diabetes : int 0 0 0 0 0 0 1 0 0 0 ...
gender0age0hypertension0heart_disease0smoking_history0bmi0HbA1c_level0blood_glucose_level0diabetes0 gender age hypertension heart_disease Length:100000 Min. : 0.08 Min. :0.00000 Min. :0.00000 Class :character 1st Qu.:24.00 1st Qu.:0.00000 1st Qu.:0.00000 Mode :character Median :43.00 Median :0.00000 Median :0.00000 Mean :41.89 Mean :0.07485 Mean :0.03942 3rd Qu.:60.00 3rd Qu.:0.00000 3rd Qu.:0.00000 Max. :80.00 Max. :1.00000 Max. :1.00000 smoking_history bmi HbA1c_level blood_glucose_level Length:100000 Min. :10.01 Min. :3.500 Min. : 80.0 Class :character 1st Qu.:23.63 1st Qu.:4.800 1st Qu.:100.0 Mode :character Median :27.32 Median :5.800 Median :140.0 Mean :27.32 Mean :5.528 Mean :138.1 3rd Qu.:29.58 3rd Qu.:6.200 3rd Qu.:159.0 Max. :95.69 Max. :9.000 Max. :300.0 diabetes Min. :0.000 1st Qu.:0.000 Median :0.000 Mean :0.085 3rd Qu.:0.000 Max. :1.000 |
In this section, we begin preparing the dataset for modeling. We reconfirm the structure and check for missing values. We also convert the target column diabetes into a factor to make it suitable for classification algorithms and examine the class distribution.
# View dataset structure
str(data) # Displays column names, data types, and sample data
# Check for missing values
colSums(is.na(data)) # Shows count of NA values per column
# Summary statistics
summary(data) # Provides descriptive statistics for numeric columns
# Convert the target column 'diabetes' to factor
data$diabetes <- as.factor(data$diabetes) # Ensures classification algorithms treat it as categorical
# Check class balance
table(data$diabetes) # Displays the count of each class (0 = No, 1 = Yes)
The above code checks for missing values and converts the diabetes target column for further classification steps.
'data.frame': 100000 obs. of 9 variables: $ gender : chr "Female" "Female" "Male" "Female" ... $ age : num 80 54 28 36 76 20 44 79 42 32 ... $ hypertension : int 0 0 0 0 1 0 0 0 0 0 ... $ heart_disease : int 1 0 0 0 1 0 0 0 0 0 ... $ smoking_history : chr "never" "No Info" "never" "current" ... $ bmi : num 25.2 27.3 27.3 23.4 20.1 ... $ HbA1c_level : num 6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ... $ blood_glucose_level: int 140 80 158 155 155 85 200 85 145 100 ... $ diabetes : int 0 0 0 0 0 0 1 0 0 0 ...
gender0age0hypertension0heart_disease0smoking_history0bmi0HbA1c_level0blood_glucose_level0diabetes0 gender age hypertension heart_disease Length:100000 Min. : 0.08 Min. :0.00000 Min. :0.00000 Class :character 1st Qu.:24.00 1st Qu.:0.00000 1st Qu.:0.00000 Mode :character Median :43.00 Median :0.00000 Median :0.00000 Mean :41.89 Mean :0.07485 Mean :0.03942 3rd Qu.:60.00 3rd Qu.:0.00000 3rd Qu.:0.00000 Max. :80.00 Max. :1.00000 Max. :1.00000 smoking_history bmi HbA1c_level blood_glucose_level Length:100000 Min. :10.01 Min. :3.500 Min. : 80.0 Class :character 1st Qu.:23.63 1st Qu.:4.800 1st Qu.:100.0 Mode :character Median :27.32 Median :5.800 Median :140.0 Mean :27.32 Mean :5.528 Mean :138.1 3rd Qu.:29.58 3rd Qu.:6.200 3rd Qu.:159.0 Max. :95.69 Max. :9.000 Max. :300.0 diabetes Min. :0.000 1st Qu.:0.000 Median :0.000 Mean :0.085 3rd Qu.:0.000 Max. :1.000 0 1 91500 8500 |
Read More: Spotify Music Data Analysis Project in R
Before building our prediction model, we divide the dataset into two parts: training data (used to build the model) and testing data (used to evaluate it). This will ensure we can validate how well the model performs on unseen data. The code for this step is:
# Load library for splitting
library(caTools) # Used for creating a random split of the dataset
# Set seed so results are reproducible
set.seed(123) # Ensures you get the same split each time you run it
# Split the data: 80% training, 20% testing
split <- sample.split(data$diabetes, SplitRatio = 0.8)
train_data <- subset(data, split == TRUE) # Training dataset
test_data <- subset(data, split == FALSE) # Testing dataset
# Check sizes of train and test sets
nrow(train_data) # Number of rows in training set
nrow(test_data) # Number of rows in testing set
The output shows the number of data it is trained on and the number of data it will test the model on.
80000 20000 |
We will now build our first prediction model using logistic regression, which is ideal for binary classification tasks like predicting diabetes (Yes/No). We'll train it on the training dataset using all available features.
# Train the logistic regression model
model <- glm(diabetes ~ ., data = train_data, family = "binomial") # Fit the model on training data
# See the summary of the model
summary(model) # Displays coefficients, significance levels, and performance stats
The output for the above step is:
Call: glm(formula = diabetes ~ ., family = "binomial", data = train_data)
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -26.954480 0.325317 -82.856 < 2e-16 *** genderMale 0.304376 0.040292 7.554 4.21e-14 *** genderOther -9.889076 131.261959 -0.075 0.940 age 0.045976 0.001256 36.600 < 2e-16 *** hypertension 0.732500 0.052698 13.900 < 2e-16 *** heart_disease 0.741719 0.067819 10.937 < 2e-16 *** smoking_historyever -0.062610 0.102914 -0.608 0.543 smoking_historyformer -0.114614 0.078173 -1.466 0.143 smoking_historynever -0.166353 0.067704 -2.457 0.014 * smoking_historyNo Info -0.735032 0.074307 -9.892 < 2e-16 *** smoking_historynot current -0.150201 0.092239 -1.628 0.103 bmi 0.086639 0.002831 30.609 < 2e-16 *** HbA1c_level 2.335872 0.039810 58.675 < 2e-16 *** blood_glucose_level 0.033147 0.000538 61.615 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 46530 on 79999 degrees of freedom Residual deviance: 18166 on 79986 degrees of freedom AIC: 18194
Number of Fisher Scoring iterations: 12 |
The above output implies that:
Some R projects you can try: Forest Fire Project Using R - A Step-by-Step Guide | Movie Rating Analysis Project in R
In this step, we will use the trained logistic regression model to predict the probability of diabetes for each test case. Then, we will convert those probabilities into binary classes (0 or 1) using a threshold of 0.5.
# Predict probabilities on the test set
pred_prob <- predict(model, newdata = test_data, type = "response") # Get predicted probabilities
# Convert probabilities to binary predictions (threshold = 0.5)
pred_class <- ifelse(pred_prob > 0.5, 1, 0) # Assign class labels based on threshold
pred_class <- as.factor(pred_class) # Convert to factor for evaluation
Now we will evaluate how well our model performed on the test set. We’ll compare the predicted values against the actual ones using a confusion matrix to get metrics like accuracy, sensitivity, and specificity. Here’s the code:
# Load caret package for evaluation
library(caret)
# Convert actual values to factor to match prediction format
actual_class <- as.factor(test_data$diabetes)
# Generate confusion matrix
confusionMatrix(pred_class, actual_class)
The output for the above code is:
Confusion Matrix and Statistics
Reference Prediction 0 1 0 18137 643 1 163 1057
Accuracy : 0.9597 95% CI : (0.9569, 0.9624) No Information Rate : 0.915 P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7029
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9911 Specificity : 0.6218 Pos Pred Value : 0.9658 Neg Pred Value : 0.8664 Prevalence : 0.9150 Detection Rate : 0.9069 Detection Prevalence : 0.9390 Balanced Accuracy : 0.8064
'Positive' Class : 0 |
The above output shows that:
Must Read: Natural Disaster Prediction Analysis Project in R
To explore how blood glucose level and BMI relate to diabetes, we'll use a scatter plot. By coloring points by diabetes status, we can visually observe how individuals with and without diabetes cluster based on these health metrics. Here’s the code:
# Load ggplot2 (comes with tidyverse)
library(ggplot2)
# Glucose vs BMI plot colored by diabetes
ggplot(data, aes(x = blood_glucose_level, y = bmi, color = diabetes)) +
geom_point(alpha = 0.5) +
labs(title = "BMI vs Blood Glucose Level by Diabetes Status",
x = "Blood Glucose Level",
y = "BMI") +
theme_minimal()
The above code gives us a graphical representation of the BMI vs Blood Glucose levels by diabetes status.
The above graph shows that:
To improve accuracy and handle complex relationships in the data, we’ll now use the Random Forest algorithm. This method combines multiple decision trees to reduce overfitting and improve prediction performance. Here’s the code:
# Install randomForest package (only once)
install.packages("randomForest")
# Load the package
library(randomForest)
The output confirms the installation and loading of the Random Forest package.
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:dplyr’:
combine
The following object is masked from ‘package:ggplot2’:
margin |
Read This: How to Build an Uber Data Analysis Project in R
Now that the package is installed and loaded, we can train the Random Forest model. We'll use the randomForest() function and specify the number of trees to grow (ntree). Setting a seed ensures reproducibility of results. Here’s the code to train the model:
# Train the random forest model
set.seed(123) # For reproducibility
rf_model <- randomForest(diabetes ~ ., data = train_data, ntree = 100)
# View the model summary
print(rf_model)
The output of the above code is:
Call: randomForest(formula = diabetes ~ ., data = train_data, ntree = 100) Type of random forest: classification Number of trees: 100 No. of variables tried at each split: 2
OOB estimate of error rate: 2.82% Confusion matrix: 0 1 class.error 0 73166 34 0.0004644809 1 2223 4577 0.3269117647 |
The above output means that:
Here’s a Fun R Project: Car Data Analysis Project Using R
Once the Random Forest model is trained, the next step is to use it for predicting diabetes on unseen test data. We'll use the predict() function to generate predictions and preview the first few results. Here’s the code:
# Predict on the test data
rf_pred <- predict(rf_model, newdata = test_data)
# View prediction results
head(rf_pred)
The output for this step is:
4: 05: 07: 09: 012: 017: 0 Levels:
|
This means that:
The “Levels: '0' '1'” confirms the prediction output is a factor with two levels:
To measure how well our Random Forest model performed on unseen data, we use a confusion matrix. This will show us how many instances were correctly or incorrectly classified into diabetic (1) and non-diabetic (0) categories. Here’s the code:
# Evaluate predictions using confusion matrix
confusionMatrix(rf_pred, as.factor(test_data$diabetes))
The above code gives us the output:
Confusion Matrix and Statistics
Reference Prediction 0 1 0 18296 554 1 4 1146
Accuracy : 0.9721 95% CI : (0.9697, 0.9743) No Information Rate : 0.915 P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7898
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9998 Specificity : 0.6741 Pos Pred Value : 0.9706 Neg Pred Value : 0.9965 Prevalence : 0.9150 Detection Rate : 0.9148 Detection Prevalence : 0.9425 Balanced Accuracy : 0.8369
'Positive' Class : 0 |
The above output shows that:
You Can Also Build This R Project: Wine Quality Prediction Project in R
After building and testing both models, it’s important to compare their prediction accuracies to determine which performs better on the test data. Here’s the code to compare both models.
# Logistic Regression Accuracy
log_accuracy <- mean(pred_class == test_data$diabetes)
# Random Forest Accuracy
rf_accuracy <- mean(rf_pred == test_data$diabetes)
# Print both
cat("Logistic Regression Accuracy:", round(log_accuracy * 100, 2), "%\n")
cat("Random Forest Accuracy:", round(rf_accuracy * 100, 2), "%\n")
The output for this gives us the comparison of both models’ performance:
Logistic Regression Accuracy: 95.97 % Random Forest Accuracy: 97.21 % |
This shows that the Random Forest model achieved higher accuracy in diabetes prediction using the given dataset.
In this Diabetes Prediction project, we used R in Google Colab to build and compare two models: Logistic Regression and Random Forest.
After preprocessing the dataset, we split it into training and testing sets, then trained both models to classify individuals as diabetic or not.
Evaluation was done using accuracy scores and confusion matrices. The Random Forest model outperformed Logistic Regression, achieving a higher accuracy of 97.21% compared to 77.15%.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Subscribe to upGrad's Newsletter
Join thousands of learners who receive useful tips
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Reference:
https://www.who.int/news-room/fact-sheets/detail/diabetes
Colab Link:
https://colab.research.google.com/drive/1QSOP_QsfGcYGBXIiMWbWjnZGsbe1_kBj#scrollTo=nQZQPytZDzcM
834 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources