Home
Blog
Data Science
Diabetes Prediction Analysis Project Using R

Diabetes Prediction Analysis Project Using R

Q: 1. What is the Diabetes Prediction project about, and why is it important?

This project focuses on building machine learning models to predict whether a person has diabetes based on health-related features like glucose levels, BMI, and age. Early prediction of diabetes can aid in timely intervention and treatment, making this a valuable healthcare application.

Q: 2. Which tools and libraries are used for this diabetes prediction project in R?

You can run this project using R in Google Colab with libraries like: caret for model training and evaluation randomForest for ensemble modeling ggplot2 for data visualization dplyr for data manipulation caTools for splitting the dataset

Q: 3. What machine learning models can be used for predicting diabetes in R?

Besides Logistic Regression and Random Forest, you can try: Support Vector Machines (SVM) – suitable for binary classification k-Nearest Neighbors (k-NN) – simple yet effective Gradient Boosting Machines (GBM) – for more robust prediction Naive Bayes – works well for probabilistic modelingHyperparameter tuning and ensemble learning can further optimize results.

Q: 4. How do you evaluate model performance in a diabetes prediction project?

Performance is usually measured using: Accuracy – proportion of correct predictions Confusion Matrix – helps visualize true positives/negatives Sensitivity and Specificity – useful for imbalanced datasets ROC-AUC Curve – to understand the model's discriminatory powerThese metrics give a holistic view of how well the model performs.

Q: 5. What are other beginner-friendly projects like diabetes prediction in R?

If you're looking to explore similar projects, here are a few great alternatives: Student Performance Analysis – predict student grades based on study hours, attendance, etc. World Happiness Report Analysis – explore factors that impact happiness across countries Daily Temperature Forecasting – use time series data to predict temperature patterns Energy Consumption Forecasting – forecast energy usage for better resource planning

By Rohit Sharma

Updated on Aug 11, 2025 | 16 min read | 1.63K+ views

Diabetes is one of the most prevalent diseases in the current times. With cases rising from 200 million in 1990 to 830 million in 2022, according to the WHO. This Diabetes Prediction Analysis Project in R will focus on building machine learning models to predict the likelihood of diabetes based on health-related factors like age, BMI, blood glucose level, and more.

Using R in Google Colab, this beginner-friendly project walks through data loading, cleaning, exploration, and modeling using Logistic Regression and Random Forest.

It helps users understand classification techniques, evaluate model performance, and visualize key trends in the data. This project is ideal for those looking to apply predictive analytics in healthcare using R.

Popular Data Science Programs

DevOps Course Online M Sc in Data Science Degree PGD in Data Science MSc in Data Science Program Post Graduate Certificate in Data Science

Master Python, Machine Learning, AI, Tableau, and SQL through expert-led online data science programs built for tomorrow’s data leaders. Learn by doing, grow with mentorship, and stay ahead in the data-first world. Enrol today and make insights your impact.

Also Read: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

What Skills and Tools Will You Need for This Diabetes Prediction Project?

Before starting the Diabetes Prediction Analysis Project in R, it's necessary to understand what you’ll be working with. This project is designed for beginners and uses simple tools and libraries in R. The table below shows the required tools, libraries, and skills you’ll need for this project.

Category	Details
Programming Tool	R (executed in Google Colab)
Main Libraries	tidyverse, caret, caTools, e1071, randomForest, ggplot2
Project Type	Binary classification using supervised learning
Machine Learning	Logistic Regression, Random Forest
Data Handling	Data cleaning, exploration, and preprocessing
Evaluation Skills	Accuracy, confusion matrix, performance comparison
Data Visualization	Scatter plots, feature relationships using ggplot2
Skill Level	Beginner (no advanced ML or R skills required)
Estimated Time	1.5 to 2 hours

Launch your career with globally recognized programs in Data Science and AI from top universities. Gain skills, industry mentorship, and credentials that set you apart. Enrol today—lead tomorrow’s data revolution.

Step-by-Step Guide to Building the Diabetes Prediction Model in R

In this section, we’ll see how to build this diabetes prediction model using R. The code for each step is given along with the output and explanation.

Step 1 – Configuring Google Colab for R

To begin working with R in Google Colab, it's important to switch the notebook’s default language from Python to R. This enables the execution of R code directly within the environment.

To do this, open a new Colab notebook and click on the Runtime menu at the top. Choose the Change runtime type option. In the dialog box that shows up, change the language setting to R using the dropdown menu and then click Save.

Step 2 – Installing and Loading Required Libraries

In this step, we prepare the environment by installing and loading the libraries needed for steps like data handling, model training, data splitting, and evaluation. We only need to install the packages once, but we must load them every time you run the notebook.

# Install essential libraries (run only once)
install.packages("tidyverse")     # For data wrangling and visualization
install.packages("caret")         # For machine learning algorithms and evaluation
install.packages("caTools")       # For splitting the dataset into train/test
install.packages("e1071")         # For SVM models and performance metrics

# Load the installed libraries
library(tidyverse)   # Load tidyverse for data handling and plotting
library(caret)       # Load caret for model training and evaluation
library(caTools)     # Load caTools for splitting data
library(e1071)       # Load e1071 for classification tools like confusion matrix

The above code installs and loads the essential libraries required for the Diabetes Prediction Analysis Project. The output confirms that the libraries are installed and loaded correctly.

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘proxy’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘sparsevctrs’, ‘timeDate’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

also installing the dependency ‘bitops’

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──

✔ dplyr 1.1.4 ✔ readr 2.1.5

✔ forcats 1.0.0 ✔ stringr 1.5.1

✔ ggplot2 3.5.2 ✔ tibble 3.3.0

✔ lubridate 1.9.4 ✔ tidyr 1.3.1

✔ purrr 1.1.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag() masks stats::lag()

ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading required package: lattice

Attaching package: ‘caret’

The following object is masked from ‘package:purrr’:

lift

Check this R project: Loan Approval Classification Using Logistic Regression in R

Step 3 – Loading the Dataset

In this step, we will load the diabetes dataset directly from the path where it was uploaded in your Colab environment. After loading the data into a variable, the first few rows are displayed to get a look at the structure and values in the dataset. The code for this step is:

# Load the dataset directly from the uploaded path
data <- read.csv("diabetes prediction dataset.csv")

# View the first few rows of the dataset
head(data)  # Displays the top rows to understand the structure

The above code loads the dataset and also gives us a glimpse of the dataset we’re working with:

	gender	age	hypertension	heart_disease	smoking_history	bmi	HbA1c_level	blood_glucose_level	diabetes
	<chr>	<dbl>	<int>	<int>	<chr>	<dbl>	<dbl>	<int>	<int>
1	Female	80	0	1	never	25.19	6.6	140	0
2	Female	54	0	0	No Info	27.32	6.6	80	0
3	Male	28	0	0	never	27.32	5.7	158	0
4	Female	36	0	0	current	23.45	5.0	155	0
5	Male	76	1	1	current	20.14	4.8	155	0
6	Female	20	0	0	never	27.32	6.6	85	0

Here’s an R project for you: Player Performance Analysis & Prediction Using R

Step 4 – Exploring the Dataset

This step helps us understand the dataset’s structure, types of variables, and whether there are any missing values. A statistical summary of each column is also generated to give an overview of the data distribution. Here’s the code to explore the dataset:

# Check structure of the dataset
str(data)  # Shows data types and column structure

# Check if there are any missing values
colSums(is.na(data))  # Summarizes missing values per column

# Summary statistics
summary(data)  # Gives min, max, mean, and quartiles for each column

The above code gives us a summary of the dataset for us to understand it better.

'data.frame': 100000 obs. of 9 variables:

$ gender : chr "Female" "Female" "Male" "Female" ...

$ age : num 80 54 28 36 76 20 44 79 42 32 ...

$ hypertension : int 0 0 0 0 1 0 0 0 0 0 ...

$ heart_disease : int 1 0 0 0 1 0 0 0 0 0 ...

$ smoking_history : chr "never" "No Info" "never" "current" ...

$ bmi : num 25.2 27.3 27.3 23.4 20.1 ...

$ HbA1c_level : num 6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ...

$ blood_glucose_level: int 140 80 158 155 155 85 200 85 145 100 ...

$ diabetes : int 0 0 0 0 0 0 1 0 0 0 ...

gender0age0hypertension0heart_disease0smoking_history0bmi0HbA1c_level0blood_glucose_level0diabetes0

gender age hypertension heart_disease

Length:100000 Min. : 0.08 Min. :0.00000 Min. :0.00000

Class :character 1st Qu.:24.00 1st Qu.:0.00000 1st Qu.:0.00000

Mode :character Median :43.00 Median :0.00000 Median :0.00000

Mean :41.89 Mean :0.07485 Mean :0.03942

3rd Qu.:60.00 3rd Qu.:0.00000 3rd Qu.:0.00000

Max. :80.00 Max. :1.00000 Max. :1.00000

smoking_history bmi HbA1c_level blood_glucose_level

Length:100000 Min. :10.01 Min. :3.500 Min. : 80.0

Class :character 1st Qu.:23.63 1st Qu.:4.800 1st Qu.:100.0

Mode :character Median :27.32 Median :5.800 Median :140.0

Mean :27.32 Mean :5.528 Mean :138.1

3rd Qu.:29.58 3rd Qu.:6.200 3rd Qu.:159.0

Max. :95.69 Max. :9.000 Max. :300.0

diabetes

Min. :0.000

1st Qu.:0.000

Median :0.000

Mean :0.085

3rd Qu.:0.000

Max. :1.000

Step 5 – Data Preprocessing: Cleaning and Formatting

In this section, we begin preparing the dataset for modeling. We reconfirm the structure and check for missing values. We also convert the target column diabetes into a factor to make it suitable for classification algorithms and examine the class distribution.

# View dataset structure
str(data)  # Displays column names, data types, and sample data

# Check for missing values
colSums(is.na(data))  # Shows count of NA values per column

# Summary statistics
summary(data)  # Provides descriptive statistics for numeric columns

# Convert the target column 'diabetes' to factor
data$diabetes <- as.factor(data$diabetes)  # Ensures classification algorithms treat it as categorical

# Check class balance
table(data$diabetes)  # Displays the count of each class (0 = No, 1 = Yes)

The above code checks for missing values and converts the diabetes target column for further classification steps.

'data.frame': 100000 obs. of 9 variables:

$ gender : chr "Female" "Female" "Male" "Female" ...

$ age : num 80 54 28 36 76 20 44 79 42 32 ...

$ hypertension : int 0 0 0 0 1 0 0 0 0 0 ...

$ heart_disease : int 1 0 0 0 1 0 0 0 0 0 ...

$ smoking_history : chr "never" "No Info" "never" "current" ...

$ bmi : num 25.2 27.3 27.3 23.4 20.1 ...

$ HbA1c_level : num 6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ...

$ blood_glucose_level: int 140 80 158 155 155 85 200 85 145 100 ...

$ diabetes : int 0 0 0 0 0 0 1 0 0 0 ...

gender0age0hypertension0heart_disease0smoking_history0bmi0HbA1c_level0blood_glucose_level0diabetes0

gender age hypertension heart_disease

Length:100000 Min. : 0.08 Min. :0.00000 Min. :0.00000

Class :character 1st Qu.:24.00 1st Qu.:0.00000 1st Qu.:0.00000

Mode :character Median :43.00 Median :0.00000 Median :0.00000

Mean :41.89 Mean :0.07485 Mean :0.03942

3rd Qu.:60.00 3rd Qu.:0.00000 3rd Qu.:0.00000

Max. :80.00 Max. :1.00000 Max. :1.00000

smoking_history bmi HbA1c_level blood_glucose_level

Length:100000 Min. :10.01 Min. :3.500 Min. : 80.0

Class :character 1st Qu.:23.63 1st Qu.:4.800 1st Qu.:100.0

Mode :character Median :27.32 Median :5.800 Median :140.0

Mean :27.32 Mean :5.528 Mean :138.1

3rd Qu.:29.58 3rd Qu.:6.200 3rd Qu.:159.0

Max. :95.69 Max. :9.000 Max. :300.0

diabetes

Min. :0.000

1st Qu.:0.000

Median :0.000

Mean :0.085

3rd Qu.:0.000

Max. :1.000

0 1

91500 8500

Step 6 – Split the Dataset into Training and Testing Sets

Before building our prediction model, we divide the dataset into two parts: training data (used to build the model) and testing data (used to evaluate it). This will ensure we can validate how well the model performs on unseen data. The code for this step is:

# Load library for splitting
library(caTools)  # Used for creating a random split of the dataset

# Set seed so results are reproducible
set.seed(123)  # Ensures you get the same split each time you run it

# Split the data: 80% training, 20% testing
split <- sample.split(data$diabetes, SplitRatio = 0.8)

train_data <- subset(data, split == TRUE)  # Training dataset
test_data  <- subset(data, split == FALSE) # Testing dataset

# Check sizes of train and test sets
nrow(train_data)  # Number of rows in training set
nrow(test_data)   # Number of rows in testing set

The output shows the number of data it is trained on and the number of data it will test the model on.

80000

20000

Step 7 – Train a Logistic Regression Model

We will now build our first prediction model using logistic regression, which is ideal for binary classification tasks like predicting diabetes (Yes/No). We'll train it on the training dataset using all available features.

# Train the logistic regression model
model <- glm(diabetes ~ ., data = train_data, family = "binomial")  # Fit the model on training data

# See the summary of the model
summary(model)  # Displays coefficients, significance levels, and performance stats

The output for the above step is:

Call:

glm(formula = diabetes ~ ., family = "binomial", data = train_data)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -26.954480 0.325317 -82.856 < 2e-16 ***

genderMale 0.304376 0.040292 7.554 4.21e-14 ***

genderOther -9.889076 131.261959 -0.075 0.940

age 0.045976 0.001256 36.600 < 2e-16 ***

hypertension 0.732500 0.052698 13.900 < 2e-16 ***

heart_disease 0.741719 0.067819 10.937 < 2e-16 ***

smoking_historyever -0.062610 0.102914 -0.608 0.543

smoking_historyformer -0.114614 0.078173 -1.466 0.143

smoking_historynever -0.166353 0.067704 -2.457 0.014 *

smoking_historyNo Info -0.735032 0.074307 -9.892 < 2e-16 ***

smoking_historynot current -0.150201 0.092239 -1.628 0.103

bmi 0.086639 0.002831 30.609 < 2e-16 ***

HbA1c_level 2.335872 0.039810 58.675 < 2e-16 ***

blood_glucose_level 0.033147 0.000538 61.615 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 46530 on 79999 degrees of freedom

Residual deviance: 18166 on 79986 degrees of freedom

AIC: 18194

Number of Fisher Scoring iterations: 12

The above output implies that:

Important Features: Age, BMI, blood sugar, and HbA1c level are strong predictors of diabetes.
Effect Direction: Higher values of age, BMI, HbA1c, and glucose increase diabetes risk.
Model Fit: The model explains the data much better than using no predictors.
Not All Features Help: Some features (like genderOther) don’t affect the result much.

Some R projects you can try: Forest Fire Project Using R - A Step-by-Step Guide | Movie Rating Analysis Project in R

Step 8 – Make Predictions on the Test Data

In this step, we will use the trained logistic regression model to predict the probability of diabetes for each test case. Then, we will convert those probabilities into binary classes (0 or 1) using a threshold of 0.5.

# Predict probabilities on the test set
pred_prob <- predict(model, newdata = test_data, type = "response")  # Get predicted probabilities

# Convert probabilities to binary predictions (threshold = 0.5)
pred_class <- ifelse(pred_prob > 0.5, 1, 0)  # Assign class labels based on threshold
pred_class <- as.factor(pred_class)         # Convert to factor for evaluation

Step 9 – Evaluate Model Accuracy Using Confusion Matrix

Now we will evaluate how well our model performed on the test set. We’ll compare the predicted values against the actual ones using a confusion matrix to get metrics like accuracy, sensitivity, and specificity. Here’s the code:

# Load caret package for evaluation
library(caret)

# Convert actual values to factor to match prediction format
actual_class <- as.factor(test_data$diabetes)

# Generate confusion matrix
confusionMatrix(pred_class, actual_class)

The output for the above code is:

Confusion Matrix and Statistics

Reference

Prediction 0 1

0 18137 643

1 163 1057

Accuracy : 0.9597

95% CI : (0.9569, 0.9624)

No Information Rate : 0.915

P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.7029

Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.9911

Specificity : 0.6218

Pos Pred Value : 0.9658

Neg Pred Value : 0.8664

Prevalence : 0.9150

Detection Rate : 0.9069

Detection Prevalence : 0.9390

Balanced Accuracy : 0.8064

'Positive' Class : 0

The above output shows that:

High Overall Accuracy: The model correctly predicted 95.97% of the cases. This is a strong overall performance.
Sensitivity is Excellent (0.9911): It correctly identified almost all actual negative cases (class 0).
Specificity is Moderate (0.6218): It’s less effective at identifying actual positive cases (class 1).
Kappa Score (0.70): Shows good agreement between predicted and actual labels, beyond random chance.

Must Read: Natural Disaster Prediction Analysis Project in R

Step 10 – Visualize the Relationship Between Glucose, BMI, and Diabetes

To explore how blood glucose level and BMI relate to diabetes, we'll use a scatter plot. By coloring points by diabetes status, we can visually observe how individuals with and without diabetes cluster based on these health metrics. Here’s the code:

# Load ggplot2 (comes with tidyverse)
library(ggplot2)

# Glucose vs BMI plot colored by diabetes
ggplot(data, aes(x = blood_glucose_level, y = bmi, color = diabetes)) +
  geom_point(alpha = 0.5) +
  labs(title = "BMI vs Blood Glucose Level by Diabetes Status",
       x = "Blood Glucose Level",
       y = "BMI") +
  theme_minimal()

The above code gives us a graphical representation of the BMI vs Blood Glucose levels by diabetes status.

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

The above graph shows that:

Higher Glucose, Higher Diabetes Risk:
Most people with high blood glucose levels (200+) are labeled with diabetes (1, shown in blue).
Low Glucose, Mostly No Diabetes:
People with lower glucose levels (below ~150) are mostly non-diabetic (0, shown in red).
BMI is Spread Out:
BMI doesn't clearly separate diabetics and non-diabetics. People from both groups have a wide range of BMI values.

Step 11 – Build the Random Forest Model for Diabetes Prediction

To improve accuracy and handle complex relationships in the data, we’ll now use the Random Forest algorithm. This method combines multiple decision trees to reduce overfitting and improve prediction performance. Here’s the code:

# Install randomForest package (only once)
install.packages("randomForest")

# Load the package
library(randomForest)

The output confirms the installation and loading of the Random Forest package.

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:dplyr’:

combine

The following object is masked from ‘package:ggplot2’:

margin

Read This: How to Build an Uber Data Analysis Project in R

Step 12 – Train the Random Forest Model on the Training Data

Now that the package is installed and loaded, we can train the Random Forest model. We'll use the randomForest() function and specify the number of trees to grow (ntree). Setting a seed ensures reproducibility of results. Here’s the code to train the model:

# Train the random forest model
set.seed(123)  # For reproducibility

rf_model <- randomForest(diabetes ~ ., data = train_data, ntree = 100)

# View the model summary
print(rf_model)

The output of the above code is:

Call:

randomForest(formula = diabetes ~ ., data = train_data, ntree = 100)

Type of random forest: classification

Number of trees: 100

No. of variables tried at each split: 2

OOB estimate of error rate: 2.82%

Confusion matrix:

0 1 class.error

0 73166 34 0.0004644809

1 2223 4577 0.3269117647

The above output means that:

Model Accuracy: The Out-of-Bag (OOB) error rate is 2.82%, meaning the model predicts correctly about 97.18% of the time on unseen training samples.
Class 0 Accuracy: The model performs very well on class 0 (non-diabetic), with a very low error rate of 0.046%.
Class 1 Accuracy: The model struggles more with class 1 (diabetic), with a 32.7% error rate, meaning it often misclassifies diabetic cases.

Here’s a Fun R Project: Car Data Analysis Project Using R

Step 13 – Make Predictions Using the Random Forest Model

Once the Random Forest model is trained, the next step is to use it for predicting diabetes on unseen test data. We'll use the predict() function to generate predictions and preview the first few results. Here’s the code:

# Predict on the test data
rf_pred <- predict(rf_model, newdata = test_data)

# View prediction results
head(rf_pred)

The output for this step is:

4: 05: 07: 09: 012: 017: 0

Levels:

'0' '1'

This means that:
The “Levels: '0' '1'” confirms the prediction output is a factor with two levels:

'0' (no diabetes)
'1' (diabetes)

Step 14 – Evaluate Model Performance Using a Confusion Matrix

To measure how well our Random Forest model performed on unseen data, we use a confusion matrix. This will show us how many instances were correctly or incorrectly classified into diabetic (1) and non-diabetic (0) categories. Here’s the code:

# Evaluate predictions using confusion matrix
confusionMatrix(rf_pred, as.factor(test_data$diabetes))

The above code gives us the output:

Confusion Matrix and Statistics

Reference

Prediction 0 1

0 18296 554

1 4 1146

Accuracy : 0.9721

95% CI : (0.9697, 0.9743)

No Information Rate : 0.915

P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.7898

Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.9998

Specificity : 0.6741

Pos Pred Value : 0.9706

Neg Pred Value : 0.9965

Prevalence : 0.9150

Detection Rate : 0.9148

Detection Prevalence : 0.9425

Balanced Accuracy : 0.8369

'Positive' Class : 0

The above output shows that:

High Accuracy: The model correctly predicted 97.21% of the cases overall.
Sensitivity (0.9998): It identified almost all non-diabetic cases (class 0) correctly.
Specificity (0.6741): It was less accurate in detecting diabetic cases (class 1), missing some.
Kappa (0.7898): Indicates strong agreement between predicted and actual outcomes, beyond random chance.
Balanced Accuracy (0.8369): Provides a fair performance score by averaging sensitivity and specificity.

You Can Also Build This R Project: Wine Quality Prediction Project in R

Step 15 – Compare Model Accuracies – Logistic Regression vs Random Forest

After building and testing both models, it’s important to compare their prediction accuracies to determine which performs better on the test data. Here’s the code to compare both models.

# Logistic Regression Accuracy
log_accuracy <- mean(pred_class == test_data$diabetes)

# Random Forest Accuracy
rf_accuracy <- mean(rf_pred == test_data$diabetes)

# Print both
cat("Logistic Regression Accuracy:", round(log_accuracy * 100, 2), "%\n")
cat("Random Forest Accuracy:", round(rf_accuracy * 100, 2), "%\n")

The output for this gives us the comparison of both models’ performance:

Logistic Regression Accuracy: 95.97 %

Random Forest Accuracy: 97.21 %

This shows that the Random Forest model achieved higher accuracy in diabetes prediction using the given dataset.

Conclusion

In this Diabetes Prediction project, we used R in Google Colab to build and compare two models: Logistic Regression and Random Forest.

After preprocessing the dataset, we split it into training and testing sets, then trained both models to classify individuals as diabetic or not.

Evaluation was done using accuracy scores and confusion matrices. The Random Forest model outperformed Logistic Regression, achieving a higher accuracy of 97.21% compared to 77.15%.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Reference:
https://www.who.int/news-room/fact-sheets/detail/diabetes

Colab Link:
https://colab.research.google.com/drive/1QSOP_QsfGcYGBXIiMWbWjnZGsbe1_kBj#scrollTo=nQZQPytZDzcM

Frequently Asked Questions (FAQs)

1. What is the Diabetes Prediction project about, and why is it important?

This project focuses on building machine learning models to predict whether a person has diabetes based on health-related features like glucose levels, BMI, and age. Early prediction of diabetes can aid in timely intervention and treatment, making this a valuable healthcare application.

2. Which tools and libraries are used for this diabetes prediction project in R?

You can run this project using R in Google Colab with libraries like:

caret for model training and evaluation
randomForest for ensemble modeling
ggplot2 for data visualization
dplyr for data manipulation
caTools for splitting the dataset

3. What machine learning models can be used for predicting diabetes in R?

Besides Logistic Regression and Random Forest, you can try:

Support Vector Machines (SVM) – suitable for binary classification
k-Nearest Neighbors (k-NN) – simple yet effective
Gradient Boosting Machines (GBM) – for more robust prediction
Naive Bayes – works well for probabilistic modeling
Hyperparameter tuning and ensemble learning can further optimize results.

4. How do you evaluate model performance in a diabetes prediction project?

Performance is usually measured using:

Accuracy – proportion of correct predictions
Confusion Matrix – helps visualize true positives/negatives
Sensitivity and Specificity – useful for imbalanced datasets
ROC-AUC Curve – to understand the model's discriminatory power
These metrics give a holistic view of how well the model performs.

5. What are other beginner-friendly projects like diabetes prediction in R?

If you're looking to explore similar projects, here are a few great alternatives:

Student Performance Analysis – predict student grades based on study hours, attendance, etc.
World Happiness Report Analysis – explore factors that impact happiness across countries
Daily Temperature Forecasting – use time series data to predict temperature patterns
Energy Consumption Forecasting – forecast energy usage for better resource planning

Rohit Sharma

834 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources