Instagram Fake Profile Detection Using Machine Learning in R
By Rohit Sharma
Updated on Jul 31, 2025 | 14 min read | 1.25K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 31, 2025 | 14 min read | 1.25K+ views
Share:
Table of Contents
This project focuses on detecting fake Instagram profiles using R in Google Colab. By analysing behavioural features like follower ratios, username structure, and account settings, we build a machine learning model to identify suspicious users.
Through data cleaning, visualization, and classification using Random Forest, the model effectively distinguishes fake profiles from real ones with high accuracy.
Data Science Is Evolving. Are You?
Step into tomorrow with top data science courses designed for the AI-native world. From GenAI mastery to hands-on projects, these programs don’t teach data; they teach dominance.
Level Your R Skills Up With These Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025
Popular Data Science Programs
The project is a beginner-level project that requires less time and effort as compared to advanced projects. Here’s a quick overview of what’s required:
Aspect |
Details |
Estimated Duration | 3 to 4 hours |
Difficulty Level | Easy to Moderate |
Skill Level Required | Beginner (basic R knowledge recommended) |
From mastering Generative AI to solving real-world problems with Advanced Analytics, upGrad’s globally recognized programs equip you to lead, not follow. Learn from industry veterans. Earn credentials that matter. Drive innovation in a data-first world.
Before starting this Instagram fake profile detection project, we need a basic understanding of certain concepts, which are listed below:
To build a functional and reliable project, we’ll use certain libraries and tools that’ll help us ensure that the project runs smoothly.
Tool/Library |
Purpose |
Google Colab (R) | Cloud-based environment to write and execute R code |
R | Primary language used for analysis and modeling |
tidyverse | For data manipulation (dplyr, ggplot2, etc.) |
janitor | To clean and standardize column names |
skimr | To generate descriptive statistics quickly |
ggplot2, GGally | To plot distributions and feature correlations |
randomForest | To build a classification model for detecting fake profiles |
caret | To split data and evaluate model performance (accuracy, confusion matrix) |
This Could Be Your Next R Project: How to Build an Uber Data Analysis Project in R
This Instagram fake profile detection project has several steps involved. In this section, we’ll break down the steps along with the code for your understanding.
To work with R in Google Colab, you’ll need to switch the default language from Python to R. This enables R code execution directly within the notebook.
Here's how you can switch to R:
In this step, we’ll install and load all the necessary R packages used throughout the project. These libraries will help with data cleaning, visualization, modeling, and evaluation. The code for this section is given below:
# Install essential libraries (run only once)
install.packages("tidyverse") # Data manipulation and visualization (includes dplyr, ggplot2, etc.)
install.packages("skimr") # Quick summaries of data frames
install.packages("janitor") # Tools for cleaning messy column names
install.packages("GGally") # Advanced correlation and pair plots
install.packages("caret") # Tools for data splitting and model evaluation
install.packages("randomForest") # Random Forest algorithm for classification
# Load the libraries into the R environment
library(tidyverse)
library(skimr)
library(janitor)
library(GGally)
library(caret)
library(randomForest)
The output for the above code is given below, which confirms that the libraries are installed and loaded:
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘proxy’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘sparsevctrs’, ‘timeDate’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’ Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.2 ✔ tibble 3.3.0 ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ✔ purrr 1.1.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors Attaching package: ‘janitor’ The following objects are masked from ‘package:stats’: chisq.test, fisher.test Loading required package: lattice Attaching package: ‘caret’ The following object is masked from ‘package:purrr’: lift randomForest 4.7-1.2 Type rfNews() to see new features/changes/bug fixes. Attaching package: ‘randomForest’ The following object is masked from ‘package:dplyr’: combine The following object is masked from ‘package:ggplot2’: margin |
Here’s a fun Project in R: Car Data Analysis Project Using R
Now that the environment is ready, the next step is to load your dataset into R. We’ll read the CSV file, take a look at the first few rows, and check the structure to understand what kind of data we’re working with. Here’s the code to load and read the dataset:
# Directly load the CSV file (works in Colab if the file is already uploaded)
data <- read.csv("Insta Fake User Behavior Analysis.csv") # Load the dataset
# View the first few rows to get an idea of the data
head(data)
# Check the structure and data types of each column
str(data)
edge_followed_by |
edge_follow |
username_length |
username_has_number |
full_name_has_number |
full_name_length |
is_private |
is_joined_recently |
has_channel |
is_business_account |
has_guides |
has_external_url |
is_fake |
|
<dbl> |
<dbl> |
<int> |
<int> |
<int> |
<int> |
<int> |
<int> |
<int> |
<int> |
<int> |
<int> |
<int> |
|
1 |
0.001 |
0.257 |
13 |
1 |
1 |
13 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
2 |
0.000 |
0.958 |
9 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
3 |
0.000 |
0.253 |
12 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
4 |
0.000 |
0.977 |
10 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
5 |
0.000 |
0.321 |
11 |
0 |
0 |
11 |
1 |
0 |
0 |
0 |
0 |
0 |
1 |
6 |
0.000 |
0.917 |
15 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
'data.frame': 785 obs. of 13 variables: $ edge_followed_by : num 0.001 0 0 0 0 0 0 0 0.001 0 ... $ edge_follow : num 0.257 0.958 0.253 0.977 0.321 0.917 0.076 0.72 0.731 0.999 ... $ username_length : int 13 9 12 10 11 15 9 15 9 7 ... $ username_has_number : int 1 1 0 1 0 1 1 1 1 1 ... $ full_name_has_number: int 1 0 0 0 0 0 1 0 0 1 ... $ full_name_length : int 13 0 0 0 11 0 9 0 11 9 ... $ is_private : int 0 0 0 0 1 0 0 0 1 0 ... $ is_joined_recently : int 0 1 0 0 0 1 1 0 0 0 ... $ has_channel : int 0 0 0 0 0 0 0 0 0 0 ... $ is_business_account : int 0 0 0 0 0 0 0 0 0 0 ... $ has_guides : int 0 0 0 0 0 0 0 0 0 0 ... $ has_external_url : int 0 0 0 0 0 0 0 0 0 0 ... $ is_fake : int 1 1 1 1 1 1 1 1 1 1 ... |
The above table gives us a glimpse of what the data looks like.
To make the dataset easier to work with, we'll clean the column names by converting them to lowercase and replacing spaces with underscores. This helps avoid syntax issues during analysis. The code for this step is:
# Use janitor to standardize column names (lowercase, no spaces/symbols)
data <- janitor::clean_names(data)
# Preview the cleaned column names and a few rows
head(data)
The output for this step gives us a dataset that’s cleaned.
edge_followed_by |
edge_follow |
username_length |
username_has_number |
full_name_has_number |
full_name_length |
is_private |
is_joined_recently |
has_channel |
is_business_account |
has_guides |
has_external_url |
is_fake |
|
<dbl> |
<dbl> |
<int> |
<int> |
<int> |
<int> |
<int> |
<int> |
<int> |
<int> |
<int> |
<int> |
<int> |
|
1 |
0.001 |
0.257 |
13 |
1 |
1 |
13 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
2 |
0.000 |
0.958 |
9 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
3 |
0.000 |
0.253 |
12 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
4 |
0.000 |
0.977 |
10 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
5 |
0.000 |
0.321 |
11 |
0 |
0 |
11 |
1 |
0 |
0 |
0 |
0 |
0 |
1 |
6 |
0.000 |
0.917 |
15 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
Read More: Machine Learning with R: Everything You Need to Know
Let’s start our analysis by examining how many fake and real user profiles exist in the dataset. This helps us understand if the data is balanced or skewed toward one class. The code for this step is:
# Count how many users are fake vs real
table(data$is_fake)
# Visualize the distribution using a bar chart
data %>%
ggplot(aes(x = factor(is_fake), fill = factor(is_fake))) +
geom_bar() +
labs(title = "Distribution of Fake vs Real Users",
x = "Is Fake (1 = Fake, 0 = Real)",
y = "Count") +
scale_fill_manual(values = c("steelblue", "tomato")) + # Custom colors for clarity
theme_minimal() # Clean theme
The output is represented by the following graph:
0 1 93 692 |
The above output shows that:
Understanding how numerical features relate to each other helps identify redundancy, multicollinearity, or strong relationships that may impact the model. The code for this step is:
# Remove target column before correlation analysis
data_features <- select(data, -is_fake)
# Create a correlation matrix plot to visualize relationships between features
GGally::ggcorr(data_features, label = TRUE, label_size = 3, layout.exp = 2)
The output for this step is a graph that explains various correlations.
The above graph shows that:
Here’s an R Project: Customer Segmentation Project Using R: A Step-by-Step Guide
To understand user behavior patterns better, we can explore if certain features (like private profiles or numeric usernames) show any visual correlation with being fake. Stacked bar plots (proportion-based) help us compare real vs fake user distribution across binary features. The code for this step is:
# Plot: Does private account relate to being fake?
data %>%
ggplot(aes(x = factor(is_private), fill = factor(is_fake))) +
geom_bar(position = "fill") +
labs(title = "Private Account vs Fake Behavior",
x = "Is Private", y = "Proportion", fill = "Is Fake") +
theme_minimal()
# Plot: Does having number in username relate to being fake?
data %>%
ggplot(aes(x = factor(username_has_number), fill = factor(is_fake))) +
geom_bar(position = "fill") +
labs(title = "Username Has Number vs Fake Behavior",
x = "Username Has Number", y = "Proportion", fill = "Is Fake") +
theme_minimal()
We get two graphs as outputs that explain:
Before building our machine learning model, we split the dataset into training and testing subsets. This helps evaluate how well the model generalizes to new data. We'll use 80% of the data for training and the remaining 20% for testing. Here’s the code:
# Set seed for reproducibility (so results remain the same)
set.seed(123)
# Split data into training (80%) and testing (20%)
split_index <- createDataPartition(data$is_fake, p = 0.8, list = FALSE)
train_data <- data[split_index, ]
test_data <- data[-split_index, ]
# Check dimensions
cat("Training rows:", nrow(train_data), "\n")
cat("Testing rows:", nrow(test_data), "\n")
The above step splits the data to train and evaluate the model based on existing data from the dataset.
Training rows: 628 Testing rows: 157 |
Must Try: Forest Fire Project Using R - A Step-by-Step Guide
Now that we have split the data, we’ll train a Random Forest model to predict whether a user is fake or not. This algorithm works well with classification problems and handles both linear and non-linear patterns effectively. Here’s the code:
# Build a random forest model to predict 'is_fake'
model_rf <- randomForest(is_fake ~ ., data = train_data, importance = TRUE, ntree = 100)
# View the model summary
print(model_rf)
The output for this step is:
Call: randomForest(formula = is_fake ~ ., data = train_data, importance = TRUE, ntree = 100) Type of random forest: classification Number of trees: 100 No. of variables tried at each split: 3 OOB estimate of error rate: 5.89% Confusion matrix: 0 1 class.error 0 51 25 0.32894737 1 12 540 0.02173913 |
The above output means that:
Before training a classification model, it's important to ensure the target variable is treated as a categorical factor, not a numeric value. This step improves model accuracy and ensures correct behavior for classification tasks. The code for this step is given below:
# Convert target variable to factor (for classification)
train_data$is_fake <- as.factor(train_data$is_fake)
test_data$is_fake <- as.factor(test_data$is_fake)
# Re-train the model
model_rf <- randomForest(is_fake ~ ., data = train_data, importance = TRUE, ntree = 100)
# View the updated model summary
print(model_rf)
The above code gives the output as:
Call: randomForest(formula = is_fake ~ ., data = train_data, importance = TRUE, ntree = 100) Type of random forest: classification Number of trees: 100 No. of variables tried at each split: 3 OOB estimate of error rate: 5.73% Confusion matrix: 0 1 class.error 0 51 25 0.32894737 1 11 541 0.01992754 |
The above output means that:
Must read: R For Data Science: Why Should You Choose R for Data Science?
In this step, we use the trained Random Forest model to predict whether profiles in the test data are fake or real. Then we evaluate how accurate the predictions are using a confusion matrix. The code for this step is given below:
The output for this step is:
Confusion Matrix and Statistics Reference Prediction 0 1 0 11 2 1 6 138 Accuracy : 0.949 95% CI : (0.9021, 0.9777) No Information Rate : 0.8917 P-Value [Acc > NIR] : 0.009303 Kappa : 0.7057 Mcnemar's Test P-Value : 0.288844 Sensitivity : 0.64706 Specificity : 0.98571 Pos Pred Value : 0.84615 Neg Pred Value : 0.95833 Prevalence : 0.10828 Detection Rate : 0.07006 Detection Prevalence : 0.08280 Balanced Accuracy : 0.81639 'Positive' Class : 0 |
The above output means that:
A Fun R Project For You: Wine Quality Prediction Project in R
In this step, we visualize which features had the biggest impact on predicting fake users using the Random Forest’s feature importance. This tells us which variables the model relied on most when making decisions. The code for this step is:
# Plot feature importance
importance_df <- as.data.frame(importance(model_rf))
importance_df$Feature <- rownames(importance_df)
# Arrange in descending order
importance_df <- importance_df %>%
arrange(desc(MeanDecreaseGini))
# Plot
ggplot(importance_df, aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
geom_col(fill = "darkred") +
coord_flip() +
labs(title = "Feature Importance (Fake User Detection)",
x = "Feature", y = "Importance Score") +
theme_minimal()
The output for the above step is a graph that highlights the feature importance of this model:
The above graph shows that:
In this Insta Fake User Behavior Analysis project, we built a Random Forest classification model using R in Google Colab to identify fake Instagram users based on profile attributes such as follower count, username structure, and account activity.
After cleaning and preparing the dataset, we explored key features and trained the model on 80% of the data. The model was then tested on the remaining 20%, achieving an accuracy of 94.9% with a balanced sensitivity and specificity. Feature importance analysis revealed that follower/following ratios and full name patterns are key indicators of fake accounts.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1m0Hkf04h6bCSH1ycEFmCTqnEt46JzNQf
827 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources