Home
Blog
Data Science
Instagram Fake Profile Detection Using Machine Learning in R

Instagram Fake Profile Detection Using Machine Learning in R

Q: 6. How accurate was the model in detecting fake users?

The Random Forest model achieved an accuracy of 94.9% on the test data, with strong performance in both sensitivity (detecting fake users) and specificity (correctly identifying real users).

By Rohit Sharma

Updated on Sep 04, 2025 | 14 min read | 1.98K+ views

Table of Contents

View all

How Much Time, Effort, and Skill Do You Need?
Key Concepts to Understand Before Starting the Instagram Fake Profile Detection Project
Key Tools and R Packages Powering This Project
Breakdown Of This Instagram Fake Profile Detection Project
Conclusion

This project focuses on detecting fake Instagram profiles using R in Google Colab. By analysing behavioural features like follower ratios, username structure, and account settings, we build a machine learning model to identify suspicious users.

Through data cleaning, visualization, and classification using Random Forest, the model effectively distinguishes fake profiles from real ones with high accuracy.

Data Science Is Evolving. Are You?

Step into tomorrow with top data science courses designed for the AI-native world. From GenAI mastery to hands-on projects, these programs don’t teach data; they teach dominance.

Level Your R Skills Up With These Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

Popular Data Science Programs

PG Diploma in Data Science MS in Data Science MSc in Data Science Program DevOps Course Online Advanced Certificate Program in Data Science

How Much Time, Effort, and Skill Do You Need?

The project is a beginner-level project that requires less time and effort as compared to advanced projects. Here’s a quick overview of what’s required:

Aspect	Details
Estimated Duration	3 to 4 hours
Difficulty Level	Easy to Moderate
Skill Level Required	Beginner (basic R knowledge recommended)

From mastering Generative AI to solving real-world problems with Advanced Analytics, upGrad’s globally recognized programs equip you to lead, not follow. Learn from industry veterans. Earn credentials that matter. Drive innovation in a data-first world.

Key Concepts to Understand Before Starting the Instagram Fake Profile Detection Project

Before starting this Instagram fake profile detection project, we need a basic understanding of certain concepts, which are listed below:

Basic Understanding of R: Familiarity with R syntax, functions, and using libraries like tidyverse, caret, and randomForest.
Machine Learning Basics: Know how classification models work, especially binary classification problems.
Data Preprocessing: Understanding how to clean and prepare data for analysis, including handling column names and data types.
Feature Engineering: Recognize how behavioral features such as follower ratios or account privacy settings can be used as inputs.
Model Evaluation: Be familiar with accuracy, sensitivity, specificity, and confusion matrices to assess model performance.
Random Forest Algorithm: A basic idea of how decision trees and ensemble models like Random Forest function.

Key Tools and R Packages Powering This Project

To build a functional and reliable project, we’ll use certain libraries and tools that’ll help us ensure that the project runs smoothly.

Tool/Library	Purpose
Google Colab (R)	Cloud-based environment to write and execute R code
R	Primary language used for analysis and modeling
tidyverse	For data manipulation (dplyr, ggplot2, etc.)
janitor	To clean and standardize column names
skimr	To generate descriptive statistics quickly
ggplot2, GGally	To plot distributions and feature correlations
randomForest	To build a classification model for detecting fake profiles
caret	To split data and evaluate model performance (accuracy, confusion matrix)

This Could Be Your Next R Project: How to Build an Uber Data Analysis Project in R | Instagram Influencer Salary

Breakdown Of This Instagram Fake Profile Detection Project

This Instagram fake profile detection project has several steps involved. In this section, we’ll break down the steps along with the code for your understanding.

Step 1: Configure Google Colab for R Programming

To work with R in Google Colab, you’ll need to switch the default language from Python to R. This enables R code execution directly within the notebook.

Here's how you can switch to R:

Launch a new notebook on Google Colab
Go to the Runtime menu
Click on Change runtime type
In the Language dropdown, select R
Click Save to apply the changes

Step 2: Install and Load the Required R Libraries

In this step, we’ll install and load all the necessary R packages used throughout the project. These libraries will help with data cleaning, visualization, modeling, and evaluation. The code for this section is given below:

# Install essential libraries (run only once)
install.packages("tidyverse")    # Data manipulation and visualization (includes dplyr, ggplot2, etc.)
install.packages("skimr")        # Quick summaries of data frames
install.packages("janitor")      # Tools for cleaning messy column names
install.packages("GGally")       # Advanced correlation and pair plots
install.packages("caret")        # Tools for data splitting and model evaluation
install.packages("randomForest") # Random Forest algorithm for classification
# Load the libraries into the R environment
library(tidyverse)
library(skimr)
library(janitor)
library(GGally)
library(caret)
library(randomForest)

The output for the above code is given below, which confirms that the libraries are installed and loaded:

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘proxy’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘sparsevctrs’, ‘timeDate’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──

✔ dplyr 1.1.4 ✔ readr 2.1.5

✔ forcats 1.0.0 ✔ stringr 1.5.1

✔ ggplot2 3.5.2 ✔ tibble 3.3.0

✔ lubridate 1.9.4 ✔ tidyr 1.3.1

✔ purrr 1.1.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag() masks stats::lag()

ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: ‘janitor’

The following objects are masked from ‘package:stats’:

chisq.test, fisher.test

Loading required package: lattice

Attaching package: ‘caret’

The following object is masked from ‘package:purrr’:

lift

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:dplyr’:

combine

The following object is masked from ‘package:ggplot2’:

margin

Here’s a fun Project in R: Car Data Analysis Project Using R

Step 3: Load and Preview the Dataset

Now that the environment is ready, the next step is to load your dataset into R. We’ll read the CSV file, take a look at the first few rows, and check the structure to understand what kind of data we’re working with. Here’s the code to load and read the dataset:

# Directly load the CSV file (works in Colab if the file is already uploaded)
data <- read.csv("Insta Fake User Behavior Analysis.csv")  # Load the dataset
# View the first few rows to get an idea of the data
head(data)
# Check the structure and data types of each column
str(data)

	edge_followed_by	edge_follow	username_length	username_has_number	full_name_has_number	full_name_length	is_private	is_joined_recently	has_channel	is_business_account	has_guides	has_external_url	is_fake
	<dbl>	<dbl>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>
1	0.001	0.257	13	1	1	13	0	0	0	0	0	0	1
2	0.000	0.958	9	1	0	0	0	1	0	0	0	0	1
3	0.000	0.253	12	0	0	0	0	0	0	0	0	0	1
4	0.000	0.977	10	1	0	0	0	0	0	0	0	0	1
5	0.000	0.321	11	0	0	11	1	0	0	0	0	0	1
6	0.000	0.917	15	1	0	0	0	1	0	0	0	0	1

'data.frame': 785 obs. of 13 variables:

$ edge_followed_by : num 0.001 0 0 0 0 0 0 0 0.001 0 ...

$ edge_follow : num 0.257 0.958 0.253 0.977 0.321 0.917 0.076 0.72 0.731 0.999 ...

$ username_length : int 13 9 12 10 11 15 9 15 9 7 ...

$ username_has_number : int 1 1 0 1 0 1 1 1 1 1 ...

$ full_name_has_number: int 1 0 0 0 0 0 1 0 0 1 ...

$ full_name_length : int 13 0 0 0 11 0 9 0 11 9 ...

$ is_private : int 0 0 0 0 1 0 0 0 1 0 ...

$ is_joined_recently : int 0 1 0 0 0 1 1 0 0 0 ...

$ has_channel : int 0 0 0 0 0 0 0 0 0 0 ...

$ is_business_account : int 0 0 0 0 0 0 0 0 0 0 ...

$ has_guides : int 0 0 0 0 0 0 0 0 0 0 ...

$ has_external_url : int 0 0 0 0 0 0 0 0 0 0 ...

$ is_fake : int 1 1 1 1 1 1 1 1 1 1 ...

The above table gives us a glimpse of what the data looks like.

Step 4: Clean the Column Names

To make the dataset easier to work with, we'll clean the column names by converting them to lowercase and replacing spaces with underscores. This helps avoid syntax issues during analysis. The code for this step is:

# Use janitor to standardize column names (lowercase, no spaces/symbols)
data <- janitor::clean_names(data)
# Preview the cleaned column names and a few rows
head(data)

The output for this step gives us a dataset that’s cleaned.

	edge_followed_by	edge_follow	username_length	username_has_number	full_name_has_number	full_name_length	is_private	is_joined_recently	has_channel	is_business_account	has_guides	has_external_url	is_fake
	<dbl>	<dbl>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>
1	0.001	0.257	13	1	1	13	0	0	0	0	0	0	1
2	0.000	0.958	9	1	0	0	0	1	0	0	0	0	1
3	0.000	0.253	12	0	0	0	0	0	0	0	0	0	1
4	0.000	0.977	10	1	0	0	0	0	0	0	0	0	1
5	0.000	0.321	11	0	0	11	1	0	0	0	0	0	1
6	0.000	0.917	15	1	0	0	0	1	0	0	0	0	1

Step 5: Explore the Target Variable (Fake vs Real Users)

Let’s start our analysis by examining how many fake and real user profiles exist in the dataset. This helps us understand if the data is balanced or skewed toward one class. The code for this step is:

# Count how many users are fake vs real
table(data$is_fake)
# Visualize the distribution using a bar chart
data %>%
  ggplot(aes(x = factor(is_fake), fill = factor(is_fake))) +
  geom_bar() +
  labs(title = "Distribution of Fake vs Real Users",
       x = "Is Fake (1 = Fake, 0 = Real)",
       y = "Count") +
  scale_fill_manual(values = c("steelblue", "tomato")) +  # Custom colors for clarity
  theme_minimal()  # Clean theme

The output is represented by the following graph:

0 1

93 692

The above output shows that:

There are 692 fake users and only 93 real users in the dataset.
The dataset is imbalanced, with far more fake accounts than real ones.
This imbalance may affect model training and needs to be handled properly.

Step 6: Explore Feature Relationships Using a Correlation Matrix

Understanding how numerical features relate to each other helps identify redundancy, multicollinearity, or strong relationships that may impact the model. The code for this step is:

# Remove target column before correlation analysis
data_features <- select(data, -is_fake)
# Create a correlation matrix plot to visualize relationships between features
GGally::ggcorr(data_features, label = TRUE, label_size = 3, layout.exp = 2)

The output for this step is a graph that explains various correlations.

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

The above graph shows that:

Most features have weak correlations (close to 0), meaning they vary independently.
is_business_account shows moderate positive correlation with username_length and has_external_url.
username_has_number is slightly negatively correlated with is_business_account.

Here’s an R Project: Customer Segmentation Project Using R: A Step-by-Step Guide

Step 7: Check Relations with Fake Behavior (Visual Insights)

To understand user behavior patterns better, we can explore if certain features (like private profiles or numeric usernames) show any visual correlation with being fake. Stacked bar plots (proportion-based) help us compare real vs fake user distribution across binary features. The code for this step is:

# Plot: Does private account relate to being fake?
data %>%
  ggplot(aes(x = factor(is_private), fill = factor(is_fake))) +
  geom_bar(position = "fill") +
  labs(title = "Private Account vs Fake Behavior",
       x = "Is Private", y = "Proportion", fill = "Is Fake") +
  theme_minimal()
# Plot: Does having number in username relate to being fake?
data %>%
  ggplot(aes(x = factor(username_has_number), fill = factor(is_fake))) +
  geom_bar(position = "fill") +
  labs(title = "Username Has Number vs Fake Behavior",
       x = "Username Has Number", y = "Proportion", fill = "Is Fake") +
  theme_minimal()

We get two graphs as outputs that explain:

Private accounts: A higher proportion of fake users seems to be private.
Usernames with numbers: These also show a greater share of fake users compared to those without numbers.

Step 8: Split Data into Training and Testing Sets

Before building our machine learning model, we split the dataset into training and testing subsets. This helps evaluate how well the model generalizes to new data. We'll use 80% of the data for training and the remaining 20% for testing. Here’s the code:

# Set seed for reproducibility (so results remain the same)
set.seed(123)
# Split data into training (80%) and testing (20%)
split_index <- createDataPartition(data$is_fake, p = 0.8, list = FALSE)
train_data <- data[split_index, ]
test_data  <- data[-split_index, ]
# Check dimensions
cat("Training rows:", nrow(train_data), "\n")
cat("Testing rows:", nrow(test_data), "\n")

The above step splits the data to train and evaluate the model based on existing data from the dataset.

Training rows: 628

Testing rows: 157

Must Try: Forest Fire Project Using R - A Step-by-Step Guide

Step 9: Train a Random Forest Classifier on the Data

Now that we have split the data, we’ll train a Random Forest model to predict whether a user is fake or not. This algorithm works well with classification problems and handles both linear and non-linear patterns effectively. Here’s the code:

# Build a random forest model to predict 'is_fake'
model_rf <- randomForest(is_fake ~ ., data = train_data, importance = TRUE, ntree = 100)
# View the model summary
print(model_rf)

The output for this step is:

Call:

randomForest(formula = is_fake ~ ., data = train_data, importance = TRUE, ntree = 100)

Type of random forest: classification

Number of trees: 100

No. of variables tried at each split: 3

OOB estimate of error rate: 5.89%

Confusion matrix:

0 1 class.error

0 51 25 0.32894737

1 12 540 0.02173913

The above output means that:

The model predicts fake profiles with high accuracy (only ~2% error for fake accounts).
It struggles more with real accounts (about 33% of real users were misclassified).
Overall error (OOB estimate) is low at 5.89%, which means the model performs quite well.

Step 10: Convert Target Variable to Factor and Retrain the Model

Before training a classification model, it's important to ensure the target variable is treated as a categorical factor, not a numeric value. This step improves model accuracy and ensures correct behavior for classification tasks. The code for this step is given below:

# Convert target variable to factor (for classification)
train_data$is_fake <- as.factor(train_data$is_fake)
test_data$is_fake  <- as.factor(test_data$is_fake)
# Re-train the model
model_rf <- randomForest(is_fake ~ ., data = train_data, importance = TRUE, ntree = 100)
# View the updated model summary
print(model_rf)

The above code gives the output as:

Call:

randomForest(formula = is_fake ~ ., data = train_data, importance = TRUE, ntree = 100)

Type of random forest: classification

Number of trees: 100

No. of variables tried at each split: 3

OOB estimate of error rate: 5.73%

Confusion matrix:

0 1 class.error

0 51 25 0.32894737

1 11 541 0.01992754

The above output means that:

The model correctly identified most fake accounts with a low error rate (about 2%).
It struggled a bit more with real accounts, misclassifying some as fake (around 33% error).
Overall, the model is about 94% accurate, which means it performs well in detecting fake profiles

Must read: R For Data Science: Why Should You Choose R for Data Science?

Step 11: Make Predictions and Evaluate the Model

In this step, we use the trained Random Forest model to predict whether profiles in the test data are fake or real. Then we evaluate how accurate the predictions are using a confusion matrix. The code for this step is given below:

The output for this step is:

Confusion Matrix and Statistics

Reference

Prediction 0 1

0 11 2

1 6 138

Accuracy : 0.949

95% CI : (0.9021, 0.9777)

No Information Rate : 0.8917

P-Value [Acc > NIR] : 0.009303

Kappa : 0.7057

Mcnemar's Test P-Value : 0.288844

Sensitivity : 0.64706

Specificity : 0.98571

Pos Pred Value : 0.84615

Neg Pred Value : 0.95833

Prevalence : 0.10828

Detection Rate : 0.07006

Detection Prevalence : 0.08280

Balanced Accuracy : 0.81639

'Positive' Class : 0

The above output means that:

The model achieved 94.9% accuracy on the test set.
It predicted fake users well but struggled slightly with real ones.
The Kappa score of 0.71 indicates good prediction consistency.

A Fun R Project For You: Wine Quality Prediction Project in R

Step 12: Analyze Feature Importance

In this step, we visualize which features had the biggest impact on predicting fake users using the Random Forest’s feature importance. This tells us which variables the model relied on most when making decisions. The code for this step is:

# Plot feature importance
importance_df <- as.data.frame(importance(model_rf))
importance_df$Feature <- rownames(importance_df)
# Arrange in descending order
importance_df <- importance_df %>%
  arrange(desc(MeanDecreaseGini))
# Plot
ggplot(importance_df, aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_col(fill = "darkred") +
  coord_flip() +
  labs(title = "Feature Importance (Fake User Detection)",
       x = "Feature", y = "Importance Score") +
  theme_minimal()

The output for the above step is a graph that highlights the feature importance of this model:

The above graph shows that:

edge_follow (number of people followed) is the most important feature; fake users often follow many accounts to look real.
full_name_length and edge_followed_by (followers) are also influential; suspicious profiles may have odd name lengths or follower patterns.
Features like is_private, username_has_number, and has_external_url are moderately important; fake users often set private profiles or include links.

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Conclusion

In this Insta Fake User Behavior Analysis project, we built a Random Forest classification model using R in Google Colab to identify fake Instagram users based on profile attributes such as follower count, username structure, and account activity.

After cleaning and preparing the dataset, we explored key features and trained the model on 80% of the data. The model was then tested on the remaining 20%, achieving an accuracy of 94.9% with a balanced sensitivity and specificity. Feature importance analysis revealed that follower/following ratios and full name patterns are key indicators of fake accounts.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1m0Hkf04h6bCSH1ycEFmCTqnEt46JzNQf

Frequently Asked Questions (FAQs)

1. What is the goal of the Fake Instagram User Behavior Analysis project?

The main objective of this project is to detect fake Instagram users based on their profile and activity features. Using machine learning techniques like Random Forest, we analyze factors such as follower-following counts, account privacy, username patterns, and profile metadata to classify accounts as fake or real.

2. Which tools and libraries are used in this project?

This project is implemented in R using Google Colab. Key libraries include:

randomForest for model building
caret for evaluation
dplyr and ggplot2 for data manipulation and visualization
e1071 for additional ML support functions

3. Can I use other machine learning algorithms for this problem?

Yes, other classification algorithms can be explored to optimize performance, such as:

Support Vector Machines (SVM)
XGBoost
Logistic Regression
Decision Trees
Neural Networks

4. What kind of dataset is required for fake user detection on Instagram?

You need a dataset with features like:

username_has_number, full_name_length, is_private, is_business_account
Follower and following counts
Engagement metrics or account metadata (e.g., external URL, guides, channel)
The dataset should be labeled with a binary target: fake (1) or real (0).

5. What are the key insights we get from feature importance in this analysis?

Feature importance analysis shows that attributes like follower-following ratio (edge_follow), full name length, and account privacy status play a significant role in identifying fake accounts. These features help the model prioritize which inputs are most predictive of fake behavior.

6. How accurate was the model in detecting fake users?

The Random Forest model achieved an accuracy of 94.9% on the test data, with strong performance in both sensitivity (detecting fake users) and specificity (correctly identifying real users).

7. What are some similar machine learning projects that beginners can try?

If you're interested in exploring other ML projects with classification or prediction goals, here are some great options:

Mobile App for Lottery Addiction Detection
House Price Prediction using Regression
Stock Market Forecasting with Time Series Models
Financial Risk Modeling using Decision Trees
Winning Strategy Prediction for Jeopardy Quiz Game

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources