Instagram Fake Profile Detection Using Machine Learning in R

By Rohit Sharma

Updated on Jul 31, 2025 | 14 min read | 1.25K+ views

Share:

This project focuses on detecting fake Instagram profiles using R in Google Colab. By analysing behavioural features like follower ratios, username structure, and account settings, we build a machine learning model to identify suspicious users.

Through data cleaning, visualization, and classification using Random Forest, the model effectively distinguishes fake profiles from real ones with high accuracy.

Data Science Is Evolving. Are You?

Step into tomorrow with top data science courses designed for the AI-native world. From GenAI mastery to hands-on projects, these programs don’t teach data; they teach dominance.

Level Your R Skills Up With These Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

How Much Time, Effort, and Skill Do You Need?

The project is a beginner-level project that requires less time and effort as compared to advanced projects. Here’s a quick overview of what’s required:

Aspect

Details

Estimated Duration 3 to 4 hours
Difficulty Level Easy to Moderate
Skill Level Required Beginner (basic R knowledge recommended)

From mastering Generative AI to solving real-world problems with Advanced Analytics, upGrad’s globally recognized programs equip you to lead, not follow. Learn from industry veterans. Earn credentials that matter. Drive innovation in a data-first world.

Key Concepts to Understand Before Starting the Instagram Fake Profile Detection Project

Before starting this Instagram fake profile detection project, we need a basic understanding of certain concepts, which are listed below:

  • Basic Understanding of R: Familiarity with R syntax, functions, and using libraries like tidyverse, caret, and randomForest.
  • Machine Learning Basics: Know how classification models work, especially binary classification problems.
  • Data Preprocessing: Understanding how to clean and prepare data for analysis, including handling column names and data types.
  • Feature Engineering: Recognize how behavioral features such as follower ratios or account privacy settings can be used as inputs.
  • Model Evaluation: Be familiar with accuracy, sensitivity, specificity, and confusion matrices to assess model performance.
  • Random Forest Algorithm: A basic idea of how decision trees and ensemble models like Random Forest function.

Key Tools and R Packages Powering This Project

To build a functional and reliable project, we’ll use certain libraries and tools that’ll help us ensure that the project runs smoothly.

Tool/Library

Purpose

Google Colab (R) Cloud-based environment to write and execute R code
R Primary language used for analysis and modeling
tidyverse For data manipulation (dplyr, ggplot2, etc.)
janitor To clean and standardize column names
skimr To generate descriptive statistics quickly
ggplot2, GGally To plot distributions and feature correlations
randomForest To build a classification model for detecting fake profiles
caret To split data and evaluate model performance (accuracy, confusion matrix)

This Could Be Your Next R Project: How to Build an Uber Data Analysis Project in R

Breakdown Of This Instagram Fake Profile Detection Project 

This Instagram fake profile detection project has several steps involved. In this section, we’ll break down the steps along with the code for your understanding.

Step 1: Configure Google Colab for R Programming

To work with R in Google Colab, you’ll need to switch the default language from Python to R. This enables R code execution directly within the notebook.

Here's how you can switch to R:

  • Launch a new notebook on Google Colab
  • Go to the Runtime menu
  • Click on Change runtime type
  • In the Language dropdown, select R
  • Click Save to apply the changes

Step 2: Install and Load the Required R Libraries

In this step, we’ll install and load all the necessary R packages used throughout the project. These libraries will help with data cleaning, visualization, modeling, and evaluation. The code for this section is given below:

# Install essential libraries (run only once)
install.packages("tidyverse")    # Data manipulation and visualization (includes dplyr, ggplot2, etc.)
install.packages("skimr")        # Quick summaries of data frames
install.packages("janitor")      # Tools for cleaning messy column names
install.packages("GGally")       # Advanced correlation and pair plots
install.packages("caret")        # Tools for data splitting and model evaluation
install.packages("randomForest") # Random Forest algorithm for classification
# Load the libraries into the R environment
library(tidyverse)
library(skimr)
library(janitor)
library(GGally)
library(caret)
library(randomForest)

The output for the above code is given below, which confirms that the libraries are installed and loaded:

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘proxy’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘sparsevctrs’, ‘timeDate’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──

 dplyr    1.1.4      readr    2.1.5

 forcats  1.0.0      stringr  1.5.1

 ggplot2  3.5.2      tibble   3.3.0

 lubridate 1.9.4      tidyr    1.3.1

 purrr    1.1.0     

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

 dplyr::filter() masks stats::filter()

 dplyr::lag()    masks stats::lag()

Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: ‘janitor’

The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test

Loading required package: lattice

Attaching package: ‘caret’

The following object is masked from ‘package:purrr’:

    lift

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:dplyr’:

    combine

The following object is masked from ‘package:ggplot2’:

    margin

Here’s a fun Project in R: Car Data Analysis Project Using R

Step 3: Load and Preview the Dataset

Now that the environment is ready, the next step is to load your dataset into R. We’ll read the CSV file, take a look at the first few rows, and check the structure to understand what kind of data we’re working with. Here’s the code to load and read the dataset:

# Directly load the CSV file (works in Colab if the file is already uploaded)
data <- read.csv("Insta Fake User Behavior Analysis.csv")  # Load the dataset
# View the first few rows to get an idea of the data
head(data)
# Check the structure and data types of each column
str(data)
 

edge_followed_by

edge_follow

username_length

username_has_number

full_name_has_number

full_name_length

is_private

is_joined_recently

has_channel

is_business_account

has_guides

has_external_url

is_fake

 

<dbl>

<dbl>

<int>

<int>

<int>

<int>

<int>

<int>

<int>

<int>

<int>

<int>

<int>

1

0.001

0.257

13

1

1

13

0

0

0

0

0

0

1

2

0.000

0.958

9

1

0

0

0

1

0

0

0

0

1

3

0.000

0.253

12

0

0

0

0

0

0

0

0

0

1

4

0.000

0.977

10

1

0

0

0

0

0

0

0

0

1

5

0.000

0.321

11

0

0

11

1

0

0

0

0

0

1

6

0.000

0.917

15

1

0

0

0

1

0

0

0

0

1

 

'data.frame': 785 obs. of  13 variables:

 $ edge_followed_by    : num  0.001 0 0 0 0 0 0 0 0.001 0 ...

 $ edge_follow         : num  0.257 0.958 0.253 0.977 0.321 0.917 0.076 0.72 0.731 0.999 ...

 $ username_length     : int  13 9 12 10 11 15 9 15 9 7 ...

 $ username_has_number : int  1 1 0 1 0 1 1 1 1 1 ...

 $ full_name_has_number: int  1 0 0 0 0 0 1 0 0 1 ...

 $ full_name_length    : int  13 0 0 0 11 0 9 0 11 9 ...

 $ is_private          : int  0 0 0 0 1 0 0 0 1 0 ...

 $ is_joined_recently  : int  0 1 0 0 0 1 1 0 0 0 ...

 $ has_channel         : int  0 0 0 0 0 0 0 0 0 0 ...

 $ is_business_account : int  0 0 0 0 0 0 0 0 0 0 ...

 $ has_guides          : int  0 0 0 0 0 0 0 0 0 0 ...

 $ has_external_url    : int  0 0 0 0 0 0 0 0 0 0 ...

 $ is_fake             : int  1 1 1 1 1 1 1 1 1 1 ...

The above table gives us a glimpse of what the data looks like.

Step 4: Clean the Column Names

To make the dataset easier to work with, we'll clean the column names by converting them to lowercase and replacing spaces with underscores. This helps avoid syntax issues during analysis. The code for this step is:

# Use janitor to standardize column names (lowercase, no spaces/symbols)
data <- janitor::clean_names(data)
# Preview the cleaned column names and a few rows
head(data)

The output for this step gives us a dataset that’s cleaned.

 

edge_followed_by

edge_follow

username_length

username_has_number

full_name_has_number

full_name_length

is_private

is_joined_recently

has_channel

is_business_account

has_guides

has_external_url

is_fake

 

<dbl>

<dbl>

<int>

<int>

<int>

<int>

<int>

<int>

<int>

<int>

<int>

<int>

<int>

1

0.001

0.257

13

1

1

13

0

0

0

0

0

0

1

2

0.000

0.958

9

1

0

0

0

1

0

0

0

0

1

3

0.000

0.253

12

0

0

0

0

0

0

0

0

0

1

4

0.000

0.977

10

1

0

0

0

0

0

0

0

0

1

5

0.000

0.321

11

0

0

11

1

0

0

0

0

0

1

6

0.000

0.917

15

1

0

0

0

1

0

0

0

0

1

 

Read More: Machine Learning with R: Everything You Need to Know

Step 5: Explore the Target Variable (Fake vs Real Users)

Let’s start our analysis by examining how many fake and real user profiles exist in the dataset. This helps us understand if the data is balanced or skewed toward one class. The code for this step is:

# Count how many users are fake vs real
table(data$is_fake)
# Visualize the distribution using a bar chart
data %>%
  ggplot(aes(x = factor(is_fake), fill = factor(is_fake))) +
  geom_bar() +
  labs(title = "Distribution of Fake vs Real Users",
       x = "Is Fake (1 = Fake, 0 = Real)",
       y = "Count") +
  scale_fill_manual(values = c("steelblue", "tomato")) +  # Custom colors for clarity
  theme_minimal()  # Clean theme

The output is represented by the following graph:

 0   1

93 692

The above output shows that:

  • There are 692 fake users and only 93 real users in the dataset.
  • The dataset is imbalanced, with far more fake accounts than real ones.
  • This imbalance may affect model training and needs to be handled properly.

Step 6: Explore Feature Relationships Using a Correlation Matrix

Understanding how numerical features relate to each other helps identify redundancy, multicollinearity, or strong relationships that may impact the model. The code for this step is:

# Remove target column before correlation analysis
data_features <- select(data, -is_fake)
# Create a correlation matrix plot to visualize relationships between features
GGally::ggcorr(data_features, label = TRUE, label_size = 3, layout.exp = 2)

The output for this step is a graph that explains various correlations.

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

The above graph shows that:

  • Most features have weak correlations (close to 0), meaning they vary independently.
  • is_business_account shows moderate positive correlation with username_length and has_external_url.
  • username_has_number is slightly negatively correlated with is_business_account.

Here’s an R Project: Customer Segmentation Project Using R: A Step-by-Step Guide

Step 7: Check Relations with Fake Behavior (Visual Insights)

To understand user behavior patterns better, we can explore if certain features (like private profiles or numeric usernames) show any visual correlation with being fake. Stacked bar plots (proportion-based) help us compare real vs fake user distribution across binary features. The code for this step is:

# Plot: Does private account relate to being fake?
data %>%
  ggplot(aes(x = factor(is_private), fill = factor(is_fake))) +
  geom_bar(position = "fill") +
  labs(title = "Private Account vs Fake Behavior",
       x = "Is Private", y = "Proportion", fill = "Is Fake") +
  theme_minimal()
# Plot: Does having number in username relate to being fake?
data %>%
  ggplot(aes(x = factor(username_has_number), fill = factor(is_fake))) +
  geom_bar(position = "fill") +
  labs(title = "Username Has Number vs Fake Behavior",
       x = "Username Has Number", y = "Proportion", fill = "Is Fake") +
  theme_minimal()

We get two graphs as outputs that explain:

  • Private accounts: A higher proportion of fake users seems to be private.
  • Usernames with numbers: These also show a greater share of fake users compared to those without numbers.

Step 8: Split Data into Training and Testing Sets

Before building our machine learning model, we split the dataset into training and testing subsets. This helps evaluate how well the model generalizes to new data. We'll use 80% of the data for training and the remaining 20% for testing. Here’s the code:

# Set seed for reproducibility (so results remain the same)
set.seed(123)
# Split data into training (80%) and testing (20%)
split_index <- createDataPartition(data$is_fake, p = 0.8, list = FALSE)
train_data <- data[split_index, ]
test_data  <- data[-split_index, ]
# Check dimensions
cat("Training rows:", nrow(train_data), "\n")
cat("Testing rows:", nrow(test_data), "\n")

The above step splits the data to train and evaluate the model based on existing data from the dataset.

Training rows: 628 

Testing rows: 157 

Must Try: Forest Fire Project Using R - A Step-by-Step Guide

Step 9: Train a Random Forest Classifier on the Data

Now that we have split the data, we’ll train a Random Forest model to predict whether a user is fake or not. This algorithm works well with classification problems and handles both linear and non-linear patterns effectively. Here’s the code:

# Build a random forest model to predict 'is_fake'
model_rf <- randomForest(is_fake ~ ., data = train_data, importance = TRUE, ntree = 100)
# View the model summary
print(model_rf)

The output for this step is:

Call:

 randomForest(formula = is_fake ~ ., data = train_data, importance = TRUE,      ntree = 100) 

               Type of random forest: classification

                     Number of trees: 100

No. of variables tried at each split: 3

        OOB estimate of  error rate: 5.89%

Confusion matrix:

       0      1      class.error

0    51    25   0.32894737

1    12  540    0.02173913

The above output means that:

  • The model predicts fake profiles with high accuracy (only ~2% error for fake accounts).
  • It struggles more with real accounts (about 33% of real users were misclassified).
  • Overall error (OOB estimate) is low at 5.89%, which means the model performs quite well.

Step 10: Convert Target Variable to Factor and Retrain the Model

Before training a classification model, it's important to ensure the target variable is treated as a categorical factor, not a numeric value. This step improves model accuracy and ensures correct behavior for classification tasks. The code for this step is given below:

# Convert target variable to factor (for classification)
train_data$is_fake <- as.factor(train_data$is_fake)
test_data$is_fake  <- as.factor(test_data$is_fake)
# Re-train the model
model_rf <- randomForest(is_fake ~ ., data = train_data, importance = TRUE, ntree = 100)
# View the updated model summary
print(model_rf)

The above code gives the output as:

Call:

randomForest(formula = is_fake ~ ., data = train_data, importance = TRUE,      ntree = 100)

              Type of random forest: classification

                    Number of trees: 100

No. of variables tried at each split: 3

       OOB estimate of  error rate: 5.73%

Confusion matrix:

      0   1      class.error

0  51    25  0.32894737

1   11  541  0.01992754

The above output means that:

  • The model correctly identified most fake accounts with a low error rate (about 2%).
  • It struggled a bit more with real accounts, misclassifying some as fake (around 33% error).
  • Overall, the model is about 94% accurate, which means it performs well in detecting fake profiles

Must read: R For Data Science: Why Should You Choose R for Data Science?

Step 11: Make Predictions and Evaluate the Model

In this step, we use the trained Random Forest model to predict whether profiles in the test data are fake or real. Then we evaluate how accurate the predictions are using a confusion matrix. The code for this step is given below:

The output for this step is:

Confusion Matrix and Statistics

         Reference

Prediction   0     1

              0  11     2

               1   6 138                       

              Accuracy : 0.949          

                95% CI : (0.9021, 0.9777)

   No Information Rate : 0.8917          

   P-Value [Acc > NIR] : 0.009303                               

                 Kappa : 0.7057                                 

Mcnemar's Test P-Value : 0.288844                             

           Sensitivity : 0.64706        

           Specificity : 0.98571        

        Pos Pred Value : 0.84615        

        Neg Pred Value : 0.95833        

            Prevalence : 0.10828        

        Detection Rate : 0.07006        

  Detection Prevalence : 0.08280        

     Balanced Accuracy : 0.81639                        

      'Positive' Class : 0               

The above output means that:

  • The model achieved 94.9% accuracy on the test set.
  • It predicted fake users well but struggled slightly with real ones.
  • The Kappa score of 0.71 indicates good prediction consistency.

A Fun R Project For You: Wine Quality Prediction Project in R

Step 12: Analyze Feature Importance

In this step, we visualize which features had the biggest impact on predicting fake users using the Random Forest’s feature importance. This tells us which variables the model relied on most when making decisions. The code for this step is:

# Plot feature importance
importance_df <- as.data.frame(importance(model_rf))
importance_df$Feature <- rownames(importance_df)
# Arrange in descending order
importance_df <- importance_df %>%
  arrange(desc(MeanDecreaseGini))
# Plot
ggplot(importance_df, aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_col(fill = "darkred") +
  coord_flip() +
  labs(title = "Feature Importance (Fake User Detection)",
       x = "Feature", y = "Importance Score") +
  theme_minimal()

The output for the above step is a graph that highlights the feature importance of this model:

The above graph shows that:

  • edge_follow (number of people followed) is the most important feature; fake users often follow many accounts to look real.
  • full_name_length and edge_followed_by (followers) are also influential; suspicious profiles may have odd name lengths or follower patterns.
  • Features like is_private, username_has_number, and has_external_url are moderately important; fake users often set private profiles or include links.

Conclusion

In this Insta Fake User Behavior Analysis project, we built a Random Forest classification model using R in Google Colab to identify fake Instagram users based on profile attributes such as follower count, username structure, and account activity.

After cleaning and preparing the dataset, we explored key features and trained the model on 80% of the data. The model was then tested on the remaining 20%, achieving an accuracy of 94.9% with a balanced sensitivity and specificity. Feature importance analysis revealed that follower/following ratios and full name patterns are key indicators of fake accounts.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1m0Hkf04h6bCSH1ycEFmCTqnEt46JzNQf

Frequently Asked Questions (FAQs)

1. What is the goal of the Fake Instagram User Behavior Analysis project?

2. Which tools and libraries are used in this project?

3. Can I use other machine learning algorithms for this problem?

4. What kind of dataset is required for fake user detection on Instagram?

5. What are the key insights we get from feature importance in this analysis?

6. How accurate was the model in detecting fake users?

7. What are some similar machine learning projects that beginners can try?

Rohit Sharma

827 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

upGrad Logo

Certification

3 Months