Black Friday Dataset Analysis for Sales Prediction

By Rohit Sharma

Updated on Aug 05, 2025 | 11 min read | 1.33K+ views

Share:

Black Friday is one of the biggest retail events of the year, not only for the massive discounts it offers but also for the sheer volume of customer data it generates. This project focuses on analyzing the Black Friday dataset to uncover insights into consumer purchasing behavior. Our aim is to identify what factors, such as age, gender, occupation, and city category, influence how much a customer spends. By exploring these patterns, we can build predictive models that help forecast future sales and support better business decisions. 

Upskill in data science with upGrad's Online Data Science Courses. Learn Python, ML, AI, SQL, and Tableau from experts, build real-world skills, and get job-ready.

Check our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog for more project ideas like this one.

Problem Statement of Balck Friday Dataset

Consider yourself a retail business trying to improve the recommendations you give to your clients. Basic information, such as their city type, gender, age group, and the types of products they have browsed or purchased, is available to you.

The key query is: Is it possible to forecast the likely spending amount of a customer?

That’s exactly what this dataset lets us try. It includes:

  • Customer details like gender, age group, marital status, occupation, and city category
  • Product information, including product IDs and category levels
  • And most importantly, the purchase amount for each transaction

This project offers hands-on experience in data preprocessingvisualization, and machine learning modeling.

Advance your data science career with upGrad's top courses and industry mentors.

What You'll Need to Begin

Although it's not necessary to be an expert in machine learning to follow along, it will be helpful to have some basic knowledge. What you should know is as follows:

Time Taken and Difficulty Level

On average, it will take about 2 to 3 hours to complete. Duration may increase/decrease depending on your familiarity with - data preprocessing, visualization, and basic machine learning concepts.

Models We Will Be Using

Here are the machine learning models that we will be utilizing:

Model

Purpose

Linear Regression

To build a simple baseline model for predicting the purchase amount

Decision Tree Regressor

To capture nonlinear patterns and decision rules

Random Forest Regressor

To improve accuracy using an ensemble of decision trees

XGBoost Regressor

To leverage gradient boosting for better performance and efficiency

Now, let’s start the main section of the project- 

How to Build a Black Friday Prediction Model

Let’s start building the project from scratch. We will start by:

  1. Downloading the dataset
  2. Loading the dataset
  3. Exploring the dataset
  4. Visualizing the data
  5. Handling missing value (if any)
  6. Feature encoding
  7. Splitting the data
  8. Training the models

Without any further delay, let’s start!

Step 1: Download the Black Friday Dataset Using kagglehub

First, we will be downloading the Black Friday dataset from Kaggle. Use the code given below to do so:

import kagglehub
# Download latest version
path = kagglehub.dataset_download("prepinstaprime/black-friday-sales-data")
print("Path to dataset files:", path)

Output:

Downloading from https://www.kaggle.com/api/v1/datasets/download/prepinstaprime/black-friday-sales-data?dataset_version_number=1...

100%|██████████| 5.48M/5.48M [00:00<00:00, 67.5MB/s]Extracting files...

Path to dataset files: /root/.cache/kagglehub/datasets/prepinstaprime/black-friday-sales-data/versions/1

Step 2: Load the Black Friday Dataset 

In this step, we will load the dataset using Pandas. Here’s the code:

import pandas as pd
# Full path to the CSV fill
file_path = "/root/.cache/kagglehub/datasets/prepinstaprime/black-friday-sales-data/versions/1/train.csv"
# Load the dataset
df = pd.read_csv(file_path)
# Display the first few rows
df.head()

Output:

 

User_ID

Product_ID

Gender

Age

Occupation

City_Category

Stay_In_Current_City_Years

Marital_Status

Product_Category_1

Product_Category_2

Product_Category_3

Purchase

0

1000001

P00069042

F

0-17

10

A

2

0

3

NaN

NaN

8370

1

1000001

P00248942

F

0-17

10

A

2

0

1

6.0

14.0

15200

2

1000001

P00087842

F

0-17

10

A

2

0

12

NaN

NaN

1422

3

1000001

P00085442

F

0-17

10

A

2

0

12

14.0

NaN

1057

4

1000002

P00285442

M

55+

16

C

4+

0

8

NaN

NaN

7969

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

The output shows that the Black Friday dataset contains customer and product-related columns, such as:

  • User_ID, Gender, Age, City_Category
  • Product_ID, Product_Category_1/2/3
  • Purchase amount

Step 3: Explore the Dataset (EDA)

Understanding the type of data we are working with is crucial before beginning any modeling. In order to accomplish that, we will verify in this step:

  • Column information and data shape
  • Missing values, if any
  • Distinct values for every column

The code to accomplish this is as follows:

# Check the shape of the dataset
print("Dataset Shape:", df.shape)

# Get column-wise info
print("\nColumn Info:")
print(df.info())

# Count missing values
print("\nMissing Values Per Column:")
print(df.isnull().sum())

# View unique values in each column (first 5 columns for quick look)
print("\nUnique Values (First 5 Columns):")
for col in df.columns[:5]:
    print(f"{col}: {df[col].nunique()} unique values")

Output:

Dataset Shape: (550068, 12)

Column Info:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 550068 entries, 0 to 550067

Data columns (total 12 columns):

 #   Column                               Non-Null Count       Dtype  

---  ------                                    --------------         -----  

 0   User_ID                                 550068 non-null      int64  

 1   Product_ID                            550068 non-null      object 

 2   Gender                                  550068 non-null     object 

 3   Age                                       550068 non-null     object 

 4   Occupation                           550068 non-null     int64  

 5   City_Category                       550068 non-null     object 

 6   Stay_In_Current_City_Years  550068 non-null     object 

 7   Marital_Status                       550068 non-null     int64  

 8   Product_Category_1              550068 non-null     int64  

 9   Product_Category_2             376430 non-null     float64

 10  Product_Category_3            166821 non-null      float64

 11  Purchase                               550068 non-null     int64  

dtypes: float64(2), int64(5), object(5)

memory usage: 50.4+ MB

None

Missing Values Per Column:

User_ID                                             0

Product_ID                                        0

Gender                                              0

Age                                                    0

Occupation                                        0

City_Category                                    0

Stay_In_Current_City_Years               0

Marital_Status                                    0

Product_Category_1                           0

Product_Category_2                      173638

Product_Category_3                      383247

Purchase                                            0

dtype: int64

Unique Values (First 5 Columns):

User_ID: 5891 unique values

Product_ID: 3631 unique values

Gender: 2 unique values

Age: 7 unique values

Occupation: 21 unique values

The output indicates that:

  • The Black Friday dataset consists of 12 columns and 550,068 rows. 
  • There are 173,638 missing values in Product Category 2, and 383,247 missing values in Product Category 3.
  • There are roughly 3.6k unique products and 5.8k unique users.
  • There are:
    • 2 categories for gender.
    • 7 different ranges of age.
    • 21 different kinds of occupations.

Step 4: Visualize the Data

In this step, we will visualize user trends, product purchases, and purchase distribution. Why? To find patterns and uncover hidden trends in the data. To achieve this, we will use Matplotlib and Seaborn. Use the code mentioned below to achieve this:

# Install seaborn if not installed
!pip install seaborn --quiet

# Import libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
sns.set(style="whitegrid")

# Set overall figure size
plt.figure(figsize=(20,16))

# Plot 1: Average Purchase by Gender
plt.subplot(2, 2, 1)
sns.barplot(data=df, x='Gender', y='Purchase', estimator='mean', palette='Set2')
plt.title("Average Purchase by Gender")
plt.ylabel("Average Purchase Amount")
plt.xlabel("Gender")

# Plot 2: Average Purchase by Age Group
plt.subplot(2, 2, 2)
sns.barplot(data=df, x='Age', y='Purchase', estimator='mean', palette='Set3')
plt.title("Average Purchase by Age Group")
plt.ylabel("Average Purchase Amount")
plt.xlabel("Age Group")

# Plot 3: Purchase Distribution Histogram
plt.subplot(2, 2, 3)
sns.histplot(df['Purchase'], bins=30, kde=True, color='salmon')
plt.title("Distribution of Purchase Amounts")
plt.xlabel("Purchase Amount")
plt.ylabel("Frequency")

# Plot 4: Purchase by City Category
plt.subplot(2, 2, 4)
sns.boxplot(data=df, x='City_Category', y='Purchase', palette='pastel')
plt.title("Purchase Amount by City Category")
plt.xlabel("City Category")
plt.ylabel("Purchase")

# Adjust layout
plt.tight_layout()
plt.show()

Output:

What does each chart explain or depict?
The charts (starting from top-left and moving clockwise) tell us that:

Plot

What does it show?

Average Purchase by Gender The average purchase amount for men is marginally higher than that of women.
Average Purchase by Age Group On average, users between the ages of 51 and 55 spend the most. Followed by those between the ages of 46-50 and 26-35.
Purchase Amount by City Category In city categories A, B, and C, the median purchase amounts are comparatively constant.
Distribution of Purchase Amounts With several peaks and a right skew, the data shows a range of spending patterns.

Step 5: Handle Missing Values & Clean the Data

We discovered that the Black Friday dataset contains missing values in step 3. So, let's deal with them and correct inconsistent data types in this step. To do this, use the code listed below:

# Fill missing values in Product_Category_2 and Product_Category_3 with 0
df['Product_Category_2'].fillna(0, inplace=True)
df['Product_Category_3'].fillna(0, inplace=True)

# Convert data type from float to int for filled columns
df['Product_Category_2'] = df['Product_Category_2'].astype(int)
df['Product_Category_3'] = df['Product_Category_3'].astype(int)

# Optional: Convert categorical columns to string (if needed for encoding later)
df['User_ID'] = df['User_ID'].astype(str)
df['Product_ID'] = df['Product_ID'].astype(str)

# Confirm no missing values
print("Any missing values left?:")
print(df.isnull().sum())

Output:

 df['Product_Category_3'].fillna(0, inplace=True)

Any missing values left?:

User_ID                                  0

Product_ID                             0

Gender                                   0

Age                                         0

Occupation                             0

City_Category                         0

Stay_In_Current_City_Years    0

Marital_Status                         0

Product_Category_1                0

Product_Category_2               0

Product_Category_3               0

Purchase                                 0

dtype: int64 

Step 6: Feature Encoding

We must convert categorical columns, such as: Gender, Age, City_Category, etc., into numerical format prior to training our model. Why? Machine Learning algorithms do not process text.  They only process numerical values, such as 0 and 1. To achieve this, we will use scikit-learn's LabelEncoder.

The code is as follows:

from sklearn.preprocessing import LabelEncoder
# Create a LabelEncoder object
le = LabelEncoder()
# List of categorical columns to encode
cat_cols = ['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years']

# Apply Label Encoding to each
for col in cat_cols:
    df[col] = le.fit_transform(df[col])

# View the first few rows to confirm
print(df[cat_cols].head())

Output:

  Gender  Age  City_Category  Stay_In_Current_City_Years

0       0    0              0                           2

1       0    0              0                           2

2       0    0              0                           2

3       0    0              0                           2

4       1    6              2                           4

Step 7: Prepare Features and Split the Data

In this step, we will remove non-informative columns, such as - User_ID and Product_ID. Once this is done, we will then split the data into training and testing sets using an 80:20 ratio.

Use the below-mentioned code to accomplish all this:

from sklearn.model_selection import train_test_split
# Prepare features and target
X = df.drop(['User_ID', 'Product_ID', 'Purchase'], axis=1)
y = df['Purchase']
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Show shapes
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

Output:

X_train shape: (440054, 9)
X_test shape: (110014, 9)
y_train shape: (440054,)
y_test shape: (110014,)

The output tells us that the black friday dataset has been successfully split. 

  • Training set contains 440,054 rows with 9 features
  • Testing set contains 110,014 rows with 9 features

Step 8: Train and Evaluate Regression Models

In this step, we will train the following regression models to predict purchase amounts:

  • Linear Regression
  • Decision Tree Regressor
  • Random Forest Regressor
  • XGBoost Regressor

We will evaluate them using - R² Score and Mean Squared Error (MSE). Here is the code:

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_squared_error
# Dictionary to store models and results
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "XGBoost": XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
}

# Train, predict, and evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    
    print(f"{name}")
    print(f"R² Score: {r2:.4f}")
    print(f"Mean Squared Error: {mse:.2f}")
    print("-" * 30)

Output:

Linear Regression

R² Score: 0.1510

Mean Squared Error: 21332344.83

------------------------------

Decision Tree

R² Score: 0.5527

Mean Squared Error: 11238644.44

------------------------------

Random Forest

R² Score: 0.6268

Mean Squared Error: 9377075.57

------------------------------

XGBoost

R² Score: 0.6606

Mean Squared Error: 8528891.00

------------------------------

Conclusion

XGBoost Regressor performed the best out of all the models that were tested. It obtained the lowest Mean Squared Error of - 8.5 million and an R2 score of - 0.6606 as well. This suggests that it outperformed the others in explaining about 66% of the variation in the purchase amount.

With only 15% of the variance explained, linear regression performed poorly. The non-linearity of the data may be the cause. The Decision Tree did better. However, ensemble models outperformed it.

Random Forest increased bagging accuracy. In the meantime, XGBoost used gradient boosting to further improve results.

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/15wUkQ8jsKjd0H2Aj3UI9ohRuyTUkbFjT

Frequently Asked Questions (FAQs)

1. What is the goal of the Black Friday dataset project?

2. Why are categorical variables converted to numerical values?

3. Why did XGBoost perform better than other models?

4. Can this model be used in real-time applications?

5. What preprocessing steps are crucial before training the model?

Rohit Sharma

834 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months