Home
Blog
Data Science
Black Friday Dataset Analysis for Sales Prediction

Black Friday Dataset Analysis for Sales Prediction

Updated on Aug 05, 2025 | 11 min read | 1.51K+ views

Black Friday is one of the biggest retail events of the year, not only for the massive discounts it offers but also for the sheer volume of customer data it generates. This project focuses on analyzing the Black Friday dataset to uncover insights into consumer purchasing behavior. Our aim is to identify what factors, such as age, gender, occupation, and city category, influence how much a customer spends. By exploring these patterns, we can build predictive models that help forecast future sales and support better business decisions.

Upskill in data science with upGrad's Online Data Science Courses. Learn Python, ML, AI, SQL, and Tableau from experts, build real-world skills, and get job-ready.

Check our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog for more project ideas like this one.

Popular Data Science Programs

Data Science Machine Learning Course Post Graduate Certificate in Data Science PGD in Data Science MS in Data Science DevOps Course Online

Problem Statement of Balck Friday Dataset

Consider yourself a retail business trying to improve the recommendations you give to your clients. Basic information, such as their city type, gender, age group, and the types of products they have browsed or purchased, is available to you.

The key query is: Is it possible to forecast the likely spending amount of a customer?

That’s exactly what this dataset lets us try. It includes:

Customer details like gender, age group, marital status, occupation, and city category
Product information, including product IDs and category levels
And most importantly, the purchase amount for each transaction

This project offers hands-on experience in data preprocessing, visualization, and machine learning modeling.

Advance your data science career with upGrad's top courses and industry mentors.

What You'll Need to Begin

Although it's not necessary to be an expert in machine learning to follow along, it will be helpful to have some basic knowledge. What you should know is as follows:

Python programming, particularly using libraries like matplotlib and pandas
How to examine, filter, and work with data using DataFrames
Fundamental ideas in machine learning, such as classification, regression, and training versus testing data
Google Colab for interactively writing and executing Python code.

Time Taken and Difficulty Level

On average, it will take about 2 to 3 hours to complete. Duration may increase/decrease depending on your familiarity with - data preprocessing, visualization, and basic machine learning concepts.

Models We Will Be Using

Here are the machine learning models that we will be utilizing:

Model	Purpose
Linear Regression	To build a simple baseline model for predicting the purchase amount
Decision Tree Regressor	To capture nonlinear patterns and decision rules
Random Forest Regressor	To improve accuracy using an ensemble of decision trees
XGBoost Regressor	To leverage gradient boosting for better performance and efficiency

Now, let’s start the main section of the project-

How to Build a Black Friday Prediction Model

Let’s start building the project from scratch. We will start by:

Downloading the dataset
Loading the dataset
Exploring the dataset
Visualizing the data
Handling missing value (if any)
Feature encoding
Splitting the data
Training the models

Without any further delay, let’s start!

Step 1: Download the Black Friday Dataset Using kagglehub

First, we will be downloading the Black Friday dataset from Kaggle. Use the code given below to do so:

import kagglehub
# Download latest version
path = kagglehub.dataset_download("prepinstaprime/black-friday-sales-data")
print("Path to dataset files:", path)

Output:

Downloading from https://www.kaggle.com/api/v1/datasets/download/prepinstaprime/black-friday-sales-data?dataset_version_number=1...

100%|██████████| 5.48M/5.48M [00:00<00:00, 67.5MB/s]Extracting files...

Path to dataset files: /root/.cache/kagglehub/datasets/prepinstaprime/black-friday-sales-data/versions/1

Step 2: Load the Black Friday Dataset

In this step, we will load the dataset using Pandas. Here’s the code:

import pandas as pd
# Full path to the CSV fill
file_path = "/root/.cache/kagglehub/datasets/prepinstaprime/black-friday-sales-data/versions/1/train.csv"
# Load the dataset
df = pd.read_csv(file_path)
# Display the first few rows
df.head()

Output:

	User_ID	Product_ID	Gender	Age	Occupation	City_Category	Stay_In_Current_City_Years	Marital_Status	Product_Category_1	Product_Category_2	Product_Category_3	Purchase
0	1000001	P00069042	F	0-17	10	A	2	0	3	NaN	NaN	8370
1	1000001	P00248942	F	0-17	10	A	2	0	1	6.0	14.0	15200
2	1000001	P00087842	F	0-17	10	A	2	0	12	NaN	NaN	1422
3	1000001	P00085442	F	0-17	10	A	2	0	12	14.0	NaN	1057
4	1000002	P00285442	M	55+	16	C	4+	0	8	NaN	NaN	7969

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

The output shows that the Black Friday dataset contains customer and product-related columns, such as:

User_ID, Gender, Age, City_Category
Product_ID, Product_Category_1/2/3
Purchase amount

Step 3: Explore the Dataset (EDA)

Understanding the type of data we are working with is crucial before beginning any modeling. In order to accomplish that, we will verify in this step:

Column information and data shape
Missing values, if any
Distinct values for every column

The code to accomplish this is as follows:

# Check the shape of the dataset
print("Dataset Shape:", df.shape)

# Get column-wise info
print("\nColumn Info:")
print(df.info())

# Count missing values
print("\nMissing Values Per Column:")
print(df.isnull().sum())

# View unique values in each column (first 5 columns for quick look)
print("\nUnique Values (First 5 Columns):")
for col in df.columns[:5]:
    print(f"{col}: {df[col].nunique()} unique values")

Output:

Dataset Shape: (550068, 12)

Column Info:

RangeIndex: 550068 entries, 0 to 550067

Data columns (total 12 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 User_ID 550068 non-null int64

1 Product_ID 550068 non-null object

2 Gender 550068 non-null object

3 Age 550068 non-null object

4 Occupation 550068 non-null int64

5 City_Category 550068 non-null object

6 Stay_In_Current_City_Years 550068 non-null object

7 Marital_Status 550068 non-null int64

8 Product_Category_1 550068 non-null int64

9 Product_Category_2 376430 non-null float64

10 Product_Category_3 166821 non-null float64

11 Purchase 550068 non-null int64

dtypes: float64(2), int64(5), object(5)

memory usage: 50.4+ MB

None

Missing Values Per Column:

User_ID 0

Product_ID 0

Gender 0

Age 0

Occupation 0

City_Category 0

Stay_In_Current_City_Years 0

Marital_Status 0

Product_Category_1 0

Product_Category_2 173638

Product_Category_3 383247

Purchase 0

dtype: int64

Unique Values (First 5 Columns):

User_ID: 5891 unique values

Product_ID: 3631 unique values

Gender: 2 unique values

Age: 7 unique values

Occupation: 21 unique values

The output indicates that:

The Black Friday dataset consists of 12 columns and 550,068 rows.
There are 173,638 missing values in Product Category 2, and 383,247 missing values in Product Category 3.
There are roughly 3.6k unique products and 5.8k unique users.
There are:
- 2 categories for gender.
- 7 different ranges of age.
- 21 different kinds of occupations.

Step 4: Visualize the Data

In this step, we will visualize user trends, product purchases, and purchase distribution. Why? To find patterns and uncover hidden trends in the data. To achieve this, we will use Matplotlib and Seaborn. Use the code mentioned below to achieve this:

# Install seaborn if not installed
!pip install seaborn --quiet

# Import libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
sns.set(style="whitegrid")

# Set overall figure size
plt.figure(figsize=(20,16))

# Plot 1: Average Purchase by Gender
plt.subplot(2, 2, 1)
sns.barplot(data=df, x='Gender', y='Purchase', estimator='mean', palette='Set2')
plt.title("Average Purchase by Gender")
plt.ylabel("Average Purchase Amount")
plt.xlabel("Gender")

# Plot 2: Average Purchase by Age Group
plt.subplot(2, 2, 2)
sns.barplot(data=df, x='Age', y='Purchase', estimator='mean', palette='Set3')
plt.title("Average Purchase by Age Group")
plt.ylabel("Average Purchase Amount")
plt.xlabel("Age Group")

# Plot 3: Purchase Distribution Histogram
plt.subplot(2, 2, 3)
sns.histplot(df['Purchase'], bins=30, kde=True, color='salmon')
plt.title("Distribution of Purchase Amounts")
plt.xlabel("Purchase Amount")
plt.ylabel("Frequency")

# Plot 4: Purchase by City Category
plt.subplot(2, 2, 4)
sns.boxplot(data=df, x='City_Category', y='Purchase', palette='pastel')
plt.title("Purchase Amount by City Category")
plt.xlabel("City Category")
plt.ylabel("Purchase")

# Adjust layout
plt.tight_layout()
plt.show()

Output:

What does each chart explain or depict?
The charts (starting from top-left and moving clockwise) tell us that:

Plot	What does it show?
Average Purchase by Gender	The average purchase amount for men is marginally higher than that of women.
Average Purchase by Age Group	On average, users between the ages of 51 and 55 spend the most. Followed by those between the ages of 46-50 and 26-35.
Purchase Amount by City Category	In city categories A, B, and C, the median purchase amounts are comparatively constant.
Distribution of Purchase Amounts	With several peaks and a right skew, the data shows a range of spending patterns.

Step 5: Handle Missing Values & Clean the Data

We discovered that the Black Friday dataset contains missing values in step 3. So, let's deal with them and correct inconsistent data types in this step. To do this, use the code listed below:

# Fill missing values in Product_Category_2 and Product_Category_3 with 0
df['Product_Category_2'].fillna(0, inplace=True)
df['Product_Category_3'].fillna(0, inplace=True)

# Convert data type from float to int for filled columns
df['Product_Category_2'] = df['Product_Category_2'].astype(int)
df['Product_Category_3'] = df['Product_Category_3'].astype(int)

# Optional: Convert categorical columns to string (if needed for encoding later)
df['User_ID'] = df['User_ID'].astype(str)
df['Product_ID'] = df['Product_ID'].astype(str)

# Confirm no missing values
print("Any missing values left?:")
print(df.isnull().sum())

Output:

df['Product_Category_3'].fillna(0, inplace=True)

Any missing values left?:

User_ID 0

Product_ID 0

Gender 0

Age 0

Occupation 0

City_Category 0

Stay_In_Current_City_Years 0

Marital_Status 0

Product_Category_1 0

Product_Category_2 0

Product_Category_3 0

Purchase 0

dtype: int64

Step 6: Feature Encoding

We must convert categorical columns, such as: Gender, Age, City_Category, etc., into numerical format prior to training our model. Why? Machine Learning algorithms do not process text. They only process numerical values, such as 0 and 1. To achieve this, we will use scikit-learn's LabelEncoder.

The code is as follows:

from sklearn.preprocessing import LabelEncoder
# Create a LabelEncoder object
le = LabelEncoder()
# List of categorical columns to encode
cat_cols = ['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years']

# Apply Label Encoding to each
for col in cat_cols:
    df[col] = le.fit_transform(df[col])

# View the first few rows to confirm
print(df[cat_cols].head())

Output:

Gender Age City_Category Stay_In_Current_City_Years

0 0 0 0 2

1 0 0 0 2

2 0 0 0 2

3 0 0 0 2

4 1 6 2 4

Step 7: Prepare Features and Split the Data

In this step, we will remove non-informative columns, such as - User_ID and Product_ID. Once this is done, we will then split the data into training and testing sets using an 80:20 ratio.

Use the below-mentioned code to accomplish all this:

from sklearn.model_selection import train_test_split
# Prepare features and target
X = df.drop(['User_ID', 'Product_ID', 'Purchase'], axis=1)
y = df['Purchase']
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Show shapes
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

Output:

X_train shape: (440054, 9)
X_test shape: (110014, 9)
y_train shape: (440054,)
y_test shape: (110014,)

The output tells us that the black friday dataset has been successfully split.

Training set contains 440,054 rows with 9 features
Testing set contains 110,014 rows with 9 features

Step 8: Train and Evaluate Regression Models

In this step, we will train the following regression models to predict purchase amounts:

Linear Regression
Decision Tree Regressor
Random Forest Regressor
XGBoost Regressor

We will evaluate them using - R² Score and Mean Squared Error (MSE). Here is the code:

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_squared_error
# Dictionary to store models and results
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "XGBoost": XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
}

# Train, predict, and evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    
    print(f"{name}")
    print(f"R² Score: {r2:.4f}")
    print(f"Mean Squared Error: {mse:.2f}")
    print("-" * 30)

Output:

Linear Regression

R² Score: 0.1510

Mean Squared Error: 21332344.83

------------------------------

Decision Tree

R² Score: 0.5527

Mean Squared Error: 11238644.44

------------------------------

Random Forest

R² Score: 0.6268

Mean Squared Error: 9377075.57

------------------------------

XGBoost

R² Score: 0.6606

Mean Squared Error: 8528891.00

------------------------------

Conclusion

XGBoost Regressor performed the best out of all the models that were tested. It obtained the lowest Mean Squared Error of - 8.5 million and an R2 score of - 0.6606 as well. This suggests that it outperformed the others in explaining about 66% of the variation in the purchase amount.

With only 15% of the variance explained, linear regression performed poorly. The non-linearity of the data may be the cause. The Decision Tree did better. However, ensemble models outperformed it.

Random Forest increased bagging accuracy. In the meantime, XGBoost used gradient boosting to further improve results.

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/15wUkQ8jsKjd0H2Aj3UI9ohRuyTUkbFjT

Frequently Asked Questions (FAQs)

1. What is the goal of the Black Friday dataset project?

The primary goal is to predict the purchase amount of customers based on various factors like age, gender, occupation, and product categories using regression algorithms.

2. Why are categorical variables converted to numerical values?

Machine learning models work with numerical data. Algorithms like Linear Regression or XGBoost can’t interpret strings or categories directly, so encoding them as numbers is essential.

3. Why did XGBoost perform better than other models?

XGBoost applies gradient boosting and regularization techniques, which help reduce both bias and variance. It handles complex patterns and interactions in the data better than basic models.

4. Can this model be used in real-time applications?

Yes. With further tuning and validation, the XGBoost model can be deployed in a retail recommendation system to predict or personalize product offers based on user behavior.

5. What preprocessing steps are crucial before training the model?

Key steps include handling missing values, converting categorical columns using Label Encoding, and splitting the dataset into training and test sets. These ensure the model receives clean, numerical data for accurate predictions.

Rohit Sharma

834 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources