Black Friday Dataset Analysis for Sales Prediction
By Rohit Sharma
Updated on Aug 05, 2025 | 11 min read | 1.33K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 05, 2025 | 11 min read | 1.33K+ views
Share:
Black Friday is one of the biggest retail events of the year, not only for the massive discounts it offers but also for the sheer volume of customer data it generates. This project focuses on analyzing the Black Friday dataset to uncover insights into consumer purchasing behavior. Our aim is to identify what factors, such as age, gender, occupation, and city category, influence how much a customer spends. By exploring these patterns, we can build predictive models that help forecast future sales and support better business decisions.
Upskill in data science with upGrad's Online Data Science Courses. Learn Python, ML, AI, SQL, and Tableau from experts, build real-world skills, and get job-ready.
Check our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog for more project ideas like this one.
Popular Data Science Programs
Consider yourself a retail business trying to improve the recommendations you give to your clients. Basic information, such as their city type, gender, age group, and the types of products they have browsed or purchased, is available to you.
The key query is: Is it possible to forecast the likely spending amount of a customer?
That’s exactly what this dataset lets us try. It includes:
This project offers hands-on experience in data preprocessing, visualization, and machine learning modeling.
Advance your data science career with upGrad's top courses and industry mentors.
Although it's not necessary to be an expert in machine learning to follow along, it will be helpful to have some basic knowledge. What you should know is as follows:
On average, it will take about 2 to 3 hours to complete. Duration may increase/decrease depending on your familiarity with - data preprocessing, visualization, and basic machine learning concepts.
Here are the machine learning models that we will be utilizing:
Model |
Purpose |
To build a simple baseline model for predicting the purchase amount |
|
To capture nonlinear patterns and decision rules |
|
To improve accuracy using an ensemble of decision trees |
|
XGBoost Regressor |
To leverage gradient boosting for better performance and efficiency |
Now, let’s start the main section of the project-
Let’s start building the project from scratch. We will start by:
Without any further delay, let’s start!
First, we will be downloading the Black Friday dataset from Kaggle. Use the code given below to do so:
import kagglehub
# Download latest version
path = kagglehub.dataset_download("prepinstaprime/black-friday-sales-data")
print("Path to dataset files:", path)
Output:
Downloading from https://www.kaggle.com/api/v1/datasets/download/prepinstaprime/black-friday-sales-data?dataset_version_number=1...
100%|██████████| 5.48M/5.48M [00:00<00:00, 67.5MB/s]Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/prepinstaprime/black-friday-sales-data/versions/1
In this step, we will load the dataset using Pandas. Here’s the code:
import pandas as pd
# Full path to the CSV fill
file_path = "/root/.cache/kagglehub/datasets/prepinstaprime/black-friday-sales-data/versions/1/train.csv"
# Load the dataset
df = pd.read_csv(file_path)
# Display the first few rows
df.head()
Output:
User_ID |
Product_ID |
Gender |
Age |
Occupation |
City_Category |
Stay_In_Current_City_Years |
Marital_Status |
Product_Category_1 |
Product_Category_2 |
Product_Category_3 |
Purchase |
|
0 |
1000001 |
P00069042 |
F |
0-17 |
10 |
A |
2 |
0 |
3 |
NaN |
NaN |
8370 |
1 |
1000001 |
P00248942 |
F |
0-17 |
10 |
A |
2 |
0 |
1 |
6.0 |
14.0 |
15200 |
2 |
1000001 |
P00087842 |
F |
0-17 |
10 |
A |
2 |
0 |
12 |
NaN |
NaN |
1422 |
3 |
1000001 |
P00085442 |
F |
0-17 |
10 |
A |
2 |
0 |
12 |
14.0 |
NaN |
1057 |
4 |
1000002 |
P00285442 |
M |
55+ |
16 |
C |
4+ |
0 |
8 |
NaN |
NaN |
7969 |
The output shows that the Black Friday dataset contains customer and product-related columns, such as:
Understanding the type of data we are working with is crucial before beginning any modeling. In order to accomplish that, we will verify in this step:
The code to accomplish this is as follows:
# Check the shape of the dataset
print("Dataset Shape:", df.shape)
# Get column-wise info
print("\nColumn Info:")
print(df.info())
# Count missing values
print("\nMissing Values Per Column:")
print(df.isnull().sum())
# View unique values in each column (first 5 columns for quick look)
print("\nUnique Values (First 5 Columns):")
for col in df.columns[:5]:
print(f"{col}: {df[col].nunique()} unique values")
Output:
Dataset Shape: (550068, 12)
Column Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 550068 non-null int64
1 Product_ID 550068 non-null object
2 Gender 550068 non-null object
3 Age 550068 non-null object
4 Occupation 550068 non-null int64
5 City_Category 550068 non-null object
6 Stay_In_Current_City_Years 550068 non-null object
7 Marital_Status 550068 non-null int64
8 Product_Category_1 550068 non-null int64
9 Product_Category_2 376430 non-null float64
10 Product_Category_3 166821 non-null float64
11 Purchase 550068 non-null int64
dtypes: float64(2), int64(5), object(5)
memory usage: 50.4+ MB
None
Missing Values Per Column:
User_ID 0
Product_ID 0
Gender 0
Age 0
Occupation 0
City_Category 0
Stay_In_Current_City_Years 0
Marital_Status 0
Product_Category_1 0
Product_Category_2 173638
Product_Category_3 383247
Purchase 0
dtype: int64
Unique Values (First 5 Columns):
User_ID: 5891 unique values
Product_ID: 3631 unique values
Gender: 2 unique values
Age: 7 unique values
Occupation: 21 unique values
The output indicates that:
In this step, we will visualize user trends, product purchases, and purchase distribution. Why? To find patterns and uncover hidden trends in the data. To achieve this, we will use Matplotlib and Seaborn. Use the code mentioned below to achieve this:
# Install seaborn if not installed
!pip install seaborn --quiet
# Import libraries
import matplotlib.pyplot as plt
import seaborn as sns
# Set plot style
sns.set(style="whitegrid")
# Set overall figure size
plt.figure(figsize=(20,16))
# Plot 1: Average Purchase by Gender
plt.subplot(2, 2, 1)
sns.barplot(data=df, x='Gender', y='Purchase', estimator='mean', palette='Set2')
plt.title("Average Purchase by Gender")
plt.ylabel("Average Purchase Amount")
plt.xlabel("Gender")
# Plot 2: Average Purchase by Age Group
plt.subplot(2, 2, 2)
sns.barplot(data=df, x='Age', y='Purchase', estimator='mean', palette='Set3')
plt.title("Average Purchase by Age Group")
plt.ylabel("Average Purchase Amount")
plt.xlabel("Age Group")
# Plot 3: Purchase Distribution Histogram
plt.subplot(2, 2, 3)
sns.histplot(df['Purchase'], bins=30, kde=True, color='salmon')
plt.title("Distribution of Purchase Amounts")
plt.xlabel("Purchase Amount")
plt.ylabel("Frequency")
# Plot 4: Purchase by City Category
plt.subplot(2, 2, 4)
sns.boxplot(data=df, x='City_Category', y='Purchase', palette='pastel')
plt.title("Purchase Amount by City Category")
plt.xlabel("City Category")
plt.ylabel("Purchase")
# Adjust layout
plt.tight_layout()
plt.show()
Output:
What does each chart explain or depict?
The charts (starting from top-left and moving clockwise) tell us that:
Plot |
What does it show? |
Average Purchase by Gender | The average purchase amount for men is marginally higher than that of women. |
Average Purchase by Age Group | On average, users between the ages of 51 and 55 spend the most. Followed by those between the ages of 46-50 and 26-35. |
Purchase Amount by City Category | In city categories A, B, and C, the median purchase amounts are comparatively constant. |
Distribution of Purchase Amounts | With several peaks and a right skew, the data shows a range of spending patterns. |
We discovered that the Black Friday dataset contains missing values in step 3. So, let's deal with them and correct inconsistent data types in this step. To do this, use the code listed below:
# Fill missing values in Product_Category_2 and Product_Category_3 with 0
df['Product_Category_2'].fillna(0, inplace=True)
df['Product_Category_3'].fillna(0, inplace=True)
# Convert data type from float to int for filled columns
df['Product_Category_2'] = df['Product_Category_2'].astype(int)
df['Product_Category_3'] = df['Product_Category_3'].astype(int)
# Optional: Convert categorical columns to string (if needed for encoding later)
df['User_ID'] = df['User_ID'].astype(str)
df['Product_ID'] = df['Product_ID'].astype(str)
# Confirm no missing values
print("Any missing values left?:")
print(df.isnull().sum())
Output:
df['Product_Category_3'].fillna(0, inplace=True)
Any missing values left?:
User_ID 0
Product_ID 0
Gender 0
Age 0
Occupation 0
City_Category 0
Stay_In_Current_City_Years 0
Marital_Status 0
Product_Category_1 0
Product_Category_2 0
Product_Category_3 0
Purchase 0
dtype: int64
We must convert categorical columns, such as: Gender, Age, City_Category, etc., into numerical format prior to training our model. Why? Machine Learning algorithms do not process text. They only process numerical values, such as 0 and 1. To achieve this, we will use scikit-learn's LabelEncoder.
The code is as follows:
from sklearn.preprocessing import LabelEncoder
# Create a LabelEncoder object
le = LabelEncoder()
# List of categorical columns to encode
cat_cols = ['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years']
# Apply Label Encoding to each
for col in cat_cols:
df[col] = le.fit_transform(df[col])
# View the first few rows to confirm
print(df[cat_cols].head())
Output:
Gender Age City_Category Stay_In_Current_City_Years
0 0 0 0 2
1 0 0 0 2
2 0 0 0 2
3 0 0 0 2
4 1 6 2 4
In this step, we will remove non-informative columns, such as - User_ID and Product_ID. Once this is done, we will then split the data into training and testing sets using an 80:20 ratio.
Use the below-mentioned code to accomplish all this:
from sklearn.model_selection import train_test_split
# Prepare features and target
X = df.drop(['User_ID', 'Product_ID', 'Purchase'], axis=1)
y = df['Purchase']
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Show shapes
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
Output:
X_train shape: (440054, 9)
X_test shape: (110014, 9)
y_train shape: (440054,)
y_test shape: (110014,)
The output tells us that the black friday dataset has been successfully split.
In this step, we will train the following regression models to predict purchase amounts:
We will evaluate them using - R² Score and Mean Squared Error (MSE). Here is the code:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_squared_error
# Dictionary to store models and results
models = {
"Linear Regression": LinearRegression(),
"Decision Tree": DecisionTreeRegressor(),
"Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
"XGBoost": XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
}
# Train, predict, and evaluate each model
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(f"{name}")
print(f"R² Score: {r2:.4f}")
print(f"Mean Squared Error: {mse:.2f}")
print("-" * 30)
Output:
Linear Regression
R² Score: 0.1510
Mean Squared Error: 21332344.83
------------------------------
Decision Tree
R² Score: 0.5527
Mean Squared Error: 11238644.44
------------------------------
Random Forest
R² Score: 0.6268
Mean Squared Error: 9377075.57
------------------------------
XGBoost
R² Score: 0.6606
Mean Squared Error: 8528891.00
------------------------------
XGBoost Regressor performed the best out of all the models that were tested. It obtained the lowest Mean Squared Error of - 8.5 million and an R2 score of - 0.6606 as well. This suggests that it outperformed the others in explaining about 66% of the variation in the purchase amount.
With only 15% of the variance explained, linear regression performed poorly. The non-linearity of the data may be the cause. The Decision Tree did better. However, ensemble models outperformed it.
Random Forest increased bagging accuracy. In the meantime, XGBoost used gradient boosting to further improve results.
Subscribe to upGrad's Newsletter
Join thousands of learners who receive useful tips
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/15wUkQ8jsKjd0H2Aj3UI9ohRuyTUkbFjT
834 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources