Bigmart Sales Dataset Analysis and Prediction Using Machine Learning
By Rohit Sharma
Updated on Jul 31, 2025 | 9 min read | 1.68K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 31, 2025 | 9 min read | 1.68K+ views
Share:
Table of Contents
Predicting product sales is a key challenge for any retail business. The Bigmart sales dataset is a well-known machine learning dataset that is used to forecast item-level sales across multiple stores.
In this project, we aim to predict the Item_Outlet_Sales using various product and outlet features. The dataset will be sourced from Kaggle. It includes sales data from 2013 and contains attributes like - Item_Type, Item_Fat_Content, Item_Visibility, Item_MRP, Outlet_Size, Outlet_Location_Type, etc.
The goal is to construct a predictive model that helps BigMart comprehend which product and store features influence sales the most.
It is better to have at least some background in:
For this project, we will be using the following tools and libraries:
Tool/Library |
Purpose |
Python |
Core programming language |
Pandas |
For loading, cleaning, and analyzing structured data |
NumPy |
For numerical operations & handling arrays |
Matplotlib |
To create static visualizations, like bar plots and histograms |
Seaborn |
For advanced statistical plotting and relationship visualization |
Scikit-learn |
For preprocessing, regression modeling, and evaluating model performance |
Google Colab |
Cloud-based Jupyter environment to write and run Python code |
Here are the models that we will be utilizing:
You can complete this Bigmart sales dataset regression project in about 4 to 6 hours. It’s a beginner-to-intermediate level machine learning project. You will get hands-on experience with supervised regression, data preprocessing, and predicting retail sales.
Let’s start building the project from scratch. We will start by:
Without any further delay, let’s start!
To build the model, we will use the popular BigMart sales dataset available on Kaggle. Follow the steps mentioned below to download the dataset:
Now that you have downloaded the file, upload it to Google Colab using the code below:
from google.colab import files
# Upload the CSV files
uploaded = files.upload()
This will prompt you to choose a file from your system. Select the bigmart.csv file you just downloaded.
Now, use the below code to load the dataset into a Pandas DataFrame as well as preview the first 5 rows:
import pandas as pd
# Load the dataset
data = pd.read_csv('bigmart.csv')
# Display the first few rows
data.head()
Doing so will help you verify that the dataset is loaded correctly.
Output:
Item_Identifier |
Item_Weight |
Item_Fat_Content |
Item_Visibility |
Item_Type |
Item_MRP |
Outlet_Identifier |
Outlet_Establishment_Year |
Outlet_Size |
Outlet_Location_Type |
Outlet_Type |
Item_Outlet_Sales |
|
0 |
FDA15 |
9.30 |
Low Fat |
0.016047 |
Dairy |
249.8092 |
OUT049 |
1999 |
Medium |
Tier 1 |
Supermarket Type1 |
3735.1380 |
1 |
DRC01 |
5.92 |
Regular |
0.019278 |
Soft Drinks |
48.2692 |
OUT018 |
2009 |
Medium |
Tier 3 |
Supermarket Type2 |
443.4228 |
2 |
FDN15 |
17.50 |
Low Fat |
0.016760 |
Meat |
141.6180 |
OUT049 |
1999 |
Medium |
Tier 1 |
Supermarket Type1 |
2097.2700 |
3 |
FDX07 |
19.20 |
Regular |
0.000000 |
Fruits and Vegetables |
182.0950 |
OUT010 |
1998 |
NaN |
Tier 3 |
Grocery Store |
732.3800 |
4 |
NCD19 |
8.93 |
Low Fat |
0.000000 |
Household |
53.8614 |
OUT013 |
1987 |
High |
Tier 3 |
Supermarket Type1 |
994.7052 |
The dataset is loaded. Now we will explore it to understand its structure. Doing so will also help us to pinpoint any issue, if it exists.
Here’s the code to do so:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Basic info
print("Shape of dataset:", data.shape)
print("\nColumns:\n", data.columns)
print("\nMissing values:\n", data.isnull().sum())
# Display first 5 rows
display(data.head())
# Plot sales distribution
plt.figure(figsize=(6,4))
sns.histplot(data['Item_Outlet_Sales'], bins=30, kde=True)
plt.title("Distribution of Item Outlet Sales")
plt.xlabel("Sales")
plt.ylabel("Frequency")
plt.show()
Output:
Shape of dataset: (8523, 12)
Columns:
Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
'Item_Type', 'Item_MRP', 'Outlet_Identifier',
'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
'Outlet_Type', 'Item_Outlet_Sales'],
dtype='object')
Missing values:
Item_Identifier 0
Item_Weight 1463
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 2410
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64
Item_Identifier |
Item_Weight |
Item_Fat_Content |
Item_Visibility |
Item_Type |
Item_MRP |
Outlet_Identifier |
Outlet_Establishment_Year |
Outlet_Size |
Outlet_Location_Type |
Outlet_Type |
Item_Outlet_Sales |
|
0 |
FDA15 |
9.30 |
Low Fat |
0.016047 |
Dairy |
249.8092 |
OUT049 |
1999 |
Medium |
Tier 1 |
Supermarket Type1 |
3735.1380 |
1 |
DRC01 |
5.92 |
Regular |
0.019278 |
Soft Drinks |
48.2692 |
OUT018 |
2009 |
Medium |
Tier 3 |
Supermarket Type2 |
443.4228 |
2 |
FDN15 |
17.50 |
Low Fat |
0.016760 |
Meat |
141.6180 |
OUT049 |
1999 |
Medium |
Tier 1 |
Supermarket Type1 |
2097.2700 |
3 |
FDX07 |
19.20 |
Regular |
0.000000 |
Fruits and Vegetables |
182.0950 |
OUT010 |
1998 |
NaN |
Tier 3 |
Grocery Store |
732.3800 |
4 |
NCD19 |
8.93 |
Low Fat |
0.000000 |
Household |
53.8614 |
OUT013 |
1987 |
High |
Tier 3 |
Supermarket Type1 |
994.7052 |
Popular Data Science Programs
What does this output convey?
The output tells us that:
As there are missing values, let’s clean the dataset by handling missing values. We will also convert categorical columns to numeric using - Label Encoding.
Here is the code to accomplish the same:
from sklearn.preprocessing import LabelEncoder
# Fill missing values
data['Item_Weight'].fillna(data['Item_Weight'].mean(), inplace=True)
data['Outlet_Size'].fillna(data['Outlet_Size'].mode()[0], inplace=True)
# Encode categorical columns
le = LabelEncoder()
cat_cols = ['Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']
data[cat_cols] = data[cat_cols].apply(le.fit_transform)
Now the dataset is ready for model training. Let’s move ahead.
In this step, we will explore visual patterns. We will use Seaborn and Matplotlib to visualize relationships between features and the target variable Item_Outlet_Sales.
Use the code below to do so:
import seaborn as sns
import matplotlib.pyplot as plt
# Set the style for better visuals
sns.set(style="whitegrid")
# 1. Distribution of Item Outlet Sales
plt.figure(figsize=(8, 4))
sns.histplot(data['Item_Outlet_Sales'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of Item Outlet Sales')
plt.xlabel('Sales')
plt.ylabel('Frequency')
plt.show()
# 2. Item MRP vs Sales
plt.figure(figsize=(8, 4))
sns.scatterplot(x='Item_MRP', y='Item_Outlet_Sales', data=data, hue='Outlet_Type')
plt.title('Item MRP vs Item Outlet Sales')
plt.xlabel('Item MRP')
plt.ylabel('Sales')
plt.show()
# 3. Sales by Outlet Type
plt.figure(figsize=(8, 4))
sns.boxplot(x='Outlet_Type', y='Item_Outlet_Sales', data=data)
plt.title('Sales by Outlet Type')
plt.xticks(rotation=45)
plt.xlabel('Outlet Type')
plt.ylabel('Sales')
plt.show()
Output:
What does this output convey?
Distribution of Item Outlet Sales:
Item MRP vs Item Outlet Sales (Colored by Outlet Type)
Sales by Outlet Type (Box Plot)
In this step, we will train the model and evaluate it. Let’s start with Logistic Regression.
Here is the code to do so:
# Drop 'Item_Identifier' as it's not useful for prediction
X = data.drop(columns=['Item_Identifier', 'Item_Outlet_Sales'])
y = data['Item_Outlet_Sales']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print("R² Score:", r2)
print("Mean Squared Error:", mse)
Output:
R² Score: 0.5248926313247789
Mean Squared Error: 1291327.6064882863
What does this output convey?
R² Score: 0.52:
Mean Squared Error: 1,291,327
Here’s the code:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error
# Initialize and train the Decision Tree model
tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X_train, y_train)
# Make predictions
y_pred_tree = tree_model.predict(X_test)
# Evaluate the model
r2_tree = r2_score(y_test, y_pred_tree)
mse_tree = mean_squared_error(y_test, y_pred_tree)
print("Decision Tree Regressor Performance:")
print("R² Score:", r2_tree)
print("Mean Squared Error:", mse_tree)
Output:
Decision Tree Regressor Performance:
R² Score: 0.15745401301402862
Mean Squared Error: 2290014.772375913
What does this output convey?
R² Score is 0.157:
Mean Squared Error (MSE) is 2,290,014:
Here is the code:
# Import the model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
# Initialize and train the model
model_rf = RandomForestRegressor(random_state=42)
model_rf.fit(X_train, y_train)
# Make predictions
y_pred_rf = model_rf.predict(X_test)
# Evaluate the model
r2 = r2_score(y_test, y_pred_rf)
mse = mean_squared_error(y_test, y_pred_rf)
print("Random Forest Regressor Performance:")
print("R² Score:", r2)
print("Mean Squared Error:", mse)
Output:
Random Forest Regressor Performance:
R² Score: 0.5702020620056147
Mean Squared Error: 1168177.9301623292
What does this output convey?
R² Score: 0.57:
Mean Squared Error: 11,68,178:
Now that we have seen the performance of both models, let’s quickly compare their results and understand how they performed on the test data:
Model |
R² Score |
Mean Squared Error |
Linear Regression |
0.52 |
12,91,328 |
Decision Tree Regressor |
0.15 |
22,90,015 |
Random Forest Regressor |
0.57 |
11,68,178 |
In comparison to the other two, Random Forest Regression did better than any others. It had the highest value of R² (0.57). This means that the model explained 57% of the variation in sales. The Mean Squared Error of the Random Forest Regressor is the lowest among the three, so in terms of accuracy, it stands first.
Decision Tree Regressor, on the contrary, performed very poorly. Meanwhile, Linear Regression did reasonably well but was never really able to compete with the ensemble.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/15zuGdLE0y7bsOHwe1ubpu-LJJq6YV2tW?usp=sharing#scrollTo=tadwXXOmXTTP
803 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources