Home
Blog
Data Science
Bigmart Sales Dataset Analysis and Prediction Using Machine Learning

Bigmart Sales Dataset Analysis and Prediction Using Machine Learning

Updated on Jul 31, 2025 | 9 min read | 2.14K+ views

Table of Contents

View all

What Should You Know Beforehand?
Technologies and Libraries Used
Models That Will Be Utilized for Learning
Time Taken and Difficulty
How to Build the Bigmart Sales Dataset Project
Conclusion

Predicting product sales is a key challenge for any retail business. The Bigmart sales dataset is a well-known machine learning dataset that is used to forecast item-level sales across multiple stores.

In this project, we aim to predict the Item_Outlet_Sales using various product and outlet features. The dataset will be sourced from Kaggle. It includes sales data from 2013 and contains attributes like - Item_Type, Item_Fat_Content, Item_Visibility, Item_MRP, Outlet_Size, Outlet_Location_Type, etc.

The goal is to construct a predictive model that helps BigMart comprehend which product and store features influence sales the most.

What Should You Know Beforehand?

It is better to have at least some background in:

Python programming language (functions, loops, and handling basic data types)
Pandas and NumPy tools (for reading, analyzing, and manipulating structured data)
Matplotlib and Seaborn libraries (for plotting and visualizing relationships between features)
Exploratory Data Analysis (EDA) and data preprocessing techniques (handling missing values, encoding categorical data, and removing outliers)
Scikit-learn (for applying regression models and evaluating them using R² score and RMSE)

Technologies and Libraries Used

For this project, we will be using the following tools and libraries:

Tool/Library	Purpose
Python	Core programming language
Pandas	For loading, cleaning, and analyzing structured data
NumPy	For numerical operations & handling arrays
Matplotlib	To create static visualizations, like bar plots and histograms
Seaborn	For advanced statistical plotting and relationship visualization
Scikit-learn	For preprocessing, regression modeling, and evaluating model performance
Google Colab	Cloud-based Jupyter environment to write and run Python code

Models That Will Be Utilized for Learning

Here are the models that we will be utilizing:

Linear Regression: The model assumes a linear relationship between the input features and the sales output. Useful as baseline predictions.
Decision Tree Regressor: It is a tree-based model whereby the dataset is split based on feature thresholds. Non-linear relationships can be captured and work well with mixed data types.
Random Forest Regressor: The one with an ensemble of decision trees. It increases prediction accuracy and decreases overfitting by averaging the outputs from multiple trees.

Time Taken and Difficulty

You can complete this Bigmart sales dataset regression project in about 4 to 6 hours. It’s a beginner-to-intermediate level machine learning project. You will get hands-on experience with supervised regression, data preprocessing, and predicting retail sales.

How to Build the Bigmart Sales Dataset Project

Let’s start building the project from scratch. We will start by:

Load and explore the dataset
Handle missing values and encode categorical features
Visualize relationships between product and outlet attributes
Train and evaluate multiple regression models

Without any further delay, let’s start!

Step 1: Download the Dataset

To build the model, we will use the popular BigMart sales dataset available on Kaggle. Follow the steps mentioned below to download the dataset:

Open a new tab in any web browser.
Type https://www.kaggle.com/datasets/yasserh/bigmartsalesdataset.
On the Bigmart Sales Dataset page, in the right pane, under the Data Explorer section, click bigmart.csv.
Click the download icon.

Step 2: Upload and Load the Dataset in Google Colab

Now that you have downloaded the file, upload it to Google Colab using the code below:

from google.colab import files

# Upload the CSV files
uploaded = files.upload()

This will prompt you to choose a file from your system. Select the bigmart.csv file you just downloaded.

Now, use the below code to load the dataset into a Pandas DataFrame as well as preview the first 5 rows:

import pandas as pd

# Load the dataset
data = pd.read_csv('bigmart.csv')

# Display the first few rows
data.head()

Doing so will help you verify that the dataset is loaded correctly.

Output:

Item_Identifier

Item_Weight

Item_Fat_Content

Item_Visibility

Item_Type

Item_MRP

Outlet_Identifier

Outlet_Establishment_Year

Outlet_Size

Outlet_Location_Type

Outlet_Type

Item_Outlet_Sales

FDA15

9.30

Low Fat

0.016047

Dairy

249.8092

OUT049

1999

Medium

Tier 1

Supermarket Type1

3735.1380

DRC01

5.92

Regular

0.019278

Soft Drinks

48.2692

OUT018

2009

Medium

Tier 3

Supermarket Type2

443.4228

FDN15

17.50

Low Fat

0.016760

Meat

141.6180

OUT049

1999

Medium

Tier 1

Supermarket Type1

2097.2700

FDX07

19.20

Regular

0.000000

Fruits and Vegetables

182.0950

OUT010

1998

NaN

Tier 3

Grocery Store

732.3800

NCD19

8.93

Low Fat

0.000000

Household

53.8614

OUT013

1987

High

Tier 3

Supermarket Type1

994.7052

Step 3: Exploratory Data Analysis (EDA)

The dataset is loaded. Now we will explore it to understand its structure. Doing so will also help us to pinpoint any issue, if it exists.

Here’s the code to do so:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Basic info
print("Shape of dataset:", data.shape)
print("\nColumns:\n", data.columns)
print("\nMissing values:\n", data.isnull().sum())

# Display first 5 rows
display(data.head())

# Plot sales distribution
plt.figure(figsize=(6,4))
sns.histplot(data['Item_Outlet_Sales'], bins=30, kde=True)
plt.title("Distribution of Item Outlet Sales")
plt.xlabel("Sales")
plt.ylabel("Frequency")
plt.show()

Output:

Shape of dataset: (8523, 12)

Columns:

Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
'Item_Type', 'Item_MRP', 'Outlet_Identifier',
'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
'Outlet_Type', 'Item_Outlet_Sales'],
dtype='object')

Missing values:

Item_Identifier 0
Item_Weight 1463
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 2410
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

Item_Identifier

Item_Weight

Item_Fat_Content

Item_Visibility

Item_Type

Item_MRP

Outlet_Identifier

Outlet_Establishment_Year

Outlet_Size

Outlet_Location_Type

Outlet_Type

Item_Outlet_Sales

FDA15

9.30

Low Fat

0.016047

Dairy

249.8092

OUT049

1999

Medium

Tier 1

Supermarket Type1

3735.1380

DRC01

5.92

Regular

0.019278

Soft Drinks

48.2692

OUT018

2009

Medium

Tier 3

Supermarket Type2

443.4228

FDN15

17.50

Low Fat

0.016760

Meat

141.6180

OUT049

1999

Medium

Tier 1

Supermarket Type1

2097.2700

FDX07

19.20

Regular

0.000000

Fruits and Vegetables

182.0950

OUT010

1998

NaN

Tier 3

Grocery Store

732.3800

NCD19

8.93

Low Fat

0.000000

Household

53.8614

OUT013

1987

High

Tier 3

Supermarket Type1

994.7052

Popular Data Science Programs

Data Science Advanced Course PGD in Data Science M Sc in Data Science Degree Data Science Machine Learning Course Cloud Computing Courses Certification

What does this output convey?

The output tells us that:

The dataset contains 8,523 rows and 12 columns.
Columns below have missing values:
- Item_Weight: 1463 missing values
- Outlet_Size: 2410 missing values
The sales distribution plot shows a right-skewed pattern. The pattern shows that most of the products have lower sales, while a few have higher sales.

Step 4: Data Preprocessing

As there are missing values, let’s clean the dataset by handling missing values. We will also convert categorical columns to numeric using - Label Encoding.

Here is the code to accomplish the same:

from sklearn.preprocessing import LabelEncoder

# Fill missing values
data['Item_Weight'].fillna(data['Item_Weight'].mean(), inplace=True)
data['Outlet_Size'].fillna(data['Outlet_Size'].mode()[0], inplace=True)

# Encode categorical columns
le = LabelEncoder()
cat_cols = ['Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
            'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']

data[cat_cols] = data[cat_cols].apply(le.fit_transform)

Now the dataset is ready for model training. Let’s move ahead.

Step 5: Visualize Key Relationships

In this step, we will explore visual patterns. We will use Seaborn and Matplotlib to visualize relationships between features and the target variable Item_Outlet_Sales.

Use the code below to do so:

import seaborn as sns
import matplotlib.pyplot as plt

# Set the style for better visuals
sns.set(style="whitegrid")

# 1. Distribution of Item Outlet Sales
plt.figure(figsize=(8, 4))
sns.histplot(data['Item_Outlet_Sales'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of Item Outlet Sales')
plt.xlabel('Sales')
plt.ylabel('Frequency')
plt.show()

# 2. Item MRP vs Sales
plt.figure(figsize=(8, 4))
sns.scatterplot(x='Item_MRP', y='Item_Outlet_Sales', data=data, hue='Outlet_Type')
plt.title('Item MRP vs Item Outlet Sales')
plt.xlabel('Item MRP')
plt.ylabel('Sales')
plt.show()

# 3. Sales by Outlet Type
plt.figure(figsize=(8, 4))
sns.boxplot(x='Outlet_Type', y='Item_Outlet_Sales', data=data)
plt.title('Sales by Outlet Type')
plt.xticks(rotation=45)
plt.xlabel('Outlet Type')
plt.ylabel('Sales')
plt.show()

Output:

What does this output convey?

Distribution of Item Outlet Sales:

The majority of items have sales below ₹3000–₹4000.
The distribution is right-skewed, meaning a few items have very high sales.
Most items don’t sell in large quantities, but some high-performing products significantly increase the average.

Item MRP vs Item Outlet Sales (Colored by Outlet Type)

Items with higher MRP generally show greater sales variation.
No clear linear relationship between MRP and sales, but clusters exist.
Outlet type influences how many items sell, but it’s not the only factor. There is overlap between outlet types.

Sales by Outlet Type (Box Plot)

Outlet Type 3 tends to have higher median sales than the others.
All outlet types show outliers, meaning a few stores perform exceptionally well.
There’s noticeable variation in sales within each outlet type, indicating performance differs store-to-store.

Step 6: Build and Train the Model

In this step, we will train the model and evaluate it. Let’s start with Logistic Regression.

Model 1: Logistic Regression

Here is the code to do so:

# Drop 'Item_Identifier' as it's not useful for prediction
X = data.drop(columns=['Item_Identifier', 'Item_Outlet_Sales'])
y = data['Item_Outlet_Sales']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print("R² Score:", r2)
print("Mean Squared Error:", mse)

Output:

R² Score: 0.5248926313247789
Mean Squared Error: 1291327.6064882863

What does this output convey?

R² Score: 0.52:

The model shows that model explains 52% of the variation in sales.
Indicates moderate prediction accuracy.

Mean Squared Error: 1,291,327

Shows the average squared difference between actual and predicted values.
A lower value is better, so this suggests there’s still some prediction error.

Model 2: Decision Tree Regressor

Here’s the code:

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error

# Initialize and train the Decision Tree model
tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X_train, y_train)

# Make predictions
y_pred_tree = tree_model.predict(X_test)

# Evaluate the model
r2_tree = r2_score(y_test, y_pred_tree)
mse_tree = mean_squared_error(y_test, y_pred_tree)

print("Decision Tree Regressor Performance:")
print("R² Score:", r2_tree)
print("Mean Squared Error:", mse_tree)

Output:

Decision Tree Regressor Performance:
R² Score: 0.15745401301402862
Mean Squared Error: 2290014.772375913

What does this output convey?

R² Score is 0.157:

The model is able to explain only 15.7% of the changes in sales. It means it's not doing a good job of comprehending the data.

Mean Squared Error (MSE) is 2,290,014:

This is the average error between actual and predicted sales. Higher numbers denote bigger mistakes in predictions.

Model 3: Random Forest Regressor

Here is the code:

# Import the model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

# Initialize and train the model
model_rf = RandomForestRegressor(random_state=42)
model_rf.fit(X_train, y_train)

# Make predictions
y_pred_rf = model_rf.predict(X_test)

# Evaluate the model
r2 = r2_score(y_test, y_pred_rf)
mse = mean_squared_error(y_test, y_pred_rf)

print("Random Forest Regressor Performance:")
print("R² Score:", r2)
print("Mean Squared Error:", mse)

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Output:

Random Forest Regressor Performance:
R² Score: 0.5702020620056147
Mean Squared Error: 1168177.9301623292

What does this output convey?

R² Score: 0.57:

This means the model explains 57% of the variation in sales data.
It's better than the Linear Regression (52%) and Decision Tree (15%) models.
Higher R² = more accurate predictions.

Mean Squared Error: 11,68,178:

This shows the average squared difference between actual and predicted sales.
Lower value = better performance.

Step 7: Compare Model Performances

Now that we have seen the performance of both models, let’s quickly compare their results and understand how they performed on the test data:

Model	R² Score	Mean Squared Error
Linear Regression	0.52	12,91,328
Decision Tree Regressor	0.15	22,90,015
Random Forest Regressor	0.57	11,68,178

Conclusion

In comparison to the other two, Random Forest Regression did better than any others. It had the highest value of R² (0.57). This means that the model explained 57% of the variation in sales. The Mean Squared Error of the Random Forest Regressor is the lowest among the three, so in terms of accuracy, it stands first.

Decision Tree Regressor, on the contrary, performed very poorly. Meanwhile, Linear Regression did reasonably well but was never really able to compete with the ensemble.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/15zuGdLE0y7bsOHwe1ubpu-LJJq6YV2tW?usp=sharing#scrollTo=tadwXXOmXTTP

Frequently Asked Questions (FAQs)

1. What is the Bigmart Sales Dataset?

The Bigmart Sales Dataset contains sales data for different products across various stores. It includes features like item type, store size, visibility, and MRP, which are used to build predictive models for sales forecasting.

2. What is the main goal of a Bigmart Sales Dataset project?

The primary objective is to predict the sales of products at different outlets using supervised machine learning techniques based on historical data.

3. Which algorithms are suitable for this project?

Common algorithms include Linear Regression, Random Forest, XGBoost, and LightGBM. Ensemble models usually give better performance in regression tasks.

4. What are the key steps in working with this dataset?

Steps include data preprocessing (handling missing values), feature engineering, data visualization, model training, evaluation, and tuning for better accuracy.

5. What tools and libraries are recommended for this project?

Python libraries such as Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn, and XGBoost are frequently used for data analysis, visualization, and model building.

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources