Bigmart Sales Dataset Analysis and Prediction Using Machine Learning

By Rohit Sharma

Updated on Jul 31, 2025 | 9 min read | 1.68K+ views

Share:

Predicting product sales is a key challenge for any retail business. The Bigmart sales dataset is a well-known machine learning dataset that is used to forecast item-level sales across multiple stores.

In this project, we aim to predict the Item_Outlet_Sales using various product and outlet features. The dataset will be sourced from Kaggle. It includes sales data from 2013 and contains attributes like - Item_Type, Item_Fat_Content, Item_Visibility, Item_MRP, Outlet_Size, Outlet_Location_Type, etc.

The goal is to construct a predictive model that helps BigMart comprehend which product and store features influence sales the most. 

What Should You Know Beforehand?

It is better to have at least some background in:

Technologies and Libraries Used

For this project, we will be using the following tools and libraries:

Tool/Library

Purpose

Python

Core programming language

Pandas

For loading, cleaning, and analyzing structured data

NumPy

For numerical operations & handling arrays

Matplotlib

To create static visualizations, like bar plots and histograms

Seaborn

For advanced statistical plotting and relationship visualization

Scikit-learn

For preprocessing, regression modeling, and evaluating model performance

Google Colab

Cloud-based Jupyter environment to write and run Python code

Models That Will Be Utilized for Learning

Here are the models that we will be utilizing: 

  • Linear RegressionThe model assumes a linear relationship between the input features and the sales output. Useful as baseline predictions.
  • Decision Tree RegressorIt is a tree-based model whereby the dataset is split based on feature thresholds. Non-linear relationships can be captured and work well with mixed data types.
  • Random Forest RegressorThe one with an ensemble of decision trees. It increases prediction accuracy and decreases overfitting by averaging the outputs from multiple trees.

Time Taken and Difficulty

You can complete this Bigmart sales dataset regression project in about 4 to 6 hours. It’s a beginner-to-intermediate level machine learning project. You will get hands-on experience with supervised regression, data preprocessing, and predicting retail sales.

How to Build the Bigmart Sales Dataset Project

Let’s start building the project from scratch. We will start by:

  1. Load and explore the dataset
  2. Handle missing values and encode categorical features
  3. Visualize relationships between product and outlet attributes
  4. Train and evaluate multiple regression models

Without any further delay, let’s start!

Step 1: Download the Dataset

To build the model, we will use the popular BigMart sales dataset available on Kaggle. Follow the steps mentioned below to download the dataset:

  1. Open a new tab in any web browser. 
  2. Type https://www.kaggle.com/datasets/yasserh/bigmartsalesdataset.
  3. On the Bigmart Sales Dataset page, in the right pane, under the Data Explorer section, click bigmart.csv
  4. Click the download icon

Step 2: Upload and Load the Dataset in Google Colab

Now that you have downloaded the file, upload it to Google Colab using the code below:

from google.colab import files

# Upload the CSV files
uploaded = files.upload()

This will prompt you to choose a file from your system. Select the bigmart.csv file you just downloaded.

Now, use the below code to load the dataset into a Pandas DataFrame as well as preview the first 5 rows:

import pandas as pd

# Load the dataset
data = pd.read_csv('bigmart.csv')

# Display the first few rows
data.head()

Doing so will help you verify that the dataset is loaded correctly.

Output:

 

Item_Identifier

Item_Weight

Item_Fat_Content

Item_Visibility

Item_Type

Item_MRP

Outlet_Identifier

Outlet_Establishment_Year

Outlet_Size

Outlet_Location_Type

Outlet_Type

Item_Outlet_Sales

0

FDA15

9.30

Low Fat

0.016047

Dairy

249.8092

OUT049

1999

Medium

Tier 1

Supermarket Type1

3735.1380

1

DRC01

5.92

Regular

0.019278

Soft Drinks

48.2692

OUT018

2009

Medium

Tier 3

Supermarket Type2

443.4228

2

FDN15

17.50

Low Fat

0.016760

Meat

141.6180

OUT049

1999

Medium

Tier 1

Supermarket Type1

2097.2700

3

FDX07

19.20

Regular

0.000000

Fruits and Vegetables

182.0950

OUT010

1998

NaN

Tier 3

Grocery Store

732.3800

4

NCD19

8.93

Low Fat

0.000000

Household

53.8614

OUT013

1987

High

Tier 3

Supermarket Type1

994.7052

 

Step 3: Exploratory Data Analysis (EDA)

The dataset is loaded. Now we will explore it to understand its structure. Doing so will also help us to pinpoint any issue, if it exists. 

Here’s the code to do so:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Basic info
print("Shape of dataset:", data.shape)
print("\nColumns:\n", data.columns)
print("\nMissing values:\n", data.isnull().sum())

# Display first 5 rows
display(data.head())

# Plot sales distribution
plt.figure(figsize=(6,4))
sns.histplot(data['Item_Outlet_Sales'], bins=30, kde=True)
plt.title("Distribution of Item Outlet Sales")
plt.xlabel("Sales")
plt.ylabel("Frequency")
plt.show()

Output:

Shape of dataset: (8523, 12)

Columns:

 Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
               'Item_Type', 'Item_MRP', 'Outlet_Identifier',
               'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
               'Outlet_Type', 'Item_Outlet_Sales'],
             dtype='object')

Missing values:

Item_Identifier                       0
Item_Weight                      1463
Item_Fat_Content                 0
Item_Visibility                        0
Item_Type                              0
Item_MRP                                0
Outlet_Identifier                      0
Outlet_Establishment_Year    0
Outlet_Size                           2410
Outlet_Location_Type            0
Outlet_Type                             0
Item_Outlet_Sales                   0
dtype: int64

 

Item_Identifier

Item_Weight

Item_Fat_Content

Item_Visibility

Item_Type

Item_MRP

Outlet_Identifier

Outlet_Establishment_Year

Outlet_Size

Outlet_Location_Type

Outlet_Type

Item_Outlet_Sales

0

FDA15

9.30

Low Fat

0.016047

Dairy

249.8092

OUT049

1999

Medium

Tier 1

Supermarket Type1

3735.1380

1

DRC01

5.92

Regular

0.019278

Soft Drinks

48.2692

OUT018

2009

Medium

Tier 3

Supermarket Type2

443.4228

2

FDN15

17.50

Low Fat

0.016760

Meat

141.6180

OUT049

1999

Medium

Tier 1

Supermarket Type1

2097.2700

3

FDX07

19.20

Regular

0.000000

Fruits and Vegetables

182.0950

OUT010

1998

NaN

Tier 3

Grocery Store

732.3800

4

NCD19

8.93

Low Fat

0.000000

Household

53.8614

OUT013

1987

High

Tier 3

Supermarket Type1

994.7052

What does this output convey?

The output tells us that:

  • The dataset contains 8,523 rows and 12 columns.
  • Columns below have missing values:
    • Item_Weight: 1463 missing values
    • Outlet_Size: 2410 missing values
  • The sales distribution plot shows a right-skewed pattern. The pattern shows that most of the products have lower sales, while a few have higher sales.

Step 4: Data Preprocessing

As there are missing values, let’s clean the dataset by handling missing values. We will also convert categorical columns to numeric using - Label Encoding.

Here is the code to accomplish the same:

from sklearn.preprocessing import LabelEncoder

# Fill missing values
data['Item_Weight'].fillna(data['Item_Weight'].mean(), inplace=True)
data['Outlet_Size'].fillna(data['Outlet_Size'].mode()[0], inplace=True)

# Encode categorical columns
le = LabelEncoder()
cat_cols = ['Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
            'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']

data[cat_cols] = data[cat_cols].apply(le.fit_transform)

Now the dataset is ready for model training. Let’s move ahead. 

Step 5: Visualize Key Relationships

In this step, we will explore visual patterns. We will use Seaborn and Matplotlib to visualize relationships between features and the target variable Item_Outlet_Sales.

Use the code below to do so:

import seaborn as sns
import matplotlib.pyplot as plt

# Set the style for better visuals
sns.set(style="whitegrid")

# 1. Distribution of Item Outlet Sales
plt.figure(figsize=(8, 4))
sns.histplot(data['Item_Outlet_Sales'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of Item Outlet Sales')
plt.xlabel('Sales')
plt.ylabel('Frequency')
plt.show()

# 2. Item MRP vs Sales
plt.figure(figsize=(8, 4))
sns.scatterplot(x='Item_MRP', y='Item_Outlet_Sales', data=data, hue='Outlet_Type')
plt.title('Item MRP vs Item Outlet Sales')
plt.xlabel('Item MRP')
plt.ylabel('Sales')
plt.show()

# 3. Sales by Outlet Type
plt.figure(figsize=(8, 4))
sns.boxplot(x='Outlet_Type', y='Item_Outlet_Sales', data=data)
plt.title('Sales by Outlet Type')
plt.xticks(rotation=45)
plt.xlabel('Outlet Type')
plt.ylabel('Sales')
plt.show()

Output:

What does this output convey?

Distribution of Item Outlet Sales:

  • The majority of items have sales below ₹3000–₹4000.
  • The distribution is right-skewed, meaning a few items have very high sales.
  • Most items don’t sell in large quantities, but some high-performing products significantly increase the average.

Item MRP vs Item Outlet Sales (Colored by Outlet Type)

  • Items with higher MRP generally show greater sales variation.
  • No clear linear relationship between MRP and sales, but clusters exist.
  • Outlet type influences how many items sell, but it’s not the only factor. There is overlap between outlet types.

Sales by Outlet Type (Box Plot)

  • Outlet Type 3 tends to have higher median sales than the others.
  • All outlet types show outliers, meaning a few stores perform exceptionally well.
  • There’s noticeable variation in sales within each outlet type, indicating performance differs store-to-store.

Step 6: Build and Train the Model

In this step, we will train the model and evaluate it. Let’s start with Logistic Regression.

Model 1: Logistic Regression

Here is the code to do so:

# Drop 'Item_Identifier' as it's not useful for prediction
X = data.drop(columns=['Item_Identifier', 'Item_Outlet_Sales'])
y = data['Item_Outlet_Sales']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print("R² Score:", r2)
print("Mean Squared Error:", mse)

Output:

R² Score: 0.5248926313247789
Mean Squared Error: 1291327.6064882863

What does this output convey?

R² Score: 0.52:

  • The model shows that model explains 52% of the variation in sales.
  • Indicates moderate prediction accuracy.

Mean Squared Error: 1,291,327

  • Shows the average squared difference between actual and predicted values.
  • A lower value is better, so this suggests there’s still some prediction error.

Model 2: Decision Tree Regressor

Here’s the code:

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error

# Initialize and train the Decision Tree model
tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X_train, y_train)

# Make predictions
y_pred_tree = tree_model.predict(X_test)

# Evaluate the model
r2_tree = r2_score(y_test, y_pred_tree)
mse_tree = mean_squared_error(y_test, y_pred_tree)

print("Decision Tree Regressor Performance:")
print("R² Score:", r2_tree)
print("Mean Squared Error:", mse_tree)

Output:

Decision Tree Regressor Performance:
R² Score: 0.15745401301402862
Mean Squared Error: 2290014.772375913

What does this output convey?

R² Score is 0.157:

  • The model is able to explain only 15.7% of the changes in sales. It means it's not doing a good job of comprehending the data.

Mean Squared Error (MSE) is 2,290,014:

  • This is the average error between actual and predicted sales. Higher numbers denote bigger mistakes in predictions.

Model 3: Random Forest Regressor

Here is the code:

# Import the model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

# Initialize and train the model
model_rf = RandomForestRegressor(random_state=42)
model_rf.fit(X_train, y_train)

# Make predictions
y_pred_rf = model_rf.predict(X_test)

# Evaluate the model
r2 = r2_score(y_test, y_pred_rf)
mse = mean_squared_error(y_test, y_pred_rf)

print("Random Forest Regressor Performance:")
print("R² Score:", r2)
print("Mean Squared Error:", mse)
background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Output:

Random Forest Regressor Performance:
R² Score: 0.5702020620056147
Mean Squared Error: 1168177.9301623292

What does this output convey?

R² Score: 0.57:

  • This means the model explains 57% of the variation in sales data.
  • It's better than the Linear Regression (52%) and Decision Tree (15%) models.
  • Higher R² = more accurate predictions.

Mean Squared Error: 11,68,178:

  • This shows the average squared difference between actual and predicted sales.
  • Lower value = better performance.

Step 7: Compare Model Performances

Now that we have seen the performance of both models, let’s quickly compare their results and understand how they performed on the test data:

Model

R² Score 

Mean Squared Error 

Linear Regression

0.52

12,91,328

Decision Tree Regressor

0.15

22,90,015

Random Forest Regressor

0.57

11,68,178

Conclusion

In comparison to the other two, Random Forest Regression did better than any others. It had the highest value of R² (0.57). This means that the model explained 57% of the variation in sales. The Mean Squared Error of the Random Forest Regressor is the lowest among the three, so in terms of accuracy, it stands first.

Decision Tree Regressor, on the contrary, performed very poorly. Meanwhile, Linear Regression did reasonably well but was never really able to compete with the ensemble.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/15zuGdLE0y7bsOHwe1ubpu-LJJq6YV2tW?usp=sharing#scrollTo=tadwXXOmXTTP

Frequently Asked Questions (FAQs)

1. What is the Bigmart Sales Dataset?

2. What is the main goal of a Bigmart Sales Dataset project?

3. Which algorithms are suitable for this project?

4. What are the key steps in working with this dataset?

5. What tools and libraries are recommended for this project?

Rohit Sharma

803 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months