Demand Forecasting for E-commerce Using Python (Machine Learning Project)

By Rohit Sharma

Updated on Jul 29, 2025 | 1.15K+ views

Share:

Demand Forecasting for E-commerce is a critical application of data science that helps online retailers to optimize stock levels, reduce wastage, and meet customer demand efficiently. 

In this project, you'll use Python to analyze historical sales data, identify trends and seasonality, and build a machine learning model to forecast future demand. This hands-on approach will help you master time series analysis and regression techniques tailored for real-world e-commerce scenarios.

Supercharge your data science career with upGrad’s top-rated Online Data Science Courses. Learn Python, Machine Learning, AI, SQL, Tableau, and more, taught by industry experts. Build real-world skills and get job-ready. Start learning today!

Turn ideas into action, explore our top Python Data Science Projects, and start building today.

What Should You Know to Build This Project Successfully?

It’s helpful to have some basic knowledge of the following before starting this project:

  • Python programming (variables, functions, loops, basic syntax)
  • Pandas and Numpy (data loading, cleaning, and numerical operations)
  • Matplotlib or Seaborn (data visualization)
  • Scikit‑learn basics (Understand how to train a regression model, make predictions, and evaluate performance using metrics such as MAE, RMSE, and R²)
  • Intro to Time Series Concepts (Grasp the basics of trend, seasonality, and autocorrelation. These ideas help when preparing data for forecasting models).

Also Read- Python Tutorial: Learn Python from Scratch

Start your data science career journey with upGrad’s top-ranked courses and gain the opportunity to learn directly from experienced industry mentors.

Tools That Power the Forecast: Tech Stack and Libraries Explained

To build this e-commerce demand forecasting, you'll work with a powerful set of Python tools and libraries designed for data analysis, modeling, and visualization:

Tool / Library

Purpose

Python The main programming language used to build the model
Google Colab Cloud-based environment for writing, running, and sharing code
Pandas For reading data, cleaning missing values, and wrangling time series
NumPy Performs fast numerical operations and array manipulations
Matplotlib / Seaborn Helps visualize trends, patterns, and actual vs. predicted values
Scikit-learn Trains the regression model and evaluates its performance
Datetime / statsmodels Deals with timestamps and enhances forecasting with statistical tools

Also Read - Top 6 Python IDEs of 2025 That Will Change Your Workflow!

How Long Does It Take and What to Expect

You can finish this E-commerce Demand Forecasting project in 3 to 4 hours. It’s perfect for beginners to intermediate learners.

Smart Forecasting: Techniques That Drive E-commerce Demand Prediction

To build a reliable demand forecasting model for e-commerce, you'll apply key techniques that help uncover patterns in historical sales and predict future demand:

  • Linear RegressionPredict future product demand from past sales data.
  • Time Series Analysis (Lag Features, Rolling Means): Like lag values and rolling averages to capture trends and seasonality.

These tools help build accurate, data-driven forecasts for smarter inventory planning.

Boost your predictive modeling skills with this free Step-by-Step Linear Regression Course, perfect for mastering feature engineering, model evaluation, and building accurate forecasts.

How to Build a Weather Forecasting Model

Let’s build this project from scratch with clear, step-by-step guidance:

  1. Load the Weather Dataset
  2. Clean and Preprocess the Data
  3. Feature Engineering and Aggregation
  4. Define Features and Target, then Split Data Chronologically.
  5. Train the Linear Regression Model
  6.  Evaluate the Model
  7. Visualize the Predictions

Without any further delay, let’s get started!

Step 1: Download the Dataset

Download the dataset from Kaggle, extract the ZIP file, and use the downloaded dataset file for the project.

Now that you’ve downloaded the dataset, let’s move on to the next step, uploading and loading it into Google Colab.

Step 2: Upload and Read the Dataset in Google Colab

Now that you have downloaded both files, upload them to Google Colab using the code below:

from google.colab import files
uploaded = files.upload()

Once uploaded, use the following Python code to read and check the data and import the required libraries:

# main.py
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

# --- 1. Data Loading and Initial Exploration ---
try:
    df = pd.read_csv('Dataset.csv', encoding='ISO-8859-1')
except FileNotFoundError:
    print("Error: 'Dataset.csv' not found. Please ensure the file is in the correct directory.")
    exit()
print("--- Initial Data Overview ---")
print("First 5 rows of the dataset:")
print(df.head())
print("\nDataset Information:")
df.info()

Output:

--- Initial Data Overview ---

First 5 rows of the dataset:

  InvoiceNo    StockCode                          Description                                Quantity \

0    536365    85123A       WHITE HANGING HEART T-LIGHT HOLDER         6   

1    536365     71053                  WHITE METAL LANTERN                              6   

2    536365    84406B       CREAM CUPID HEARTS COAT HANGER               8   

3    536365    84029G       KNITTED UNION FLAG HOT WATER BOTTLE       6   

4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.                   6

 

        InvoiceDate  UnitPrice  CustomerID               Country  

0  01-12-2010 08:26       2.55     17850.0          United Kingdom  

1  01-12-2010 08:26       3.39     17850.0         United Kingdom  

2  01-12-2010 08:26       2.75     17850.0         United Kingdom  

3  01-12-2010 08:26       3.39     17850.0         United Kingdom  

4  01-12-2010 08:26       3.39     17850.0         United Kingdom  

Dataset Information:

RangeIndex: 541909 entries, 0 to 541908

Data columns (total 8 columns):

 #   Column       Non-Null Count   Dtype  

---  ------       --------------   -----  

 0   InvoiceNo    541909 non-null  object 

 1   StockCode    541909 non-null  object 

 2   Description  540455 non-null  object 

 3   Quantity     541909 non-null  int64  

 4   InvoiceDate  541909 non-null  object 

 5   UnitPrice    541909 non-null  float64

 6   CustomerID   406829 non-null  float64

 7   Country      541909 non-null  object 

dtypes: float64(2), int64(1), object(5)

Step 3: Data Cleaning and Preprocessing

We’ll clean the dataset, extract useful time-based features, handle missing values, and convert categorical variables into a numerical format that the machine learning model can understand.

Here is the code for this step:

# Drop rows with missing CustomerID
df.dropna(subset=['CustomerID'], inplace=True)
print(f"\nDropped rows with missing CustomerID. New shape: {df.shape}")

# Convert InvoiceDate to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], format='mixed')

# Remove returns/cancellations (Quantity < 0)
df = df[df['Quantity'] > 0]
print(f"Removed returned items. New shape: {df.shape}")

# Remove zero or negative UnitPrice entries
df = df[df['UnitPrice'] > 0]
print(f"Removed items with zero or negative price. New shape: {df.shape}")

Output:

Dropped rows with missing CustomerID. New shape: (406829, 8)

Removed returned items. New shape: (397924, 8)

Removed items with zero or negative price. New shape: (397884, 8)

Also Read - Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data

Step 4:  Feature Engineering and Time Series Creation

To forecast sales, we need to convert raw transactional data into a structured time series format. This step includes calculating the total price and aggregating daily sales quantities.

Tasks:

  • Create a TotalPrice column
  • Aggregate daily quantity sold into a time series format

Here is the code for this step:

# Create a 'TotalPrice' column
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Aggregate total quantity sold per day
daily_sales = df.set_index('InvoiceDate').resample('D')['Quantity'].sum().reset_index()
daily_sales.rename(columns={'InvoiceDate': 'Date', 'Quantity': 'TotalQuantity'}, inplace=True)
print("\nAggregated daily sales data (first 5 rows):")
print(daily_sales.head())

Output: 

Aggregated daily sales data (first 5 rows):

        Date             TotalQuantity

0 2010-01-12          24215

1 2010-01-13              0

2 2010-01-14              0

3 2010-01-15              0

4 2010-01-16              0

Also Read- Feature Engineering for Machine Learning: Process, Techniques, and Examples

Step 5: Creating Time-Based Features for the Model

Now that we have a clean time series, the next step is to engineer features that help the model understand patterns in time, like weekdays, seasonality, and recent trends.

Here is the code for this step:

# Set Date as index for easy feature creation
daily_sales.set_index('Date', inplace=True)

# Calendar-based features
daily_sales['dayofweek'] = daily_sales.index.dayofweek
daily_sales['month'] = daily_sales.index.month
daily_sales['year'] = daily_sales.index.year
daily_sales['dayofyear'] = daily_sales.index.dayofyear

# Lag features
daily_sales['lag_1'] = daily_sales['TotalQuantity'].shift(1)
daily_sales['lag_7'] = daily_sales['TotalQuantity'].shift(7)

# Rolling window feature (7-day rolling mean)
daily_sales['rolling_mean_7'] = daily_sales['TotalQuantity'].rolling(window=7).mean()

# Drop rows with NaN values caused by lags and rolling calculations
daily_sales.dropna(inplace=True)
print("\nCreated time-based, lag, and rolling features.")
print("Final features for the model (first 5 rows):")
print(daily_sales.head())

Output:

Created time-based, lag, and rolling features.

Final features for the model (first 5 rows):

Date              TotalQuantity  dayofweek  month  year  dayofyear  lag_1    lag_7  \

                                                                          

2010-01-19              0          1      1  2010         19    0.0  24215.0   

2010-01-20              0          2      1  2010         20    0.0      0.0   

2010-01-21              0          3      1  2010         21    0.0      0.0   

2010-01-22              0          4      1  2010         22    0.0      0.0   

2010-01-23              0          5      1  2010         23    0.0      0.0   

            rolling_mean_7  

Date                        

2010-01-19             0.0  

2010-01-20             0.0  

2010-01-21             0.0  

2010-01-22             0.0  

2010-01-23             0.0

Step 6: Defining Features and Target for the Forecast

Now that we’ve created relevant time-based features, it’s time to separate them into input features (X) and the output we want to predict (y).

Here, TotalQuantity is the target variable, and all other columns help in making predictions.

Here is the code for this step:

# The target 'y' is the total daily quantity we want to predict.
y = daily_sales['TotalQuantity']

# The features 'X' are all the columns we created to help predict the target.
X = daily_sales.drop('TotalQuantity', axis=1)
print("\nShape of Feature Matrix (X):", X.shape)
print("Shape of Target Vector (y):", y.shape)

Output:

Shape of Feature Matrix (X): (691, 7)

Shape of Target Vector (y): (691,)

Step 7:  Splitting the Data and Training the Model

For time series forecasting, it's important to maintain chronological order.

We’ll split the data based on time, training on earlier records, and testing on more recent ones.

Then we train a simple and effective Linear Regression model using the training data.

Here is the code:

# For time series data, we split chronologically to train on the past and test on the future.
split_ratio = 0.8
split_index = int(len(X) * split_ratio)
X_train = X[:split_index]
X_test = X[split_index:]
y_train = y[:split_index]
y_test = y[split_index:]
print(f"\nSplit data into {split_ratio*100}% training and {(1-split_ratio)*100}% testing.")
print("Training set size:", len(X_train))
print("Testing set size:", len(X_test))

# --- 7. Model Training ---
print("\n--- Model Training ---")
# We'll use Linear Regression, a straightforward and effective model for regression tasks.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
print("Linear Regression model trained successfully.")

Output:

Split data into 80.0% training and 19.999999999999996% testing.

Training set size: 552

Testing set size: 139

Linear Regression model trained successfully.

Step 8: Model Prediction and Evaluation

Once the model is trained, it's time to test how well it performs on unseen data.

We’ll make predictions using the test set and evaluate the results using standard regression metrics like MAE, RMSE, and R².

Here is the Code for this step:

print("\n--- Model Evaluation ---")
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Predict on test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Display results
print(f"Mean Absolute Error (MAE): {mae:,.2f} units")
print(f"Root Mean Squared Error (RMSE): {rmse:,.2f} units")
print(f"R-squared (R²): {r2:.2f}")

Output:

Mean Absolute Error (MAE): 8,059.80 units

Root Mean Squared Error (RMSE): 11,250.11 units

R-squared (R²): 0.19

Also Read - Evaluation Metrics in Machine Learning: Top 10 Metrics You Should Know

Step 9: Visualizing Actual vs Predicted Sales

After evaluation, it's helpful to visualize how well the model predictions align with actual sales.

This line chart lets you see patterns, gaps, and overall prediction accuracy over time.

# --- Visualizing Results ---
import matplotlib.pyplot as plt
print("\n--- Visualizing Results ---")
plt.style.use('seaborn-v0_8-whitegrid')
fig, ax = plt.subplots(figsize=(15, 7))

# Plot actual vs predicted sales
ax.plot(y_test.index, y_test.values, label='Actual Daily Sales', color='blue', linewidth=2)
ax.plot(y_test.index, y_pred, label='Predicted Daily Sales', color='red', linestyle='--', linewidth=2)

# Customize the chart
ax.set_title('Demand Forecast: Actual vs. Predicted Sales', fontsize=16)
ax.set_xlabel('Date', fontsize=12)
ax.set_ylabel('Total Quantity Sold', fontsize=12)
ax.legend()
plt.xticks(rotation=45)
plt.tight_layout()

# Display the plot
print("Displaying the plot of actual vs. predicted sales.")
plt.show()

Output:

Also Read - Top 15 Types of Data Visualization: Benefits and How to Choose the Right Tool for Your Needs in 2025

Conclusion

This project demonstrated how to forecast daily product demand using linear regression. You cleaned the sales data, created time-based features, and built a time series model to predict future sales. The model gave decent performance and helped visualize actual vs predicted demand. While this was a solid starting point, you can improve accuracy by exploring more advanced models or adding external factors.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Reference:
https://colab.research.google.com/drive/1EES-nOpZTlAnGOwOGpr_cqC49Vd3SzWq?usp=sharing

Frequently Asked Questions (FAQs)

1. What is demand forecasting in e-commerce?

2. How can machine learning be used for demand forecasting?

3. Why did this project use linear regression?

4. What features improve demand forecasting accuracy?

5. Can this model be used in real e-commerce applications?

Rohit Sharma

796 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months