Demand Forecasting for E-commerce Using Python (Machine Learning Project)
By Rohit Sharma
Updated on Jul 29, 2025 | 1.15K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 29, 2025 | 1.15K+ views
Share:
Table of Contents
Demand Forecasting for E-commerce is a critical application of data science that helps online retailers to optimize stock levels, reduce wastage, and meet customer demand efficiently.
In this project, you'll use Python to analyze historical sales data, identify trends and seasonality, and build a machine learning model to forecast future demand. This hands-on approach will help you master time series analysis and regression techniques tailored for real-world e-commerce scenarios.
Supercharge your data science career with upGrad’s top-rated Online Data Science Courses. Learn Python, Machine Learning, AI, SQL, Tableau, and more, taught by industry experts. Build real-world skills and get job-ready. Start learning today!
Turn ideas into action, explore our top Python Data Science Projects, and start building today.
It’s helpful to have some basic knowledge of the following before starting this project:
Also Read- Python Tutorial: Learn Python from Scratch
Start your data science career journey with upGrad’s top-ranked courses and gain the opportunity to learn directly from experienced industry mentors.
To build this e-commerce demand forecasting, you'll work with a powerful set of Python tools and libraries designed for data analysis, modeling, and visualization:
Tool / Library |
Purpose |
Python | The main programming language used to build the model |
Google Colab | Cloud-based environment for writing, running, and sharing code |
Pandas | For reading data, cleaning missing values, and wrangling time series |
NumPy | Performs fast numerical operations and array manipulations |
Matplotlib / Seaborn | Helps visualize trends, patterns, and actual vs. predicted values |
Scikit-learn | Trains the regression model and evaluates its performance |
Datetime / statsmodels | Deals with timestamps and enhances forecasting with statistical tools |
Also Read - Top 6 Python IDEs of 2025 That Will Change Your Workflow!
You can finish this E-commerce Demand Forecasting project in 3 to 4 hours. It’s perfect for beginners to intermediate learners.
To build a reliable demand forecasting model for e-commerce, you'll apply key techniques that help uncover patterns in historical sales and predict future demand:
These tools help build accurate, data-driven forecasts for smarter inventory planning.
Boost your predictive modeling skills with this free Step-by-Step Linear Regression Course, perfect for mastering feature engineering, model evaluation, and building accurate forecasts.
Let’s build this project from scratch with clear, step-by-step guidance:
Without any further delay, let’s get started!
Download the dataset from Kaggle, extract the ZIP file, and use the downloaded dataset file for the project.
Now that you’ve downloaded the dataset, let’s move on to the next step, uploading and loading it into Google Colab.
Now that you have downloaded both files, upload them to Google Colab using the code below:
from google.colab import files
uploaded = files.upload()
Once uploaded, use the following Python code to read and check the data and import the required libraries:
# main.py
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
# --- 1. Data Loading and Initial Exploration ---
try:
df = pd.read_csv('Dataset.csv', encoding='ISO-8859-1')
except FileNotFoundError:
print("Error: 'Dataset.csv' not found. Please ensure the file is in the correct directory.")
exit()
print("--- Initial Data Overview ---")
print("First 5 rows of the dataset:")
print(df.head())
print("\nDataset Information:")
df.info()
Output:
--- Initial Data Overview ---
First 5 rows of the dataset:
InvoiceNo StockCode Description Quantity \
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
InvoiceDate UnitPrice CustomerID Country
0 01-12-2010 08:26 2.55 17850.0 United Kingdom
1 01-12-2010 08:26 3.39 17850.0 United Kingdom
2 01-12-2010 08:26 2.75 17850.0 United Kingdom
3 01-12-2010 08:26 3.39 17850.0 United Kingdom
4 01-12-2010 08:26 3.39 17850.0 United Kingdom
Dataset Information:
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 InvoiceNo 541909 non-null object
1 StockCode 541909 non-null object
2 Description 540455 non-null object
3 Quantity 541909 non-null int64
4 InvoiceDate 541909 non-null object
5 UnitPrice 541909 non-null float64
6 CustomerID 406829 non-null float64
7 Country 541909 non-null object
dtypes: float64(2), int64(1), object(5)
We’ll clean the dataset, extract useful time-based features, handle missing values, and convert categorical variables into a numerical format that the machine learning model can understand.
Here is the code for this step:
# Drop rows with missing CustomerID
df.dropna(subset=['CustomerID'], inplace=True)
print(f"\nDropped rows with missing CustomerID. New shape: {df.shape}")
# Convert InvoiceDate to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], format='mixed')
# Remove returns/cancellations (Quantity < 0)
df = df[df['Quantity'] > 0]
print(f"Removed returned items. New shape: {df.shape}")
# Remove zero or negative UnitPrice entries
df = df[df['UnitPrice'] > 0]
print(f"Removed items with zero or negative price. New shape: {df.shape}")
Output:
Dropped rows with missing CustomerID. New shape: (406829, 8)
Removed returned items. New shape: (397924, 8)
Removed items with zero or negative price. New shape: (397884, 8)
Also Read - Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data
To forecast sales, we need to convert raw transactional data into a structured time series format. This step includes calculating the total price and aggregating daily sales quantities.
Tasks:
Here is the code for this step:
# Create a 'TotalPrice' column
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
# Aggregate total quantity sold per day
daily_sales = df.set_index('InvoiceDate').resample('D')['Quantity'].sum().reset_index()
daily_sales.rename(columns={'InvoiceDate': 'Date', 'Quantity': 'TotalQuantity'}, inplace=True)
print("\nAggregated daily sales data (first 5 rows):")
print(daily_sales.head())
Output:
Aggregated daily sales data (first 5 rows):
Date TotalQuantity
0 2010-01-12 24215
1 2010-01-13 0
2 2010-01-14 0
3 2010-01-15 0
4 2010-01-16 0
Also Read- Feature Engineering for Machine Learning: Process, Techniques, and Examples
Now that we have a clean time series, the next step is to engineer features that help the model understand patterns in time, like weekdays, seasonality, and recent trends.
Here is the code for this step:
# Set Date as index for easy feature creation
daily_sales.set_index('Date', inplace=True)
# Calendar-based features
daily_sales['dayofweek'] = daily_sales.index.dayofweek
daily_sales['month'] = daily_sales.index.month
daily_sales['year'] = daily_sales.index.year
daily_sales['dayofyear'] = daily_sales.index.dayofyear
# Lag features
daily_sales['lag_1'] = daily_sales['TotalQuantity'].shift(1)
daily_sales['lag_7'] = daily_sales['TotalQuantity'].shift(7)
# Rolling window feature (7-day rolling mean)
daily_sales['rolling_mean_7'] = daily_sales['TotalQuantity'].rolling(window=7).mean()
# Drop rows with NaN values caused by lags and rolling calculations
daily_sales.dropna(inplace=True)
print("\nCreated time-based, lag, and rolling features.")
print("Final features for the model (first 5 rows):")
print(daily_sales.head())
Output:
Created time-based, lag, and rolling features.
Final features for the model (first 5 rows):
Date TotalQuantity dayofweek month year dayofyear lag_1 lag_7 \
2010-01-19 0 1 1 2010 19 0.0 24215.0
2010-01-20 0 2 1 2010 20 0.0 0.0
2010-01-21 0 3 1 2010 21 0.0 0.0
2010-01-22 0 4 1 2010 22 0.0 0.0
2010-01-23 0 5 1 2010 23 0.0 0.0
rolling_mean_7
Date
2010-01-19 0.0
2010-01-20 0.0
2010-01-21 0.0
2010-01-22 0.0
2010-01-23 0.0
Now that we’ve created relevant time-based features, it’s time to separate them into input features (X) and the output we want to predict (y).
Here, TotalQuantity is the target variable, and all other columns help in making predictions.
Here is the code for this step:
# The target 'y' is the total daily quantity we want to predict.
y = daily_sales['TotalQuantity']
# The features 'X' are all the columns we created to help predict the target.
X = daily_sales.drop('TotalQuantity', axis=1)
print("\nShape of Feature Matrix (X):", X.shape)
print("Shape of Target Vector (y):", y.shape)
Output:
Shape of Feature Matrix (X): (691, 7)
Shape of Target Vector (y): (691,)
For time series forecasting, it's important to maintain chronological order.
We’ll split the data based on time, training on earlier records, and testing on more recent ones.
Then we train a simple and effective Linear Regression model using the training data.
Here is the code:
# For time series data, we split chronologically to train on the past and test on the future.
split_ratio = 0.8
split_index = int(len(X) * split_ratio)
X_train = X[:split_index]
X_test = X[split_index:]
y_train = y[:split_index]
y_test = y[split_index:]
print(f"\nSplit data into {split_ratio*100}% training and {(1-split_ratio)*100}% testing.")
print("Training set size:", len(X_train))
print("Testing set size:", len(X_test))
# --- 7. Model Training ---
print("\n--- Model Training ---")
# We'll use Linear Regression, a straightforward and effective model for regression tasks.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
print("Linear Regression model trained successfully.")
Output:
Split data into 80.0% training and 19.999999999999996% testing.
Training set size: 552
Testing set size: 139
Linear Regression model trained successfully.
Once the model is trained, it's time to test how well it performs on unseen data.
We’ll make predictions using the test set and evaluate the results using standard regression metrics like MAE, RMSE, and R².
Here is the Code for this step:
print("\n--- Model Evaluation ---")
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
# Predict on test set
y_pred = model.predict(X_test)
# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
# Display results
print(f"Mean Absolute Error (MAE): {mae:,.2f} units")
print(f"Root Mean Squared Error (RMSE): {rmse:,.2f} units")
print(f"R-squared (R²): {r2:.2f}")
Output:
Mean Absolute Error (MAE): 8,059.80 units
Root Mean Squared Error (RMSE): 11,250.11 units
R-squared (R²): 0.19
Also Read - Evaluation Metrics in Machine Learning: Top 10 Metrics You Should Know
After evaluation, it's helpful to visualize how well the model predictions align with actual sales.
This line chart lets you see patterns, gaps, and overall prediction accuracy over time.
# --- Visualizing Results ---
import matplotlib.pyplot as plt
print("\n--- Visualizing Results ---")
plt.style.use('seaborn-v0_8-whitegrid')
fig, ax = plt.subplots(figsize=(15, 7))
# Plot actual vs predicted sales
ax.plot(y_test.index, y_test.values, label='Actual Daily Sales', color='blue', linewidth=2)
ax.plot(y_test.index, y_pred, label='Predicted Daily Sales', color='red', linestyle='--', linewidth=2)
# Customize the chart
ax.set_title('Demand Forecast: Actual vs. Predicted Sales', fontsize=16)
ax.set_xlabel('Date', fontsize=12)
ax.set_ylabel('Total Quantity Sold', fontsize=12)
ax.legend()
plt.xticks(rotation=45)
plt.tight_layout()
# Display the plot
print("Displaying the plot of actual vs. predicted sales.")
plt.show()
Output:
Popular Data Science Programs
Also Read - Top 15 Types of Data Visualization: Benefits and How to Choose the Right Tool for Your Needs in 2025
This project demonstrated how to forecast daily product demand using linear regression. You cleaned the sales data, created time-based features, and built a time series model to predict future sales. The model gave decent performance and helped visualize actual vs predicted demand. While this was a solid starting point, you can improve accuracy by exploring more advanced models or adding external factors.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Reference:
https://colab.research.google.com/drive/1EES-nOpZTlAnGOwOGpr_cqC49Vd3SzWq?usp=sharing
796 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources