Weather Forecasting Model Using Machine Learning and Time Series Analysis
By Rohit Sharma
Updated on Jul 30, 2025 | 8 min read | 1.21K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 30, 2025 | 8 min read | 1.21K+ views
Share:
Table of Contents
Accurate weather prediction plays a major role in agriculture, travel, disaster planning, and daily life. In this project, you'll build a Weather Forecasting Model using machine learning techniques and time series data.
You'll apply regression models to predict continuous variables such as temperature, humidity, or pressure. You'll also learn how to prepare time-based data and evaluate your model's accuracy using standard metrics.
If you're looking to accelerate your data science journey, check out the Online Data Science Courses at upGrad. The programs help you learn Python, Machine Learning, AI, Tableau, SQL, and more from top-tier faculty. Enroll today!
Spark your next big idea. Browse our full collection of data science projects in Python.
Popular Data Science Programs
It’s helpful to have some basic knowledge of the following before starting this project:
Also Read- Data Structures in Python
Start your journey of career advancement in data science with upGrad’s top-ranked courses and get a chance to learn from industry-established mentors:
For this Weather Forecasting Model project, the following tools and libraries will be used:
Tool / Library |
Purpose |
Python | Core programming language |
Google Colab | Cloud-based notebook for coding and collaboration |
Pandas | Data loading, cleaning, and manipulation |
NumPy | Numerical computations and array operations |
Matplotlib / Seaborn | Visualizing time series trends and correlations |
Scikit-learn | Building and evaluating regression models |
Datetime / statsmodels |
Handling time-based data and forecasting |
Also Read - Top 6 Python IDEs of 2025 That Will Change Your Workflow!
For our Weather Forecasting Model, we’ll use the following key techniques to build and evaluate predictive models:
Check this free Linear Regression – Step by Step Guide course to enhance your skills in predictive modeling, feature engineering, and model evaluation.
You can complete this Weather Forecasting Model project in about 3 to 4 hours. It’s ideal for beginners to intermediate users, offering hands-on experience in:
Let’s build this project from scratch with clear, step-by-step guidance:
Without any further delay, let’s get started!
Download the dataset from Kaggle, extract the ZIP file, and use the downloaded dataset file for the project.
Now that you’ve downloaded the dataset, let’s move on to the next step, uploading and loading it into Google Colab.
Now that you have downloaded both files, upload them to Google Colab using the code below:
from google.colab import files
uploaded = files.upload()
Once uploaded, use the following Python code to read and check the data and import the required libraries:
# main.py
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
# --- 1. Data Loading and Initial Exploration ---
# Load the dataset from the uploaded CSV file.
try:
df = pd.read_csv('weatherHistory.csv')
except FileNotFoundError:
print("Error: 'weatherHistory.csv' not found. Please make sure the file is in the correct directory.")
exit()
print("--- Initial Data Overview ---")
print("First 5 rows of the dataset:")
print(df.head())
print("\nDataset Information:")
df.info()
print("\nChecking for missing values:")
print(df.isnull().sum())
Output :
--- Initial Data Overview ---
First 5 rows of the dataset:
Formatted Date Summary Precip Type Temperature (C) \
0 2006-04-01 00:00:00.000 +0200 Partly Cloudy rain 9.472222
1 2006-04-01 01:00:00.000 +0200 Partly Cloudy rain 9.355556
2 2006-04-01 02:00:00.000 +0200 Mostly Cloudy rain 9.377778
3 2006-04-01 03:00:00.000 +0200 Partly Cloudy rain 8.288889
4 2006-04-01 04:00:00.000 +0200 Mostly Cloudy rain 8.755556
Apparent Temperature (C) Humidity Wind Speed (km/h) \
0 7.388889 0.89 14.1197
1 7.227778 0.86 14.2646
2 9.377778 0.89 3.9284
3 5.944444 0.83 14.1036
4 6.977778 0.83 11.0446
Wind Bearing (degrees) Visibility (km) Loud Cover Pressure (millibars) \
0 251.0 15.8263 0.0 1015.13
1 259.0 15.8263 0.0 1015.63
2 204.0 14.9569 0.0 1015.94
3 269.0 15.8263 0.0 1016.41
4 259.0 15.8263 0.0 1016.51
Daily Summary
0 Partly cloudy throughout the day.
1 Partly cloudy throughout the day.
2 Partly cloudy throughout the day.
3 Partly cloudy throughout the day.
4 Partly cloudy throughout the day.
Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96453 entries, 0 to 96452
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Formatted Date 96453 non-null object
1 Summary 96453 non-null object
2 Precip Type 95936 non-null object
3 Temperature (C) 96453 non-null float64
4 Apparent Temperature (C) 96453 non-null float64
5 Humidity 96453 non-null float64
6 Wind Speed (km/h) 96453 non-null float64
7 Wind Bearing (degrees) 96453 non-null float64
8 Visibility (km) 96453 non-null float64
9 Loud Cover 96453 non-null float64
10 Pressure (millibars) 96453 non-null float64
11 Daily Summary 96453 non-null object
dtypes: float64(8), object(4)
memory usage: 8.8+ MB
Checking for missing values:
Formatted Date 0
Summary 0
Precip Type 517
Temperature (C) 0
Apparent Temperature (C) 0
Humidity 0
Wind Speed (km/h) 0
Wind Bearing (degrees) 0
Visibility (km) 0
Loud Cover 0
Pressure (millibars) 0
Daily Summary 0
dtype: int64
We’ll clean the dataset, extract useful time-based features, handle missing values, and convert categorical variables into a numerical format that the machine learning model can understand.
Here is the code for this step:
# --- Step 3: Data Preprocessing and Feature Engineering ---
print("\n--- Data Preprocessing ---")
# Convert 'Formatted Date' to datetime format
df['Formatted Date'] = pd.to_datetime(df['Formatted Date'], utc=True)
# Set datetime as index and sort chronologically
df.set_index('Formatted Date', inplace=True)
df.sort_index(inplace=True)
# Fill missing values in 'Precip Type' with the most frequent value
precip_mode = df['Precip Type'].mode()[0]
df['Precip Type'].fillna(precip_mode, inplace=True)
print(f"Filled missing 'Precip Type' values with '{precip_mode}'.")
# Create new time-based features from the datetime index
df['year'] = df.index.year
df['month'] = df.index.month
df['day'] = df.index.day
df['hour'] = df.index.hour
print("Created time-based features: year, month, day, hour.")
# Drop columns that add no value or cause data leakage
df_processed = df.drop(['Loud Cover', 'Apparent Temperature (C)', 'Daily Summary'], axis=1)
# Convert categorical variables into numerical format using one-hot encoding
df_processed = pd.get_dummies(df_processed, columns=['Summary', 'Precip Type'], drop_first=True)
print("Converted categorical features to numeric using one-hot encoding.")
Output:
--- Data Preprocessing ---
Filled missing 'Precip Type' values with 'rain'.
Created new time-based features: 'year', 'month', 'day', 'hour'.
Converted categorical features to numeric using one-hot encoding.
Check out this tutorial on How to Work with datetime in Python to learn how to handle dates and times, parse timestamps, format outputs, and perform date-based operations
We now separate the target variable and features: Temperature (C) will be our prediction target, and the rest will serve as input features for the model.
Here is the code for this step:
# --- Step 4: Defining Features (X) and Target (y) ---
# The target variable 'y' is what we want to predict.
y = df_processed['Temperature (C)']
# The feature matrix 'X' contains all the variables used for prediction.
X = df_processed.drop('Temperature (C)', axis=1)
# Display shapes to confirm the structure
print("\nShape of Feature Matrix (X):", X.shape)
print("Shape of Target Vector (y):", y.shape)
Output:
Shape of Feature Matrix (X): (96453, 36)
Shape of Target Vector (y): (96453,)
Also Read- Feature Engineering for Machine Learning: Process, Techniques, and Examples
Since this is time series data, we avoid random splitting. Instead, we split the dataset chronologically, training on the past and testing on the future.
Here is the code for this step:
# --- Step 5: Splitting Data into Training and Testing Sets ---
# Define split ratio
split_ratio = 0.8
split_index = int(len(X) * split_ratio)
# Chronologically split the dataset
X_train = X[:split_index]
X_test = X[split_index:]
y_train = y[:split_index]
y_test = y[split_index:]
# Print the sizes of the splits
print(f"\nSplit data into {split_ratio*100}% training and {(1 - split_ratio)*100}% testing.")
print("Training set size:", len(X_train))
print("Testing set size:", len(X_test))
Output:
Split data into 80.0% training and 19.999999999999996% testing.
Training set size: 77162
Testing set size: 19291
Now we train a Linear Regression model, which is a simple yet effective algorithm for predicting continuous variables like temperature. The model learns patterns from the training data to make future predictions.
Here is the code for this step:
# --- Step 6: Model Training ---
print("\n--- Model Training ---")
# Initialize the Linear Regression model
model = LinearRegression()
# Train the model on the training data
model.fit(X_train, y_train)
print("Linear Regression model trained successfully.")
Output:
--- Model Training ---
Linear Regression model trained successfully
Also Read - What is Regression: Regression Analysis Explained
After training the model, we test its performance on unseen data. We use standard regression metrics to measure accuracy and understand how well the model predicts temperature.
Here is the code:
# --- Step 7: Model Prediction and Evaluation ---
print("\n--- Model Evaluation ---")
# Predict on the test set
y_pred = model.predict(X_test)
# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
# Print the evaluation results
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R²): {r2:.2f}")
Output:
--- Model Evaluation ---
Mean Absolute Error (MAE): 5.22
Mean Squared Error (MSE): 38.56
Root Mean Squared Error (RMSE): 6.21
R-squared (R²): 0.53
To better understand model performance, we visualize the predictions:
Here is the Code for this step:
# --- Step 8: Visualizing the Results ---
print("\n--- Visualizing Results ---")
# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
# Line plot: Actual vs Predicted temperatures (sample of 500)
fig, ax = plt.subplots(figsize=(15, 7))
sample_size = 500
ax.plot(y_test.index[:sample_size], y_test.values[:sample_size], label='Actual Temperature', color='blue', linewidth=2)
ax.plot(y_test.index[:sample_size], y_pred[:sample_size], label='Predicted Temperature', color='red', linestyle='--', linewidth=2)
ax.set_title('Weather Forecast: Actual vs. Predicted Temperature (Sample of Test Data)', fontsize=16)
ax.set_xlabel('Date', fontsize=12)
ax.set_ylabel('Temperature (C)', fontsize=12)
ax.legend()
plt.xticks(rotation=45)
plt.tight_layout()
print("Displaying the plot of actual vs. predicted temperatures.")
plt.show()
# Scatter plot: Predicted vs Actual values
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(y_test, y_pred, alpha=0.3)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2, label='Perfect Prediction')
ax.set_xlabel('Actual Temperature (C)', fontsize=12)
ax.set_ylabel('Predicted Temperature (C)', fontsize=12)
ax.set_title('Actual vs. Predicted Scatter Plot', fontsize=16)
ax.legend()
plt.tight_layout()
plt.show()
Output:
Also Read - Top 15 Types of Data Visualization: Benefits and How to Choose the Right Tool for Your Needs in 2025
In this project, we built a weather forecasting model using linear regression on historical weather data. After cleaning the dataset, creating time-based features, and encoding categorical variables, we trained the model and evaluated it using MAE, RMSE, and R². The results showed reasonable prediction accuracy, and visualizations confirmed the model's ability to capture key trends. This project reinforced essential skills in regression, time series handling, and model evaluation.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1KBrnnWm6Ka858VkrhyKDVBVHcwyIm3RG
802 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources