Home
Blog
Data Science
Indian Rainfall Analysis and Prediction Using Linear Regression

Indian Rainfall Analysis and Prediction Using Linear Regression

Q: 4. How accurate is the prediction model?

The model achieved a high R² score, indicating a strong correlation between early-year rainfall and the total annual rainfall. This means that a significant portion of annual rainfall variability can be explained just by looking at the first five months of data.

By Rohit Sharma

Updated on Aug 11, 2025 | 1.81K+ views

Table of Contents

View all

Heads Up Before You Dive In!
Indian Rainfall Analysis and Prediction: Methodology
Predicting Annual Rainfall: A Step-by-Step Guide Using Machine Learning
Final Conclusion

Rainfall plays a key role in agriculture, water supply, and climate balance across India. Seasonal variations impact crop yields, reservoir levels, and daily life.

This project on Indian Rainfall Analysis and Prediction studies historical rainfall data to identify trends, seasonal contributions, and high-rainfall regions. You’ll also build a Linear Regression model to predict annual rainfall using the first five months of data.

Explore the world of data science with upGrad's Online Data Science Courses. Learn Python, Machine Learning, AI, SQL, and Tableau from industry experts. Enrol today!

Explore this collection of Python Data Science Projects for all skill levels.

Popular Data Science Programs

PG Diploma in Data Science DevOps Full Course Online Masters in Data Science Degree MSc in Data Science Program Advanced Certificate Program in Data Science

Heads Up Before You Dive In!

To work effectively on the Indian Rainfall Analysis and Prediction project, make sure you're comfortable with the following:

Basic Python programming knowledge (You should be able to write simple scripts, use loops and conditions, and define functions.)
Experience with data manipulation using Pandas and NumPy (These libraries are essential for reading the dataset, handling missing values, and preparing the data for analysis.)
Understanding of data visualisation with Matplotlib and Seaborn (You'll use these tools to plot graphs like bar plots, box plots, and heatmaps for better data understanding.)
Knowledge of data preprocessing techniques (You know how to clean data, encode categorical variables, scale features, and split the dataset into training and test sets)
Familiarity with Regression Algorithms (Understanding models like Linear Regression is important, as it’s used here to predict annual rainfall from early-year data.)

If you're new to Python, check out this free upGrad course to boost your skills!- Learn Basic Python Programming

upGrad's globally recognised programs enable you to lead and innovate in a data-first world. Gain valuable credentials, master Generative AI, and solve real-world problems using Advanced Analytics, all while learning from industry veterans.

Indian Rainfall Analysis and Prediction: Methodology

To predict annual rainfall, we used historical district-wise rainfall data and built a regression model that learns patterns from early-year monthly rainfall. Here's what we did:

Data Preprocessing and Cleaning
Feature Engineering
Train-Test Split
Regression Model (Linear Regression)
Model Evaluation (R² Score and RMSE)

Discover beginner-friendly Python projects!- Sales Data Analysis Project | Customer Churn Prediction Project: From Data to Decisions

Estimated Time to Complete: The Indian Rainfall Analysis and Prediction project is estimated to take 3 to 4 hours. The time may vary depending on your familiarity with Python, especially in data loading, exploratory data analysis, feature selection, regression modelling, and model evaluation.

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree18 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Predicting Annual Rainfall: A Step-by-Step Guide Using Machine Learning

Here’s how you can build the Indian Rainfall Analysis and Prediction project from scratch using Python and machine learning:

Load the Rainfall Dataset
Import district-wise rainfall data with monthly values from January to December and annual totals.
Clean and Preprocess the Data
Remove extra spaces in column names, handle any missing values, and structure the dataset for analysis.
Explore and Visualise the Data
Create bar plots for top rainfall states and districts, line charts for monthly averages, and pie charts for seasonal contributions.
Train Regression Models
Use Linear Regression to predict annual rainfall based on early-year (January–May) rainfall data.
Evaluate Model Performance
Measure accuracy with R² score and RMSE to check how well the model predicts annual rainfall.

Step 1: Import Required Libraries

First, import all the necessary Python libraries for data handling, visualisation, model building, and evaluation.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score, mean_squared_error

import warnings

warnings.filterwarnings('ignore')

Want to explore Data Science further? Find out more here! - Handwritten Digit Recognition with CNN Using Python | Weather Forecasting Model Using Machine Learning and Time Series Analysis

Step 2: Loading and Preparing Data

This step loads the rainfall dataset and performs initial exploration to understand its structure and contents.


print("---Loading and Preparing Data ---")

try:

    df = pd.read_csv('district wise rainfall normal.csv')

    print("Dataset loaded successfully.")

except FileNotFoundError:

    print("Error: 'district wise rainfall normal.csv' not found. Please check the file path.")

    exit()





# Initial Data Exploration

print("\n--- Initial Data Exploration ---")

print("DataFrame Head:")

display(df.head())

print("\nDataFrame Info:")

display(df.info())

print("\nDataFrame Description:")

display(df.describe())

Output:

---Loading and Preparing Data ---

Dataset loaded successfully.

--- Initial Data Exploration ---

DataFrame Info:

RangeIndex: 641 entries, 0 to 640

Data columns (total 19 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 STATE_UT_NAME 641 non-null object

1 DISTRICT 641 non-null object

2 JAN 641 non-null float64

3 FEB 641 non-null float64

4 MAR 641 non-null float64

5 APR 641 non-null float64

6 MAY 641 non-null float64

7 JUN 641 non-null float64

8 JUL 641 non-null float64

9 AUG 641 non-null float64

10 SEP 641 non-null float64

11 OCT 641 non-null float64

12 NOV 641 non-null float64

13 DEC 641 non-null float64

14 ANNUAL 641 non-null float64

15 Jan-Feb 641 non-null float64

16 Mar-May 641 non-null float64

17 Jun-Sep 641 non-null float64

18 Oct-Dec 641 non-null float64

dtypes: float64(17), object(2)

Dive into these projects!- Customer Purchase Behaviour Analysis Project Using Python | World Happiness Report Analysis with Python

Step 3: Cleaning Column Names

This step removes extra spaces from column names to ensure consistent data handling.

# Clean up column names

df.columns = df.columns.str.strip()

print("Cleaned DataFrame shape:", df.shape)

Output:

Cleaned DataFrame shape: (641, 19)

Step 4: Exploratory Data Analysis (EDA)

This step visualises rainfall patterns across states, districts, months, and seasons.



sns.set_style("whitegrid")



# a. Top 10 States with Highest Annual Rainfall

plt.figure(figsize=(12, 7))

state_rainfall = df.groupby('STATE_UT_NAME')['ANNUAL'].mean().sort_values(ascending=False).head(10)

sns.barplot(x=state_rainfall.values, y=state_rainfall.index, palette='Blues_r')

plt.title('Top 10 States by Average Annual Rainfall', fontsize=16)

plt.xlabel('Average Annual Rainfall (mm)', fontsize=12)

plt.ylabel('State / Union Territory', fontsize=12)

plt.savefig('top_10_states_rainfall.png', dpi=300, bbox_inches='tight')

print("Generated 'top_10_states_rainfall.png'")



# b. Top 10 Districts with Highest Annual Rainfall

plt.figure(figsize=(12, 7))

district_rainfall = df.groupby('DISTRICT')['ANNUAL'].mean().sort_values(ascending=False).head(10)

sns.barplot(x=district_rainfall.values, y=district_rainfall.index, palette='Greens_r')

plt.title('Top 10 Districts by Average Annual Rainfall', fontsize=16)

plt.xlabel('Average Annual Rainfall (mm)', fontsize=12)

plt.ylabel('District', fontsize=12)

plt.savefig('top_10_districts_rainfall.png', dpi=300, bbox_inches='tight')

print("Generated 'top_10_districts_rainfall.png'")



# c. Monthly Rainfall Distribution (National Average)

months = ['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC']

monthly_avg_rainfall = df[months].mean()

plt.figure(figsize=(14, 7))

sns.lineplot(x=monthly_avg_rainfall.index, y=monthly_avg_rainfall.values, marker='o', color='crimson', lw=2)

plt.title('Average Monthly Rainfall Across India', fontsize=16)

plt.xlabel('Month', fontsize=12)

plt.ylabel('Average Rainfall (mm)', fontsize=12)

plt.xticks(rotation=45)

plt.savefig('national_monthly_rainfall.png', dpi=300, bbox_inches='tight')

print("Generated 'national_monthly_rainfall.png'")



# d. Contribution of Different Seasons to Annual Rainfall

df['MONSOON'] = df['JUN'] + df['JUL'] + df['AUG'] + df['SEP']

df['PRE_MONSOON'] = df['MAR'] + df['APR'] + df['MAY']

df['POST_MONSOON'] = df['OCT'] + df['NOV'] + df['DEC']

df['WINTER'] = df['JAN'] + df['FEB']



seasonal_contribution = df[['PRE_MONSOON', 'MONSOON', 'POST_MONSOON', 'WINTER']].mean()

plt.figure(figsize=(10, 8))

plt.pie(seasonal_contribution, labels=seasonal_contribution.index, autopct='%1.1f%%',

        colors=sns.color_palette('pastel'), wedgeprops={'edgecolor': 'black'})

plt.title('Contribution of Seasons to Total Annual Rainfall', fontsize=16)

plt.savefig('seasonal_rainfall_contribution.png', dpi=300, bbox_inches='tight')

print("Generated 'seasonal_rainfall_contribution.png'")

Output:

Hey, check out these super quick and easy Python projects, perfect for beginners!- Complete Airline Passenger Traffic Analysis Project | Heart Disease Prediction Using Logistic Regression and Random Forest

Step 5: Defining Features and Target Variable

This step selects the input features and target for the rainfall prediction model.

Here is the code for this step:

# --- 3. Annual Rainfall Prediction Model ---

print("\n--- 3. Annual Rainfall Prediction Model ---")



# Define features (X) and target (y)

# We will predict the ANNUAL rainfall based on the rainfall in the first 5 months.

features = ['JAN', 'FEB', 'MAR', 'APR', 'MAY']

target = 'ANNUAL'



X = df[features]

y = df[target]

Take a look at these projects!- Loan Default Risk Analysis Using Machine Learning Techniques | Breast Cancer Classification and Prediction with Logistic Regression

Step 6: Splitting Data into Training and Testing Sets

This step divides the dataset into training and testing portions for model building and evaluation.

Here is the code for this step:

# Split the data into training (80%) and testing (20%) sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training data shape: {X_train.shape}")

print(f"Testing data shape: {X_test.shape}")

Output:

Training data shape: (512, 5)

Testing data shape: (129, 5)

Step 7: Training the Linear Regression Model

This step fits a Linear Regression model to predict annual rainfall using the first five months' rainfall data.

# Initialize and train a simple Linear Regression model

print("\nTraining the Annual Rainfall prediction model...")

model = LinearRegression()

model.fit(X_train, y_train)

print("Model training complete.")



# Print model coefficients and intercept

print("\nModel Coefficients (mm per month):")

for feature, coef in zip(features, model.coef_):

    print(f"  {feature}: {coef:.3f}")

print(f"Model Intercept (mm): {model.intercept_:.3f}")

Output:

Training the Annual Rainfall prediction model...

Model training complete.

Model Coefficients (mm per month):

JAN: -1.867

FEB: 3.907

MAR: 1.062

APR: -3.200

MAY: 7.394

Model Intercept (mm): 809.245

Want to spot fraud in transactions? Check this out!- Fraud Detection in Transactions with Python: A Machine Learning Project

Step 8: Model Evaluation

This step tests the trained model on unseen data and measures prediction accuracy.

# --- Model Evaluation ---

print("\n--- Model Evaluation ---")

# Make predictions on the test set

y_pred = model.predict(X_test)



# Calculate and print evaluation metrics

r2 = r2_score(y_test, y_pred)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))



print(f"Model R-squared (R²): {r2:.3f}")

print(f"Root Mean Squared Error (RMSE): {rmse:.2f} mm")

print("An R² score close to 1.0 indicates the model can predict annual rainfall well based on early-year data.")

Output:

--- Model Evaluation ---

Model R-squared (R²): 0.683

Root Mean Squared Error (RMSE): 532.29 mm

An R² score close to 1.0 indicates the model can predict annual rainfall well based on early-year data.

Check these also - IPL Match Winner Prediction using Logistic Regression | Bollywood Movie Analysis and Success Prediction with Machine Learning

Step 9: Example Prediction

This step uses the trained model to forecast annual rainfall from sample early-year data.

print("\n--- Example Prediction ---")

# Create a hypothetical data point for prediction

hypothetical_data = {

    'JAN': [20],

    'FEB': [30],

    'MAR': [50],

    'APR': [100],

    'MAY': [200]

}

example_df = pd.DataFrame(hypothetical_data)



print("Predicting Annual Rainfall for the following early-year data:")

print(example_df)



# Use the trained model to predict the annual rainfall

predicted_rainfall = model.predict(example_df)

print(f"\nPredicted Annual Rainfall: {predicted_rainfall[0]:.2f} mm")

Output:

--- Example Prediction ---

Predicting Annual Rainfall for the following early-year data:

JAN FEB MAR APR MAY

0 20 30 50 100 200

Predicted Annual Rainfall: 2101.08 mm

Also Read - Indian Automobile Market Analysis Using Random Forest

Final Conclusion

This project explored district-wise rainfall data, cleaned and prepared it for analysis, and visualised key seasonal and regional patterns. Using early-year rainfall data from January to May, a Linear Regression model was developed to predict the total annual rainfall. The model showed strong performance with a high R² score, indicating reliable predictive capability.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1AfJH7QFC8yHFso3QKWm8f4zRn1SeTW2-?usp=sharing

Frequently Asked Questions (FAQs)

1. What is the main goal of this project?

The primary objective is to study rainfall trends in different districts and states of India. The project also develops a predictive model that estimates the total annual rainfall using data from just the first five months of the year.

2. What dataset was used?

The dataset contains district-wise rainfall records for India, including monthly rainfall values from January to December. It also has seasonal aggregates like pre-monsoon, monsoon, post-monsoon, and winter rainfall, which help in understanding the distribution across different climatic periods.

3. Which machine learning algorithm was applied?

A Linear Regression model was chosen for its simplicity and interpretability. It uses rainfall values from January to May as input features to predict the total annual rainfall. This method allows us to see the direct influence of early-year rainfall on the overall yearly total.

4. How accurate is the prediction model?

The model achieved a high R² score, indicating a strong correlation between early-year rainfall and the total annual rainfall. This means that a significant portion of annual rainfall variability can be explained just by looking at the first five months of data.

Rohit Sharma

849 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources