Home
Blog
Data Science
Natural Disaster Prediction Analysis Project in R

Natural Disaster Prediction Analysis Project in R

Updated on Jul 31, 2025 | 14 min read | 1.64K+ views

Table of Contents

View all

Things to Know Before You Start This Disaster Risk Analysis Project
What Tools and R Packages Will You Be Working With?
Project Duration, Difficulty, and Skill Level Required
Step-by-Step Breakdown of This Disaster Risk Analysis Project
Conclusion

The Disaster Prediction Analysis Project Using R focuses on understanding and predicting natural disaster risks across countries using real-world data.

By analyzing key indicators such as exposure, vulnerability, and coping capacities, we aim to model and interpret the World Risk Index (WRI) using regression techniques like Linear Regression and Random Forest.

This project involves step-by-step data preprocessing, model training, evaluation, and visualization, all within the R environment on Google Colab.

Don’t Just Learn Data Science. Design the Future With It. From decoding GenAI to building real impact, these top online data science courses don’t just upskill you, they rewire how you think. Ready to lead the AI-first revolution?

Fuel Your Data Science Ambitions: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

Popular Data Science Programs

Advanced Certificate Program in Data Science Cloud Computing Courses Certification MS in Data Science MSc AI and Data Science Program Postgraduate Diploma in Data Science

Things to Know Before You Start This Disaster Risk Analysis Project

Here are a few things you should know before starting this Disaster Risk Analysis Project

Basic R programming: You should know how to run code, load libraries, and read data in R.
Understanding of data frames: You need to be familiar with rows, columns, and how to filter/select data.
Concept of target vs predictors: You need to understand what you're trying to predict (WRI) and which variables influence it.
Regression basics: You must know what regression means and the difference between linear and non-linear models.
Preprocessing knowledge: You need to understand why we handle missing data, scale features, and split datasets.
Model evaluation metrics: You should be aware of RMSE and R² to compare model performance.
Visualization logic: You also must know why plots are used to explore trends and support insights.

Step Into the Future of Data Science—Master GenAI, Build Real-World Expertise, and Earn Credibility with Globally Recognized Certifications. Your Transformation Begins Here.

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

What Tools and R Packages Will You Be Working With?

The following tools and packages will be used for the project. The project will require the following packages to ensure that the project runs smoothly.

Category	Name	Purpose
Environment	R	Programming language used for data analysis and modeling
Platform	Google Colab (R)	Cloud-based notebook environment for running R code
Data Handling	tidyverse	Collection of R packages for data manipulation and cleaning
Data Summarization	skimr	Summarizes dataset structure and missing values
Preprocessing	caret	Used for data preprocessing, model training, and evaluation
Modeling	randomForest	Builds Random Forest regression models
Visualization	ggplot2 (part of tidyverse)	Creates insightful charts and plots for EDA

Project Duration, Difficulty, and Skill Level Required

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Different projects require different levels of effort and skills. The average duration and difficulty level are given in the table below.

Aspect	Details
Estimated Duration	4–6 hours (including preprocessing, modeling, and evaluation)
Difficulty Level	Easy
Skill Level	Beginner

Step-by-Step Breakdown of This Disaster Risk Analysis Project

This section will explain the various steps involved in this project, giving you a better overview of how various libraries and functions work. Let’s look at each step and what output we will get.

Step 1: Configure Google Colab to Run R

To begin working with R in Google Colab, you need to switch the runtime environment from Python to R. This step ensures your notebook can execute R code seamlessly.

Steps to follow:

Launch a new notebook on Google Colab
Go to the Runtime tab in the menu bar
Choose Change runtime type
Set the language option to R from the dropdown
Click Save to apply the configuration

Here’s Something For You: How to Build an Uber Data Analysis Project in R

Step 2: Install and Load Required R Packages

In this step, we install the essential R packages needed for data manipulation, model training, and summarization. These libraries provide the tools we’ll use throughout the analysis.

# Install packages (only once)
install.packages(c("tidyverse", "caret", "randomForest", "skimr"))  # Installs all necessary libraries

# Load the libraries
library(tidyverse)      # For data cleaning, transformation, and visualization
library(caret)          # For preprocessing, training, and evaluating models
library(randomForest)   # For building Random Forest regression models
library(skimr)          # For generating summary statistics of the dataset

The output for the above step is:

Installing packages into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

This confirms that the required libraries are now installed, and we can move on with the next step.

Step 3: Load the Dataset into Your R Environment

Now that your setup is ready, this step loads the dataset into R from your Google Colab file system so you can begin analyzing it. Here’s the code for this step:

# Read your uploaded dataset
data <- read.csv("/content/world_risk_index.csv", stringsAsFactors = FALSE)  # Load the CSV file as a data frame

# View the first few rows
head(data)  # Display the top 6 rows to get a quick look at the data

This step will give us a sneak peek of the dataset. The output for this is given below:

	Region	WRI	Exposure	Vulnerability	Susceptibility	Lack.of.Coping.Capabilities	Lack.of.Adaptive.Capacities	Year	Exposure.Category	WRI.Category	Vulnerability.Category	Susceptibility.Category
	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<int>	<chr>	<chr>	<chr>	<chr>
1	Vanuatu	32.00	56.33	56.81	37.14	79.34	53.96	2011	Very High	Very High	High	High
2	Tonga	29.08	56.04	51.90	28.94	81.80	44.97	2011	Very High	Very High	Medium	Medium
3	Philippinen	24.32	45.09	53.93	34.99	82.78	44.01	2011	Very High	Very High	High	High
4	Salomonen	23.51	36.40	64.60	44.11	85.95	63.74	2011	Very High	Very High	Very High	High
5	Guatemala	20.88	38.42	54.35	35.36	77.83	49.87	2011	Very High	Very High	High	High
6	Bangladesch	17.45	27.52	63.41	44.96	86.49	58.77	2011	Very High	Very High	Very High	High

Step 4: Examine the Structure of the Dataset

Before jumping into cleaning or modeling, it’s important to understand the shape and structure of your data. This step shows the types of variables and their formats. The code for this step is given below:

# Understand structure and summary
str(data)  # Displays the structure of the dataset, including column names, data types, and sample values

The output for this section is:

'data.frame': 1917 obs. of 12 variables:

$ Region : chr "Vanuatu" "Tonga" "Philippinen" "Salomonen" ...

$ WRI : num 32 29.1 24.3 23.5 20.9 ...

$ Exposure : num 56.3 56 45.1 36.4 38.4 ...

$ Vulnerability : num 56.8 51.9 53.9 64.6 54.4 ...

$ Susceptibility : num 37.1 28.9 35 44.1 35.4 ...

$ Lack.of.Coping.Capabilities: num 79.3 81.8 82.8 86 77.8 ...

$ Lack.of.Adaptive.Capacities: num 54 45 44 63.7 49.9 ...

$ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...

$ Exposure.Category : chr "Very High" "Very High" "Very High" "Very High" ...

$ WRI.Category : chr "Very High" "Very High" "Very High" "Very High" ...

$ Vulnerability.Category : chr "High" "Medium" "High" "Very High" ...

$ Susceptibility.Category : chr "High" "Medium" "High" "High" ...

Output Insight:
The dataset contains 1917 observations and 12 variables. It includes a mix of:

Character columns (like Region, WRI.Category)
Numeric columns (like WRI, Exposure, Vulnerability)

Here’s An Interesting R Project: Customer Segmentation Project Using R: A Step-by-Step Guide

Step 5: Clean the Data by Keeping Only Numeric Features

To prepare the dataset for modeling, we remove columns that are not helpful for prediction, like region names, categorical labels, and the year. This step ensures the model only works with numeric inputs. The output for this section is:

# Remove non-numeric columns: Region, Year, and categorical labels
data_clean <- data %>%
  select(where(is.numeric)) %>%     # keep only numeric columns
  select(-Year)                     # remove 'Year'

# Check the cleaned data
head(data_clean)  # Display the first few rows of the cleaned, numeric-only dataset

The output for this section is given below:

	WRI	Exposure	Vulnerability	Susceptibility	Lack.of.Coping.Capabilities	Lack.of.Adaptive.Capacities
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	32.00	56.33	56.81	37.14	79.34	53.96
2	29.08	56.04	51.90	28.94	81.80	44.97
3	24.32	45.09	53.93	34.99	82.78	44.01
4	23.51	36.40	64.60	44.11	85.95	63.74
5	20.88	38.42	54.35	35.36	77.83	49.87
6	17.45	27.52	63.41	44.96	86.49	58.77

The above table shows that:

Removed non-numeric columns like Region, Year, and all categorical labels, which are not suitable for regression modeling.
Kept only the numeric variables necessary for prediction.
Final dataset now includes 6 features

Step 6: Preprocess the Data (Imputation and Scaling)

This step prepares the numeric data for modeling by handling missing values and standardizing the features. The output for this step is:

# Preprocess: Impute missing values + standardize (center & scale)
preprocess <- preProcess(data_clean, method = c("medianImpute", "center", "scale"))  # Fill missing values with median and normalize data

# Apply the preprocessing to the dataset
data_ready <- predict(preprocess, data_clean)  # Use the preprocessing rules to transform the dataset

# View the cleaned and processed data
head(data_ready)  # Display the first few rows of the preprocessed dataset

The output for the above step is:

	WRI	Exposure	Vulnerability	Susceptibility	Lack.of.Coping.Capabilities	Lack.of.Adaptive.Capacities
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	4.402504	3.998157	0.6312845	0.4085375	0.5919142	0.80210787
2	3.876687	3.969837	0.2764045	-0.1148547	0.7554905	0.13869581
3	3.019532	2.900515	0.4231268	0.2713066	0.8206550	0.06785314
4	2.873672	2.051893	1.1943221	0.8534208	1.0314423	1.52381753
5	2.400076	2.249156	0.4534831	0.2949231	0.4915076	0.50028859
6	1.782420	1.184717	1.1083125	0.9076748	1.0673493	1.15705914

You Must Build This Fun R Project: Wine Quality Prediction Project in R

Step 7: Split the Data into Training and Testing Sets

To evaluate model performance accurately, we split the dataset into a training set (to build the model) and a testing set (to validate it). This ensures we test the model on unseen data. The code for this step is:

# Set seed for repeatable results
set.seed(123)  # Ensures the split is the same every time you run the code

# Split: 80% for training, 20% for testing
split_index <- createDataPartition(data_ready$WRI, p = 0.8, list = FALSE)  # Creates index for splitting

# Create training and testing sets
train_data <- data_ready[split_index, ]  # 80% of the data
test_data  <- data_ready[-split_index, ] # Remaining 20% of the data

# Check number of rows in each
nrow(train_data)  # Number of training samples
nrow(test_data)   # Number of testing samples

The output for the above step is:

1536

381

The dataset has been successfully split into 1,536 rows for training and 381 rows for testing.

Step 8: Train Linear and Random Forest Regression Models

In this step, we train two regression models, Linear Regression and Random Forest, to predict the World Risk Index (WRI). We use cross-validation to ensure the models are reliable and not overfitting. Here’s the code for this section:

# Set up cross-validation method
control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)  # 5-fold cross-validation repeated 3 times

# Train Linear Regression Model
lm_model <- train(WRI ~ ., data = train_data, method = "lm", trControl = control)  # Linear regression model

# Train Random Forest Model
rf_model <- train(WRI ~ ., data = train_data, method = "rf", trControl = control, tuneLength = 5)  # Random Forest with hyperparameter tuning

# View summaries of both models
lm_model   # Shows performance metrics for linear regression
rf_model   # Shows performance and tuning info for Random Forest

The output for this step is:

note: only 4 unique complexity parameters in the default grid. Truncating the grid to 4 .

Linear Regression

1536 samples

5 predictor

No pre-processing

Resampling: Cross-Validated (5 fold, repeated 3 times)

Summary of sample sizes: 1229, 1229, 1228, 1229, 1229, 1228, ...

Resampling results:

RMSE Rsquared MAE

0.7214486 0.8050491 0.1503502

Tuning parameter 'intercept' was held constant at a value of TRUE

Random Forest

1536 samples

5 predictor

No pre-processing

Resampling: Cross-Validated (5 fold, repeated 3 times)

Summary of sample sizes: 1230, 1228, 1229, 1229, 1228, 1230, ...

Resampling results across tuning parameters:

mtry RMSE Rsquared MAE

2 0.15166475 0.9811035 0.05591219

3 0.09795349 0.9911595 0.03434792

4 0.08634327 0.9927228 0.03009917

5 0.09483177 0.9914011 0.03238711

RMSE was used to select the optimal model using the smallest value.

The final value used for the model was mtry = 4.

The above output means that:

Linear Regression Results:

Used 5 features to predict WRI.
Accuracy is decent:
- RMSE: 0.72, average error is moderate.
- R²: 0.80, explains 80% of the variation in WRI.
- MAE: 0.15, average absolute error.

Random Forest Results:

Tried different settings (mtry) to improve accuracy.
Best result was when mtry = 4:
- RMSE: 0.086, very low error.
- R²: 0.99, explains 99% of the variation (very accurate).
- MAE: 0.03, very small average error.

This means that:

Random Forest is much more accurate than Linear Regression.
It captures complex patterns in the data better.

Here’s a R Project For Beginners: Spam Filter Project Using R with Naive Bayes – With Code

Step 9: Make Predictions and Evaluate the Models

Now that both models are trained, this step uses them to make predictions on the test dataset and evaluate how well they perform using common metrics like RMSE and R². The code for this step is given below:

# Predict WRI on test data
lm_pred <- predict(lm_model, newdata = test_data)
rf_pred <- predict(rf_model, newdata = test_data)


# Evaluate performance
lm_result <- postResample(lm_pred, test_data$WRI)
rf_result <- postResample(rf_pred, test_data$WRI)


# Print results
print("Linear Regression Results:")
print(lm_result)


print("Random Forest Results:")
print(rf_result)

The output for the above step is:

[1] "Linear Regression Results:"

RMSE Rsquared MAE

0.1668324 0.9730143 0.1028355

[1] "Random Forest Results:"

RMSE Rsquared MAE

0.05025981 0.99776721 0.02075273

This means that:

Random Forest performed much better than Linear Regression on the test data, with a much lower RMSE (0.05 vs. 0.16) and higher R² (0.99 vs. 0.97).
Random Forest's MAE (0.02) shows it made very small errors on average, compared to Linear Regression’s (0.10).
This means Random Forest gave more accurate and reliable predictions for disaster risk (WRI).

Step 10: Identify Important Features Using Random Forest

This step helps you understand which variables had the biggest impact on predicting disaster risk. Random Forest automatically ranks features based on how useful they are in making accurate predictions. The code for this step is:

# Get feature importance from Random Forest
importance <- varImp(rf_model)  # Calculate how important each variable was in the model

# View importance scores
print(importance)  # Print the scores to see which features contributed most

# Plot the top variables
plot(importance, top = 10, main = "Top 10 Important Features (Random Forest)")  # Visualize the top features

The output for this section is:

rf variable importance

Overall

Exposure 100.000

Vulnerability 6.611

Lack.of.Coping.Capabilities 6.256

Susceptibility 4.962

Lack.of.Adaptive.Capacities 0.000

The above graph shows that:

Exposure is by far the most important predictor; it has the highest score (100), meaning the model relies heavily on it to predict disaster risk (WRI).
Other variables like Vulnerability, Lack of Coping Capabilities, and Susceptibility contribute slightly but much less.
Lack of Adaptive Capacities had no measurable importance in the Random Forest model; it didn’t help improve predictions.
This tells us that Exposure plays the biggest role in disaster risk according to this model.

New to R? Here’s A Cool Project: Car Data Analysis Project Using R

Step 11: Visualize the Distribution of the World Risk Index

This step helps you understand how the World Risk Index (WRI) values are spread across countries. A histogram gives a clear picture of how frequently different WRI levels occur. The following code is used in this step:

ggplot(data, aes(x = WRI)) +
  geom_histogram(fill = "steelblue", color = "white", bins = 30) +  # Creates a histogram with 30 bars
  labs(title = "Distribution of World Risk Index",
       x = "WRI", y = "Number of Countries") +  # Adds labels
  theme_minimal()  # Uses a clean theme for better visual appeal

The above output gives a graph that shows the distribution of the world risk index.

The above graph shows that:

Skewed to the Right (Positively Skewed)
Most countries have a low WRI score, clustered between 0 and 15. As WRI increases, the number of countries drops sharply.
High Concentration Around 5–10
The highest number of countries (almost 400) fall between the 5 and 10 WRI range, meaning moderate risk is most common globally.
Few High-Risk Countries
Only a small number of countries have very high WRI values (above 20), showing that extremely high disaster risk is rare.

Step 12: Visualize Relationship Between Exposure and Vulnerability (Colored by WRI)

This scatter plot helps us learn how Exposure and Vulnerability relate to each other, and how their combination affects the World Risk Index (WRI). The points are colored by WRI levels to highlight risk. The code for this step is:

ggplot(data, aes(x = Exposure, y = Vulnerability, color = WRI)) +
  geom_point(size = 2, alpha = 0.7) +  # Smaller, semi-transparent points
  scale_color_gradient(low = "blue", high = "red") +  # Blue = low WRI, Red = high WRI
  labs(title = "Exposure vs Vulnerability (Colored by WRI)",
       x = "Exposure", y = "Vulnerability") +
  theme_minimal()

The above step will give us a graph showing the exposure vs vulnerability.

The above graph shows that:

Countries with high exposure and vulnerability (top right) have the highest disaster risk, shown in red.
Most countries have low exposure, so they appear on the left and are colored blue or purple (low WRI).
This shows that exposure has a strong effect on increasing disaster risk.

Conclusion

In this Disaster Prediction Analysis project, we built a Random Forest regression model using R in Google Colab to predict the World Risk Index (WRI) based on key factors like Exposure, Vulnerability, and Coping Capacities.

After preprocessing the dataset (removing non-numeric columns, handling missing values, and scaling), we split the data into training and test sets.

We then trained both Linear Regression and Random Forest models. The Random Forest model performed significantly better, with an RMSE of 0.0502 and an R-squared value of 0.9977, indicating it explained over 99% of the variation in disaster risk scores.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/17qLyoshVly8Vqa-fWxXk_ugDb5mMeM_H#scrollTo=HC6KPoLgtbkZ

Frequently Asked Questions (FAQs)

1. How does this disaster risk analysis model work in R?

This model estimates disaster risk (WRI) using numerical indicators such as exposure levels, vulnerability, and coping capacities. After preparing and cleaning the data, a Random Forest regression model is trained to learn patterns and predict WRI with high accuracy.

2. Why is Google Colab a good platform for running R-based analysis?

Google Colab offers a cloud-based environment where you can run R scripts without needing local setup. It's ideal for beginners because it supports code execution, visualization, and file uploads—all in one place.

3. What makes Random Forest suitable for disaster prediction?

Random Forest handles non-linear relationships well and reduces overfitting by averaging results across many decision trees. It's especially powerful when some features (like Exposure) dominate the prediction while others have a weaker influence.

4. How can I improve the model’s prediction further?

You can improve performance by using techniques such as:

Feature engineering (creating new variables)
Hyperparameter tuning (grid search or Bayesian optimization)
Trying advanced models like XGBoost, SVR, or Neural Networks
Addressing class imbalance when doing classification

5. What are some interesting R projects I can try after this?

Here are some engaging project ideas that use different types of datasets and models:

Forest Fire Risk Prediction – Use weather and terrain data to forecast fires.
Uber Data Analytics in R – Analyze trip frequency, location trends, and peak hours.
Condo Sale Price Estimation – Predict property prices using real estate data.
Churn Forecasting Model – Identify which users are likely to leave a service.
Fake News Classification – Build a model to separate real vs. fake news articles.

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources