Natural Disaster Prediction Analysis Project in R
By Rohit Sharma
Updated on Jul 31, 2025 | 14 min read | 1.14K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 31, 2025 | 14 min read | 1.14K+ views
Share:
Table of Contents
The Disaster Prediction Analysis Project Using R focuses on understanding and predicting natural disaster risks across countries using real-world data.
By analyzing key indicators such as exposure, vulnerability, and coping capacities, we aim to model and interpret the World Risk Index (WRI) using regression techniques like Linear Regression and Random Forest.
This project involves step-by-step data preprocessing, model training, evaluation, and visualization, all within the R environment on Google Colab.
Don’t Just Learn Data Science. Design the Future With It. From decoding GenAI to building real impact, these top online data science courses don’t just upskill you, they rewire how you think. Ready to lead the AI-first revolution?
Fuel Your Data Science Ambitions: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025
Popular Data Science Programs
Here are a few things you should know before starting this Disaster Risk Analysis Project
Step Into the Future of Data Science—Master GenAI, Build Real-World Expertise, and Earn Credibility with Globally Recognized Certifications. Your Transformation Begins Here.
The following tools and packages will be used for the project. The project will require the following packages to ensure that the project runs smoothly.
Category |
Name |
Purpose |
Environment | R | Programming language used for data analysis and modeling |
Platform | Google Colab (R) | Cloud-based notebook environment for running R code |
Data Handling | tidyverse | Collection of R packages for data manipulation and cleaning |
Data Summarization | skimr | Summarizes dataset structure and missing values |
Preprocessing | caret | Used for data preprocessing, model training, and evaluation |
Modeling | randomForest | Builds Random Forest regression models |
Visualization | ggplot2 (part of tidyverse) | Creates insightful charts and plots for EDA |
Different projects require different levels of effort and skills. The average duration and difficulty level are given in the table below.
Aspect |
Details |
Estimated Duration | 4–6 hours (including preprocessing, modeling, and evaluation) |
Difficulty Level | Easy |
Skill Level | Beginner |
This section will explain the various steps involved in this project, giving you a better overview of how various libraries and functions work. Let’s look at each step and what output we will get.
To begin working with R in Google Colab, you need to switch the runtime environment from Python to R. This step ensures your notebook can execute R code seamlessly.
Steps to follow:
Here’s Something For You: How to Build an Uber Data Analysis Project in R
In this step, we install the essential R packages needed for data manipulation, model training, and summarization. These libraries provide the tools we’ll use throughout the analysis.
# Install packages (only once)
install.packages(c("tidyverse", "caret", "randomForest", "skimr")) # Installs all necessary libraries
# Load the libraries
library(tidyverse) # For data cleaning, transformation, and visualization
library(caret) # For preprocessing, training, and evaluating models
library(randomForest) # For building Random Forest regression models
library(skimr) # For generating summary statistics of the dataset
The output for the above step is:
Installing packages into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) |
This confirms that the required libraries are now installed, and we can move on with the next step.
Now that your setup is ready, this step loads the dataset into R from your Google Colab file system so you can begin analyzing it. Here’s the code for this step:
# Read your uploaded dataset
data <- read.csv("/content/world_risk_index.csv", stringsAsFactors = FALSE) # Load the CSV file as a data frame
# View the first few rows
head(data) # Display the top 6 rows to get a quick look at the data
This step will give us a sneak peek of the dataset. The output for this is given below:
Region |
WRI |
Exposure |
Vulnerability |
Susceptibility |
Lack.of.Coping.Capabilities |
Lack.of.Adaptive.Capacities |
Year |
Exposure.Category |
WRI.Category |
Vulnerability.Category |
Susceptibility.Category |
|
<chr> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<int> |
<chr> |
<chr> |
<chr> |
<chr> |
|
1 |
Vanuatu |
32.00 |
56.33 |
56.81 |
37.14 |
79.34 |
53.96 |
2011 |
Very High |
Very High |
High |
High |
2 |
Tonga |
29.08 |
56.04 |
51.90 |
28.94 |
81.80 |
44.97 |
2011 |
Very High |
Very High |
Medium |
Medium |
3 |
Philippinen |
24.32 |
45.09 |
53.93 |
34.99 |
82.78 |
44.01 |
2011 |
Very High |
Very High |
High |
High |
4 |
Salomonen |
23.51 |
36.40 |
64.60 |
44.11 |
85.95 |
63.74 |
2011 |
Very High |
Very High |
Very High |
High |
5 |
Guatemala |
20.88 |
38.42 |
54.35 |
35.36 |
77.83 |
49.87 |
2011 |
Very High |
Very High |
High |
High |
6 |
Bangladesch |
17.45 |
27.52 |
63.41 |
44.96 |
86.49 |
58.77 |
2011 |
Very High |
Very High |
Very High |
High |
Before jumping into cleaning or modeling, it’s important to understand the shape and structure of your data. This step shows the types of variables and their formats. The code for this step is given below:
# Understand structure and summary
str(data) # Displays the structure of the dataset, including column names, data types, and sample values
The output for this section is:
'data.frame': 1917 obs. of 12 variables: $ Region : chr "Vanuatu" "Tonga" "Philippinen" "Salomonen" ... $ WRI : num 32 29.1 24.3 23.5 20.9 ... $ Exposure : num 56.3 56 45.1 36.4 38.4 ... $ Vulnerability : num 56.8 51.9 53.9 64.6 54.4 ... $ Susceptibility : num 37.1 28.9 35 44.1 35.4 ... $ Lack.of.Coping.Capabilities: num 79.3 81.8 82.8 86 77.8 ... $ Lack.of.Adaptive.Capacities: num 54 45 44 63.7 49.9 ... $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ... $ Exposure.Category : chr "Very High" "Very High" "Very High" "Very High" ... $ WRI.Category : chr "Very High" "Very High" "Very High" "Very High" ... $ Vulnerability.Category : chr "High" "Medium" "High" "Very High" ... $ Susceptibility.Category : chr "High" "Medium" "High" "High" ... |
Output Insight:
The dataset contains 1917 observations and 12 variables. It includes a mix of:
Here’s An Interesting R Project: Customer Segmentation Project Using R: A Step-by-Step Guide
To prepare the dataset for modeling, we remove columns that are not helpful for prediction, like region names, categorical labels, and the year. This step ensures the model only works with numeric inputs. The output for this section is:
# Remove non-numeric columns: Region, Year, and categorical labels
data_clean <- data %>%
select(where(is.numeric)) %>% # keep only numeric columns
select(-Year) # remove 'Year'
# Check the cleaned data
head(data_clean) # Display the first few rows of the cleaned, numeric-only dataset
The output for this section is given below:
WRI |
Exposure |
Vulnerability |
Susceptibility |
Lack.of.Coping.Capabilities |
Lack.of.Adaptive.Capacities |
|
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
|
1 |
32.00 |
56.33 |
56.81 |
37.14 |
79.34 |
53.96 |
2 |
29.08 |
56.04 |
51.90 |
28.94 |
81.80 |
44.97 |
3 |
24.32 |
45.09 |
53.93 |
34.99 |
82.78 |
44.01 |
4 |
23.51 |
36.40 |
64.60 |
44.11 |
85.95 |
63.74 |
5 |
20.88 |
38.42 |
54.35 |
35.36 |
77.83 |
49.87 |
6 |
17.45 |
27.52 |
63.41 |
44.96 |
86.49 |
58.77 |
The above table shows that:
This step prepares the numeric data for modeling by handling missing values and standardizing the features. The output for this step is:
# Preprocess: Impute missing values + standardize (center & scale)
preprocess <- preProcess(data_clean, method = c("medianImpute", "center", "scale")) # Fill missing values with median and normalize data
# Apply the preprocessing to the dataset
data_ready <- predict(preprocess, data_clean) # Use the preprocessing rules to transform the dataset
# View the cleaned and processed data
head(data_ready) # Display the first few rows of the preprocessed dataset
The output for the above step is:
WRI |
Exposure |
Vulnerability |
Susceptibility |
Lack.of.Coping.Capabilities |
Lack.of.Adaptive.Capacities |
|
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
|
1 |
4.402504 |
3.998157 |
0.6312845 |
0.4085375 |
0.5919142 |
0.80210787 |
2 |
3.876687 |
3.969837 |
0.2764045 |
-0.1148547 |
0.7554905 |
0.13869581 |
3 |
3.019532 |
2.900515 |
0.4231268 |
0.2713066 |
0.8206550 |
0.06785314 |
4 |
2.873672 |
2.051893 |
1.1943221 |
0.8534208 |
1.0314423 |
1.52381753 |
5 |
2.400076 |
2.249156 |
0.4534831 |
0.2949231 |
0.4915076 |
0.50028859 |
6 |
1.782420 |
1.184717 |
1.1083125 |
0.9076748 |
1.0673493 |
1.15705914 |
You Must Build This Fun R Project: Wine Quality Prediction Project in R
To evaluate model performance accurately, we split the dataset into a training set (to build the model) and a testing set (to validate it). This ensures we test the model on unseen data. The code for this step is:
# Set seed for repeatable results
set.seed(123) # Ensures the split is the same every time you run the code
# Split: 80% for training, 20% for testing
split_index <- createDataPartition(data_ready$WRI, p = 0.8, list = FALSE) # Creates index for splitting
# Create training and testing sets
train_data <- data_ready[split_index, ] # 80% of the data
test_data <- data_ready[-split_index, ] # Remaining 20% of the data
# Check number of rows in each
nrow(train_data) # Number of training samples
nrow(test_data) # Number of testing samples
The output for the above step is:
1536 381 |
The dataset has been successfully split into 1,536 rows for training and 381 rows for testing.
In this step, we train two regression models, Linear Regression and Random Forest, to predict the World Risk Index (WRI). We use cross-validation to ensure the models are reliable and not overfitting. Here’s the code for this section:
# Set up cross-validation method
control <- trainControl(method = "repeatedcv", number = 5, repeats = 3) # 5-fold cross-validation repeated 3 times
# Train Linear Regression Model
lm_model <- train(WRI ~ ., data = train_data, method = "lm", trControl = control) # Linear regression model
# Train Random Forest Model
rf_model <- train(WRI ~ ., data = train_data, method = "rf", trControl = control, tuneLength = 5) # Random Forest with hyperparameter tuning
# View summaries of both models
lm_model # Shows performance metrics for linear regression
rf_model # Shows performance and tuning info for Random Forest
The output for this step is:
note: only 4 unique complexity parameters in the default grid. Truncating the grid to 4 . Linear Regression 1536 samples 5 predictor No pre-processing Resampling: Cross-Validated (5 fold, repeated 3 times) Summary of sample sizes: 1229, 1229, 1228, 1229, 1229, 1228, ... Resampling results: RMSE Rsquared MAE 0.7214486 0.8050491 0.1503502 Tuning parameter 'intercept' was held constant at a value of TRUE Random Forest 1536 samples 5 predictor No pre-processing Resampling: Cross-Validated (5 fold, repeated 3 times) Summary of sample sizes: 1230, 1228, 1229, 1229, 1228, 1230, ... Resampling results across tuning parameters: mtry RMSE Rsquared MAE 2 0.15166475 0.9811035 0.05591219 3 0.09795349 0.9911595 0.03434792 4 0.08634327 0.9927228 0.03009917 5 0.09483177 0.9914011 0.03238711 RMSE was used to select the optimal model using the smallest value. The final value used for the model was mtry = 4. |
The above output means that:
Linear Regression Results:
Random Forest Results:
This means that:
Here’s a R Project For Beginners: Spam Filter Project Using R with Naive Bayes – With Code
Now that both models are trained, this step uses them to make predictions on the test dataset and evaluate how well they perform using common metrics like RMSE and R². The code for this step is given below:
# Predict WRI on test data
lm_pred <- predict(lm_model, newdata = test_data)
rf_pred <- predict(rf_model, newdata = test_data)
# Evaluate performance
lm_result <- postResample(lm_pred, test_data$WRI)
rf_result <- postResample(rf_pred, test_data$WRI)
# Print results
print("Linear Regression Results:")
print(lm_result)
print("Random Forest Results:")
print(rf_result)
The output for the above step is:
[1] "Linear Regression Results:" RMSE Rsquared MAE 0.1668324 0.9730143 0.1028355 [1] "Random Forest Results:" RMSE Rsquared MAE 0.05025981 0.99776721 0.02075273 |
This means that:
This step helps you understand which variables had the biggest impact on predicting disaster risk. Random Forest automatically ranks features based on how useful they are in making accurate predictions. The code for this step is:
# Get feature importance from Random Forest
importance <- varImp(rf_model) # Calculate how important each variable was in the model
# View importance scores
print(importance) # Print the scores to see which features contributed most
# Plot the top variables
plot(importance, top = 10, main = "Top 10 Important Features (Random Forest)") # Visualize the top features
The output for this section is:
rf variable importance Overall Exposure 100.000 Vulnerability 6.611 Lack.of.Coping.Capabilities 6.256 Susceptibility 4.962 Lack.of.Adaptive.Capacities 0.000 |
The above graph shows that:
New to R? Here’s A Cool Project: Car Data Analysis Project Using R
This step helps you understand how the World Risk Index (WRI) values are spread across countries. A histogram gives a clear picture of how frequently different WRI levels occur. The following code is used in this step:
ggplot(data, aes(x = WRI)) +
geom_histogram(fill = "steelblue", color = "white", bins = 30) + # Creates a histogram with 30 bars
labs(title = "Distribution of World Risk Index",
x = "WRI", y = "Number of Countries") + # Adds labels
theme_minimal() # Uses a clean theme for better visual appeal
The above output gives a graph that shows the distribution of the world risk index.
The above graph shows that:
This scatter plot helps us learn how Exposure and Vulnerability relate to each other, and how their combination affects the World Risk Index (WRI). The points are colored by WRI levels to highlight risk. The code for this step is:
ggplot(data, aes(x = Exposure, y = Vulnerability, color = WRI)) +
geom_point(size = 2, alpha = 0.7) + # Smaller, semi-transparent points
scale_color_gradient(low = "blue", high = "red") + # Blue = low WRI, Red = high WRI
labs(title = "Exposure vs Vulnerability (Colored by WRI)",
x = "Exposure", y = "Vulnerability") +
theme_minimal()
The above step will give us a graph showing the exposure vs vulnerability.
The above graph shows that:
In this Disaster Prediction Analysis project, we built a Random Forest regression model using R in Google Colab to predict the World Risk Index (WRI) based on key factors like Exposure, Vulnerability, and Coping Capacities.
After preprocessing the dataset (removing non-numeric columns, handling missing values, and scaling), we split the data into training and test sets.
We then trained both Linear Regression and Random Forest models. The Random Forest model performed significantly better, with an RMSE of 0.0502 and an R-squared value of 0.9977, indicating it explained over 99% of the variation in disaster risk scores.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/17qLyoshVly8Vqa-fWxXk_ugDb5mMeM_H#scrollTo=HC6KPoLgtbkZ
803 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources