Natural Disaster Prediction Analysis Project in R

By Rohit Sharma

Updated on Jul 31, 2025 | 14 min read | 1.14K+ views

Share:

The Disaster Prediction Analysis Project Using R focuses on understanding and predicting natural disaster risks across countries using real-world data. 

By analyzing key indicators such as exposure, vulnerability, and coping capacities, we aim to model and interpret the World Risk Index (WRI) using regression techniques like Linear Regression and Random Forest. 

This project involves step-by-step data preprocessing, model training, evaluation, and visualization, all within the R environment on Google Colab. 

Don’t Just Learn Data Science. Design the Future With It. From decoding GenAI to building real impact, these top online data science courses don’t just upskill you, they rewire how you think. Ready to lead the AI-first revolution?

Fuel Your Data Science Ambitions: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

Things to Know Before You Start This Disaster Risk Analysis Project

Here are a few things you should know before starting this Disaster Risk Analysis Project

  • Basic R programming: You should know how to run code, load libraries, and read data in R.
  • Understanding of data frames: You need to be familiar with rows, columns, and how to filter/select data.
  • Concept of target vs predictors: You need to understand what you're trying to predict (WRI) and which variables influence it.
  • Regression basics: You must know what regression means and the difference between linear and non-linear models.
  • Preprocessing knowledge: You need to understand why we handle missing data, scale features, and split datasets.
  • Model evaluation metrics: You should be aware of RMSE and R² to compare model performance.
  • Visualization logic: You also must know why plots are used to explore trends and support insights.

Step Into the Future of Data Science—Master GenAI, Build Real-World Expertise, and Earn Credibility with Globally Recognized Certifications. Your Transformation Begins Here.

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

What Tools and R Packages Will You Be Working With?

The following tools and packages will be used for the project. The project will require the following packages to ensure that the project runs smoothly. 

Category

Name

Purpose

Environment R Programming language used for data analysis and modeling
Platform Google Colab (R) Cloud-based notebook environment for running R code
Data Handling tidyverse Collection of R packages for data manipulation and cleaning
Data Summarization skimr Summarizes dataset structure and missing values
Preprocessing caret Used for data preprocessing, model training, and evaluation
Modeling randomForest Builds Random Forest regression models
Visualization ggplot2 (part of tidyverse) Creates insightful charts and plots for EDA

Project Duration, Difficulty, and Skill Level Required

Different projects require different levels of effort and skills. The average duration and difficulty level are given in the table below.

Aspect

Details

Estimated Duration 4–6 hours (including preprocessing, modeling, and evaluation)
Difficulty Level Easy
Skill Level Beginner

Step-by-Step Breakdown of This Disaster Risk Analysis Project

This section will explain the various steps involved in this project, giving you a better overview of how various libraries and functions work. Let’s look at each step and what output we will get.

Step 1: Configure Google Colab to Run R

To begin working with R in Google Colab, you need to switch the runtime environment from Python to R. This step ensures your notebook can execute R code seamlessly.

Steps to follow:

  1. Launch a new notebook on Google Colab
  2. Go to the Runtime tab in the menu bar
  3. Choose Change runtime type
  4. Set the language option to R from the dropdown
  5. Click Save to apply the configuration

Here’s Something For You: How to Build an Uber Data Analysis Project in R

Step 2: Install and Load Required R Packages

In this step, we install the essential R packages needed for data manipulation, model training, and summarization. These libraries provide the tools we’ll use throughout the analysis.

# Install packages (only once)
install.packages(c("tidyverse", "caret", "randomForest", "skimr"))  # Installs all necessary libraries

# Load the libraries
library(tidyverse)      # For data cleaning, transformation, and visualization
library(caret)          # For preprocessing, training, and evaluating models
library(randomForest)   # For building Random Forest regression models
library(skimr)          # For generating summary statistics of the dataset

The output for the above step is:

Installing packages into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

This confirms that the required libraries are now installed, and we can move on with the next step.

Step 3: Load the Dataset into Your R Environment

Now that your setup is ready, this step loads the dataset into R from your Google Colab file system so you can begin analyzing it. Here’s the code for this step:

# Read your uploaded dataset
data <- read.csv("/content/world_risk_index.csv", stringsAsFactors = FALSE)  # Load the CSV file as a data frame

# View the first few rows
head(data)  # Display the top 6 rows to get a quick look at the data

This step will give us a sneak peek of the dataset. The output for this is given below:

 

Region

WRI

Exposure

Vulnerability

Susceptibility

Lack.of.Coping.Capabilities

Lack.of.Adaptive.Capacities

Year

Exposure.Category

WRI.Category

Vulnerability.Category

Susceptibility.Category

 

<chr>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<int>

<chr>

<chr>

<chr>

<chr>

1

Vanuatu

32.00

56.33

56.81

37.14

79.34

53.96

2011

Very High

Very High

High

High

2

Tonga

29.08

56.04

51.90

28.94

81.80

44.97

2011

Very High

Very High

Medium

Medium

3

Philippinen

24.32

45.09

53.93

34.99

82.78

44.01

2011

Very High

Very High

High

High

4

Salomonen

23.51

36.40

64.60

44.11

85.95

63.74

2011

Very High

Very High

Very High

High

5

Guatemala

20.88

38.42

54.35

35.36

77.83

49.87

2011

Very High

Very High

High

High

6

Bangladesch

17.45

27.52

63.41

44.96

86.49

58.77

2011

Very High

Very High

Very High

High

 

Step 4: Examine the Structure of the Dataset

Before jumping into cleaning or modeling, it’s important to understand the shape and structure of your data. This step shows the types of variables and their formats. The code for this step is given below:

# Understand structure and summary
str(data)  # Displays the structure of the dataset, including column names, data types, and sample values

The output for this section is:

'data.frame': 1917 obs. of  12 variables:

 $ Region                     : chr  "Vanuatu" "Tonga" "Philippinen" "Salomonen" ...

 $ WRI                        : num  32 29.1 24.3 23.5 20.9 ...

 $ Exposure                   : num  56.3 56 45.1 36.4 38.4 ...

 $ Vulnerability              : num  56.8 51.9 53.9 64.6 54.4 ...

 $ Susceptibility             : num  37.1 28.9 35 44.1 35.4 ...

 $ Lack.of.Coping.Capabilities: num  79.3 81.8 82.8 86 77.8 ...

 $ Lack.of.Adaptive.Capacities: num  54 45 44 63.7 49.9 ...

 $ Year                       : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...

 $ Exposure.Category          : chr  "Very High" "Very High" "Very High" "Very High" ...

 $ WRI.Category               : chr  "Very High" "Very High" "Very High" "Very High" ...

 $ Vulnerability.Category     : chr  "High" "Medium" "High" "Very High" ...

 $ Susceptibility.Category    : chr  "High" "Medium" "High" "High" ...

Output Insight:
The dataset contains 1917 observations and 12 variables. It includes a mix of:

  • Character columns (like Region, WRI.Category)
  • Numeric columns (like WRI, Exposure, Vulnerability)

Here’s An Interesting R Project: Customer Segmentation Project Using R: A Step-by-Step Guide

Step 5: Clean the Data by Keeping Only Numeric Features

To prepare the dataset for modeling, we remove columns that are not helpful for prediction, like region names, categorical labels, and the year. This step ensures the model only works with numeric inputs. The output for this section is:

# Remove non-numeric columns: Region, Year, and categorical labels
data_clean <- data %>%
  select(where(is.numeric)) %>%     # keep only numeric columns
  select(-Year)                     # remove 'Year'

# Check the cleaned data
head(data_clean)  # Display the first few rows of the cleaned, numeric-only dataset

The output for this section is given below:

 

WRI

Exposure

Vulnerability

Susceptibility

Lack.of.Coping.Capabilities

Lack.of.Adaptive.Capacities

 

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

1

32.00

56.33

56.81

37.14

79.34

53.96

2

29.08

56.04

51.90

28.94

81.80

44.97

3

24.32

45.09

53.93

34.99

82.78

44.01

4

23.51

36.40

64.60

44.11

85.95

63.74

5

20.88

38.42

54.35

35.36

77.83

49.87

6

17.45

27.52

63.41

44.96

86.49

58.77

 

The above table shows that:

  • Removed non-numeric columns like Region, Year, and all categorical labels, which are not suitable for regression modeling.
  • Kept only the numeric variables necessary for prediction.
  • Final dataset now includes 6 features

Step 6: Preprocess the Data (Imputation and Scaling)

This step prepares the numeric data for modeling by handling missing values and standardizing the features. The output for this step is:

# Preprocess: Impute missing values + standardize (center & scale)
preprocess <- preProcess(data_clean, method = c("medianImpute", "center", "scale"))  # Fill missing values with median and normalize data

# Apply the preprocessing to the dataset
data_ready <- predict(preprocess, data_clean)  # Use the preprocessing rules to transform the dataset

# View the cleaned and processed data
head(data_ready)  # Display the first few rows of the preprocessed dataset

The output for the above step is:

 

WRI

Exposure

Vulnerability

Susceptibility

Lack.of.Coping.Capabilities

Lack.of.Adaptive.Capacities

 

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

1

4.402504

3.998157

0.6312845

0.4085375

0.5919142

0.80210787

2

3.876687

3.969837

0.2764045

-0.1148547

0.7554905

0.13869581

3

3.019532

2.900515

0.4231268

0.2713066

0.8206550

0.06785314

4

2.873672

2.051893

1.1943221

0.8534208

1.0314423

1.52381753

5

2.400076

2.249156

0.4534831

0.2949231

0.4915076

0.50028859

6

1.782420

1.184717

1.1083125

0.9076748

1.0673493

1.15705914

 

You Must Build This Fun R Project: Wine Quality Prediction Project in R

Step 7: Split the Data into Training and Testing Sets

To evaluate model performance accurately, we split the dataset into a training set (to build the model) and a testing set (to validate it). This ensures we test the model on unseen data. The code for this step is:

# Set seed for repeatable results
set.seed(123)  # Ensures the split is the same every time you run the code

# Split: 80% for training, 20% for testing
split_index <- createDataPartition(data_ready$WRI, p = 0.8, list = FALSE)  # Creates index for splitting

# Create training and testing sets
train_data <- data_ready[split_index, ]  # 80% of the data
test_data  <- data_ready[-split_index, ] # Remaining 20% of the data

# Check number of rows in each
nrow(train_data)  # Number of training samples
nrow(test_data)   # Number of testing samples

The output for the above step is:

1536

381

The dataset has been successfully split into 1,536 rows for training and 381 rows for testing.

Step 8: Train Linear and Random Forest Regression Models

In this step, we train two regression models, Linear Regression and Random Forest, to predict the World Risk Index (WRI). We use cross-validation to ensure the models are reliable and not overfitting. Here’s the code for this section:

# Set up cross-validation method
control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)  # 5-fold cross-validation repeated 3 times

# Train Linear Regression Model
lm_model <- train(WRI ~ ., data = train_data, method = "lm", trControl = control)  # Linear regression model

# Train Random Forest Model
rf_model <- train(WRI ~ ., data = train_data, method = "rf", trControl = control, tuneLength = 5)  # Random Forest with hyperparameter tuning

# View summaries of both models
lm_model   # Shows performance metrics for linear regression
rf_model   # Shows performance and tuning info for Random Forest

The output for this step is:

note: only 4 unique complexity parameters in the default grid. Truncating the grid to 4 .

Linear Regression 

1536 samples

   5 predictor

No pre-processing

Resampling: Cross-Validated (5 fold, repeated 3 times) 

Summary of sample sizes: 1229, 1229, 1228, 1229, 1229, 1228, ... 

Resampling results:

  RMSE       Rsquared   MAE      

  0.7214486  0.8050491  0.1503502

Tuning parameter 'intercept' was held constant at a value of TRUE

Random Forest 

1536 samples

   5 predictor

No pre-processing

Resampling: Cross-Validated (5 fold, repeated 3 times) 

Summary of sample sizes: 1230, 1228, 1229, 1229, 1228, 1230, ... 

Resampling results across tuning parameters:

  mtry  RMSE        Rsquared   MAE       

  2     0.15166475  0.9811035  0.05591219

  3     0.09795349  0.9911595  0.03434792

  4     0.08634327  0.9927228  0.03009917

  5     0.09483177  0.9914011  0.03238711

RMSE was used to select the optimal model using the smallest value.

The final value used for the model was mtry = 4.

The above output means that:

Linear Regression Results:

  • Used 5 features to predict WRI.
  • Accuracy is decent:
    • RMSE: 0.72, average error is moderate.
    • R²: 0.80, explains 80% of the variation in WRI.
    • MAE: 0.15, average absolute error.

Random Forest Results:

  • Tried different settings (mtry) to improve accuracy.
  • Best result was when mtry = 4:
    • RMSE: 0.086, very low error.
    • R²: 0.99, explains 99% of the variation (very accurate).
    • MAE: 0.03, very small average error.

This means that:

  • Random Forest is much more accurate than Linear Regression.
  • It captures complex patterns in the data better.

Here’s a R Project For Beginners: Spam Filter Project Using R with Naive Bayes – With Code

Step 9: Make Predictions and Evaluate the Models

Now that both models are trained, this step uses them to make predictions on the test dataset and evaluate how well they perform using common metrics like RMSE and R². The code for this step is given below:

# Predict WRI on test data
lm_pred <- predict(lm_model, newdata = test_data)
rf_pred <- predict(rf_model, newdata = test_data)


# Evaluate performance
lm_result <- postResample(lm_pred, test_data$WRI)
rf_result <- postResample(rf_pred, test_data$WRI)


# Print results
print("Linear Regression Results:")
print(lm_result)


print("Random Forest Results:")
print(rf_result)

The output for the above step is:

[1] "Linear Regression Results:"

     RMSE  Rsquared       MAE 

0.1668324 0.9730143 0.1028355 

[1] "Random Forest Results:"

      RMSE   Rsquared        MAE 

0.05025981 0.99776721 0.02075273 

This means that:

  • Random Forest performed much better than Linear Regression on the test data, with a much lower RMSE (0.05 vs. 0.16) and higher R² (0.99 vs. 0.97).
  • Random Forest's MAE (0.02) shows it made very small errors on average, compared to Linear Regression’s (0.10).
  • This means Random Forest gave more accurate and reliable predictions for disaster risk (WRI).

Step 10: Identify Important Features Using Random Forest

This step helps you understand which variables had the biggest impact on predicting disaster risk. Random Forest automatically ranks features based on how useful they are in making accurate predictions. The code for this step is:

# Get feature importance from Random Forest
importance <- varImp(rf_model)  # Calculate how important each variable was in the model

# View importance scores
print(importance)  # Print the scores to see which features contributed most

# Plot the top variables
plot(importance, top = 10, main = "Top 10 Important Features (Random Forest)")  # Visualize the top features

The output for this section is:

rf variable importance

                                                 Overall

Exposure                                 100.000

Vulnerability                              6.611

Lack.of.Coping.Capabilities      6.256

Susceptibility                             4.962

Lack.of.Adaptive.Capacities     0.000

The above graph shows that:

  • Exposure is by far the most important predictor; it has the highest score (100), meaning the model relies heavily on it to predict disaster risk (WRI).
  • Other variables like Vulnerability, Lack of Coping Capabilities, and Susceptibility contribute slightly but much less.
  • Lack of Adaptive Capacities had no measurable importance in the Random Forest model; it didn’t help improve predictions.
  • This tells us that Exposure plays the biggest role in disaster risk according to this model.

New to R? Here’s A Cool Project: Car Data Analysis Project Using R

Step 11: Visualize the Distribution of the World Risk Index

This step helps you understand how the World Risk Index (WRI) values are spread across countries. A histogram gives a clear picture of how frequently different WRI levels occur. The following code is used in this step:

ggplot(data, aes(x = WRI)) +
  geom_histogram(fill = "steelblue", color = "white", bins = 30) +  # Creates a histogram with 30 bars
  labs(title = "Distribution of World Risk Index",
       x = "WRI", y = "Number of Countries") +  # Adds labels
  theme_minimal()  # Uses a clean theme for better visual appeal

The above output gives a graph that shows the distribution of the world risk index.

The above graph shows that:

  1. Skewed to the Right (Positively Skewed)
    Most countries have a low WRI score, clustered between 0 and 15. As WRI increases, the number of countries drops sharply.
  2. High Concentration Around 5–10
    The highest number of countries (almost 400) fall between the 5 and 10 WRI range, meaning moderate risk is most common globally.
  3. Few High-Risk Countries
    Only a small number of countries have very high WRI values (above 20), showing that extremely high disaster risk is rare.

Step 12: Visualize Relationship Between Exposure and Vulnerability (Colored by WRI)

This scatter plot helps us learn how Exposure and Vulnerability relate to each other, and how their combination affects the World Risk Index (WRI). The points are colored by WRI levels to highlight risk. The code for this step is:

ggplot(data, aes(x = Exposure, y = Vulnerability, color = WRI)) +
  geom_point(size = 2, alpha = 0.7) +  # Smaller, semi-transparent points
  scale_color_gradient(low = "blue", high = "red") +  # Blue = low WRI, Red = high WRI
  labs(title = "Exposure vs Vulnerability (Colored by WRI)",
       x = "Exposure", y = "Vulnerability") +
  theme_minimal()

The above step will give us a graph showing the exposure vs vulnerability.

The above graph shows that:

  • Countries with high exposure and vulnerability (top right) have the highest disaster risk, shown in red.
  • Most countries have low exposure, so they appear on the left and are colored blue or purple (low WRI).
  • This shows that exposure has a strong effect on increasing disaster risk.

Conclusion

In this Disaster Prediction Analysis project, we built a Random Forest regression model using R in Google Colab to predict the World Risk Index (WRI) based on key factors like Exposure, Vulnerability, and Coping Capacities.

After preprocessing the dataset (removing non-numeric columns, handling missing values, and scaling), we split the data into training and test sets. 

We then trained both Linear Regression and Random Forest models. The Random Forest model performed significantly better, with an RMSE of 0.0502 and an R-squared value of 0.9977, indicating it explained over 99% of the variation in disaster risk scores.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/17qLyoshVly8Vqa-fWxXk_ugDb5mMeM_H#scrollTo=HC6KPoLgtbkZ

Frequently Asked Questions (FAQs)

1. How does this disaster risk analysis model work in R?

2. Why is Google Colab a good platform for running R-based analysis?

3. What makes Random Forest suitable for disaster prediction?

4. How can I improve the model’s prediction further?

5. What are some interesting R projects I can try after this?

Rohit Sharma

803 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months