Trend Analysis Project on COVID-19 using R
By Rohit Sharma
Updated on Jul 24, 2025 | 9 min read | 1.58K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 24, 2025 | 9 min read | 1.58K+ views
Share:
Table of Contents
Who can forget the dreaded time of COVID-19? That was a time when data became crucial, and analysis of proper data helped curb the spread and helped us fight back.
This project on COVID-19 using R is based on time series analysis. In this project we will analyze the trends in confirmed cases, deaths, and vaccinations globally. We’ll work with the given dataset, which includes cleaning and transforming time series data, and applying statistical forecasting models like ARIMA and ETS.
We'll also use various powerful R libraries such as tidyverse, lubridate, forecast, and ggplot2 to build an analytical pipeline. The main goal of this project on COVID-19 using R is to develop a predictive model that visualizes historical COVID-19 data and forecasts future trends.
Take the next step in your data science career with upGrad’s Online Data Science Courses. Master essential tools like Python, Machine Learning, AI, Tableau, and SQL—taught by industry-leading faculty. Begin your learning journey today!
Boost Your Data Science Skills With These Top 25+ R Projects for Beginners
Popular Data Science Programs
Before starting this project on COVID-19 using R, it’s essential to have a foundational understanding of the following concepts and tools:
Begin your data science journey with upGrad’s industry-aligned programs. Learn from leading experts, master essential tools and techniques, and build job-ready skills through hands-on projects and real-world applications.
This project on COVID-19 using R requires a variety of tools and technologies to use the data, execute, and run it successfully. The tools, libraries, and technologies are:
Category |
Tool / Library |
Purpose / Use Case |
IDE / Platform | RStudio / Google Colab | This is used in the development environment for writing and executing R code |
Programming Language | R | It is the primary language used for data manipulation, visualization, and forecasting |
Data Handling | readr, dplyr, tidyr | They are used for reading, cleaning, and transforming COVID-19 datasets |
Time Series Processing | zoo, xts, lubridate | These are used for handling time series data and managing date-time formats |
Visualization | ggplot2, plotly | They’re used for creating interactive and static data visualizations |
Forecasting Models | forecast, fable, tseries | These are used for implementing and tuning models like ARIMA, ETS, and others |
Statistical Testing | tseries, urca | They’re used for conducting stationarity tests (ADF, KPSS) and evaluating time series assumptions |
In this project on COVID-19 using R, you will understand the following forecasting models and techniques widely used in statistical modeling and predictive analytics:
Before you begin, here’s an estimation of the time and effort required to complete this project:
The following steps will be followed to build the trend analysis project on COVID-19 using R.
Before we start the analysis, we need a detailed and frequently updated COVID-19 dataset. We can find data from various reliable sources. Find the dataset you want to work with and download the CSV file.
To begin the trend analysis project on COVID-19 using R, the first essential step is setting up your R environment with the necessary libraries. These packages will help us in data manipulation, visualization, data handling, and forecasting.
# Install the required packages (run only once)
install.packages("ggplot2")
install.packages("forecast")
install.packages("dplyr")
install.packages("lubridate")
# Load the libraries into your R session
library(ggplot2)
library(forecast)
library(dplyr)
library(lubridate)
After setting up the environment, the next step is to upload the COVID-19 dataset into R. Use the following code to import your CSV file, preview the data, and understand its structure:
# Load the dataset from the specified path
covid_raw <- read.csv("/content/coronavirus_dataset.csv", stringsAsFactors = FALSE)
# Preview the first few rows of the dataset
head(covid_raw)
# Examine the structure and data types of each column
str(covid_raw)
You’ll get an overview of the dataset here:
Output:
Province.State |
Country.Region |
Lat |
Long |
date |
cases |
type |
|
<chr> |
<chr> |
<dbl> |
<dbl> |
<chr> |
<int> |
<chr> |
|
1 |
Afghanistan |
33 |
65 |
2020-01-22 |
0 |
confirmed |
|
2 |
Afghanistan |
33 |
65 |
2020-01-23 |
0 |
confirmed |
|
3 |
Afghanistan |
33 |
65 |
2020-01-24 |
0 |
confirmed |
|
4 |
Afghanistan |
33 |
65 |
2020-01-25 |
0 |
confirmed |
|
5 |
Afghanistan |
33 |
65 |
2020-01-26 |
0 |
confirmed |
|
6 |
Afghanistan |
33 |
65 |
2020-01-27 |
0 |
confirmed |
Also Read: R For Data Science: Why Should You Choose R for Data Science?
In this step, we prepare the dataset by converting the date column into a proper Date format and aggregating the daily confirmed COVID-19 cases globally.
library(dplyr)
library(lubridate)
# Convert the 'date' column to Date format for accurate time series handling
covid <- covid_raw %>%
mutate(date = ymd(date))
# Aggregate daily global confirmed cases by summing cases for each date
global_confirmed <- covid %>%
filter(type == "confirmed") %>%
group_by(date) %>%
summarise(daily_cases = sum(cases, na.rm = TRUE))
# Preview the aggregated data
head(global_confirmed)
We get a table like this:
date |
daily_cases |
<date> |
<int> |
2020-01-22 |
555 |
2020-01-23 |
98 |
2020-01-24 |
288 |
2020-01-25 |
493 |
2020-01-26 |
684 |
2020-01-27 |
809 |
Now that we have aggregated the daily confirmed cases globally, we visualize the trend over time using ggplot2.
library(ggplot2)
ggplot(global_confirmed, aes(x = date, y = daily_cases)) +
geom_line(color = "steelblue") +
labs(title = "Daily Confirmed COVID-19 Cases Worldwide",
x = "Date", y = "Daily Cases") +
theme_minimal()
This visualization provides a clear overview of the pandemic’s progression globally, serving as a foundation for further analysis and forecasting.
Must Read: 18 Types of Regression in Machine Learning You Should Know
In this step, we calculate both the cumulative total confirmed cases and the daily new cases globally. This helps us better understand daily infection trends by deriving new cases from cumulative data.
global_confirmed <- covid_raw %>%
filter(type == "confirmed") %>%
group_by(date) %>%
summarise(total_cases = sum(cases, na.rm = TRUE)) %>%
mutate(daily_cases = total_cases - lag(total_cases, default = first(total_cases)))
This step generates a time series of new daily cases, crucial for detecting spikes and trends in understanding how the pandemic spread.
With the time series data prepared, we will now fit an ARIMA model to forecast the future COVID-19 case counts. ARIMA models are powerful for capturing patterns in time series data, including trends and seasonality.
# Fit the ARIMA model automatically based on the data
fit_arima <- auto.arima(covid_ts)
# Forecast the next 30 days of COVID-19 cases
forecast_arima <- forecast(fit_arima, h = 30)
# Visualize the forecast with confidence intervals
autoplot(forecast_arima) +
labs(title = "30-Day COVID-19 Forecast using ARIMA",
x = "Time", y = "Forecasted Cases")
This forecast allows the people concerned to anticipate potential future trends and prepare accordingly.
To evaluate our model’s performance, we split the time series data into training and testing sets. This approach helps validate the accuracy of forecasts against actual observed values.
# Convert daily cases to a time series object with yearly frequency
covid_ts <- ts(global_confirmed$daily_cases, frequency = 365)
# Total length of the time series
n <- length(covid_ts)
# Create training set (all but last 30 days)
train_ts <- ts(covid_ts[1:(n - 30)], frequency = 365)
# Create test set (last 30 days)
test_ts <- ts(covid_ts[(n - 29):n], frequency = 365)
# Fit ARIMA model on training data
model_train <- auto.arima(train_ts)
# Forecast the next 30 days using the trained model
forecast_test <- forecast(model_train, h = 30)
# Evaluate forecast accuracy against actual test data
accuracy(forecast_test, test_ts)
This method ensures that the forecasting model is robust and reliable before deploying it for future predictions.
ME |
RMSE |
MAE |
MPE |
MAPE |
MASE |
ACF1 |
Theil's U |
|
Training set |
63.6 |
3666.128 |
1810.56 |
100 |
100 |
NaN |
-0.3738800 |
NA |
Test set |
1364.8 |
7384.268 |
5661.20 |
100 |
100 |
NaN |
-0.6447488 |
1.014011 |
Also Read: Data Preprocessing in Machine Learning: 11 Key Steps You Must Know!
To assess the performance of the ARIMA model visually, we plot the forecasted daily cases alongside the actual observed data for the test period.
autoplot(forecast_test) +
autolayer(test_ts, series = "Actual") +
labs(title = "ARIMA Forecast vs Actual",
y = "Daily Cases", x = "Days")
This visualization provides an immediate sense of how closely the ARIMA model predictions track real-world COVID-19 case trends.
In this trend analysis project on COVID-19 using R, we cleaned, visualized, and forecasted daily confirmed cases using statistical models like ARIMA. We started with global data, preprocessed the dataset, aggregated daily cases, and then applied time series techniques to model trends and patterns.
We also divided the data into training and testing sets to evaluate forecast accuracy, using metrics that validate the model’s predictive capabilities. The ARIMA model provided a clear visualization of both historical and forecasted COVID-19 cases.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1LJYJTkh6BGtJgDZj2iZA_s3WvaGQufiX
779 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources