View All
View All
View All
View All
View All
View All
View All
View All
View All
View All
View All

Trend Analysis Project on COVID-19 using R

By Rohit Sharma

Updated on Jul 24, 2025 | 9 min read | 1.58K+ views

Share:

Who can forget the dreaded time of COVID-19? That was a time when data became crucial, and analysis of proper data helped curb the spread and helped us fight back.

This project on COVID-19 using R is based on time series analysis. In this project we will analyze the trends in confirmed cases, deaths, and vaccinations globally. We’ll work with the given dataset, which includes cleaning and transforming time series data, and applying statistical forecasting models like ARIMA and ETS.

We'll also use various powerful R libraries such as tidyverse, lubridate, forecast, and ggplot2 to build an analytical pipeline. The main goal of this project on COVID-19 using R is to develop a predictive model that visualizes historical COVID-19 data and forecasts future trends.

Take the next step in your data science career with upGrad’s Online Data Science Courses. Master essential tools like Python, Machine Learning, AI, Tableau, and SQL—taught by industry-leading faculty. Begin your learning journey today!

Boost Your Data Science Skills With These Top 25+ R Projects for Beginners

 

What Should You Know Before Getting Started with COVID-19 Data Analysis?

Before starting this project on COVID-19 using R, it’s essential to have a foundational understanding of the following concepts and tools:

  • Basic R Programming
    You need to have familiarity with R syntax, data frames, and functions for preprocessing, visualization, and modeling.
  • Time Series Concepts
    You need to understand that components like trend, seasonality, and stationarity will help in interpreting pandemic patterns and applying forecasting models effectively.
  • Data Manipulation with dplyr and tidyr
    These tidyverse packages streamline data cleaning and transformation tasks, especially when reshaping large datasets.
  • Data Visualization with ggplot2
    You must have proficiency in plotting techniques as it is crucial for understanding the spread and impact of COVID-19 over time.
  • Statistical Modeling Basics
    You need to have knowledge of statistical forecasting models, such as ARIMA, ETS, or exponential smoothing methods. These concepts will be valuable.
  • Working with Dates and Time in R
    You also need to have experience with packages like lubridate can help manage daily, weekly, or monthly time series in a better way.

Begin your data science journey with upGrad’s industry-aligned programs. Learn from leading experts, master essential tools and techniques, and build job-ready skills through hands-on projects and real-world applications.

What Are The Tools, Technologies, and R Libraries Required

This project on COVID-19 using R requires a variety of tools and technologies to use the data, execute, and run it successfully. The tools, libraries, and technologies are:

Category

Tool / Library

Purpose / Use Case

IDE / Platform RStudio / Google Colab This is used in the development environment for writing and executing R code
Programming Language R It is the primary language used for data manipulation, visualization, and forecasting
Data Handling readr, dplyr, tidyr They are used for reading, cleaning, and transforming COVID-19 datasets
Time Series Processing zoo, xts, lubridate These are used for handling time series data and managing date-time formats
Visualization ggplot2, plotly They’re used for creating interactive and static data visualizations
Forecasting Models forecast, fable, tseries These are used for implementing and tuning models like ARIMA, ETS, and others
Statistical Testing tseries, urca They’re used for conducting stationarity tests (ADF, KPSS) and evaluating time series assumptions

Forecasting Models You Will Learn and Apply

In this project on COVID-19 using R, you will understand the following forecasting models and techniques widely used in statistical modeling and predictive analytics:

  • ARIMA (AutoRegressive Integrated Moving Average)
    ARIMA is a model used for non-seasonal and seasonal time series that captures autocorrelation and trend in data.
  • SARIMA (Seasonal ARIMA)
    It is an extension of ARIMA that handles seasonality explicitly, ideal for recurring pandemic waves or periodic spikes.
  • ETS (Error, Trend, Seasonality)
    ETS is a part of exponential smoothing models that automatically detect and model components of a time series.
  • STL Decomposition
    STL is used for breaking down the time series into trend, seasonal, and residual components for better interpretation.
  • Baseline Models (Naïve, Mean Forecast)
    These models are useful for benchmarking and understanding the uplift in performance achieved through various advanced forecasting models.

Estimated Time Required and Complexity

Before you begin, here’s an estimation of the time and effort required to complete this project:

  • Estimated Duration:
    6–8 hours for intermediate R users; 10–12 hours if you're new to time series concepts.
  • Difficulty Level:
    Intermediate. This project requires basic programming skills and a conceptual understanding of statistical models.

Step-by-Step Guide to Building The COVID-19 Trends Analysis Project

The following steps will be followed to build the trend analysis project on COVID-19 using R.

Step 1: Source and Download COVID-19 Datasets

Before we start the analysis, we need a detailed and frequently updated COVID-19 dataset. We can find data from various reliable sources. Find the dataset you want to work with and download the CSV file.

Step 2: Install and Load Required R Packages

To begin the trend analysis project on COVID-19 using R, the first essential step is setting up your R environment with the necessary libraries. These packages will help us in data manipulation, visualization, data handling, and forecasting.

# Install the required packages (run only once)
install.packages("ggplot2")
install.packages("forecast")
install.packages("dplyr")
install.packages("lubridate")

# Load the libraries into your R session
library(ggplot2)
library(forecast)
library(dplyr)
library(lubridate)
  • ggplot2: It is used for advanced and customizable data visualizations.
  • forecast: This library provides tools to build and evaluate forecasting models like ARIMA.
  • dplyr: It simplifies data wrangling tasks such as filtering and summarizing.
  • lubridate: This library facilitates working with dates and times in R.

Step 3: Load and Inspect the COVID-19 Dataset

After setting up the environment, the next step is to upload the COVID-19 dataset into R. Use the following code to import your CSV file, preview the data, and understand its structure:

# Load the dataset from the specified path
covid_raw <- read.csv("/content/coronavirus_dataset.csv", stringsAsFactors = FALSE)

# Preview the first few rows of the dataset
head(covid_raw)

# Examine the structure and data types of each column
str(covid_raw)
background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

You’ll get an overview of the dataset here:

Output:

 

Province.State

Country.Region

Lat

Long

date

cases

type

 

<chr>

<chr>

<dbl>

<dbl>

<chr>

<int>

<chr>

1

 

Afghanistan

33

65

2020-01-22

0

confirmed

2

 

Afghanistan

33

65

2020-01-23

0

confirmed

3

 

Afghanistan

33

65

2020-01-24

0

confirmed

4

 

Afghanistan

33

65

2020-01-25

0

confirmed

5

 

Afghanistan

33

65

2020-01-26

0

confirmed

6

 

Afghanistan

33

65

2020-01-27

0

confirmed

 

Also Read: R For Data Science: Why Should You Choose R for Data Science?

Step 4: Clean Data and Aggregate Global Daily Confirmed Cases

In this step, we prepare the dataset by converting the date column into a proper Date format and aggregating the daily confirmed COVID-19 cases globally.

library(dplyr)
library(lubridate)

# Convert the 'date' column to Date format for accurate time series handling
covid <- covid_raw %>%
  mutate(date = ymd(date))

# Aggregate daily global confirmed cases by summing cases for each date
global_confirmed <- covid %>%
  filter(type == "confirmed") %>%
  group_by(date) %>%
  summarise(daily_cases = sum(cases, na.rm = TRUE))

# Preview the aggregated data
head(global_confirmed)

We get a table like this:

date

daily_cases

<date>

<int>

2020-01-22

555

2020-01-23

98

2020-01-24

288

2020-01-25

493

2020-01-26

684

2020-01-27

809

Step 5: Visualize Daily Confirmed COVID-19 Cases Worldwide

Now that we have aggregated the daily confirmed cases globally, we visualize the trend over time using ggplot2.

library(ggplot2)

ggplot(global_confirmed, aes(x = date, y = daily_cases)) +
  geom_line(color = "steelblue") +
  labs(title = "Daily Confirmed COVID-19 Cases Worldwide",
       x = "Date", y = "Daily Cases") +
  theme_minimal()
  • geom_line() plots the time series line chart to highlight trends and fluctuations in daily confirmed cases.

This visualization provides a clear overview of the pandemic’s progression globally, serving as a foundation for further analysis and forecasting.

Must Read: 18 Types of Regression in Machine Learning You Should Know

Step 6: Calculate Cumulative and Daily New Confirmed Cases

In this step, we calculate both the cumulative total confirmed cases and the daily new cases globally. This helps us better understand daily infection trends by deriving new cases from cumulative data.

global_confirmed <- covid_raw %>%
  filter(type == "confirmed") %>%
  group_by(date) %>%
  summarise(total_cases = sum(cases, na.rm = TRUE)) %>%
  mutate(daily_cases = total_cases - lag(total_cases, default = first(total_cases)))

This step generates a time series of new daily cases, crucial for detecting spikes and trends in understanding how the pandemic spread.

Step 7: Apply ARIMA Model and Forecast the Next 30 Days

With the time series data prepared, we will now fit an ARIMA model to forecast the future COVID-19 case counts. ARIMA models are powerful for capturing patterns in time series data, including trends and seasonality.

# Fit the ARIMA model automatically based on the data
fit_arima <- auto.arima(covid_ts)

# Forecast the next 30 days of COVID-19 cases
forecast_arima <- forecast(fit_arima, h = 30)

# Visualize the forecast with confidence intervals
autoplot(forecast_arima) +
  labs(title = "30-Day COVID-19 Forecast using ARIMA",
       x = "Time", y = "Forecasted Cases")

This forecast allows the people concerned to anticipate potential future trends and prepare accordingly.

Step 8: Prepare Training and Test Sets, Fit ARIMA Model, and Evaluate Forecast Accuracy

To evaluate our model’s performance, we split the time series data into training and testing sets. This approach helps validate the accuracy of forecasts against actual observed values.

# Convert daily cases to a time series object with yearly frequency
covid_ts <- ts(global_confirmed$daily_cases, frequency = 365)

# Total length of the time series
n <- length(covid_ts)

# Create training set (all but last 30 days)
train_ts <- ts(covid_ts[1:(n - 30)], frequency = 365)

# Create test set (last 30 days)
test_ts <- ts(covid_ts[(n - 29):n], frequency = 365)

# Fit ARIMA model on training data
model_train <- auto.arima(train_ts)

# Forecast the next 30 days using the trained model
forecast_test <- forecast(model_train, h = 30)

# Evaluate forecast accuracy against actual test data
accuracy(forecast_test, test_ts)

This method ensures that the forecasting model is robust and reliable before deploying it for future predictions.

 

ME

RMSE

MAE

MPE

MAPE

MASE

ACF1

Theil's U

Training set

63.6

3666.128

1810.56

100

100

NaN

-0.3738800

NA

Test set

1364.8

7384.268

5661.20

100

100

NaN

-0.6447488

1.014011

 

Also Read: Data Preprocessing in Machine Learning: 11 Key Steps You Must Know!

Step 9: Visualize ARIMA Forecast Compared to Actual COVID-19 Cases

To assess the performance of the ARIMA model visually, we plot the forecasted daily cases alongside the actual observed data for the test period.

autoplot(forecast_test) +
  autolayer(test_ts, series = "Actual") +
  labs(title = "ARIMA Forecast vs Actual",
       y = "Daily Cases", x = "Days")
  • autoplot(forecast_test): Plots the ARIMA forecast with confidence intervals.
  • autolayer(test_ts, series = "Actual"): Overlays the actual daily cases on the same plot for direct comparison.
  • Clear labels help distinguish between predicted and observed values, facilitating intuitive interpretation of model accuracy.

This visualization provides an immediate sense of how closely the ARIMA model predictions track real-world COVID-19 case trends.

Conclusion

In this trend analysis project on COVID-19 using R, we cleaned, visualized, and forecasted daily confirmed cases using statistical models like ARIMA. We started with global data, preprocessed the dataset, aggregated daily cases, and then applied time series techniques to model trends and patterns.

We also divided the data into training and testing sets to evaluate forecast accuracy, using metrics that validate the model’s predictive capabilities. The ARIMA model provided a clear visualization of both historical and forecasted COVID-19 cases.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1LJYJTkh6BGtJgDZj2iZA_s3WvaGQufiX

Frequently Asked Questions (FAQs)

1. What are the key objectives and outcomes of the COVID-19 data analysis project?

2. Which tools and R libraries are used in this COVID-19 analysis?

3. What other forecasting algorithms can optimize COVID-19 case predictions?

4. What are some other data science projects related to COVID-19 or health analytics?

5. How can I improve forecasting accuracy in COVID-19 time series models?

Rohit Sharma

779 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months