Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconData Sciencebreadcumb forward arrow iconData Preprocessing in R: Ultimate Tutorial [2024]

Data Preprocessing in R: Ultimate Tutorial [2024]

Last updated:
28th Feb, 2022
Views
Read Time
6 Mins
share image icon
In this article
Chevron in toc
View All
Data Preprocessing in R: Ultimate Tutorial [2024]

In our following data preprocessing in R tutorial, you’ll learn the fundamentals of how to perform data preprocessing. This tutorial requires you to be familiar with the basics of R and programming:

1. Step: Finding and Fixing Issues

We’ll start our data preprocessing in R tutorial by importing the data set first. After all, you can’t preprocess the data if you don’t have the data in the first place.

In our case, the data is stored in the data.csv file in the working directory. You can use the command setwd(“desired location”) and set the working directory.  

Here’s how you’ll start the process:

dataset <- read.csv(“Data.csv”)

Here’s our dataset:

##CountryAgeSalaryPurchased
##1France4472000No
##2Spain2748000Yes
##3Germany3054000No
##4Spain3861000No
##5Germany40NAYes
##6France3558000Yes
##7SpainNA52000No
##8France4879000Yes
##9Germany5083000No
##10France3767000Yes

As you can see, there are missing values in the Salary and Age columns of our dataset. We have identified the issue present in our dataset so we can now start fixing the same

No other issues seem to be present in our dataset so we only have to handle the missing values. We can fix this problem by replacing the NA values with the average values of the respective columns. Here’s how:

dataset$Age <- ifelse(is.na(dataset$Age), 

                      ave(dataset$Age, FUN = function(x) 

                        mean(x, na.rm = TRUE)), 

                      dataset$Age)

dataset$Salary <- ifelse(is.na(dataset$Salary), 

                      ave(dataset$Salary, FUN = function(x) 

                        mean(x, na.rm = TRUE)), 

                      dataset$Salary)

Notice how we used the ave() function here. It takes the average of the specific column you have entered where FUN is a function of x that calculates the mean excluding NA values (na.rm=TRUE).

 else,

 take whatever present in dataset$Age

We’ll use the mean() function now:
#defining  x = 1 2 3

x <- 1:3

#introducing missing value

x[1] <- NA

# mean = NA

mean(x)

## [1] NA

# mean = mean excluding the NA value

mean(x, na.rm = T)

## [1] 2.5

After identifying and fixing the problem, our dataset looks like this:

##CountryAgeSalaryPurchased
##1France4472000.00No
##2Spain2748000.00Yes
##3Germany3054000.00No
##4Spain3861000.00No
##5Germany4063777.78Yes
##6France3558000.00Yes
##7Spain3852000.00No
##8France4879000.00Yes
##9Germany5083000.00No
##10France3767000.00Yes

Also Read: Career Opportunities in R Programming Language

2. Step: Categorical Data 

Categorical data is non-numeric data that belongs to particular categories. The Country column in our dataset is categorical data. The read.csv() function in R would make all the string variables as categorical variables. However, we can’t use it in every case. 

Here’s how you can create specific variables as factor variables:
dataset$Country = factor(dataset$Country,

                         levels = c(‘France’, ‘Spain’, ‘Germany’),

                         labels = c(1, 2, 3))

dataset$Purchased = factor(dataset$Purchased,

                           levels = c(‘No’, ‘Yes’),

                           labels = c(0, 1))

3. Step: Splitting Data

Now, we have to split our dataset into two separate datasets. One for training our machine learning model while the other one for testing the same. 

To do so, we’ll first install the caTools package (if not available) and add it to our library. Afterwards, we’ll use the set.seed() function to ensure that the split is done randomly. Use the following code:
library(caTools)

set.seed(123) 

split = sample.split(dataset$Purchased,SplitRatio = 0.8)

training_set = subset(dataset,split == TRUE)

test_set = subset(dataset, split == FALSE)

You must have noticed that we have kept the split ratio as 80:20. This is because it is the most conventional split ratio for training sets and test sets. Our sample.split() method has taken the column and created a numeric array with randomized true and false values according to the split ratio. 

Our learners also read: Top Python Free Courses

Explore our Popular Data Science Courses

upGrad’s Exclusive Data Science Webinar for you –

Watch our Webinar on The Future of Consumer Data in an Open Data Economy

 

4. Step: Feature Scaling or Overfitting

Feature scaling is required when different features in your dataset have different ranges. In our case, the Age and Salary columns have different ranges, which can cause problems in training our ML model. 

When you have a feature with a significantly higher range than the other feature, the euclidean distance increases considerably, causing the model to give wrong results. 

Note that most libraries in R fix this issue automatically but it’s important to know how to fix this. Do the following:

training_set[,2:3] = scale(training_set[,2:3])

test_set[,2:3] = scale(test_set[,2:3])

It would fix the issue and your training set’s features would have the same ranges, minimizing the chances of any errors during machine learning implementations. 

Top Data Science Skills to Learn to upskill

Get data science certification from the World’s top Universities. Learn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

Read our popular Data Science Articles

Conclusion

We hope that you found our data preprocessing in R tutorial helpful. It would be best to understand the tutorial before you try testing it out yourself. Understanding the concepts is much more important than using them. 

What are your thoughts on our data preprocessing in R tutorial? Share them in the comments below.

If you are curious to learn about R, data science, check out our Executive PG in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Profile

Rohit Sharma

Blog Author
Rohit Sharma is the Program Director for the UpGrad-IIIT Bangalore, PG Diploma Data Analytics Program.

Frequently Asked Questions (FAQs)

1How many types of polymorphism are there in python?

In Python, polymorphism refers to a generic function name that can be used for a variety of purposes. This idea is commonly used in Python programming concept that is object-oriented.
Polymorphism is implemented in Python for several purposes, such as Duck Typing, Operator overloading, Method overloading, and Method overriding, as it is in other programming languages such as Java and C++. Overloading and overriding are the two primary methods for achieving polymorphism.
A class with many methods with the same name but distinct arguments is known as method overloading. Although method overloading is not supported by default in Python, there are numerous techniques to do it.

2What is Duck typing?

Duck typing is a polymorphism notion. The phrase duck typing comes from a proverb that states that anything that walks, quacks, and swims like a duck is dubbed a duck, regardless of what it is. In simple terms, it indicates that if something matches its behaviour to something else, it will be considered a member of that category.

3What is overloading and overriding?

When a method with the same name as well as arguments is used in both a derived class and a base or super class, the derived class method is said to override the method provided in the base class. When the overridden method is called, the derived class's method is always invoked. The method that was utilised in the base class is now hidden.
Python, on the other hand, does not provide method overloading based on the type, quantity, or order of method parameters. Method overloading is a Python approach for defining a method such that it can be called in multiple ways. Unlike other programming languages, this one is unique.

Explore Free Courses

Suggested Blogs

Priority Queue in Data Structure: Characteristics, Types &#038; Implementation
57467
Introduction The priority queue in the data structure is an extension of the “normal” queue. It is an abstract data type that contains a
Read More

by Rohit Sharma

15 Jul 2024

An Overview of Association Rule Mining &#038; its Applications
142458
Association Rule Mining in data mining, as the name suggests, involves discovering relationships between seemingly independent relational databases or
Read More

by Abhinav Rai

13 Jul 2024

Data Mining Techniques &#038; Tools: Types of Data, Methods, Applications [With Examples]
101684
Why data mining techniques are important like never before? Businesses these days are collecting data at a very striking rate. The sources of this eno
Read More

by Rohit Sharma

12 Jul 2024

17 Must Read Pandas Interview Questions &amp; Answers [For Freshers &#038; Experienced]
58115
Pandas is a BSD-licensed and open-source Python library offering high-performance, easy-to-use data structures, and data analysis tools. The full form
Read More

by Rohit Sharma

11 Jul 2024

Top 7 Data Types of Python | Python Data Types
99373
Data types are an essential concept in the python programming language. In Python, every value has its own python data type. The classification of dat
Read More

by Rohit Sharma

11 Jul 2024

What is Decision Tree in Data Mining? Types, Real World Examples &#038; Applications
16859
Introduction to Data Mining In its raw form, data requires efficient processing to transform into valuable information. Predicting outcomes hinges on
Read More

by Rohit Sharma

04 Jul 2024

6 Phases of Data Analytics Lifecycle Every Data Analyst Should Know About
82805
What is a Data Analytics Lifecycle? Data is crucial in today’s digital world. As it gets created, consumed, tested, processed, and reused, data goes
Read More

by Rohit Sharma

04 Jul 2024

Most Common Binary Tree Interview Questions &#038; Answers [For Freshers &#038; Experienced]
10471
Introduction Data structures are one of the most fundamental concepts in object-oriented programming. To explain it simply, a data structure is a par
Read More

by Rohit Sharma

03 Jul 2024

Data Science Vs Data Analytics: Difference Between Data Science and Data Analytics
70271
Summary: In this article, you will learn, Difference between Data Science and Data Analytics Job roles Skills Career perspectives Which one is right
Read More

by Rohit Sharma

02 Jul 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon