Apart from staff and infrastructure, data is the new building block of any company. From large corporations to small scale industries, data is the fuel that drives their businesses. This data is associated with their daily business transactions, customer purchase data, sales data, financial charts, business statistics, marketing campaigns and much more. That is why Tim O’Reilly, founder of O’Reilly Media said that we are entering a situation where data is going to be more important than software.
But what to do with so much data? Companies use this data to derive valuable insights into their business performance. They hire data scientists who perform data manipulation in R to make sense out of this data. For example, understanding the sales and marketing data for the past year will give them an idea about where they stand. A recent study showed that the data analytics market is expected to be worth $77.6 billion by 2023.
Data scientists are hired to make sense out of this data by a process called data manipulation.
What is data manipulation?
Data manipulation is the process of organizing data to read and understand it better. For example, company officials may obtain customer data from their systems and logbooks. Mostly, this data will be stored in CRM (Customer Relationship Management) software and excel sheets. But it may not be organized properly. Data manipulation includes ways to organize all this data, such as according to alphabetical order.
The data can be sorted according to date, time, serial number or any other field. People in the accounts department of a company use the data to determine sales trends, user preferences, market statistics and product prices. Financial analysts use data to understand how the stock market is performing, trends and the best stocks where they should invest.
Furthermore, web server data can be used for understanding how much traffic the website has. In this technological era, IoT is an example of a technology where data is sourced from sensors attached to machines. This data is used for determining the performance of the machine, and if it has any defects. Data manipulation is crucial in IoT as the market will be worth $81.67 billion by 2025.
Data manipulation is popularly performed using a programming language called R. Let us know the language a little better.
What is R?
To understand data manipulation in R, you have to know the basics of R. It is a modern programming language that is used for data analytics, statistical computing and artificial intelligence. The language was created in 1993 by Ross Ihaka and Robert Gentleman. Nowadays, researchers, data analysts, scientists and statisticians use R to analyse, clean and visualize data.
R has a huge catalogue consisting of graphical and statistical methods that can support machine learning, linear regression, statistical inference and time series. Under the GNU General Public License, the language is freely available for operating systems such as Windows, Mac and Linux. It is platform friendly, which means that R code written on one platform can be easily executed in another.
R is now considered the main programming language for data science. But it is a comprehensive language as you can use it for software development as well as complicated tasks such as statistical modelling. You can develop web applications using its package RShiny.
It is such a powerful language that some of the world’s best companies such as Google and Facebook are using it.
Let us check out some of the most important features of R:
- It has CRAN (Comprehensive R Archive Network) that is a repository having more than 10,000 R packages, having all the required functionalities for working with data
- It is an open-source programming language. This means that you can download it for free and even contribute towards its development, update its features and customize its existing functionalities
- You can create high-quality visualizations from the data at hand from R’s useful graphical libraries such as ggplot2 and plotly
- R is a very fast language. As it is an interpreted programming language, there is no requirement for a compiler for converting the R programs into executable code, and so an R script runs faster
- R can perform a variety of complicated calculations in a jiffy, consisting of arrays, data frames and vectors. There are many operators for performing these calculations
- It handles structured and unstructured data. Extensions for Big Data and SQL are available for handling all types of data
- R has a continuously growing community that has the brightest minds. These people are constantly contributing towards the programming language by developing r libraries and updates
- You can easily integrate R with other programming languages such as Python, Java and C++. You can also combine it with Hadoop for distributed computing
Now that you have gathered the basics of the R programming language, let us dive into the exciting stuff!
Variables in R
While programming in R or performing any data manipulation in R, you have to deal with variables. Variables are used for storing data that may be in the form of strings, integers, floating point integers or just Boolean values. These variables reserve a space in the memory for its contents. Unlike traditional programming languages, variables in R are assigned along with R objects.
The variables do not have a data type, but gets the type of the R object it is assigned to. The most popular R objects are:
- Data frames
These data structures are extremely important for data manipulation in R and data analysis. Let us look at them in a little more detail to understand basic data manipulation:
They are the most basic data structures and are used for 1 dimensional data. The types of atomic vectors are:
When you create value in R, it becomes a single element vector of length 1. For example,
print(“ABC”); # single element vector of type character
print(10.5) # single element vector of double type
Elements in vectors are accessed using their index numbers. Index positions in vectors start from 1. For example,
t <- c(“Mon”,”Tue”,”Wed”,”Sat”)
u <- t[c(1,2,3)]
The result will be “Mon” “Tue” “Wed”
These are objects in R that are used to hold different types of elements inside it. These can be integers, strings and even lists. If the data cannot be held in a data frame or an array, this is the best option. Lists can also hold a matrix. You can create lists using the list() method.
Use the following code to create a list:
list_data <- list(“Black”, “Green”, c(11,4,14), TRUE, 31.22, 120.5)
List elements can be accessed using list indices.
print(list_data) #the code prints out the first element of the list
Example of data manipulation with lists:
list_data <- NULL # this code removes the last element of the list if it has 4 elements
Arrays are objects that can be used for storing only a single data type. Data of more than two dimensions can be stored in arrays. For this, you have to use the array() function that takes the vectors as input. It uses the value in the dim parameter for creating the array.
For example, look at the following code:
vector_result <- array(c(vectorA,vectorB),dim = c(3,3,2))
In these R objects, the elements are organised in a 2-dimensional layout. Matrices hold elements of similar atomic types. These are beneficial when the elements belong to a single class. Matrices having numeric elements are created for mathematical calculations. You can create matrices using the matrix() function.
The basic syntax to create a matrix is given below:
matrix(data, nrow, ncol, byrow, dimnames)
- Data – This is the input vector that becomes the data element for the matrix
- Nrow – This is the number of rows you want to create
- Ncol – This is the number of columns you want to create
- Byrow –This is a logical clue. If its value is TRUE, the vector elements will be arranged by row
- Dimname – Names given to the columns and rows
These R objects are used for categorizing data and storing them as levels. They are good for statistical modelling and data analysis. Both integers and strings can be stored in factors. You can use the factor() function for creating a factor by providing a vector as an input to the method.
It has a two-dimensional structure like an array having rows and columns. Here, each row has a set of values belonging to each column. The columns contain the value of one variable. They are used for representing data from spreadsheets. These can be used for storing data of factor, numeric or character type.
A data frame has the following features:
- Row names need to be unique
- Column names must be non-empty
- The number of data items in each column must be the same
Data manipulation in R
During data manipulation in R, the first step is to create small samples of data from a huge dataset. This is done as the entire data set cannot be analyzed at a time. Usually, data analysts create a representative subset of the dataset. This helps them to identify the trends and patterns in the larger data set. This sampling process is also called subsetting.
The different ways to create subset in R are as follows:
- $ – This selects a single element of data and its result is always a vector
- [[ – This subsetting operator also returns a single element, but you can refer to the elements by their position
- [ – This operator is used for returning multiple elements of data
Some of the basic functions for data manipulation in R are:
As the name suggests, the sample() method is used for creating data samples from a larger data set. Along with this command, you mention the number of samples you wish to draw from the dataset or a vector. The basic syntax is as follows:
sample(x, size, replace = FALSE, prob = NULL)
x – This can be a vector or a dataset of multiple elements from which the sample has to be chosen
size – This is a positive integer that denotes the number of items to select
replace – This can be True or False, whether you want the sampling with or without replacement
prob – It is an argument used for providing a vector of weights for getting the elements of the vector that is being sampled
This function creates a frequency table that is used for calculating the number of unique values of a particular variable. For example, let us create a frequency table with the iris data set:
The code written above creates a table depicting the types of species in the iris dataset.
The duplicated() method is used for identifying and removing duplicate values from a data set. It takes a vector or data frame as an argument and returns True for the elements that are duplicates. For example,
This will check which of these elements are duplicates and return True or False.
Also read: Decision Tree in R
Data manipulation in R using the dplyr package
R provides a simple and easy to use package called dplyr for data manipulation. The package has some in-built methods for manipulation, data exploration and transformation. Let us check out some of the most important functions of this package:
The select() method is one of the basic functions for data manipulation in R. This method is used for selecting columns in R. Using this, you can select data as with its column name. The columns can be selected based on certain conditions. Suppose we want to select the 3rd and 4th column of a data frame called myData, the code will be:
This method is used for filtering rows of a dataset that match specific criteria. It can work like the select(), you pass the data frame first and then a condition separated using a comma.
For example, if you want to filter out columns that have cars that are red in colour in a data set, you have to write:
As a result, the matching rows will be displayed.
You can use the mutate() method to create new columns in a dataset while preserving the old ones. These columns can be created by specifying a condition. For example,
mutate(mtcars, mtcars_new_col = mpg / cyl)
In this command, in the mtcars dataset, a new column is created mtcars_new_col that contains the values of mpg column divided by cyl column.
This is used for sorting rows in ascending or descending order, using one or more variables. Instead of applying the desc() method, you can add a minus (-) symbol before the sorting variable. This will indicate the descending order of sorting. For example,
The group_by() method is used for grouping observations in a dataset by one or multiple variables.
The summarise() function is beneficial for determining data insights such as mean, median and mode. It is used along with grouped data created by another method group_by. summarise() helps to reduce multiple values into single ones.
The merge() method combines or merges data sets together. This is useful for clubbing together multiple sources of input data together.
The method offers you 4 ways to merge datasets. They are mentioned below:
- Natural join – This is used to keep the rows that match the specified condition within the data frames
- Full outer join – This merges and stores all the rows from both of the data frames
- Left outer join – This stores all rows of a data frame A, and those in B that match
- Right outer join – This stores all rows of a data frame B, and those in A that match
This is a function that you can use for renaming columns of a data frame when the specified condition is satisfied.
This is used for renaming all the columns of a data frame without specifying any condition.
The pipe operator is available in packages such as magrittr and dplyr for simplifying your overall code. The operator lets you combine multiple functions together. Denoted by the %>% symbol, it can be used with popular methods such as summarise(), filter(), select() and group_by() while data manipulation in R.
Besides dplyr, there are many other packages in CRAN for data manipulation in R. In fact, you will find more than 7000 packages for reducing your coding and also your errors. Many of these packages are created by expert developers, so you are in safe hands. These include:
If you are a beginner in data manipulation in R, you might go for the in-built base functions available in R. These include methods such as with(), within(), duplicated(), cut(), table(), sample() and sort(). But they are time consuming and repetitive. It is not a very efficient option.
Thus, the best way forward is to use the huge number of packages in CRAN such as dplyr. These are super useful and make your programs more efficient.
If you are curious to learn about R, data science, check out our PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.