Today, we are going to learn about the DataFrame in Apache PySpark. Pyspark is one of the top data science tools in 2020. It is named columns of a distributed collection of rows in Apache Spark. It is very similar to the Tables or columns in Excel Sheets and also similar to the relational database’ table. PySpark DataFrame also has similar characteristics of RDD, which are:
Distributed: The nature of DataFrame and RDD is both distributed
Lazy Evaluations: Execution of task is not done if the action is not performed
Nature of Immutable: Another similar characteristic of RDD / DataFrame is that it cannot be changed once it is created. But one can apply the transformation to transform the RDD / DataFrame.
Advantage of DataFrames
1. This supports many various languages, such as Java, Scala, R, Python, which is useful in terms of API Support. The API support for multiple languages helps in working with many programming languages.
2. A wide range of data sources and formats are supported by DataFrame, which helps a lot to use a different source of data and their format conveniently.
3. One of the best parts about DataFrame is that it can even handle Petabytes of data, which is a remarkable ability to handle such massive data.
4. Apache Spark quickly understands the schema of DataFrame with the observation in Spark DataFrame. Under named columns, the Spark DataFrame’s Observation is organized. In this way, the plan of queries execution is optimized.
5. Massive Volume of Semi-structured and Structured Data can be quickly processed because it is designed to do it DataFrames.
Apache Spark Setup
Apache Spark should be set up in the machine before it can be started to use for DataFrame Operations. Data can be operated with the support of DataFrame as it supports various DataFrame Operations. Here we are going to discuss some common DataFrame’ operations.
The creation of SparkContext is the first step in the programming of Apache. For the execution of operations in a cluster, there is a requirement of SparkContext. How to access a cluster is told by SparkContext. It also shows the Spark about the location to obtain a cluster.
Read: Deep Learning Frameworks
Then the Apache Cluster connection is established. Its creation is already done if one is using Spark Shell. The other way, configuration setting, can be provided, initialized, and imported for creation of the SparkContext.
One can use this code for the creation:
from pyspark import SparkContext
sc = SparkContext()
DataFrame Creation from CSV file
A new library has to be specified in the shell of python so that the CSV file can be read. To do this, the first step is to download the latest version of the Spark-CSV package and do the extraction of the package in the Spark’s Home Directory. After that, we will open the shell of PySpark, and the package has to be included.
$ ./bin/pyspark –packages com.databricks:spark-csv_2.10:1.3.0
Now the DataFrame will be created after the data has been read from the CSV file.
train = sqlContext.load(source=”com.databricks.spark.csv”, path = ‘PATH/train.csv’, header = True,inferSchema = True)
test = sqlContext.load(source=”com.databricks.spark.csv”, path = ‘PATH/test-comb.csv’, header = True,inferSchema = True)
The test CSV files and train CSV files are located in the folder location called PATH. If the Header is there in the file of CSV, then it will show as True. To know the type of data in each column of the data frame, we will use inferSchema = True option. By using inferSchema = True option, Detection of the data type of data frame’s each column will be detected automatically by SQL context. All the columns will be read as a string If we do not set inferSchema to be true.
Manipulation of DataFrame
Now here we are going to see how to manipulate the Data Frame:
- Know Columns’ Type of Data
printSchema will be used to see the column type and its data type. Now the schema will be printed in tree format by applying the printSchema().
|– User_ID: integer (nullable = true)
|– Product_ID: string (nullable = true)
|– Gender: string (nullable = true)
|– Age: string (nullable = true)
|– Occupation: integer (nullable = true)
|– City_Category: string (nullable = true)
|– Stay_In_Current_City_Years: string (nullable = true)
|– Marital_Status: integer (nullable = true)
|– Product_Category_1: integer (nullable = true)
|– Product_Category_2: integer (nullable = true)
|– Product_Category_3: integer (nullable = true)
|– Purchase: integer (nullable = true)
After the reading of file of csv, we can see that we accurately got the type of data or the schema of each column in data frame.
- Showing first n observation
To see the first n observation, one can use the head operation. Pandas’s head operation is same like that of PySpark’s head operation.
[Row(User_ID=1000001, Product_ID=u’P00069042′, Gender=u’F’, Age=u’0-17′, Occupation=10, City_Category=u’A’, Stay_In_Current_City_Years=u’2′, Marital_Status=0, Product_Category_1=3, Product_Category_2=None, Product_Category_3=None, Purchase=8370),
Row(User_ID=1000001, Product_ID=u’P00248942′, Gender=u’F’, Age=u’0-17′, Occupation=10, City_Category=u’A’, Stay_In_Current_City_Years=u’2′, Marital_Status=0, Product_Category_1=1, Product_Category_2=6, Product_Category_3=14, Purchase=15200),
Row(User_ID=1000001, Product_ID=u’P00087842′, Gender=u’F’, Age=u’0-17′, Occupation=10, City_Category=u’A’, Stay_In_Current_City_Years=u’2′, Marital_Status=0, Product_Category_1=12, Product_Category_2=None, Product_Category_3=None, Purchase=1422),
Row(User_ID=1000001, Product_ID=u’P00085442′, Gender=u’F’, Age=u’0-17′, Occupation=10, City_Category=u’A’, Stay_In_Current_City_Years=u’2′, Marital_Status=0, Product_Category_1=12, Product_Category_2=14, Product_Category_3=None, Purchase=1057),
Row(User_ID=1000002, Product_ID=u’P00285442′, Gender=u’M’, Age=u’55+’, Occupation=16, City_Category=u’C’, Stay_In_Current_City_Years=u’4+’, Marital_Status=0, Product_Category_1=8, Product_Category_2=None, Product_Category_3=None, Purchase=7969)]
Now we will use the show operation to see the result in a better manner because the results will come in the format of row. We can also truncate the result by using the argument truncate = True.
|1000001| P00069042| F|0-17| 10| A| 2| 0| 3| null| null| 8370|
|1000001| P00248942| F|0-17| 10| A| 2| 0| 1| 6| 14| 15200|
only showing top 2 rows
- Counting of DataFrame’s rows’ number
To count the rows number in the data frame, we can use the operation of the count. Now we will count the number of rows of test files and train files by applying the count operation.
We have 233598, 550069, rows in test & train, respectively.
- Getting the count of column and name of columns from test and train file
Similar to the operation of the column in DataFrame of pandas, we will use columns operation to get the name of the column. Now first we will print the no. of the column and the name of the column from the test file and train file.
12 [‘User_ID’, ‘Product_ID’, ‘Gender’, ‘Age’, ‘Occupation’, ‘City_Category’, ‘Stay_In_Current_City_Years’, ‘Marital_Status’, ‘Product_Category_1’, ‘Product_Category_2’, ‘Product_Category_3’, ‘Purchase’]
Now we are doing it for the test file similarly.
13 [”, ‘User_ID’, ‘Product_ID’, ‘Gender’, ‘Age’, ‘Occupation’, ‘City_Category’, ‘Stay_In_Current_City_Years’, ‘Marital_Status’, ‘Product_Category_1’, ‘Product_Category_2’, ‘Product_Category_3’, ‘Comb’]
After the output above, we can see that there are 12 columns in the training file and 13 columns in the test file. From the above output, we can check that we have 13 columns in the test file and 12 in the training file. Column “Comb” is the only single column in the test file, and there is no “Purchase” not present in the test file. There is one more column in the test file that we can see is not having any name of the column.
- Getting the summary statistics such as count, max, min, standard deviance, mean in the DataFrame’s numerical columns
In the DataFrame, we will use the operation called describe the operation. We can do the calculation of the numerical column and get a statistical summary by using describe the operation. All the numerical columns will be calculated in the DataFrame, we there is no column name specified in the calculation of summary statistics.
|summary| User_ID| Occupation| Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3| Purchase|
| count| 550068| 550068| 550068| 550068| 376430| 166821| 550068|
| mean|1003028.8424013031|8.076706879876669|0.40965298835780306| 5.404270017525106| 9.842329251122386|12.668243206790512| 9263.968712959126|
| stddev|1727.5915855308265|6.522660487341778| 0.4917701263173273|3.9362113692014082| 5.086589648693526| 4.125337631575267|5023.0653938206015|
| min| 1000001| 0| 0| 1| 2| 3| 12|
| max| 1006040| 20| 1| 20| 18| 18| 23961|
In describe operation, this is what we get when string column name or categorical column name is specified.
| count| 550068|
| mean| null|
| stddev| null|
| min| P00000142|
| max| P0099942|
Based on ASCII, the max and min values of calculated. Describe operation is used to work on the String type column.
- Selection of DataFrame’s column
We will use the name of the columns in the select operation to select the column. We will mention the name of the column with the separation by using commas. Now we are going to see how the selection of “Age” and “User_ID” from the training file is made.
- |User_ID| Age|
- |1000002| 55+|
- Finding Distinct product no. in test files and train files
To calculate the DataFrame’s no. of distinct rows, we will use the distinct operation. Now here we are going to apply distinct operation for the calculation of no. of distinct product in test and train file.
We have 3492 & 3633 distinct products in test & train file, respectively. Now we know that in the training file, we have more distinct values than the test file as we can learn from the output result. Now we will use subtract operation to find out the Product_ID categories which are not present in the training file but is present in the test file. Same thing one can also do for all features of categorical.
diff_cat_in_train_test.distinct().count()# For distinct count
So from the above result, we can know that there are 47 various categories, which are not present in the training file but is present in the test file. The data will be skipped or collected from the test file, which is not present in the file of the train.
- Calculation of categorical columns’ pairwise frequency?
Let us do the calculation of the column’s pairwise frequency in the DataFrame by using the operation can crosstab operation. Now let us calculate the “Gender” and “Age” columns in DataFrame of the train by applying crosstab operation.
|Age_Gender| F| M|
| 0-17| 5083| 10019|
| 46-50|13199| 32502|
| 18-25|24628| 75032|
| 36-45|27170| 82843|
| 55+| 5083| 16421|
| 51-55| 9894| 28607|
The distinct value of Gender is the column name, and the different amount of Age is row name, which can be seen in the above result. In the table, the count of the pair will be zero if it has not occurred.
Our learners also read: Free Python Course with Certification
How to get DataFrame with Unique rows?
To find unique rows and not to include duplicate rows, we will use dropDuplicates operation. It will get the Dataframe without any duplicate rows by dropping the duplicate rows of a DataFrame. Please check here to know how the dropDuplicates procedure is performed to get all the unique rows for the columns.
| 55+| F|
| 55+| M|
| 0-17| F|
| 0-17| M|
- How to drop rows will null value?
If one wants to drop all the rows which have a null value, then we can use the operation called dropna operation. To drop row from the DataFrame, it considers three options.
- Subset – it is the list of all the names of the columns to considered for the operation of dropping column with null values.
- Thresh – this helps in dropping the rows with less than thresh non-null values. By default, nothing is specified in this.
- How – It can be used in two types – all or any. By using any, it will drop the row if any value in the row is null. By using all, it will decrease the row if all the rows’ values are null.
Now here we are going to use all these options one by one to drop the rows which are null by using default options such as subset, thresh, None for how, none, any.
- How to fill the DataFrame’s null values with constant no.?
To fill the null values with constant no. We will use fillna operation here. There are two parameters to be considered by fillna operation to fill the null values.
- subset: Here, one needs to specify the columns to be considered for filling values.
- value: Here, we can mention the amount to be replaced with what value, which can be any data type such as string, float, int in all the columns.
Here we are going to fill ‘-1’ in place of null values in train DataFrame.
|1000001| P00069042| F|0-17| 10| A| 2| 0| 3| -1| -1| 8370|
|1000001| P00248942| F|0-17| 10| A| 2| 0| 1| 6| 14| 15200|
only showing top 2 rows
PySpark is gaining momentum in the world of Artificial Intelligence and Machine learning. PySpark is used to solve real-world machine learning problem. You can create RDD from different data source both external and existing and do all types of transforms on it. Hope this article has been informative and was able to give you deep insights on PySpark dataframes.
If you are curious to learn about PySpark, and other data science tools, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.
Is PySpark more efficient than Pandas?
Yes, PySpark is quicker than Pandas, and it even outperforms Pandas in a benchmarking test. In basic terms, Pandas does operations on a single machine, whereas PySpark executes operations across several machines. If you're working on a Machine Learning application with a huge dataset, PySpark is the ideal option, as it can execute operations 100 times quicker than Pandas. Because of the JVM, the Scala programming language is 10 times quicker than Python for data analysis and processing. When Python programming code is utilized to make calls to Spark libraries, the results are mediocre.
What are some of the drawbacks of using Apache PySpark?
Spark doesn't have its own file management system. Due to the high cost of extra memory required to operate Spark, in-memory computing might be prohibitively costly. When utilising Apache Spark with Hadoop, developers run into difficulties with compact files. Data iterates in batches in Spark, with each iteration being planned and processed independently. In Apache Spark, data is divided into smaller batches at a predetermined time interval. As a result, record-based window criteria will not be supported by Apache. It instead provides time-based window criteria.
How are Datasets, DataFrame and RDD different from each other?
RDD is a clustered collection of data items that is dispersed over multiple computers. Data is represented via RDDs, which are a collection of Java or Scala objects. A DataFrame is a collection of data structured into named columns that is spread across many servers. In a relational database, it is conceptually equivalent to a table. Dataset is a dataframe API extension that offers the RDD API's type-safe, object-oriented programming interface capability. A DataFrame is a distributed collection of data, similar to an RDD, and is not mutable. Data is structured into named columns, similar to a table in a relational database, rather than an RDD. When it comes to simple tasks like grouping data, RDD is slower than both Dataframes and Datasets. It has a simple API for performing aggregate tasks. It can aggregate data quicker than RDDs and Datasets.