Anyone who is familiar with Apache Spark knows why it is becoming one of the most preferred Big Data tools today – it allows for super-fast computation.
The fact that Spark supports speedy Big Data processing is making it a hit with companies worldwide. From big names like Amazon, Alibaba, eBay, and Yahoo, to small firms in the industry, Spark has gained an enormous fan following. Thanks to this, companies are continually looking for skilled Big Data professionals with domain expertise in Spark.
For everyone who wishes to bag jobs related to a Big Data (Spark) profile, you must first successfully crack the Spark interview. Here is something that can get you a step closer to your goal – 15 most commonly asked Apache Spark interview questions!
- What is Spark?
Spark is an open-source, cluster computing Big Data framework that allows real-time processing. It is a general-purpose data processing engine that is capable of handling different workloads like batch, interactive, iterative, and streaming. Spark executes in-memory computations that help boost the speed of data processing. It can run standalone, or on Hadoop, or in the cloud.
- What is RDD?
RDD or Resilient Distributed Dataset is the primary data structure of Spark. It is an essential abstraction in Spark that represents the data input in an object format. RDD is a read-only, immutable collection of objects in which each node is partitioned into smaller parts that can be computed on different nodes of a cluster to enable independent data processing.
- Differentiate between Apache Spark and Hadoop MapReduce.
The key differentiators between Apache Spark and Hadoop MapReduce are:
- Spark is easier to program and doesn’t require any abstractions. MapReduce is written in Java and is difficult to program. It needs abstractions.
- Spark has an interactive mode, whereas MapReduce lacks it. However, tools like Pig and Hive make it easier to work with MapReduce.
- Spark allows for batch processing, streaming, and machine learning within the same cluster. MapReduce is best-suited for batch processing.
- Spark can modify the data in real-time via Spark Streaming. There’s no such real-time provision in MapReduce – you can only process a batch of stored data.
- Spark facilitates low latency computations by caching partial results in memory. This requires more memory space. Contrarily, MapReduce is disk-oriented that allows for permanent storage.
- Since Spark can execute processing tasks in-memory, it can process data much faster than MapReduce.
- What is the Sparse Vector?
A sparse vector comprises of two parallel arrays, one for indices and the other for values. They are used for storing non-zero entries to save memory space.
- What is Partitioning in Spark?
Partitioning is used to create smaller and logical data units to help speed up data processing. In Spark, everything is a partitioned RDD. Partitions parallelize distributed data processing with minimal network traffic for sending data to the various executors in the system.
- Define Transformation and Action.
Both Transformation and Action are operations executed within an RDD.
When Transformation function is applied to an RDD, it creates another RDD. Two examples of transformation are map() and filer() – while map() applies the function transferred to it on each element of RDD and creates another RDD, filter() creates a new RDD by selecting components from the present RDD that transfer the function argument. It is triggered only when an Action occurs.
An Action retrieves the data from RDD to the local machine. It triggers the execution by using a lineage graph to load the data into the original RDD, perform all intermediate transformations, and return final results to the Driver program or write it out to file system.
- What is a Lineage Graph?
In Spark, the RDDs co-depend on one another. The graphical representation of these dependencies among the RDDs is called a lineage graph. With information from the lineage graph, each RDD can be computed on demand – if ever a chunk of a persistent RDD is lost, the lost data can be recovered using the lineage graph information.
- What is the purpose of the SparkCore?
SparkCore is the base engine of Spark. It performs a host of vital functions like fault-tolerance, memory management, job monitoring, job scheduling, and interaction with storage systems.
- Name the major libraries of the Spark Ecosystem.
The major libraries in the Spark Ecosystem are:
- Spark Streaming – It is used to enable real-time data streaming.
- Spark MLib- It is Spark’s Machine Learning library that is commonly used learning algorithms like classification, regression, clustering, etc.
- Spark SQL – It helps execute SQL-like queries on Spark data by applying standard visualization or business intelligence tools.
- Spark GraphX – It is a Spark API for graph processing to develop and transform interactive graphs.
- What is YARN? Is it required to install Spark on all nodes of a YARN cluster?
Yarn is a central resource management platform in Spark. It enables the delivery of scalable operations across the Spark cluster. While Spark is the data processing tool, YARN is the distributed container manager. Just as Hadoop MapReduce can run on YARN, Spark too can run on YARN.
It is not necessary to install Spark on all nodes of a YARN cluster because Spark can execute on top of YARN – it runs independently from its installation. It also includes different configurations to run on YARN such as master, queue, deploy-mode, driver-memory, executor-memory, and executor-cores.
- What is the Catalyst Framework?
Catalyst framework is a unique optimization framework in Spark SQL. The main purpose of a catalyst framework is to enable Spark to automatically transform SQL queries by adding new optimizations to develop a faster processing system.
- What are the different types of cluster managers in Spark?
The Spark framework comprises of three types of cluster managers:
- Standalone – The primary manager used to configure a cluster.
- Apache Mesos – The built-in, generalized cluster manager of Spark that can run Hadoop MapReduce and other applications as well.
- Yarn – The cluster manager for handling resource management in Hadoop.
- What is a Worker Node?
Worker Node is the “slave node” to the Master Node. It refers to any node that can run the application code in a cluster. So, the master node assigns work to the worker nodes which perform the assigned tasks. Worker nodes process the data stored within and then reports to the master node.
- What is a Spark Executor?
A Spark Executor is a process that runs computations and stores the data in the worker node. Every time the SparkContext connects with a cluster manager, it acquires an Executor on the nodes within a cluster. These executors execute the final tasks that are assigned to them by the SparkContext.
- What is a Parquet file?
Parquet file is a columnar format file that allows Spark SQL to both read and write operations. Using the parquet file (columnar format) has many advantages:
- Column storage format consumes less space.
- Column storage format keeps IO operations in check.
- It allows you to access specific columns with ease.
- It follows type-specific encoding and delivers better-summarized data.
There – we have eased you into Spark. These 15 fundamental concepts in Spark will help you get started with Spark.
If you are interested to know more about Big Data, check out our PG Diploma in Software Development Specialization in Big Data program which is designed for working professionals and provides 7+ case studies & projects, covers 14 programming languages & tools, practical hands-on workshops, more than 400 hours of rigorous learning & job placement assistance with top firms.