With data analytics gaining momentum, there has been a surge in the demand of people good with handling Big Data. From data analysts to data scientists, Big Data is creating an array of job profiles today. The first and foremost thing you’re expected to be hands-on with is Hadoop.
No matter what job role/profile, you’ll probably be working on Hadoop in one way or the other. So, you can invariably expect the interviewers to shoot a few Hadoop questions your way.
For that and more, let us look at the top 15 Hadoop interview questions that can be expected in any interview you sit for.
What is Hadoop? What are the primary components Hadoop?
Hadoop is an infrastructure equipped with relevant tools and services required to process and store Big Data. To be precise, Hadoop is the ‘solution’ to all the Big Data challenges. Furthermore, the Hadoop framework also helps organizations to analyze Big Data and make better business decisions.
The primary components of Hadoop are:
- Hadoop MapReduce
- Hadoop Common
- PIG and HIVE – The Data Access Components.
- HBase – For Data Storage
- Ambari, Oozie and ZooKeeper – Data Management and Monitoring Component
- Thrift and Avro – Data Serialization components
- Apache Flume, Sqoop, Chukwa – The Data Integration Components
- Apache Mahout and Drill – Data Intelligence Components
What are the core concepts of the Hadoop framework?
Hadoop is fundamentally based on two core concepts. They are:
- HDFS: HDFS or Hadoop Distributed File System is a Java-based reliable file system used for storing vast datasets in the block format. The Master-Slave Architecture powers it.
- MapReduce: MapReduce is a programming structure that helps process large datasets. This function is further broken down into two parts – while ‘map’ segregates the datasets into tuples, ‘reduce’ uses the map tuples and creates a combination of smaller chunks of tuples.
Name the most common input formats in Hadoop?
There are three common input formats in Hadoop:
- Text Input Format: This is the default input format in Hadoop.
- Sequence File Input Format: This input format is used for reading files in sequence.
- Key Value Input Format: This one is used to read plain text files.
What is YARN?
YARN is the abbreviation of Yet Another Resource Negotiator. It is Hadoop’s data processing framework that manages data resources and creates an environment for successful processing.
What is “Rack Awareness”?
“Rack Awareness” is an algorithm that NameNode uses to determine the pattern in which the data blocks and their replicas are stored within Hadoop cluster. This is achieved with the help of rack definitions that reduce the congestion between data nodes contained in the same rack.
What are Active and Passive NameNodes?
A high-availability Hadoop system usually contains two NameNodes – Active NameNode and Passive NameNode.
The NameNode that runs the Hadoop cluster is called the Active NameNode and the standby NameNode that stores the data of the Active NameNode is the Passive NameNode.
The purpose of having two NameNodes is that if the Active NameNode crashes, the Passive NameNode can take the lead. Thus, the NameNode is always running in the cluster, and the system never fails.
What are the different schedulers in the Hadoop framework?
There are three different schedulers in Hadoop framework:
- COSHH – COSHH helps schedule decisions by reviewing the cluster and workload combined with heterogeneity.
- FIFO Scheduler – FIFO lines up jobs in a queue based on their time of arrival, without using heterogeneity.
- Fair Sharing – Fair Sharing creates a pool for individual users containing multiple maps and reduce slots on a resource that they can use to execute specific jobs.
What is Speculative Execution?
Often in Hadoop framework, some nodes may run slower than the rest. This tends to constrain the entire program. To overcome this, Hadoop first detects or ‘speculates’ when a task is running slower than usual, and then it launches an equivalent backup for that task. So, in the process, the master node executes both the tasks simultaneously and whichever is completed first is accepted while the other one is killed. This backup feature of Hadoop is known as Speculative Execution.
Name the main components of Apache HBase?
Apache HBase is comprised of three components:
- Region Server: After a table is divided into multiple regions, clusters of these regions are forwarded to the clients via the Region Server.
- HMaster: This is a tool that helps manage and coordinate the Region server.
- ZooKeeper: ZooKeeper is a coordinator within the HBase distributed environment. It helps maintain a server state inside the cluster through communication in sessions.
What is “Checkpointing”? What is its benefit?
Checkpointing refers to the procedure by which a FsImage and Edit log are combined to form a new FsImage. Thus, instead of replaying the edit log, the NameNode can directly load the final in-memory state from the FsImage. The secondary NameNode is responsible for this process.
The benefit that Checkpointing offers is that it minimizes the startup time of the NameNode, thereby making the entire process more efficient.
How to debug a Hadoop code?
To debug a Hadoop code, first, you need to check the list of MapReduce tasks that are presently running. Then you need to check whether or not any orphaned tasks are running simultaneously. If so, you need to find the location of Resource Manager logs by following these simple steps:
Run “ps –ef | grep –I ResourceManager” and in the displayed result, try to find if there is an error related to a specific job id.
Now, identify the worker node that was used to execute the task. Log in to the node and run “ps –ef | grep –iNodeManager.”
Finally, scrutinize the Node Manager log. Most of the errors are generated from user level logs for each map-reduce job.
What is the purpose of RecordReader in Hadoop?
Hadoop breaks data into block formats. RecordReader helps integrate these data blocks into a single readable record. For example, if the input data is split into two blocks –
Row 1 – Welcome to
Row 2 – UpGrad
RecordReader will read this as “Welcome to UpGrad.”
What are the modes in which Hadoop can run?
The modes in which Hadoop can run are:
- Standalone mode – This is a default mode of Hadoop that is used for debugging purpose. It does not support HDFS.
- Pseudo-distributed mode – This mode required the configuration of mapred-site.xml, core-site.xml, and hdfs-site.xml files. Both the Master and Slave Node are the same here.
- Fully-distributed mode – Fully-distributed mode is Hadoop’s production stage in which data is distributed across various nodes on a Hadoop cluster. Here, the Master and the Slave Nodes are allotted separately.
Name some practical applications of Hadoop.
Here are some real-life instances where Hadoop is making a difference :
- Managing street traffic
- Fraud detection and prevention
- Analyse customer data in real-time to improve customer service
- Accessing unstructured medical data from physicians, HCPs, etc., to improve healthcare services.
What are the vital Hadoop tools that can enhance the performance of Big Data?
The Hadoop tools that boost Big Data performance significantly are
These Hadoop interview questions should be of great help to you in your next interview. While it is sometimes the tendency of interviewers to twist some Hadoop interview questions, it should not be an issue for you if you have your basics sorted.
If you are interested to know more about Big Data, check out our Advanced Certificate Programme in Big Data from IIIT Bangalore.
What are Pig and Hive?
Pig is an analysis tool that works with large datasets. It works on all kinds of data. It was developed by Yahoo. Pig Latin is a dataflow language used to analyse data in Hadoop. Pig operates on the client-side. It does not have any database specifically dedicated for storing metadata. It doesn't support JDBC and ODBC drivers. It has inbuilt operators for join, filter, etc. Hive was built on the Hadoop ecosystem. It is a data warehouse for the Hadoop Distributed File System (HDFS). It is a declarative SQL-like language. Developed by Facebook, it operates on the server-side of the cluster and is suitable for OLAP (Online Analytical Processing) operations.
What is meant by Master-Slave architecture in Hadoop?
The Master-Slave architecture is a technique in which there is a centralised or privileged node that is responsible for coordinating and holding data. In contrast, the other nodes are slaves, i.e., they are a copy of the master and do the tasks assigned to them by the master. Hadoop has a master node called the Name node. It monitors the data nodes, has metadata in it, and receives heartbeat signals from data nodes. The slave nodes or data nodes store the actual data and perform operations on it. The secondary Name node does checkpointing periodically and helps the main node in its operations.
How does MapReduce work?
MapReduce is used for massively parallel data processing. It consists of four phases, namely, mapper, shuffle-and-sort, reducer, and combiner. The mapper splits the input dataset and outputs key-value pairs where the key is the word/entity, and the value is the frequency. Shuffle-and-sort removes duplicates and sorts the key-value pairs by the key. Reducer is used to aggregate the key-value pairs obtained from the previous phase. Combiner is an optimisation technique in which reduction occurs at the node level. This is how MapReduce works.