Big Data interviews may be conducted on general lines (wherein you must have a general idea about the popular Big Data frameworks and tools) or they may be focused on a particular framework or tool. Today, we are going to focus on one widely used Big Data framework – Apache Hive.
We have created this list of Apache Hive interview questions to help you get a better idea about the kind of questions that employers usually ask during Hadoop interviews pertaining to Hive.
So, if you are someone who wishes to nail Hive interview, keep reading till the end!
- What is Apache Hive?
Apache Hive is a data warehousing framework built on top of Hadoop. It is primarily used for analyzing structured and semi-structured data. Hive is designed to project structure on the data and execute queries written in HQL (Hive Query Language), similar to that of SQL statements. Further, the Hive compiler transforms these queries into map-reduce jobs.
- What kind of applications can Hive support?
Hive can support any application written in Python, Java, C++, Ruby, and PHP.
- What do you mean by a Metastore? Why does Hive not store the metadata in HDFS?
Metastore is a repository in Hive that stores the metadata information. It does so by leveraging RDBMS along with an open-source ORM (Object Relational Model) layer called Data Nucleus that turns the object representation into the relational schema and vice versa.
Hive stores metadata information using RDBMS and not HDFS since reading/writing operations using HDFS is a time-consuming process. RDBMS has an advantage over it since it helps achieve low latency.
- Differentiate between Local and Remote Metastore.
A local metastore runs in the same JVM in which the Hive service runs. It can either connect to a database running in a separate JVM on the same machine or a remote machine. On the contrary, a remote metastore runs in a separate JVM and not in the one where the Hive service runs.
- What do you mean by a Partition in Hive? What is its importance?
In Hive, tables are classified and organized into partitions to organize similar type of data together, either according to a column or partition key. So, a partition is actually a sub-directory in the table directory. A table may have more than one partition keys for a particular partition.
Through partitioning, you can achieve granularity in a Hive table. This helps to reduce the query latency as it only scans relevant partitioned data instead of the whole dataset.
- What is a Hive Variable?
A Hive variable is created in the Hive environment developed by Hive scripting languages. Using the source command, it transfers values to hive queries when the query starts executing.
- What kind of data warehouse applications is Hive suitable for?
The design regulations of Hadoop and HDFS put certain limitations on Hive’s abilities. Also, it doesn’t have the necessary features required for OLTP (Online Transaction Processing). Hive is best suited for data warehouse applications in massive data sets that require:
- Analysis of the relatively static data.
- Less response time.
- No dynamic changes in data.
- What is a Hive Index?
Hive index is a Hive query optimization method. It is used to speed up the access of a specific column or set of columns in a Hive database. By utilizing a Hive index, the database system does not require to read all rows in a table to find the chosen data.
- Why do you need Hcatolog?
Hcatalog is required for sharing data structures with external systems. It provides access to the Hive metastore, so you can read/write data to Hive data warehouse.
- Name the components of a Hive query processor?
The components of a Hive query processor are:
- Logical Plan of Generation.
- Physical Plan of Generation.
- Execution Engine.
- UDF’s and UDAF’s.
- Semantic Analyzer.
- Type Checking.
- How do ORC format tables help Hive to enhance the performance?
Using the ORC (Optimized Row Columnar) file format, you can store the Hive data efficiently as it helps to simplify numerous limitations of the Hive file format.
- What is the function of the Object-Inspector?
In Hive, the Object-Inspector helps to analyze the internal structure of a row object and individual structure of columns. Furthermore, it also offers ways to access complex objects that can be stored in different formats in memory.
- What’s the difference between Hive and HBase?
The key differentiating points between Hive and HBase are:
- Hive is a data warehouse framework whereas HBase is a NoSQL database.
- While Hive can run most SQL queries, HBase does not allow SQL queries.
- Hive doesn’t support record-level insert, update, and delete operations on a table, but HBase supports these functions.
- Hive runs on top of MapReduce, but HBase runs on top of HDFS.
- What is a Managed Table and an External Table?
In a managed table, both the metadata information and the table data is deleted from the Hive warehouse directory if you leave/exit a managed table. However, in an external table, only the metadata information associated with the table is deleted while the table data is retained in the HDFS.
- Name the different components of a Hive architecture.
There are 5 components of a Hive Architecture:
- User Interface – It allows the user to submit queries and other operations to the Hive system. The user interface supports Hive web UI, Hive command line, and Hive HD Insight.
- Driver – It creates a session handle for the queries and then sends the queries to the compiler to create an execution plan for the same.
- Metastore – It contains the structured data along with all the information on different tables and partitions in the warehouse (with attributes). On receiving the metadata request, it sends the metadata to the compiler to execute the queries.
- Compiler – It generates the execution plan to parse the queries, perform semantic analysis on different query blocks, and generate query expression.
- Execution Engine – While the compiler makes the execution plan, the execution engine implements it. It manages the dependencies of the various stages of the plan.