Apache HBase is an excellent big data solution for when you want your application to push or pull data in real-time. It is mainly known for its flexible schema and high speed. This article aims to give you the answers to some of the top HBase interview questions. Interviewers want to test candidates’ technical as well as general awareness. So, your effort should be to communicate the concepts precisely and thoroughly.
Many leading companies use Hbase technology around the world, including Adobe, HubSpot, Facebook, Twitter, Yahoo!, and OpenLogic, and StumbleUpon. For aspiring web developers looking to build scalable websites, mastering tools like Hadoop and HBase can prove immensely useful.
Learn data science from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.
Read: Hadoop Project Ideas
Top HBase Interview Questions & Answers
1. What is HBase?
It is a column-oriented database developed by the Apache Software Foundation. Running on top of a Hadoop cluster, HBase is used to store semi-structured and unstructured data. So, it does not have a rigid schema like that of a traditional relational database. Also, it does not support an SQL syntax structure. HBase stores and operates on data through a master node regulating the cluster and region servers.
2. What are the reasons for using Hbase?
HBase offers a high capacity storage system and random read and write operations. It can handle large datasets, performing several operations per second. The distributed and horizontally scalable design makes HBase a popular choice for real-time applications.
3. Explain the key components of HBase.
The working parts of HBase include Zookeeper, HBase Master, RegionServer, Region, and Catalog Tables. The purpose of each element can be described as follows:
- Zookeeper coordinates between the client and the HBase Master
- HBase Master monitors the RegionServer and takes care of the admin functions
- RegionServer supervises the Region
- Region contains the MemStore and HFile
- Catalog Tables comprise ROOT and META
Basically, Hbase consists of a set of tables with each table having rows, columns, and a primary key. It is the HBase column that denotes an object’s attribute.
4. What are the different types of operational commands in HBase?
There are five crucial operational commands in HBase: Get, Delete, Put, Increment, and Scan.
Get is used to read the table. Executed via HTable.get, it returns data or attributes of a specific row from the table. Delete removes rows from a table, whereas Put adds or updates rows. Increment enables increment operations on a single row. Finally, Scan is used to iterate over multiple rows for certain attributes.
Explore our Popular Data Science Online Certifications
5. What do you understand by WAL and Hlog?
- WAL stands for Write Ahead Log and is quite similar to the BIN log in MySQL. It records all the changes in the data.
- HLog is Hadoop’s standard in-memory sequence file that maintains the HLogkey store.
WAL and HLog serve as lifelines in the events of server failure and data loss. If the RegionServer crashes or becomes unavailable, WAL files ensure that the data changes can be replayed.
Our learners also read: Learn Python Online for Free
Top Data Science Skills You Should Learn
6. Describe some situations wherein you would use Hbase.
It is suitable to use HBase when:
- The size of your data is vast, requiring you to operate on millions of records.
- You are implementing a complete redesign and overhauling the conventional RDBMS.
- You have the resources to undertake infrastructure investment in clusters.
- There are particular SQL-less commands, such as transactions, typed columns, inner lines, etc.
7. What do you mean by columns families and row keys?
Column families constitute the basic storage units in HBase. These are defined during table creation and stored together on the disk, later allowing for the application of features like compression.
A row key enables the logical grouping of cells. It is prefixed to the combined key, letting the application define the sort order. In this way, all the cells with the same row key can be saved on the same server.
8. How does HBase differ from a relational database?
HBase is different from a relational database as it is a schema-less, column-oriented data store containing sparsely populated tables. A relational database is schema-based, row-oriented, and stores normalized data in thin tables. Moreover, HBase has the advantage of automated partitioning, whereas there is no such built-in support in RDBMS.
Read: DBMS vs. RDBMS: Difference Between DBMS & RDBMS
Read our popular Data Science Articles
9. What constitutes a cell in HBase?
Cells are the smallest units of HBase tables, holding the data in the form of tuples. A tuple is a data structure having multiple parts. In HBase, it consists of {row, column, version}.
10. Define compaction in HBase.
Compaction is the process used to merge HFiles into a single file before the old files are removed from the database.
11. Can you access HFile directly without using HBase?
Yes, there is a unique technique to access HFile directly without the aid of HBase. The HFile.main method can be used for this purpose.
12. Discuss deletion and tombstone markers in HBase.
In HBase, a normal deletion process results in a tombstone marker. The deleted cells become invisible, but the data represented by them is actually removed during compaction. HBase has three types of tombstone markers:
- Version delete marker: It marks a single version of a column for deletion
- Column delete marker: It marks all versions of a column
- Family delete marker: It sets up all columns of a column family for deletion
Here, it needs to be noted that a row in HBase would be entirely deleted after major compaction. Therefore, when you delete and add more data, the Gets may be masked by tombstone markers, and you may not see the inserted values until after the compactions.
upGrad’s Exclusive Data Science Webinar for you –
Watch our Webinar on How to Build Digital & Data Mindset?
13. What happens when you alter the block size of a column family?
If your database is already occupied and you wish to alter your column family’s block size in HBase, the old data may remain in the old block size. During compaction, the old and new data would behave like this:
- Existing data would take the new block size and continue to be read correctly.
- New files would have the new block size.
In this way, all data transform to the desired block size before the next major compaction.
14. Define the different modes that HBase can run.
HBase can either run in standalone mode or the distributed mode. Standalone is the default mode of HBase that uses the local files system instead of HDFS. As for the distributed mode, it can be further subdivided into:
- Pseudo-distributed mode: All daemons run on a single node
- Fully-distributed mode: Daemons run across all nodes in the cluster
15. How would you implement joins in HBase?
HBase uses MapReduce jobs to process terabytes of data in a scalable fashion. It does not directly support joins, but the join queries are implemented by retrieving data from HBase tables.
Checkout: Hadoop Interview Questions
16. Discuss the purpose of filters in HBase.
Filters were introduced in Apache HBase 0.92 to help users access HBase over Shell or Thrift. So, they take care of your server-side filtering needs. There are also decorating filters that extend the uses of filters to gain additional control over returned data. Here are some examples of filters in HBase:
- Bloom Filter: Typically used for real-time queries, it is a space-efficient way of knowing whether an HFile includes a specific row or cell
- Page Filter: Accepting the page size as a parameter, the Page Filter can optimize the scan of individual HRegions
17. Compare HBase with (i) Cassandra (ii) Hive.
(i) HBase and Cassandra: Both Cassandra and HBase are NoSQL databases designed to manage large datasets. However, the syntax of Cassandra Query Language (CQL) is modeled after SQL. In both data stores, the row key forms the primary index. Cassandra can create secondary indexes on column values. Hence, it can improve data access in columns with high levels of repetition. HBase lacks this provision but has other mechanisms to bring in the secondary index functionality. These methods can be easily found in online reference guides.
(ii) HBase and Hive: Both of them are Hadoop-based technologies. As discussed above, HBase is a NoSQL key/value database. On the other hand, Hive is an SQL-like engine capable of running sophisticated MapReduce jobs. You can perform read and write data operations from Hive to HBase and vice-versa. While Hive is more suitable for analytical tasks, HBase is an excellent solution for real-time querying.
Also Read: HBase Architecture: Everything That you Need to Know
Conclusion
These HBase interview questions and use cases bring us to the end of this article. We attempted to cover different topics to cater to basic, intermediate, and advanced levels. So, keep on revising to make a stellar impression on your recruiter!
If you are curious to learn about data science, check out IIIT-B & upGrad’s Executive PG Program in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.