Table of Contents
What is Hive?
The Apache hive is an open-source data warehousing tool developed by Facebook for distributed processing and data analytics. It is developed on top of the Hadoop Distributed File System (HDFS). A mechanism for projecting structure onto the data in Hadoop is provided by Hive. A SQL-like language called HiveQL (HQL) is used to query that data. There is a similarity between the tables in Hive and tables in a relational database. Hive queries can be easily written by whoever is familiar with SQL.
A few features of Hive are:
- Storage of schema information into a database and the processed data into HDFS.
- Designed for OLAP.
- The querying language is HiveQL or HQL, which is similar to SQL.
- It is fast, familiar, scalable, and extensible.
Uses of Hive
- It is the Apache Hive distributed storage.
- Tools are provided that enable the users to easily extract, transform, and load data.
- A variety of data formats are offered for providing the structure.
- Files stored in Hadoop Distributed File System (HDFS) can be accessed by Hive.
Commands of Hive
The hive commands are:
- Data Definition Language (DDL): The tables and other objects in the database are built and modified through these commands.
- CREATE: It is used to create a table or Database.
- SHOW: It is used to show Database, Table, Properties, etc.
- ALTER: It is used to make changes to the existing table.
- DESCRIBE: It describes the table columns.
- TRUNCATE: Used to permanently truncate and delete the rows of tables.
- DELETE: Deletes the table data, but can be restored.
- Data Manipulation Language (DML): used to retrieve, store, modify, delete, insert, and update data in the database.
- Syntax for LOAD, INSERT Statements
LOAD data <LOCAL> inpath <file path> into table [tablename]
- After loading of the data the data manipulation commands are used to retrieve the data.
- Count aggregate function is used to count the total number of the records in a table.
- “create external” keyword is used to create a table and provides a location where the table will be created. An EXTERNAL table points to any HDFS location for its storage.
- Insert commands are used to load the data Hive table. The “insert overwrite” is used to overwrite the existing data and “insert into” is used to append the data into an existing data.
- A table is divided into partitions by the “partitioned by” command and divided into buckets by “clustered by” command.
- Insertion of data throws errors as the dynamic partition is not enabled. Therefore, the following parameters are to be set in the Hive shell.
To enable dynamic partitions, by default, it’s false
- ‘Drop Table’ command deletes the data and metadata for a table
- Aggregation: Syntax:
Select count (DISTINCT category) from tablename;
The command will count different categories of ‘cate’ tables.
- Grouping: Syntax:
Select category, sum( amount) from txt records group by category
The result set will be grouped into one or more columns.
- Join Operation: perform to combine fields from two tables by using values common to each column.
- Left outer join: For table A and B, left outer join is to contain all records of the “left” table (A), even if the join-condition does not find any matching record in the “right” table (B).
- Right Outer Join: Every row from the “right” table (B) will appear in the joined table at least once.
- Full join: The joined table will contain all records from both tables The joined table will contain all records from both tables.
The apache hive architecture is shown in Figure 1.
List of Major Components
The major components of the hive architecture are:
1. Hive client
Different applications written in languages like Java, Python, C++, etc. are communicated through the use of different drivers provided by Hive. It can be written in any language as per choice. The Clients and servers in turn communicate with the Hive server in the Hive services.
Mostly they are categorized into three types:
- Thrift Client: It is based on Apache Thrift to serve a request from a Thrift client. The Thrift client will be used for communication for the Thrift-based applications.
- JDBC Client: JDBC is provided for Java-related applications. Java applications are connected to the Hive using the JDBC driver. It further uses the Thrift to communicate with the Hive server.
- ODBC Client: The applications based on the ODBC protocol are allowed to connect to the Hive through the ODBC drivers. Similar to JDBC, it uses Thrift to communicate to the Hive server.
2. Hive Services
Hive services provide means for the interactions of Hive with the Clients. Any query-related operations that have to be performed by the Client will have to be communicated through the Hire services. For Data Definition Language (DDL) operations, CLI acts as the Hive service.
All the drivers have to communicate with the Hive server and then to the main driver in the Hive services. Drivers in the Hive services represent the main driver which communicates with the Client specific applications and all types of JDBC, ODBC, etc. The requests from different applications are processed by the driver to the metastore and field systems which will be further processed.
Services offered by Hive are:
- Beeline: The Beeline is a command shell where a user can submit its queries to the system. It is supported by HiveServer2. It is a JDBC client that is based on SQLLINE CLI.
- Hive Server 2: Clients are allowed to execute the queries against the hive. A successor of HiveServer1, it allows the execution of multiple queries from multiple clients. It provides the best support for open API clients like JDBC and ODBC.
- Hive Driver: The user submits the HiveQL statements to the Hive driver through the command shell. It sends the query to the compiler and creates session handles for the query.
- Hive compiler: The Hive compiler is used for passing the query. Using the metadata stored in the metastore, the Hive compiler performs semantic analysis and type checking on the different query blocks and expressions. An execution plan is then generated by the compiler which is the DAG (Directed Acyclic Graph). Each stage of the DAG is a metadata operation, operation on HDFS, or is a map/reduce job.
- Optimizer: The main role of the optimizer is to perform transformation operations on the execution plan. It increases efficiency and scalability by splitting the tasks.
- Execution Engine: After the completion of the compilation and optimization steps, it is the role of the execution engine that executes the execution plan created by the compiler. The plan is executed using Hadoop in order of their dependencies.
- Metastore: Metastore is generally a relational database that stores the metadata information related to the structure of the tables and partitions. It is a central repository that also includes storing information of column and column types. Information related to serializer and deserializer, are also stored in Metastore which is required for reading/write operations along with HDFS files which store data. A Thrift interface is provided by Metastore for querying and manipulating Hive metadata.
Metastore can be configured in two modes:
- Remote: This mode is useful for non-Java applications and in the remote mode the metastore is a Thrift service.
- Embedded: In this mode, the client can directly interact with the metastore through the JDBC.
- HCatalog: The table and storage management layer for Hadoop is the HCatalog. Different data processing tools for reading and writing data on the grid are available like Pig, MapReduce, etc. Built on the top of Hive metastore, the tabular data of Hive metastore is exposed to other data processing tools.
- WebHCat: WebHCat is an HTTP interface and REST API for HCatalog. It performs Hive metadata operations and offers a service of running Hadoop MapReduce (or YARN), Pig, Hive jobs.
3. Processing and Resource Management
The execution of the queries is carried out by an internal MapReduce framework.
The MapReduce framework is a software framework for processing large amounts of data on large clusters of commodity hardware. The data is split into chunks and then processed by map-reduce tasks.
4. Distributed Storage
The Hive services communicate with the Hive storage for performing the following actions:
- The Hive “Meta storage database” holds the metadata information of tables created in Hive.
- The Hadoop cluster on HDFS will store Query results and the data loaded onto the tables.
Different Modes of Hive
Depending on the size of the data, Hive can operate in two modes.
- Local Mode
The local mode of Hive is used when
- The Hadoop installed has one data node and is installed under pseudo mode.
- The data size of a single local machine is smaller.
- Fast processing on local machines due to the smaller data sets present.
- Map reduce mode
The Map reduce mode of Hive is used when
- Hadoop has multiple data nodes with distributed data across the different nodes.
- The data size is larger and parallel execution of the query is required.
- Large data sets can be processed with better performance.
Characteristics of Hive
- Data is loaded into the tables after the tables and the databases are created.
- Only structured data stored in tables can be managed and queried by Hive.
- The Hive framework has features of optimization and usability while dealing with the structured data which is not present in Map Reduce.
- For ease of use, Hive SQL-inspired language is a simpler approach compared to the complex programming language of Map Reduce. Familiar concepts of tables, rows, columns, etc. are used in Hive.
- For increasing the performance of the queries, Hive can partition the data using a directory structure.
- Hive contains an important component called the “Metastore” which resides in a relational database and stores schema information. Two methods can be used to interact with Hive: Web GUI and Java Database Connectivity (JDBC) interface.
- A command-line interface (CLI) is used for most of the interactions. The CLI is used for writing the Hive queries using the Hive Query Language (HQL).
- The HQL syntax is similar to that of the SQL syntax.
- Four file formats are supported by Hive; TEXTFILE, SEQUENCEFILE, ORC, and RCFILE (Record Columnar File).
Apache Hive is an open-source data warehousing tool consisting of major components like Hive clients, Hive services, Processing framework and Resource Management, and Distributed Storage.
It is built on top of the Hadoop ecosystem for the processing of structures and semi-structured data. The user interface provided by Hive enables the user to submit their queries in Hive Query Language (HQL). This is passed to the compiler for generating an execution plan. The plan is finally executed by the execution engine.
If you are interested to know more about Big Data, check out our PG Diploma in Software Development Specialization in Big Data program which is designed for working professionals and provides 7+ case studies & projects, covers 14 programming languages & tools, practical hands-on workshops, more than 400 hours of rigorous learning & job placement assistance with top firms.
Check our other Software Engineering Courses at upGrad.