Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconBig Databreadcumb forward arrow iconApache Hive Ultimate Tutorial For Beginners: Learn Hive from Scratch

Apache Hive Ultimate Tutorial For Beginners: Learn Hive from Scratch

Last updated:
20th Mar, 2020
Views
Read Time
6 Mins
share image icon
In this article
Chevron in toc
View All
Apache Hive Ultimate Tutorial For Beginners: Learn Hive from Scratch

Apache hive is an open-source data warehouse system built on top of Hadoop. It is used for querying and analyzing large datasets stored in Hadoop files. This Apache Hive tutorial will help you understand its basics, characteristics, and usage. 

In the digital age, about 2.5 quintillion bytes of data are generated every day. We need innovative technologies to contain this data explosion. And Hive is one such tool that processes structured and semi-structured data in the industry-leading Hadoop ecosystem. As more and more employers seek to leverage the capabilities of big data, they are looking for people who are well-versed with Hadoop tools. Therefore, a Hadoop Hive tutorial is an essential component of any big data course for beginners. 

What is Hive? Explain in simple terms.

Apache Hive allows developers to summarize data, run queries, and analyze large data sets. Built on top of the Hadoop Distributed File System (HDFS), it brings more structure to the data by organizing it into tables. Also, Hive uses its HiveQL or HQL language to perform SQL-like queries on the data.

While SQL is executed on traditional databases, HQL can automatically translate the queries into MapReduce jobs. Hive abstracts the complexity of Hadoop by converting SQL queries into a series of jobs to be executed on the Hadoop cluster. So, to master Apache Hive, you need a basic familiarity with SQL. But there is no need to learn Java. 

Ads of upGrad blog

Moving on in our Apache Hive tutorial, let’s discuss its uses in the modern workplace environments.

Why do you need to use Hive?

Traditional database systems are not equipped to handle the large amount of data generated by big data applications today. And Hadoop is a framework that solves this problem. Various tools aid the Hadoop modules, Hive being one of them. With Apache Hive, you can perform the following tasks:

  • Tables can be portioned and bucketed, making it feasible to process data stored in the Hadoop Distributed File System (HDFS). Tables are defined directly in HDFS
  • JDBC/ODBC drivers are available for integration with traditional technologies
  • Provides schema flexibility and evolution along with data summarization, facilitating easier analyses
  • Saves you from writing complex Hadoop MapReduce jobs
  • The partition and bucket concept enables fast data retrieval
  • Very easy to learn and implement for SQL developers
  • Fast and scalable system
  • Hive supports different kinds of files, such as Text file, Sequence file, RC file, ORF file, Parquet file, and AVRO file

Explore our Popular Software Engineering Courses

What are the major components of the Hive architecture?

1. User interface: Allows you to submit a query, process the instructions, and manage them. The Command Line Interface (CLI) and web UI allow external users to connect with Hive. 

2. Metastore: As the name suggests, the metastore holds the metadata of the database. It contains information about the scheme and location of tables. It also stores the partition metadata. Present on the traditional relational database, it allows you to monitor the distributed data in the cluster. It tracks the data, replicates it, and provides backup.

3. Driver: It is that part of the process engine that receives HiveQL statements. The driver creates sessions to execute the statement and monitors its lifecycle. It also stores the metadata generated during the statement’s execution. 

4. Compiler: This part of the HiveQL process engine converts the query into MapReduce inputs, such as Abstract Syntax Tree (AST) and Directed Acyclic Graph (DAG)

5. Optimizer: This component of the Hive architecture performs transformations in the execution plan to provide an optimized DAG. It splits the tasks for better performance. 

6. Executor: It schedules or pipelines the tasks to complete the execution process. For this, it interacts with the Hadoop job tracker.

Read: Hadoop Tutorial for Beginners

This Apache Hive tutorial cannot be complete without discussing how these Hive components interact with one another to carry out queries. So, we have listed the steps below.

Step 1: User enters a query into the CLI or Web UI, which forwards the query to the driver.

Step 2: The driver passes the query to the compiler for checking. The compiler ensures the accuracy of the syntax.

Step 3: The compiler requests the Metastore for the required metadata in order to proceed further.

Step 4: After receiving the metadata, the compiler re-sends the execution plan to the driver.

Step 5: The driver forwards this plan to the execution engine.

Step 6: The execution engine carries out the final stages. It sends the task to the JobTracker (Name node) within Hadoop’s MapReduce module.

Step 7: The JobTracker further assigns the task to the TaskTracker (Data node).

Step 8: The query is executed and sent back to the executor.

Step 9: The executor sends the results to the driver. 

Step 10: The driver forwards results to the user interface of Hive. 

In-Demand Software Development Skills

Read: Hadoop Developer Salary in India

What do you know about Hive Shell?

Hive Shell allows users to run HQL queries. It is Hive’s command-line interface. You can run Hive Shell in two modes:

  • Non-interactive: Specify the location of the file containing HQL queries with the -f option. For example, hive -f my-script.q
  • Interactive: Go to the Hive Shell directly and submit queries manually to get the result. For example, $bin/hive, go to hive shell

List some limitations of Hive

  • It offers limited subquery support
  • Hive queries have high latency
  • Materialized views are not allowed in Apache Hive
  • It does not provide real-time queries, row-level updates, update and delete operations
  • Apache Hive is not suitable for online transitional process or OLTP

Explore Our Software Development Free Courses

Summing up

Ads of upGrad blog

In this Hadoop Hive tutorial, we covered different aspects of Hive, its usage, and architecture. We also delved into its working and discussed its limitations. All this information will help you start your Hive learning journey. After all, it is one of the most widely used and trusted big data frameworks! 

If you are interested to know more about Big Data, check out our Advanced Certificate Programme in Big Data from IIIT Bangalore.

Learn Software Development Courses online from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs or Masters Programs to fast-track your career.

Profile

Utkarsh Singh

Blog Author
Get Free Consultation

Select Coursecaret down icon
Selectcaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Big Data Course

Frequently Asked Questions (FAQs)

1What is a DAG, and what are its uses?

DAG stands for Directed Acyclic Graph. It is constructed by using a 3-address code. It consists of nodes representing a part of an expression and edges representing the relationship between the nodes. It is a graph that helps show relationships between various code segments. It is used to represent multiple expressions during compilation. The leaf nodes store the identifiers and constants. Interior nodes store the operator. DAGs help in the code optimisation phase by eliminating common subexpressions & dead code and by applying transformations to the basic blocks. Topological sort can be performed on DAGs.

2What is meant by OLTP, and why is Hive not suitable for it?

OLTP stands for Online Transaction Processing. It is a method of processing transactions quickly. It consists of an operational and information environment. The information environment consists of data warehouses and data marts, i.e., the smaller subunits of data. The operating environment consists of data mining tools, business processes, and end-users. Online banking, order entry, ticket booking, etc., are some examples of OLTP. It maintains integrity constraints, allows read-write operations, performs CRUD operations, and conducts incremental backups. Throughput is the performance metric for transactions. It works with large databases too. Data manipulation is easy, and concurrency and consistency are also ensured while using OLTP.

3What are the salaries of Apache Hive developer jobs in India?

An Apache Hive developer is expected to have hands-on experience in Hadoop, an understanding of Spark pipelines, the ability to write complex Hive queries, and should have some experience in NoSQL and Cloud. Apache Hive developers design and implement solutions that help in data processing at a large scale. They also design, build, maintain, and reuse code. An individual's salary depends on location, organisation, skill set, work experience, etc. The average salary for Apache Hive developer jobs in India is around INR 7.5 LPA. For developers with 3-5 years of experience, the compensation could range from INR 15 LPA to 25 LPA.

Explore Free Courses

Suggested Blogs

Characteristics of Big Data: Types & 5V’s
5175
Introduction The world around is changing rapidly, we live a data-driven age now. Data is everywhere, from your social media comments, posts, and lik
Read More

by Rohit Sharma

04 Mar 2024

50 Must Know Big Data Interview Questions and Answers 2024: For Freshers & Experienced
6881
Introduction The demand for potential candidates is increasing rapidly in the big data technologies field. There are plenty of opportunities in this
Read More

by Mohit Soni

What is Big Data – Characteristics, Types, Benefits & Examples
184576
Lately the term ‘Big Data’ has been under the limelight, but not many people know what is big data. Businesses, governmental institutions, HCPs (Healt
Read More

by Abhinav Rai

18 Feb 2024

Cassandra vs MongoDB: Difference Between Cassandra & MongoDB [2023]
5454
Introduction Cassandra and MongoDB are among the most famous NoSQL databases used by large to small enterprises and can be relied upon for scalabilit
Read More

by Rohit Sharma

31 Jan 2024

13 Ultimate Big Data Project Ideas & Topics for Beginners [2024]
99321
Big Data Project Ideas Big Data is an exciting subject. It helps you find patterns and results you wouldn’t have noticed otherwise. This skill
Read More

by upGrad

16 Jan 2024

Be A Big Data Analyst – Skills, Salary & Job Description
899599
In an era dominated by Big Data, one cannot imagine that the skill set and expertise of traditional Data Analysts are enough to handle the complexitie
Read More

by upGrad

16 Dec 2023

12 Exciting Hadoop Project Ideas & Topics For Beginners [2024]
20583
Hadoop Project Ideas & Topics Today, big data technologies power diverse sectors, from banking and finance, IT and telecommunication, to manufact
Read More

by Rohit Sharma

29 Nov 2023

Top 10 Exciting Data Engineering Projects & Ideas For Beginners [2024]
39826
Data engineering is an exciting and rapidly growing field that focuses on building, maintaining, and improving the systems that collect, store, proces
Read More

by Rohit Sharma

21 Sep 2023

Big Data Architects Salary in India: For Freshers & Experienced [2024]
899164
Big Data – the name indicates voluminous data, which can be both structured and unstructured. Many companies collect, curate, and store data, but how
Read More

by Rohit Sharma

04 Sep 2023

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon