Bucketing in Hive: Create Bucketed Table in Hive

Working with a big dataset can be challenging. There’s a lot to keep track of and one small error can disturb your entire workflow. One of the most prominent tools for managing large datasets is bucketing. 

This article will tell you about how you can perform bucketing in Hive. We’ll explore multiple implementations of this function through examples.

What is Bucketing in Hive?

Bucketing is a data organization technique. While partitioning and bucketing in Hive are quite similar concepts, bucketing offers the additional functionality of dividing large datasets into smaller and more manageable sets called buckets. 

With bucketing in Hive, you can decompose a table data set into smaller parts, making them easier to handle. Bucketing allows you to group similar data types and write them to one single file, which enhances your performance while joining tables or reading data. This is a big reason why we use bucketing with partitioning most of the time. 

When Do We Use Bucketing? 

Bucketing is a very useful functionality. If you haven’t used it before, you should keep the following points in mind to determine when to use this function:

  • When a column has a high cardinality, we can’t perform partitioning on it. A very high number of partitions will generate too many Hadoop files which would increase the load on the node. That’s because the node will have to keep the metadata of every partition, and that would affect the performance of that node.
  • You should use bucketing if your queries have several map-side joins. A map-side join is a process where you join two tables by only using the map function without using the reduce function. 

Highlights of Bucketing in Hive

Bucketing is based on the hashing function so it has the following highlights:

  • The hash_function depends on the kind of the bucketing column you have.
  • You should keep in mind that the Records with the same bucketed column would be stored in the same bucket. 
  • This function requires you to use the Clustered By clause to divide a table into buckets. 
  • In the table directory, the Bucket numbering is 1-based and every bucket is a file.
  • Bucketing is a standalone function. This means you can perform bucketing without performing partitioning on a table.
  • A bucketed table creates nearly equally distributed data file sections. 
  • Note that bucketing doesn’t ensure your table would be properly populated. So you’ll have to manage the Data Loading into the buckets yourself, which can be cumbersome. 

Read: Hive Vs Spark

Bucketing in Hive: Example #1

It’d be best to understand bucketing in Hive by using an example. We’ll use the following data for our example:

EMPID FIRSTNAME LASTNAME SPORTS CITY COUNTRY
1001 Emerry Blair Basketball Qutubullapur San Marino
1002 Zephr Stephenson Cricket Neerharen Dominican Republic
1003 Autumn Bean Basketball Neerharen Dominican Republic
1004 Kasimir Vance Badminton Neerharen Dominican Republic
1005 Mufutau Flores   Qutubullapur San Marino
1006 Ayanna Banks Football Neerharen Dominican Republic
1007 Selma Ball Tennis Qutubullapur San Marino
1008 Berk Fuller Badminton Neerharen Dominican Republic
1009 Imogene Terrell   Qutubullapur San Marino
1010 Colorado Hutchinson Tennis Qutubullapur San Marino

 

Our sample data contains employee information for a sports team. However, some of the employees are not a part of any team. 

Here’s the sample data you can copy-paste to follow along with this example:

id,FirstName,LastName,Sports,City,Country

1001,Emerry, Blair, Basketball, Qutubullapur, San Marino

1002, Zephr, Stephenson, Cricket, Neerharen, Dominican Republic

1003, Autumn, Bean, Basketball, Neerharen, Dominican Republic

1004, Kasimir, Vance, Badminton, Neerharen, Dominican Republic

1005, Mufutau, Flores, Qutubullapur, San Marino

1006, Ayanna, Banks, Football, Neerharen, Dominican Republic

1007,Selma,Ball,Tennis,Qutubullapur,San Marino

1008, Berk, Fuller, Badminton, Neerharen, Dominican Republic

1009,Imogene,Terrell,,Qutubullapur,San Marino

1010,Colorado,Hutchinson,Tennis,Qutubullapur,San Marino

We already know that bucketing allows us to cluster datasets into smaller sections for optimization. Let’s now discuss how one completes this process:

Creating the Base Table

First, we’ll create a table called employee_base:

CREATE TABLE db_bdpbase.employee_base (

    emplid     INT,

    firstname  STRING,

    lastname   STRING,

     sports STRING,

    city       STRING,

    country    STRING

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

STORED AS TEXTFILE

TBLPROPERTIES(“skip.header.line.count”=”1”);

Our sample data has a header which is not needed for bucketing, so we’ll remove it by adding the ‘skip header’ property.

Loading the data into the Base Table

We’ll use the location ‘/usr/bdp/hive/sample_data.csv’ for our sample data and use the following command for loading it into the table:

LOAD DATA INPATH ‘/user/bdp/hive/sample_data.csv’ INTO TABLE  db_bdpbase.employee_base;

Creating the Bucketed Table

In this section, we’ll create a bucketed table. Now we can either make a bucketed table with a partition or without partition. 

Bucketed Table With Partition

In this case, the country is the partition column and we have bucketed the empid column that we sorted in ascending order:

CREATE TABLE db_bdpbase.bucketed_partition_tbl (

empid     INT,

firstname  STRING,

lastname   STRING,

sports STRING,

city       STRING

) PARTITIONED BY(country STRING)

CLUSTERED BY (empid)

SORTED BY (empid ASC) INTO 4 BUCKETS;

Bucketed Table Without Partition

Alternatively, we can create a bucketed table without partition: 

CREATE TABLE db_bdpbase.bucketed_tbl_only (

empid     INT,

firstname  STRING,

lastname   STRING,

city       STRING,

Country STRING

)

CLUSTERED BY (empid)

SORTED BY (empid ASC) INTO 4 BUCKETS;

Here, we have bucketed the table on the same column empid. 

Setting the Property

The default setting for bucketing in Hive is disabled so we enabled it by setting its value to true. The following property would select the number of the clusters and reducers according to the table:

SET hive.enforce.bucketing=TRUE; (NOT needed IN Hive 2.x onward)

Loading Data Into the Bucketed Table

So far, we have created two bucketed tables and a base table with our sample data. Now we’ll load the data into the bucketed table from the base table by using the following command in the bucketed table with partition:

INSERT OVERWRITE TABLE db_bdpbase.bucketed_partition_tbl PARTITION (country) SELECT * FROM db_bdpbase.employee_base;

To load data into the bucketed table without any partition, we’ll use the following command:

INSERT OVERWRITE TABLE db_bdpbase.bucketed_tbl_only SELECT * FROM db_bdpbase.employee_base;

Checking the Bucketed Table Data

After loading the data into the bucketed table, we will check how it is stored in the HDFS. We’ll use the following code to check the bucketed table with partition:

hadoop fs -ls hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/db_bdpbase.db/bucketed_partition_tbl

Data Storage in Bucketed Tables

Every data point gets mapped to a specific according to the following formula:

hash_function(bucket_column) mode num_bucket

Now, consider the first table which we partitioned based on the country, our sample data will get divided into the following sections:

EMPID FIRSTNAME LASTNAME SPORTS CITY COUNTRY
1002 Zephr Stephenson Cricket Neerharen Dominican Republic
1003 Autumn Bean Basketball Neerharen Dominican Republic
1004 Kasimir Vance Badminton Neerharen Dominican Republic
1006 Ayanna Banks Football Neerharen Dominican Republic
1008 Berk Fuller Badminton Neerharen Dominican Republic

 

EMPID FIRSTNAME LASTNAME SPORTS CITY COUNTRY
1001 Emerry Blair Basketball Qutubullapur San Marino
1005 Mufutau Flores   Qutubullapur San Marino
1007 Selma Ball Tennis Qutubullapur San Marino
1009 Imogene Terrell   Qutubullapur San Marino
1010 Colorado Hutchinson Tennis Qutubullapur San Marino

 

For Domincan Republic, every row will be stored in the bucket:

hash_function(1002) mode 4 = 2 (Representing index of bucket)

hash_function(1003) mode 4 = 3

hash_function(1004) mode 4 = 0

hash_function(1006) mode 4 = 2

hash_function(1008) mode 4 = 0

Note that the hash_function of INT value will give you the same result. You can check the data in every file at the HDFS location. If you want, you can repeat this process for other countries present in the database. 

Bucketing in Hive: Example #2

As we have already covered the various steps and procedures present in implementing this function, we can try it out easily. The following is a simple example of bucketing in Hive. Here, we have only bucketed the available data into different parts so we can manage it more easily: 

0: jdbc:hive2://cdh-vm.dbaglobe.com:10000/def> create table monthly_taxi_fleet6  

. . . . . . . . . . . . . . . . . . . . . . .> (month char(7),fleet smallint,company varchar(50))  

. . . . . . . . . . . . . . . . . . . . . . .> clustered by (company) into 3 buckets  

. . . . . . . . . . . . . . . . . . . . . . .> stored as avro;

Example using Apache Hive version 1.1.0-cdh5.13.1, hive.enforce.bucketing=false by default  

0: jdbc:hive2://cdh-vm.dbaglobe.com:10000/def> insert into monthly_taxi_fleet6  

. . . . . . . . . . . . . . . . . . . . . . .> select month,fleet,company from monthly_taxi_fleet;

[upgrad@cdh-vm ~]$ hdfs dfs -ls -R /user/hive/warehouse/monthly_taxi_fleet6  

-rwxrwxrwt   1 upgrad hive  25483 2017-12-26 10:40 /user/hive/warehouse/monthly_taxi_fleet6/000000_0

— hive.enforce.bucketing: Whether bucketing is enforced. If true, while inserting into the table, bucketing is enforced.  

— Default Value: Hive 0.x: false, Hive 1.x: false, Hive 2.x: removed, which effectively makes it always true (HIVE-12331)

0: jdbc:hive2://cdh-vm.dbaglobe.com:10000/def> set hive.enforce.bucketing=true;

0: jdbc:hive2://cdh-vm.dbaglobe.com:10000/def> insert into monthly_taxi_fleet6  

. . . . . . . . . . . . . . . . . . . . . . .> select month,fleet,company from monthly_taxi_fleet;

[upgrad@cdh-vm ~]$ hdfs dfs -ls -R /user/hive/warehouse/monthly_taxi_fleet6  

-rwxrwxrwt   1 upgrad hive  13611 2017-12-26 10:43 /user/hive/warehouse/monthly_taxi_fleet6/000000_0  

-rwxrwxrwt   1 upgrad hive   6077 2017-12-26 10:43 /user/hive/warehouse/monthly_taxi_fleet6/000001_0  

-rwxrwxrwt   1 upgrad hive   6589 2017-12-26 10:43 /user/hive/warehouse/monthly_taxi_fleet6/000002_0

0: jdbc:hive2://cdh-vm.dbaglobe.com:10000/def> describe extended monthly_taxi_fleet6;  

+—————————–+—————————————————-+———-+–+  

|      col_name       |                 data_type                  | comment  |  

+—————————–+—————————————————-+———-+–+  

| month                   | char(7)                                        |      |  

| fleet                   | int                                            |      |  

| company                 | varchar(50)                                    |      |  

|                         | NULL                                           | NULL |  

| Detailed Table Information  | Table(tableName:monthly_taxi_fleet6, dbName:default, owner:upgrad, createTime:1514256031, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:month, type:char(7), comment:null), FieldSchema(name:fleet, type:smallint, comment:null), FieldSchema(name:company, type:varchar(50), comment:null)], location:hdfs://cdh-vm.dbaglobe.com:8020/user/hive/warehouse/monthly_taxi_fleet6, inputFormat:org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat, compressed:false, numBuckets:3, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.avro.AvroSerDe, parameters:{serialization.format=1}), bucketCols:[company], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{totalSize=26277, numRows=1128, rawDataSize=0, COLUMN_STATS_ACCURATE=true, numFiles=3, transient_lastDdlTime=1514256192}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE) |      |  

+—————————–+—————————————————-+———-+–+  

5 rows selected (0.075 seconds)

Checkout: Basic Hive Interview Questions

Bucketing in Hive: Example #3

Below is a little advanced example of bucketing in Hive. Here, we have performed partitioning and used the Sorted By functionality to make the data more accessible. This is among the biggest advantages of bucketing. You can use it with other functions to manage large datasets more efficiently and effectively. 

0: jdbc:hive2://cdh-vm.dbaglobe.com:10000/def> create table monthly_taxi_fleet7  

. . . . . . . . . . . . . . . . . . . . . . .> (month char(7),fleet smallint)  

. . . . . . . . . . . . . . . . . . . . . . .> partitioned by (company varchar(50))  

. . . . . . . . . . . . . . . . . . . . . . .> clustered by (month) sorted by (month)into 3 buckets

. . . . . . . . . . . . . . . . . . . . . . .> stored as avro;

0: jdbc:hive2://cdh-vm.dbaglobe.com:10000/def> insert into monthly_taxi_fleet7  

. . . . . . . . . . . . . . . . . . . . . . .> partition (company)  

. . . . . . . . . . . . . . . . . . . . . . .> select month,fleet,company from monthly_taxi_fleet;

[upgrad@cdh-vm ~]$ hdfs dfs -ls -R /user/hive/warehouse/monthly_taxi_fleet7  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=CityCab  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=CityCab/000000_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=CityCab/000001_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=CityCab/000002_0  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Comfort  

-rwxrwxrwt   1 upgrad hive    913 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Comfort/000000_0  

-rwxrwxrwt   1 upgrad hive    913 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Comfort/000001_0  

-rwxrwxrwt   1 upgrad hive    913 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Comfort/000002_0  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Individual Yellow- Top  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Individual Yellow- Top/000000_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Individual Yellow- Top/000001_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Individual Yellow- Top/000002_0  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Premier  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Premier/000000_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Premier/000001_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Premier/000002_0  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Prime  

-rwxrwxrwt   1 upgrad hive    765 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Prime/000000_0  

-rwxrwxrwt   1 upgrad hive    765 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Prime/000001_0  

-rwxrwxrwt   1 upgrad hive    766 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Prime/000002_0  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=SMRT  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=SMRT/000000_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=SMRT/000001_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=SMRT/000002_0  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Smart  

-rwxrwxrwt   1 upgrad hive    720 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Smart/000000_0  

-rwxrwxrwt   1 upgrad hive    719 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Smart/000001_0  

-rwxrwxrwt   1 upgrad hive    719 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Smart/000002_0  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=TransCab  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=TransCab/000000_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=TransCab/000001_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=TransCab/000002_0  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=YTC  

-rwxrwxrwt   1 upgrad hive    432 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=YTC/000000_0  

-rwxrwxrwt   1 upgrad hive    432 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=YTC/000001_0  

-rwxrwxrwt   1 upgrad hive    432 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=YTC/000002_0

Learn More About Partitioning and Bucketing in Hive

In the examples we shared before, we performed partitioning and bucketing in Hive in multiple ways and learned about how you can implement them in Hive. However, Apache Hive has many other functionalities and learning about all of them can be quite daunting. 

That’s why we recommend taking a data engineering course. It would allow you to study from industry experts who have spent years in this industry. A course provides you with a structured curriculum where you learn everything steps by step. At upGrad, we offer dedicated data engineering courses.

With our courses, you get access to upGrad’s Student Success Corner where you get personalized resume feedback, interview preparation, career counselling, and many other advantages. 

After the course completion, you’ll be a skilled data engineering professional. 

Conclusion

Bucketing in Hive is very simple and easy to perform. It is certainly a useful function for large datasets. However, when you perform both partitioning and bucketing in Hive together, you can manage quite humongous datasets very easily. 

If you are interested to know more about Big Data Program, check out our PG Diploma in Software Development Specialization in Big Data program which is designed for working professionals and provides 7+ case studies & projects, covers 14 programming languages & tools, practical hands-on workshops, more than 400 hours of rigorous learning & job placement assistance with top firms.

If you have any questions or thoughts regarding bucketing, do share them in the comments below. We’d love to hear from you. 

Upskill Yourself & Get Ready for The Future

400+ HOURS OF LEARNING. 14 LANGUAGES & TOOLS. IIIT-B ALUMNI STATUS.
APPLY NOW

Leave a comment

Your email address will not be published.

×