In my previous article, we discussed what exactly is Big Data – and why it’s such a big deal.
We also saw how any domain or industry (you just name it!) could improve its operations by putting Big Data to good use. Organisations are realizing this fact and are trying to onboard the right set of people, armor them with the correct set of tools and technologies, and make sense of their Big Data.
As more and more organisations wake up to this fact, the Data Science market is growing all the more rapidly alongside. Everyone wants a piece of this pie – which has resulted in a massive growth in big data tools and technologies.
In this article, we’ll talk about the right tools and technologies you should have in your toolkit as you jump on the big data bandwagon. Familiarity with these tools will also help you any upcoming interviews you might face.
You can’t possibly talk about Big Data without mentioning the elephant in the room (pun intended!) – Hadoop. An acronym for ‘High-availability distributed object-oriented platform”, Hadoop is essentially a framework used for maintaining, self-healing, error handling, and securing large datasets. However, over the years, Hadoop has encompassed an entire ecosystem of related tools. Not only that, most commercial Big Data solutions are based on Hadoop.
A typical Hadoop platform stack consists of HDFS, Hive, HBase, and Pig.
It stands for Hadoop Distributed Filesystem. It can be thought of as the file storage system for Hadoop. HDFS deals with distribution and storage of large datasets.
MapReduce allows massive datasets to be processed rapidly in parallel. It follows a simple idea – to deal with a lot of data in a very little time, simply employ more workers for the job. A typical MapReduce job is processed in two phases: Map and Reduce. The “Map” phase sends a query for processing to various nodes in a Hadoop cluster, and the “Reduce” phase collects all the results to output into a single value. MapReduce takes care of scheduling jobs, monitoring jobs, and re-executing the failed task.
Hive is a data warehousing tool which converts query language into MapReduce commands. It was initiated by Facebook. The best part about using Hive is that developers can use their existing SQL knowledge since Hive uses HQL (Hive Query Language) which has a syntax similar to the classic SQL.
HBase is a column-oriented DBMS which deals with unstructured data in real time and runs on top of Hadoop. SQL cannot be used to query on HBase as it doesn’t deal with structured data. For that, Java is the preferred language. HBase is extremely efficiently in reading and writing large datasets in real-time.
Pig is a high-level procedural programming language that was initiated by Yahoo! And became open source in 2007. As strange as it may sound, it’s called Pig because it can handle any type of data you throw at it!
Apache Spark deserves a special mention on this list as it is the fastest engine for Big Data processing. It’s put to use by major players including Amazon, Yahoo!, eBay, and Flipkart. Take a look at all the organisations that are powered by Spark, and you will be blown away!
Spark has in many ways outdated Hadoop as it lets you run programs up to a hundred times faster in-memory, and ten times faster on disk.
It complements the intentions with which Hadoop was introduced. When dealing with large datasets, one of the major concerns is processing speed, so, there was a need to diminish the waiting time between the execution of each query. And Spark does exactly that – thanks to its built-in modules for streaming, graph processing, machine learning, and SQL support. It also supports the most common programming languages – Java, Python, and Scala.
The main motive behind introducing Spark was to speed up the computational processes of Hadoop. However, it should not be seen as an extension of the latter. In fact, Spark uses Hadoop for two main purposes only — storage and processing. Other than that, it’s a pretty standalone tool.
Traditional databases (RDBMS) store information in a structured way by defining rows and columns. It is possible there because the data being stored isn’t unstructured or semi-structured. But when we talk about dealing with Big Data, we’re talking about largely unstructured datasets. In such datasets, querying using SQL won’t work, because the S (structure) doesn’t exist here. So, to deal with that, we have NoSQL databases.
NoSQL databases are built to specialise in storing unstructured data and provide quick data retrievals. However, they don’t provide the same level of consistency as traditional databases – you can’t blame them for that, blame the data!
The most popular NoSQL databases include MongoDB, Cassandra, Redis, and Couchbase. Even Oracle and IBM – the leading RDBMS vendors – now offer NoSQL databases, after seeing the rapid growth in its usage.
Data lakes have seen a continuous rise in their usage over the past couple of years. However, a lot of people still think Data Lakes are just Data Warehouse revisited – but that’s not true. The only similarity between the two is that they’re both data storage repositories. Frankly, that’s it.
A Data Lake is can be defined as a storage repository which holds a huge amount of raw data from a variety of sources, in a variety of formats, until it is needed. You must be aware that data warehouses store the data in a hierarchical folder structure, but that’s not the case with Data Lakes. Data Lakes use a flat architecture to save the datasets.
Many enterprises are switching to Data Lakes to simplify the processing of accessing their Big Data. The Data Lakes store the collected data in their natural state – unlike a data warehouse which processes the data before storing. That’s why the “lake” and “warehouse” metaphor is apt. If you see data as water, a data lake can be thought of a water lake – storing water unfiltered and in its natural form, and a data warehouse can be thought of as water stored in bottles and kept on the shelf.
In any computer system, the RAM, or Random Access Memory, is responsible for speeding up the processing. Using a similar philosophy, in-memory databases were developed so that you can move your Data to your system, instead of taking your system to the data. What that essentially means is that if you store data in-memory, it’ll cut down the processing time by quite a margin. Data fetching and retrieval won’t be a pain anymore as all the data will be in-memory.
But practically, if you’re handling a really large dataset, it’s not possible to get it all in-memory. However, you can keep a part of it in-memory, process it, and then bring another part in-memory for further processing. To help with that, Hadoop provides several tools that contain both on-disk and in-memory databases to speed up the processing.
The list provided in the this article is by no means a “comprehensive list of Big Data tools and technologies”. Instead, it focuses on the “must-know” Big Data tools and technologies. The field of Big Data is constantly evolving and new technologies are outdating the older ones very quickly. There are many more technologies beyond the Hadoop-Spark stack, like Finch, Kafka, Nifi, Samza, and more. These tools provide seamless results sans hiccups. Each of these has their specific use cases but before you get working on any of them, it’s important to be aware of the ones we mentioned in the article.
Latest posts by Mohit Soni (see all)
- How do I Find Mentors for Data Science? - August 16, 2018
- 15 Must Know Big Data Interview Questions and Answers - May 31, 2018
- 7 Interesting Big Data Projects You Need To Watch Out - May 28, 2018