Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconBig Databreadcumb forward arrow iconEverything You Need to Know about Apache Storm

Everything You Need to Know about Apache Storm

Last updated:
19th Feb, 2018
Views
Read Time
8 Mins
share image icon
In this article
Chevron in toc
View All
Everything You Need to Know about Apache Storm

The ever-increasing growth in the production and analytics of Big Data keeps presenting new challenges, and the data scientists and programmers gracefully take it in their stride – by constantly improving the applications developed by them. One such problem was that of real-time streaming. Real-time data holds extremely high value for businesses, but it has a time-window after which it loses its value – an expiry date, if you will. If the value of this real-time data is not realised within the window, no usable information can be extracted from it. This real-time data comes in quickly and continuously, therefore the term “Streaming”.

Analytics of this real-time data can help you stay updated on what’s happening right now, such as the number of people reading your blog post, or the number of people visiting your Facebook page. Although it might sound like just a “nice-to-have” feature, in practice, It is essential. Imagine you’re a part of an Ad Agency performing real-time analytics on your ad-campaigns – that the client paid heavily for. Real-time analytics can keep you posted on how is your Ad performing in the market, how the users are responding to it, and other things of that nature. Quite an essential tool if you think of it this way, right?

Looking at the value that real-time data holds, organisations started coming up with various real-time data analytics tools. In this article, we’ll be talking about one of those – Apache Storm. We’ll look at what it is, the architecture of a typical storm application, it’s core components (also known as abstractions), and its real life-use cases.

Let’s go!

Ads of upGrad blog

What is Apache Storm?

Apache Storm
Apache Storm – released by Twitter, is a distributed open-source framework that helps in the real-time processing of data. Apache Storm works for real-time data just as Hadoop works for batch processing of data (Batch processing is the opposite of real-time. In this, data is divided into batches, and each batch is processed. This isn’t done in real-time.)

 

Apache Storm does not have any state-managing capabilities and relies heavily on Apache ZooKeeper (a centralised service for managing the configurations in Big Data applications) to manage its cluster state – things like message acknowledgments, processing statuses, and other such messages. Apache Storm has its applications designed in the form of directed acyclic graphs. It is known for processing over one million tuples per second per node – which is highly scalable and provides processing job guarantees. Storm is written in Clojure which is the Lisp-like functional-first programming language.

At the heart of Apache Storm is a “Thrift Definition” for defining and submitting the logic graph (also known as topologies). Since Thrift can be implemented in any language of your choice, topologies can also be created in any language. This makes Storm support a multitude of languages – making it all the more developer friendly.

Explore our Popular Software Engineering Courses

Storm runs on YARN and integrates perfectly with the Hadoop ecosystem. It is a true real-time data processing framework having zero batch support. It takes a complete stream of data as an entire ‘event’ instead of breaking it into series of small batches. Hence, it is best suited for data which is to be ingested as a single entity.

Let’s have a look at the general architecture of a Storm application – It’ll give you more insights into how Storm works!

Apache Storm: General Architecture and Important Components

Apache Storm
There are essentially two types of nodes involved in any Storm application (as shown above).

  • Master Node (Nimbus Service)

If you’re aware of the inner-workings of Hadoop, you must know what a ‘Job Tracker’ is. It’s a daemon that runs on the Master node of Hadoop and is responsible for distributing task among nodes. Nimbus is a similar kind of service for Storm. It runs on the Master Node of a Storm cluster and is responsible for distributing the tasks among the worker nodes.

Nimbus is a Thrift service provided by Apache which allows you to submit your code in the programming language of your choice. This helps you write your application without having to learn a new language specifically for Storm.

As we talked earlier, Storm lacks any state managing capabilities. The Nimbus service has to rely on ZooKeeper to monitor the messages being sent by the worker nodes while processing the tasks. All the worker nodes update their task status in the ZooKeeper service for Nimbus to see and monitor.

  • Worker Node (Supervisor Service)

These are the nodes responsible for performing the tasks. Worker nodes in Storm run a service called Supervisor. The Supervisor is responsible for receiving the work assigned to a machine by the Nimbus service. As the name suggests, Supervisor supervises the worker processes and to help them complete the assigned tasks. Each of these worker processes executes a subset of the complete topology.

A Storm application has essentially four components/abstractions that are responsible for performing the tasks at hand. These are:

Apache Storm

  • Topology

    The logic for any real-time application is packaged in the form of a topology – which is essentially a network of bolts and spouts. To understand better, you can compare it to the MapReduce jobs (read our article on MapReduce if you’re unaware of what that is!). One key difference is that the MapReduce job finishes when its execution is complete, whereas a Storm topology runs forever (unless you explicitly kill it yourself). The network consists of nodes that form the processing logic, and links (also known as the stream) that demonstrate the passing of data and execution of processes.

  • Stream

    You need to understand what are tuples before understanding what are streams. Tuples are the main data structures in a  Storm cluster. These are named lists of values where the values can be anything from integers, longs, shorts, bytes, doubles, strings, booleans floats, to byte arrays. Now,.streams are a sequence of tuples that are created and processed in real-time in a distributed environment. They form the core abstraction unit of a Spark cluster.

In-Demand Software Development Skills

  • Spout

    A sprout is the source of streams in a Storm tuple. It is responsible for getting in touch with the actual data source, receiving data continuously, transforming those data into the actual stream of tuples and finally sending them to the bolts to be processed. It can be either reliable or unreliable. A reliable Spout will replay the tuple if it failed to be processed by Storm, an unreliable Spout, on the other hand, will forget about the tuple soon after emitting it.

  • Bolt

    Bolts are responsible for performing all the processing of the topology. They form the processing logic unit of a Storm application.  One can utilise bolt to perform many essential operations like- filtering, functions, joins, aggregations, connecting to databases, and many more.  

Who Uses Storm?

Although a number of powerful and easy to use tools have their presence in the market of Big Data, Storm finds a unique place in that list because of its ability to handle any programming language you throw at it. Many organisations put Storm to use.

Let’s look at a couple of big players that use Apache Storm and how!

  • Twitter

twitter logo
Twitter uses Storm to power a variety of its systems – from the personalisation of your feed, revenue optimisation, to improving search results and other such processes. Because Twitter developed Storm (which was later bought by Apache and named Apache Storm), it integrates seamlessly with the rest of Twitter’s infrastructure – the database systems (Cassandra, Memcached, etc.), the messaging environment (Mesos), and the monitoring systems.

  • Spotify

spotify logo
Spotify is known for streaming music to over 50 million active users and 10 million subscribers. It provides a wide range of real-time features like music recommendation, monitoring, analytics, ads targeting, and playlist creations. To achieve this feat, Spotify utilises Apache Storm.
Stacked with Kafka, Memcached, and netty-zmtp based messaging environment, Apache Storm enables Spotify to build low-latency fault-tolerant distributed systems easily.

Explore Our Software Development Free Courses

To Wrap Up…

Ads of upGrad blog

If you wish to establish your career as a Big Data analyst, streaming is the way to go. If you’re able to master the art of dealing with real-time data, you’ll be the number one preference for companies hiring for an analyst role. There couldn’t be a better time to dive into real-time data analytics because that is the need of the hour in the truest sense!

If you are interested to know more about Big Data, check out our Advanced Certificate Programme in Big Data from IIIT Bangalore.

Learn Software Engineering degrees online from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

Profile

Mohit Soni

Blog Author
Mohit Soni is working as the Program Manager for the BITS Pilani Big Data Engineering Program. He has been working with the Big Data Industry and BITS Pilani for the creation of this program. He is also an alumnus of IIT Delhi.
Get Free Consultation

Select Coursecaret down icon
Selectcaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Big Data Course

Frequently Asked Questions (FAQs)

1How does Apache ZooKeeper assist in Apache Storm operations?

The primary use of ZooKeeper in Storm is to operate the clusters by coordinating with them. Also, Storm is very careful when working with ZooKeeper, and puts a low load due to its inability to keep up with message passing. Working with a single node ZooKeeper, in this case, could be an ideal scenario. However, when large Storm clusters are being deployed, there could be a need to work with a larger ZooKeeper cluster. There are a couple of things to keep in mind before deploying it. First, it should always be run under supervision. The major reason is that ZooKeeper is fail-fast and will terminate the process as soon as it experiences an error. Second, setting up a cron is essential for managing ZooKeeper’s data and maintaining transaction logs. The absence of cron could bombard risk as ZooKeeper runs out of disk space quite often.

2How can Apache Storm be used for financial services?

Apache Storm could be an effective solution to prevent security fraud. With real-time anomaly detection, it will be easy to detect activities and patterns that could lead to fraud. Apache Storm could also help with order routing. It is the process through which an order reaches an exchange point via the end-user. Pricing and compliance violations are other sectors in financial services where Storm could prove beneficial.

3Where can Apache Storm be used effectively?

Apache Storm is used for stream and data processing. Whenever real-time data processing is required to update unlimited databases. Moreover, the data processing should be able to afford the data that is being fed. Distributed RPC also uses Storm to enroll computation in real-time. Apache Storm’s dedicated use is also identified in continuous computation. Storm uses real-time data to present to its customers. This data is generated through data streams. Processing requires time, and streaming usually takes time before it reaches customers directly. Apache Storm is also used in calculating analytics to spontaneously respond in real-time.

Explore Free Courses

Suggested Blogs

Characteristics of Big Data: Types & 5V’s
5177
Introduction The world around is changing rapidly, we live a data-driven age now. Data is everywhere, from your social media comments, posts, and lik
Read More

by Rohit Sharma

04 Mar 2024

50 Must Know Big Data Interview Questions and Answers 2024: For Freshers & Experienced
6885
Introduction The demand for potential candidates is increasing rapidly in the big data technologies field. There are plenty of opportunities in this
Read More

by Mohit Soni

What is Big Data – Characteristics, Types, Benefits & Examples
184579
Lately the term ‘Big Data’ has been under the limelight, but not many people know what is big data. Businesses, governmental institutions, HCPs (Healt
Read More

by Abhinav Rai

18 Feb 2024

Cassandra vs MongoDB: Difference Between Cassandra & MongoDB [2023]
5454
Introduction Cassandra and MongoDB are among the most famous NoSQL databases used by large to small enterprises and can be relied upon for scalabilit
Read More

by Rohit Sharma

31 Jan 2024

13 Ultimate Big Data Project Ideas & Topics for Beginners [2024]
99327
Big Data Project Ideas Big Data is an exciting subject. It helps you find patterns and results you wouldn’t have noticed otherwise. This skill
Read More

by upGrad

16 Jan 2024

Be A Big Data Analyst – Skills, Salary & Job Description
899599
In an era dominated by Big Data, one cannot imagine that the skill set and expertise of traditional Data Analysts are enough to handle the complexitie
Read More

by upGrad

16 Dec 2023

12 Exciting Hadoop Project Ideas & Topics For Beginners [2024]
20584
Hadoop Project Ideas & Topics Today, big data technologies power diverse sectors, from banking and finance, IT and telecommunication, to manufact
Read More

by Rohit Sharma

29 Nov 2023

Top 10 Exciting Data Engineering Projects & Ideas For Beginners [2024]
39827
Data engineering is an exciting and rapidly growing field that focuses on building, maintaining, and improving the systems that collect, store, proces
Read More

by Rohit Sharma

21 Sep 2023

Big Data Architects Salary in India: For Freshers & Experienced [2024]
899164
Big Data – the name indicates voluminous data, which can be both structured and unstructured. Many companies collect, curate, and store data, but how
Read More

by Rohit Sharma

04 Sep 2023

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon