Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconBig Databreadcumb forward arrow iconApache Kafka Tutorial: Introduction, Concepts, Workflow, Tools, Applications

Apache Kafka Tutorial: Introduction, Concepts, Workflow, Tools, Applications

Last updated:
10th Mar, 2020
Views
Read Time
12 Mins
share image icon
In this article
Chevron in toc
View All
Apache Kafka Tutorial: Introduction, Concepts, Workflow, Tools, Applications

Introduction 

With the increasing popularity of Kafka as a messaging system, many companies demand professionals with a sound knowledge of Kafka skills, and that’s where an Apache Kafka Tutorial comes handy. An enormous amount of data is used in the realm of Big Data that need a messaging system for data collection and analysis.

Kafka is an efficient replacement of the conventional message broker with improved throughput, inherent partitioning and replication and built-in fault tolerance, making it suitable for message processing applications on a large-scale. If you have been looking for an Apache Kafka Tutorial, this is the right article for you.

Key takeaways of this Apache Kafka Tutorial 

  • Concept of messaging systems 
  • A brief introduction to Apache Kafka
  • Concepts related to Kafka cluster and Kafka architecture
  • Brief description of Kafka messaging workflow
  • Overview of important Kafka tools
  • Use cases and applications of Apache Kafka

Also learn about: Apache Spark Streaming Tutorial For Beginners

Ads of upGrad blog

A brief overview of messaging systems 

The main function of a messaging system is to allow data transfer from one application to another; the system ensures that the applications focus only on the data without getting stalled during the process of data sharing and transmission. There are two kinds of messaging systems:

1. Point to point messaging system

In this system, the producers of the messages are called senders and the ones who consume the messages are receivers. In this domain, the messages are exchanged via a destination known as a queue; the senders or the producers produce the messages to the queue, and the messages are consumed by the receivers from the queue.

Source

2. Publish-subscribe messaging system

In this system, the producers of the messages are called publishers and the ones who consume the messages are subscribers. However, in this domain, the messages are exchanged through a destination known as a topic. A publisher produces the messages to a topic and having subscribed to a topic, the subscribers consume the messages from the topic. This system allows broadcasting of messages (having more than one subscriber and each gets a copy of the messages published to a particular topic).

Source

Explore our Popular Software Engineering Courses

Apache Kafka – an introduction

Apache Kafka is based on a publish-subscribe (pub-sub) messaging system. In the pub-sub messaging system, publishers are the producers of the messages, and subscribers are the consumers of the messages. In this system, the consumers can consume all the messages of the subscribed topic(s.) This principle of the pub-sub messaging system is employed in Apache Kafka.

In addition, Apache Kafka uses the concept of distributed messaging, whereby, there is a non-synchronous queuing of messages between the messaging system and the applications.  With a robust queue capable of handling a large volume of data, Kafka allows you to transmit messages from one end-point to another and is suited to both online and offline consumption of messages. Combining reliability, scalability, durability and high-throughput performance, Apache Kafka is ideal for integration and communication between units of large-scale data systems in the real-world.

Also read: Big Data Project Ideas

Source

Concept of Apache Kafka clusters 

Source

  1. Kafka zookeeper: The brokers in a cluster are coordinated and managed by zookeepers. Zookeeper notifies producers and consumers about the presence of a new broker or failure of a broker in the Kafka system as well as notifies consumers about offset value. Producers and consumers coordinate their activities with another broker on receiving from the zookeeper.
  2. Kafka broker: Kafka brokers are systems responsible for maintaining the published data in Kafka clusters with the help of zookeepers. A broker may have zero or more partitions for each topic.
  3. Kafka producer: The messages on one or more than one Kafka topics are published by the producer and pushed to brokers, without awaiting broker acknowledgement. 
  4. Kafka consumer: Consumers extract data from the brokers and consume already published messages from one or more topics, issue a non-synchronous pull request to the broker to have a ready to consume buffer of bytes and then supplies an offset value to rewind or skip to any partition point.

In-Demand Software Development Skills

Fundamental concepts of Kafka architecture 

  1. Topics: It is a logical channel to which messages are published by producers and from which messages are received by consumers. Topics can be replicated (copied) as well as partitioned (divided). A particular kind of message is published on a specific topic, with each topic identifiable by its unique name.
  2. Topic partitions: In the Kafka cluster, topics are divided into partitions as well as replicated across brokers. A producer can add a key to a published message, and messages with the same key end up in the same partition. An incremental ID called offset is assigned to each message in a partition, and these IDs are valid only within the partition and have no value across partitions in a topic.
  3. Leader and replica: Every Kafka broker has a few partitions with each partition, either being a leader or a replica (backup) of the topic. The leader is responsible for not only reading and writing to a topic but also updating the replicas with new data. If, in any case, the leader fails, the replica can take over as the new leader.

Architecture of Apache Kafka 


Source

A Kafka having more than one broker is called a Kafka cluster. Four of the core APIs will be discussed in this Apache Kafka Tutorial:

  1. Producer API: The Kafka producer API allows a stream of records to be published by an application to one or several Kafka topics.
  2. Consumer API: The consumer API allows an application to process the continuous flow of records produced to one or more topics.
  3. Streams API: The streams API allows an application to consume an input stream from one or several topics and generate an output stream to one or several output topics, thus permitting the application to act as a stream processor. This efficiently modifies the input streams to the output streams.
  4. Connector API: The connector API allows the creation and running of reusable producers and consumers, thus enabling a connection between Kafka topics and existing data systems or applications.

Workflow of the publisher-subscriber messaging domain

  1. Kafka producers send messages to a topic at regular intervals.
  2. Kafka brokers ensure equal distribution of messages within the partitions by storing them in the partitions configured for a particular topic.
  3. Subscribing to a specific topic is done by Kafka consumers. Once the consumer has subscribed to a topic, the current offset of the topic is offered to the consumer, and the topic is saved in the zookeeper ensemble.
  4. The consumer requests Kafka for new messages at regular intervals.
  5. Kafka forwards the messages to consumers immediately on receipt from producers.
  6. The consumer receives the message and processes it.
  7. The Kafka broker gets an acknowledgement as soon as the message is processed.
  8. On receipt of the acknowledgement, the offset is upgraded to the new value.
  9. The flow repeats until the consumer stops the request.
  10. The consumer can skip or rewind an offset at any time and read subsequent messages as per convenience.

Workflow of the queue messaging system

In a queue messaging system, several consumers with the same group ID can subscribe to a topic. They are considered a single group and share the messages. The workflow of the system is:

  1. Kafka producers send messages to a topic at regular intervals.
  2. Kafka brokers ensure equal distribution of messages within the partitions by storing them in the partitions configured for a particular topic.
  3. A single consumer subscribes to a specific topic.
  4. Until a new consumer subscribes to the same topic, Kafka interacts with the single consumer.
  5. With the arrival of the new consumers, the data is shared between two consumers. The sharing is repeated until the number of configured partitions for that topic equals the number of consumers.
  6. A new consumer will not receive further messages when the number of consumers exceeds the number of configured partitions. This situation arises due to the condition that each consumer is entitled to a minimum of one partition, and if no partition is blank, the new consumers have to wait.

Explore Our Software Development Free Courses

2 important tools in Apache Kafka 

Next, in this Apache Kafka Tutorial, we will discuss Kafka tools packaged under “org.apache.kafka.tools.*.

1. Replication Tools

It is a high-level design tool that imparts higher availability and more durability.

  • Create Topic tool: This tool is used to create a topic with a replication factor and a default number of partitions and uses the default scheme of Kafka to perform a replica assignment.
  • List Topic tool: The information for a given list of topics is listed by this tool. Fields such as partition, topic name, leader, replicas and isr are displayed by this tool.
  • Add Partition tool: More partitions for a particular topic can be added by this tool. It also performs manual assignment of replicas of the added partitions.

2. System tools

The run class script can be used to run system tools in Kafka. The syntax is:

  • Mirror Maker: The use of this tool is to mirror one Kafka cluster to another.
  • Kafka Migration tool: This tool helps in migrating a Kafka broker from one version to another.
  • Consumer Offset Checker: This tool displays Kafka topic, log size, offset, partitions, consumer group and owner for the particular set of topics.

Also Read: Apache Pig Tutorial

Top 4 use cases of Apache Kafka 

Let us discuss some important use cases of Apache Kafka in this Apache Kafka Tutorial:

  1. Stream processing: The feature of strong durability of Kafka allows it to be used in the field of stream processing. In this case, data is read from a topic, processed and the processed data is then written to a new topic to make it available for applications and users.
  2. Metrics: Kafka is frequently used for operational monitoring of data. Statistics are aggregated from distributed applications to make a centralised feed of operational data. 
  3. Tracking website activity: Data warehouses like BigQuery and Google employ Kafka for tracking activities on websites. Site activities like searches, page views or other user actions are published to central topics and made accessible for real-time processing, offline analysis and dashboards.
  4. Log aggregation: Using Kafka, logs can be collected from many services and made available in a standardised format to many consumers.   

 

Top 5 Applications of Apache Kafka 

Ads of upGrad blog

Some of the best industrial applications supported by Kafka include:

  1. Uber: The cab app needs immense real-time processing and handles huge data volume. Important processes like auditing, ETA calculations and driver and customer matching are modelled based on Kafka Streams.
  2. Netflix: The on-demand internet streaming platform Netflix uses Kafka metrics for processing of events and real-time monitoring.
  3. LinkedIn: LinkedIn manages 7 trillion messages every day, with 100,000 topics, 7 million partitions and over 4000 brokers. Apache Kafka is used in LinkedIn for user activity tracking, monitoring and tracking.
  4. Tinder: This popular dating app uses Kafka Streams for several processes that include content moderation, recommendations, updating the user time zone, notifications and user activation, among others.
  5. Pinterest: With a monthly search of billions of pins and ideas, Pinterest has leveraged Kafka for many processes. Kafka Streams are utilised for indexing of contents, detecting spams, recommendations and for calculating budgets of real-time ads.

Conclusion

In this Apache Kafka Tutorial, we have discussed the fundamental concepts of Apache Kafka, architecture and cluster in Kafka, Kafka workflow, Kafka tools and some applications of Kafka. Apache Kafka has some of the best features like durability, scalability, fault tolerance, reliability, extensibility, replication and high-throughput that make it accessible across some of the best industrial applications, as exemplified in this Apache Kafka Tutorial

If you are interested to know more about Big Data, check out our Advanced Certificate Programme in Big Data from IIIT Bangalore.

Learn Software Development Courses online from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs or Masters Programs to fast-track your career.

Profile

Utkarsh Singh

Blog Author
Get Free Consultation

Selectcaret down icon
Select Area of interestcaret down icon
Select Work Experiencecaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Big Data Course

Frequently Asked Questions (FAQs)

1What exactly is Kafka?

Kafka is an open-source storage system that uses comprehensive storage. It even keeps track of the time. Slow data transmission between a sender and a receiver has been eliminated by Kafka. Kafka's operations are so robust that it cannot lose messages in the long run. Another reason to use it is its compatibility, which has made it acceptable worldwide. Some businesses use Kafka to check large amounts of data regularly. Professional social media like LinkedIn monitors data and operational metrics regularly and Twitter allows users to stream its infrastructure.

2What is the concept of Apache Kafka, and what is its workflow?

Kafka's workflow includes producers sending messages at regular intervals. They will even repeat the flow until the consumer stops the request. Kafka brokers ensure that messages are distributed evenly by storing them in partitions dedicated to a specific topic. Some of the components are included in the Kafka concept. Zookeeper notifies producers and consumers when a new broker or a new Kafka system fails. It assists the broker in the upkeep of published data. The partition offset must be used by the consumers to keep track of how many messages they have consumed.

3What are the Kafka tools, and what are the various Kafka applications?

There are two types of Kafka tools: system tools and replication tools. System tools are those that run scripts from the command line. The Kafka Migration Tool, Mirror Maker, and Consumer Offset Checker are all included. Whereas replication tools handle high-level design tools. They provide a topic list, partition, and topic creator tools. Kafka includes applications such as Twitter, which provides a platform for both senders and receivers to tweet. Netflix, on the other hand, helps to monitor real-time and is a platform where people can relax. Kafka streams and monitors data using LinkedIn.

Explore Free Courses

Suggested Blogs

50 Must Know Big Data Interview Questions and Answers 2024: For Freshers & Experienced
8363
Introduction The demand for potential candidates is increasing rapidly in the big data technologies field. There are plenty of opportunities in this
Read More

by Mohit Soni

Top 6 Major Challenges of Big Data & Simple Solutions To Solve Them
103401
No organization today can operate effectively without data. Data, generated incessantly from various sources like business transactions, sales records
Read More

by Rohit Sharma

17 Jun 2024

13 Best Big Data Project Ideas & Topics for Beginners [2024]
102460
Big Data Project Ideas Big Data is an exciting subject. It helps you find patterns and results you wouldn’t have noticed otherwise. This skill
Read More

by upGrad

29 May 2024

Characteristics of Big Data: Types & 5V’s
7238
Introduction The world around is changing rapidly, we live a data-driven age now. Data is everywhere, from your social media comments, posts, and lik
Read More

by Rohit Sharma

04 May 2024

Top 10 Hadoop Commands [With Usages]
12435
In this era, with huge chunks of data, it becomes essential to deal with them. The data springing from organizations with growing customers is way lar
Read More

by Rohit Sharma

12 Apr 2024

What is Big Data – Characteristics, Types, Benefits & Examples
187104
Lately the term ‘Big Data’ has been under the limelight, but not many people know what is big data. Businesses, governmental institutions, HCPs (Healt
Read More

by Abhinav Rai

18 Feb 2024

Cassandra vs MongoDB: Difference Between Cassandra & MongoDB [2023]
5546
Introduction Cassandra and MongoDB are among the most famous NoSQL databases used by large to small enterprises and can be relied upon for scalabilit
Read More

by Rohit Sharma

31 Jan 2024

Be A Big Data Analyst – Skills, Salary & Job Description
899975
In an era dominated by Big Data, one cannot imagine that the skill set and expertise of traditional Data Analysts are enough to handle the complexitie
Read More

by upGrad

16 Dec 2023

12 Exciting Hadoop Project Ideas & Topics For Beginners [2024]
21452
Hadoop Project Ideas & Topics Today, big data technologies power diverse sectors, from banking and finance, IT and telecommunication, to manufact
Read More

by Rohit Sharma

29 Nov 2023

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon