Apache Kafka Tutorial: Introduction, Concepts, Workflow, Tools, Applications

Introduction 

With the increasing popularity of Kafka as a messaging system, many companies demand professionals with a sound knowledge of Kafka skills, and that’s where an Apache Kafka Tutorial comes handy. An enormous amount of data is used in the realm of Big Data that need a messaging system for data collection and analysis.

Kafka is an efficient replacement of the conventional message broker with improved throughput, inherent partitioning and replication and built-in fault tolerance, making it suitable for message processing applications on a large-scale. If you have been looking for an Apache Kafka Tutorial, this is the right article for you.

Key takeaways of this Apache Kafka Tutorial 

  • Concept of messaging systems 
  • A brief introduction to Apache Kafka
  • Concepts related to Kafka cluster and Kafka architecture
  • Brief description of Kafka messaging workflow
  • Overview of important Kafka tools
  • Use cases and applications of Apache Kafka

Also learn about: Apache Spark Streaming Tutorial For Beginners

A brief overview of messaging systems 

The main function of a messaging system is to allow data transfer from one application to another; the system ensures that the applications focus only on the data without getting stalled during the process of data sharing and transmission. There are two kinds of messaging systems:

1. Point to point messaging system

In this system, the producers of the messages are called senders and the ones who consume the messages are receivers. In this domain, the messages are exchanged via a destination known as a queue; the senders or the producers produce the messages to the queue, and the messages are consumed by the receivers from the queue.

Source

2. Publish-subscribe messaging system

In this system, the producers of the messages are called publishers and the ones who consume the messages are subscribers. However, in this domain, the messages are exchanged through a destination known as a topic. A publisher produces the messages to a topic and having subscribed to a topic, the subscribers consume the messages from the topic. This system allows broadcasting of messages (having more than one subscriber and each gets a copy of the messages published to a particular topic).

Source

Apache Kafka – an introduction

Apache Kafka is based on a publish-subscribe (pub-sub) messaging system. In the pub-sub messaging system, publishers are the producers of the messages, and subscribers are the consumers of the messages. In this system, the consumers can consume all the messages of the subscribed topic(s.) This principle of the pub-sub messaging system is employed in Apache Kafka.

In addition, Apache Kafka uses the concept of distributed messaging, whereby, there is a non-synchronous queuing of messages between the messaging system and the applications.  With a robust queue capable of handling a large volume of data, Kafka allows you to transmit messages from one end-point to another and is suited to both online and offline consumption of messages. Combining reliability, scalability, durability and high-throughput performance, Apache Kafka is ideal for integration and communication between units of large-scale data systems in the real-world.

Also read: Big Data Project Ideas

Source

Concept of Apache Kafka clusters 

Source

  1. Kafka zookeeper: The brokers in a cluster are coordinated and managed by zookeepers. Zookeeper notifies producers and consumers about the presence of a new broker or failure of a broker in the Kafka system as well as notifies consumers about offset value. Producers and consumers coordinate their activities with another broker on receiving from the zookeeper.
  2. Kafka broker: Kafka brokers are systems responsible for maintaining the published data in Kafka clusters with the help of zookeepers. A broker may have zero or more partitions for each topic.
  3. Kafka producer: The messages on one or more than one Kafka topics are published by the producer and pushed to brokers, without awaiting broker acknowledgement. 
  4. Kafka consumer: Consumers extract data from the brokers and consume already published messages from one or more topics, issue a non-synchronous pull request to the broker to have a ready to consume buffer of bytes and then supplies an offset value to rewind or skip to any partition point.

Fundamental concepts of Kafka architecture 

  1. Topics: It is a logical channel to which messages are published by producers and from which messages are received by consumers. Topics can be replicated (copied) as well as partitioned (divided). A particular kind of message is published on a specific topic, with each topic identifiable by its unique name.
  2. Topic partitions: In the Kafka cluster, topics are divided into partitions as well as replicated across brokers. A producer can add a key to a published message, and messages with the same key end up in the same partition. An incremental ID called offset is assigned to each message in a partition, and these IDs are valid only within the partition and have no value across partitions in a topic.
  3. Leader and replica: Every Kafka broker has a few partitions with each partition, either being a leader or a replica (backup) of the topic. The leader is responsible for not only reading and writing to a topic but also updating the replicas with new data. If, in any case, the leader fails, the replica can take over as the new leader.

Architecture of Apache Kafka 


Source

A Kafka having more than one broker is called a Kafka cluster. Four of the core APIs will be discussed in this Apache Kafka Tutorial:

  1. Producer API: The Kafka producer API allows a stream of records to be published by an application to one or several Kafka topics.
  2. Consumer API: The consumer API allows an application to process the continuous flow of records produced to one or more topics.
  3. Streams API: The streams API allows an application to consume an input stream from one or several topics and generate an output stream to one or several output topics, thus permitting the application to act as a stream processor. This efficiently modifies the input streams to the output streams.
  4. Connector API: The connector API allows the creation and running of reusable producers and consumers, thus enabling a connection between Kafka topics and existing data systems or applications.

Workflow of the publisher-subscriber messaging domain

  1. Kafka producers send messages to a topic at regular intervals.
  2. Kafka brokers ensure equal distribution of messages within the partitions by storing them in the partitions configured for a particular topic.
  3. Subscribing to a specific topic is done by Kafka consumers. Once the consumer has subscribed to a topic, the current offset of the topic is offered to the consumer, and the topic is saved in the zookeeper ensemble.
  4. The consumer requests Kafka for new messages at regular intervals.
  5. Kafka forwards the messages to consumers immediately on receipt from producers.
  6. The consumer receives the message and processes it.
  7. The Kafka broker gets an acknowledgement as soon as the message is processed.
  8. On receipt of the acknowledgement, the offset is upgraded to the new value.
  9. The flow repeats until the consumer stops the request.
  10. The consumer can skip or rewind an offset at any time and read subsequent messages as per convenience.

Workflow of the queue messaging system

In a queue messaging system, several consumers with the same group ID can subscribe to a topic. They are considered a single group and share the messages. The workflow of the system is:

  1. Kafka producers send messages to a topic at regular intervals.
  2. Kafka brokers ensure equal distribution of messages within the partitions by storing them in the partitions configured for a particular topic.
  3. A single consumer subscribes to a specific topic.
  4. Until a new consumer subscribes to the same topic, Kafka interacts with the single consumer.
  5. With the arrival of the new consumers, the data is shared between two consumers. The sharing is repeated until the number of configured partitions for that topic equals the number of consumers.
  6. A new consumer will not receive further messages when the number of consumers exceeds the number of configured partitions. This situation arises due to the condition that each consumer is entitled to a minimum of one partition, and if no partition is blank, the new consumers have to wait.

2 important tools in Apache Kafka 

Next, in this Apache Kafka Tutorial, we will discuss Kafka tools packaged under “org.apache.kafka.tools.*.

1. Replication Tools

It is a high-level design tool that imparts higher availability and more durability.

  • Create Topic tool: This tool is used to create a topic with a replication factor and a default number of partitions and uses the default scheme of Kafka to perform a replica assignment.
  • List Topic tool: The information for a given list of topics is listed by this tool. Fields such as partition, topic name, leader, replicas and isr are displayed by this tool.
  • Add Partition tool: More partitions for a particular topic can be added by this tool. It also performs manual assignment of replicas of the added partitions.

2. System tools

The run class script can be used to run system tools in Kafka. The syntax is:

  • Mirror Maker: The use of this tool is to mirror one Kafka cluster to another.
  • Kafka Migration tool: This tool helps in migrating a Kafka broker from one version to another.
  • Consumer Offset Checker: This tool displays Kafka topic, log size, offset, partitions, consumer group and owner for the particular set of topics.

Also Read: Apache Pig Tutorial

Top 4 use cases of Apache Kafka 

Let us discuss some important use cases of Apache Kafka in this Apache Kafka Tutorial:

  1. Stream processing: The feature of strong durability of Kafka allows it to be used in the field of stream processing. In this case, data is read from a topic, processed and the processed data is then written to a new topic to make it available for applications and users.
  2. Metrics: Kafka is frequently used for operational monitoring of data. Statistics are aggregated from distributed applications to make a centralised feed of operational data. 
  3. Tracking website activity: Data warehouses like BigQuery and Google employ Kafka for tracking activities on websites. Site activities like searches, page views or other user actions are published to central topics and made accessible for real-time processing, offline analysis and dashboards.
  4. Log aggregation: Using Kafka, logs can be collected from many services and made available in a standardised format to many consumers.   

 

Top 5 Applications of Apache Kafka 

Some of the best industrial applications supported by Kafka include:

  1. Uber: The cab app needs immense real-time processing and handles huge data volume. Important processes like auditing, ETA calculations and driver and customer matching are modelled based on Kafka Streams.
  2. Netflix: The on-demand internet streaming platform Netflix uses Kafka metrics for processing of events and real-time monitoring.
  3. LinkedIn: LinkedIn manages 7 trillion messages every day, with 100,000 topics, 7 million partitions and over 4000 brokers. Apache Kafka is used in LinkedIn for user activity tracking, monitoring and tracking.
  4. Tinder: This popular dating app uses Kafka Streams for several processes that include content moderation, recommendations, updating the user time zone, notifications and user activation, among others.
  5. Pinterest: With a monthly search of billions of pins and ideas, Pinterest has leveraged Kafka for many processes. Kafka Streams are utilised for indexing of contents, detecting spams, recommendations and for calculating budgets of real-time ads.

Conclusion

In this Apache Kafka Tutorial, we have discussed the fundamental concepts of Apache Kafka, architecture and cluster in Kafka, Kafka workflow, Kafka tools and some applications of Kafka. Apache Kafka has some of the best features like durability, scalability, fault tolerance, reliability, extensibility, replication and high-throughput that make it accessible across some of the best industrial applications, as exemplified in this Apache Kafka Tutorial

If you are interested to know more about Big Data, check out our PG Diploma in Software Development Specialization in Big Data program which is designed for working professionals and provides 7+ case studies & projects, covers 14 programming languages & tools, practical hands-on workshops, more than 400 hours of rigorous learning & job placement assistance with top firms.

Lead the Data Driven Technological Revolution

7 Case Studies & Projects. Job Assistance with Top Firms. Dedicated Student Mentor.
Learn More @ UPGRAD

Leave a comment

Your email address will not be published. Required fields are marked *

×
Aspire to be a Data Scientist
Download syllabus & join our Data Science Program and develop practical knowledge & skills.
Download syllabus
By clicking Download syllabus, I authorize upGrad and its representatives to contact me
via SMS / Email / Phone / WhatsApp / any other modes.
I agree to upGrad terms and conditions and our privacy policy.