We are currently living in a world where a vast amount of data is generated every second at a rapid rate. This data can provide meaningful and useful results if it is accurately analysed. It can also offer solutions to many industries at the right time.
These are very helpful in the industries such as in Travel Services, Retail, Media, Finance and Health Care. Many other top companies have adopted Data Analysis such as Tracking of Customer interaction with different kinds of products done by Amazon on its platform or Viewers receiving personalized recommendations at real-time, which is provided by Netflix.
It can be used by any business which uses a large amount of data, and they can analyse it for their benefit to improve the overall process in their business and to increase customer satisfaction and user experiences. Better User experiences and customer satisfaction provides benefit to the organization, in the long run, to expand the business and make a profit.
What is Streaming?
Streaming of data is a method in which information is transferred as a continuous and a steady stream. As the Internet is growing, technologies of streaming are also increasing.
What is Spark Streaming?
When data continuously arrives in a sequence of unbound, then it is called a data stream. Input data is flowing steadily, and it is divided by streaming. Further processing of data is done after it is divided into discrete units. The analysing of data and processing data at low latency is called stream processing.
In 2013, Apache Spark was added with Spark Streaming. There are many sources from which the Data ingestion can happen such as TCP Sockets, Amazon Kinesis, Apache Flume and Kafka. With the help of sophisticated algorithms, processing of data is done. A high-level function such as window, join, reduce and map are used to express the processing. Live Dashboards, Databases and file systems are used to push the processed data to file systems.
Working of Stream
Following are the internal working. Spark streaming divides the live input data streams into batches. Spark Engine is used to process these batches to generate final stream batches as a result.
Data in the stream is divided into small batches and is represented by Apache Spark Discretized Stream (Spark DStream). Spark RDDs is used to build DStreams, and this is the core data abstraction of Spark. Any components of Apache Spark such as Spark SQL and Spark MLib can be easily integrated with the Spark Streaming seamlessly.
Spark Streaming helps in scaling the live data streams. It is one of the extensions of the core Spark API. It also enables processing of fault-tolerant stream and high-throughput. The use of Spark Streaming does Real-time processing and streaming of live data. Major Top Companies in the world are using the service of Spark Streaming such as Pinterest, Netflix and Uber.
Spark Streaming also provides an analysis of data in real-time. Live and Fast processing of data are performed on the single platform of Spark Streaming.
Also read Apache Spark Architecture
Why Spark Streaming?
Spark Streaming can be used to stream real-time data from different sources, such as Facebook, Stock Market, and Geographical Systems, and conduct powerful analytics to encourage businesses.
There are five significant aspects of Spark Streaming which makes it so unique, and they are:
Advanced Libraries like graph processing, machine learning, SQL can be easily integrated with it.
The data which is getting streamed can be done in conjunction with interactive queries and also static datasets.
3. Load Balancing
Spark Streaming has a perfect balancing of load, which makes it very special.
4. Resource usage
Spark Streaming use the available resource in a very optimum way.
5. Recovery from stragglers and failures
Spark Streaming can quickly recover from any kinds of failures or straggler.
Need for Streaming in Apache Spark
Continuous operator model is used while designing the system for processing streams traditionally to process the data. The working of the system is as follows:
- Data sources are used to stream the data. The different kinds of Data sources are IoT device, system telemetry data, live logs and many more. These streaming data are ingested into data ingestion systems such as Amazon Kinesis, Apache Kafka and many more.
- On a cluster, parallel processing is done on the data.
- Downstream systems such as Kafka, Cassandra, HBase are used to pass the results.
A set of worker nodes runs some continuous operators. The processing of records of streamed data is done one at a time. The documents are then forwarded to the next operators in the pipeline.
Source Operators are used to receiving Data from ingestion systems. Sink Operators are used to giving output to the downstream system.
Some operators are continuous. These are a natural and straightforward model. When it comes to Analytics of complex data at real-time, which is done at a large scale, traditional architecture faces some challenges in the modern world, and they are:
Fast failure recovery
In today’s system failures are quickly accompanied and accommodated by recovering lost information by computing the missing info in parallel nodes. Thus, it makes the recovery even faster compared to traditional systems.
Load balancer helps to allocate resource and data among the node in a more efficient manner so that no resource is waiting or doing nothing but the data is evenly distributed throughout the nodes.
Unification of Interactive, Batch and Streaming Workloads
One can also interact with streaming data by making queries to the streaming data. It can also be combined with static datasets. One cannot do ad-hoc queries using new operators because it is not designed for continuous operators. Interactive, streaming and batch queries can be combined by using a single-engine.
SQL queries and analytics with ML
Developing systems with common database command made developer life easy to work in collaboration with other systems. The community widely accepts SQL queries. Where the system provides module and libraries for machine learning that can be used for advance analytical purpose.
Spark Streaming Overview
Spark Streaming uses a set of RDDs which is used to process the real-time data. Hence, Spark Streaming is generally used commonly for treating real-time data stream. Spark Streaming provides fault-tolerant and high throughput processing of live streams of data. It is an extra feature that comes with core spark API.
Spark Streaming Features
- Business Analysis: With the use of Spark Streaming, one can also learn the behaviour of the audience. These learning can later be used in the decision-making of businesses.
- Integration: Real-time and Batch processing is integrated with Spark
- Fault Tolerance – The unique ability of the Spark is that it can recover from the failure efficiently.
- Speed: Low Latency is achieved by Spark
- Scaling: Nodes can be scaled easily up to hundreds by Spark.
Spark Streaming Fundamentals
1. Streaming Context
In Spark the data stream is consumed and managed by Streaming Context. It creates an object of Receiver which is produced by registering an Input streaming. Thus it is the main Spark functionality that becomes a critical entry point to the system as it provides many contexts that provide a default workflow for different sources like Akka Actor, Twitter and ZeroMQ.
A spark context object represents the connection with a spark cluster. Where the Spark Streaming object is created by a StreamingContext object, accumulators, RDDs and broadcast variables can also be created a SparkContex object.
2. Checkpoints, Broadcast Variables and Accumulators
Checkpoint works similar to Checkpoints which stores the state of the systems the same as in the games. Where, in this case, Checkpoints helps in reducing the loss of resources and make the system more resilient to system breakdown. A checkpoint methodology is a better way to keep track of and save the states of the system so that at the time of recovery, it can be easily pulled back.
Instead of providing the complete copy of tasks to the network Nodes, it always catches a read-only variable which is responsible for acknowledging the nodes of different task present and thus reducing transfer and computation cost by individual nodes. So it can provide a significant input set more efficiently. It also uses advanced algorithms to distribute the broadcast variable to different nodes in the network; thus, the communication cost is reduced.
Accumulators are variables which can be customized for different purposes. But there also exist already defined Accumulators like counter and sum Accumulators. There is also tracking Accumulators that keeps track of each node, and some extra features can also be added into it. Numeric Accumulators support many digital functions which are also supported by Spark. A custom-defined Accumulators can also be created demanded by the user.
DStream means Discretized Stream. Spark Streaming offers the necessary abstraction, which is called Discretized Stream (DStream). DStream is a data which streams continuously. From a source of data, DStream is received. It may also be obtained from a stream of processed data. Transformation of input stream generates processed data stream.
After a specified interval, data is contained in an RDD. Endless series of RDDs represents a DStream.
Developers can use DStream to cache the stream’s data in memory. This is useful if the data is computed multiple times in the DStream. It can be achieved by using the persist() method on a DStream.
Duplication of data is done to ensure the safety of having a resilient system that can resist and failure in the system thus having an ability to tolerate faults in the system (such as Kafka, Sockets, Flume etc.)
Spark Streaming Advantage & Architecture
Processing of one data stream at a time can be cumbersome at times; hence Spark Streaming discretize the data into small sub batches which are easily manageable. That’s because Spark workers get buffers of data in parallel accepted by Spark Streaming receiver. And hence the whole system runs the batches in parallel and then accumulates the final results. Then these short tasks are processed in batches by Spark engine, and the results are provided to other systems.
In Spark Streaming architecture, the computation is not statically allocated and loaded to a node but based on the data locality and availability of the resources. It is thus reducing loading time as compared to previous traditional systems. Hence the use of data locality principle, it is also easier for fault detection and its recovery.
Data nodes in Spark are usually represented by RDD that is Resilient Distribution Dataset.
Goals of Spark Streaming
Following are the Goals achieved by Spark architecture.
1. Dynamic load balancing
This is one of the essential features of Spark Streaming where data streams are dynamically allocated by the load balancer, which is responsible for allocation data and computation of resources using specific rules defined in it. The main goal of load balancing is to balance the workload efficiently across the workers and put everything in a parallel way such that there is no wastage of resources available. And also responsible for dynamically allocating resource to the worker nodes in the system.
2. Failure and Recovery
As in the traditional system, when there occurs an operation failure, the whole system has to recompute that part to get the lost information back. But the problem comes when one node is handling all this recovery and making the entire system to wait for its completion. Whereas in Spark the lost information is computed by other free nodes and bring back the system to track without any extra waiting like in the traditional methods.
And also the failed task is distributed evenly on all the nodes in the system to recompute and bring back it from failure faster than the traditional method.
3. Batches and Interactive query
Set of RDDs in Spark are called to be DStream in Spark that provides a relation between Streaming workloads and batches. These batches are stored in Spark’s memory, which provides an efficient way to query the data present in it.
The best part of Spark is that it includes a wide variety of libraries that can be used when required by the spark system. Few names of the libraries are MLlib for machine learning, SQL for data query, GraphX and Data Frame whereas Dataframe and questions can be converted to equivalent SQL statements by DStreams.
As the spark system uses parallel distributions of the task that improve its throughput capacity and thus leveraging the sparks engine that capable of achieving low latency as low as up to few 100 milliseconds.
How do Spark Streaming works?
The data in the stream is divided into small batches which are called DStreams in the Spark Streaming. It is a sequence of RDDs internally. Spark APIs are used by RDDS to process the data and shipments are returned as a result. The API of Spark Streaming is available in Python, Java and Scala. Many features are lacking in the recently introduced Python API in Spark 1.2.
Stateful computations are called a state that is maintained by the Spark Streaming based on the incoming data in the stream. The data that flows in the stream is processed within a time frame. This time frame is to be specified by the developer, and it is to be allowed by Spark Streaming. The time window is the time frame within which the work should be completed. The time window is updated within a time interval which is also known as the sliding interval in the window.
Spark Streaming Sources
Receiver object which is related with an input DStream, stores data received, in Sparks Memory for processing.
Built-in streaming has two categories:
1. Basic source
Sources available in Streaming API, e.g. Socket Connection and File System.
2. Advanced source
Advanced level of sources is Kinesis, Flume & Kafka etc.
There are two types of operations which are supported by Spark RDDS, and they are:-
1. Output Operations in Apache Spark
Output Operations are used to push out the data of the DStream into an external system such as a file system or a database. Output Operations allows transformed data to be consumed by the external systems. All the DStreams Transformation are actually executed by the triggering, which is done by the external systems.
These are the current Output operations:
foreachRDD(func), [suffix]), saveAsHadoopFiles(prefix, [suffix]), saveAsObjectFiles(prefix, [suffix])”prefix-TIME_IN_MS[.suffix]”, saveAsTextFiles(prefix, print()
RDDs lazily execute output Operations. Inside the DStream Operations of Output, RDD Actions are taken forcefully to be processed of the received data. The execution of Output Operations is done one-at-a-time. Spark applications define the order of the performance of the output operations.
2. Spark Transformation
Spark transformation also changes the data from the DStream as RDDs support it in Spark. Just as Spark RDD’s, many alterations are supported by DStream.
Following are the most common Transformation operations:
Window(), updateStateByKey(), transform(), [numTasks]), cogroup(otherStream, [numTasks]), join(otherStream, reduceByKey(func, [numTasks]), countByValue(), reduce(), union(otherStream), count(), repartition(numPartitions), filter(), flatMap(), map().
In today’s data-driven world tools to store and analyse data has proved to be the key factor in business analytics and growth. Big Data and the associated tools and technologies have proven to be on a rising demand. As such Apache Spark has a great market and offers great features to customers and businesses.
If you are curious to learn about apache spark streaming, data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.