Apache Spark has emerged as a much more accessible and compelling replacement for Hadoop, the original choice for managing Big Data. Apache Spark, like other sophisticated Big Data tools, is extremely powerful and well-equipped for tackling huge datasets efficiently.
Through this blog post, let’s help you clarify the finer points of Apache Spark.
What is Apache Spark?
Spark, in very simple terms, is a general-purpose data handling and the processing engine that is fit for use in a variety of circumstances. Data scientists make use of Apache Spark to improve their querying, analyses and well as the transformation of data. Tasks most frequently accomplished using Spark include interactive queries across large data sets, analysis, and processing of streaming data from sensors and other sources, as well as machine learning tasks.
Spark was introduced back in 2009 at the University of California, Berkeley. It found its way to the Apache Software Foundation’s incubator back in 2014 and was promoted in 2014 to one of the Foundation’s highest-level projects. Currently, Spark is one of the most highly rated projects of the foundation. The community that has grown up around the project includes both prolific individual contributors as well as well-funded corporate backers.
From the time it was incepted, it was made sure that most of the tasks happen in-memory. Therefore, it was always going to be faster and much more optimised than other approaches like Hadoop’s MapReduce, which writes data to and from hard drives between each stage of processing. It is claimed that the in-memory capability of Spark gives it 100x speed than Hadoop’s MapReduce. This comparison, however true, isn’t fair. Because Spark was designed keeping speed in mind, whereas Hadoop was ideally developed for batch processing (which doesn’t require as much speed as stream processing).
What Does Spark Do?
Spark is capable of handling petabytes of data at a time. This data is distributed across a cluster of thousands of cooperating servers – physical or virtual. Apache spark comes with an extensive set of libraries and API which support all the commonly used languages like Python, R, and Scala. Spark is often used with HDFS (Hadoop Distributed File System – Hadoop’s data storage system) but can be integrated equally well with other data storage systems.
Some typical use cases of Apache Spark include:
- Spark streaming and processing: Today, managing “streams” of data is a challenge for any data professional. This data arrives steady, often from multiple sources, and all at one time. While one way could be to store this data in disks and analyse it retrospectively, this would cost businesses a lost. Streams of financial data, for example, can be processed in real-time to identify—and refuse—potentially fraudulent transactions. Apache Spark helps with precisely this.
- Machine learning: With the increasing volume of data, ML approaches too are becoming much more feasible and accurate. Today, the software can be trained to identify and act upon triggers and then apply the same solutions to new and unknown data. Apache Spark’s standout feature of storing data in-memory helps in quicker querying and thus makes it an excellent choice for training ML algorithms.
- Interactive streaming analytics: Business analysts and data scientists want to explore their data by asking a question. They no longer want to work with pre-defined queries to create static dashboards of sales, production-line productivity, or stock prices. This interactive query process requires systems such as Spark that is able to respond quickly.
- Data integration: Data is produced by a variety of sources and is seldom clean. ETL (Extract, transform, load) processes are often performed to pull data from different systems, clean it, standardise it, and then store it into a separate system for analysis. Spark is increasingly being used to reduce the cost and time required for this.
Companies using Apache Spark
A wide range of organisations has been quick to support and join hands with Apache Spark. They realised that Spark delivers real value, such as interactive querying and machine learning.
Famous companies like IBM and Huawei have already invested quite a significant sum in this technology, and many growing startups are building their products in and around Spark. For instance, the Berkeley team responsible for creating spark founded Databricks in 2013. Databricks provides a hosted end-to-end data platform powered by Spark.
All the major Hadoop vendors are beginning to support Spark alongside their existing products. Web-oriented organisations like Baidu, e-commerce operation Alibaba Taobao, and social networking company Tencent all use Spark-based operations at scale. To give you some perspective of the power of Apache Spark, Tencent has 800 million active users that generate over 800 TB of data per day for processing.
In addition to these web-based giants, pharmaceutical companies like Novartis also depend upon Spark. Using Spark Streaming, they’ve reduced the time required to get modelling data into the hands of researchers.
What Sets Spark Apart?
Let’s look at the key reasons why Apache Spark has quickly become a data scientist’s favourite:
- Flexibility and accessibility: Having such a rich set of APIs, Spark has ensured that all of its capabilities are incredibly accessible. All these APIs are designed for interacting quickly and efficiently with data at scale, thus making Apache Spark extremely flexible. There is thorough documentation for these APIs, and it is written in an extraordinarily lucid and straightforward manner.
- Speed: Speed is what Spark is designed for. Both in-memory or on disk. A team of Databricks used Spark for the 100TB Benchmark challenge. This challenge involves processing a huge but static data set. The team was able to process 100TBs of data stored on an SSD in just 23 minutes using Spark. The previous winner did it in 72 minutes using Hadoop. What is even better is that Spark performs well when supporting interactive queries of data stored in memory. In these situations, Apache Spark is claimed to be 100 times faster than MapR.
- Support: Like we said earlier, Apache Spark supports most of the famous programming languages including Java, Python, Scala, and R. Spark also includes support for tight integration with a number of storage systems except just HDFS. Furthermore, the community behind Apache Spark is huge, active, and international.
With that, we come to the end of this blog post. We hope you enjoyed getting into the details of Apache Spark. If large sets of data make your adrenaline rush, we recommend you get hands-on with Apache Spark and make yourself an asset!