Tutorial Playlist
Apache Spark is an open-source distributed general-purpose cluster computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Spark was initially developed in 2009 at UC Berkeley and open-sourced in 2010 as an Apache project. Gradually, it gained popularity and quickly emerged as the preferred big data platform for businesses. A recent survey indicates that over 60% of enterprises dealing with data have implemented Spark.
Some key reasons for Spark's popularity are its ease of use, speed and unified architecture. Spark allows developers to quickly write applications in Java, Scala, Python or R and easily build data pipelines. It also runs workloads up to 100x faster than Hadoop MapReduce in memory or 10x faster on disk. Spark offers over 80 high-level operators to make parallel jobs easy.
In this comprehensive tutorial, we will understand what Spark is, its features, architecture and other related concepts in detail.
Spark has a well-defined layered architecture comprising the Spark Core and built-in libraries including SQL, DataFrames, MLlib, GraphX, and Spark Streaming.
The key components of the Spark architecture are:
This separation of responsibilities allows each component to focus on only one function leading to a modular and versatile framework.
Spark SQL is a Spark module used for structured data processing and relational queries. It provides a DataFrame API to write SQL-like queries in Python, Java, or Scala.
Spark SQL also includes a SQL parser and optimizer that allow running regular ANSI SQL statements. It can work with various data sources like Hive, Avro, Parquet, ORC, JSON, and JDBC.
Some key benefits of Spark SQL:
1. We can create a DataFrame from JSON data:
data = [{"Category":"Fiction", "Title":"To Kill a Mockingbird", "Author":"Harper Lee", "Price":7.99},Â
    {"Category":"Fiction", "Title":"1984", "Author":"George Orwell", "Price":9.99}]
df = spark.createDataFrame(data)
df.show()
Output:
2. We can run SQL queries on DataFrames interactively:
df.createOrReplaceTempView("books")Â
spark.sql("SELECT * FROM books WHERE Price < 8").show()
Output:
3. We can also create permanent tables, insert data and run SQL queries:
spark.sql("CREATE TABLE books (Category STRING, Title STRING, Author STRING, Price FLOAT)")
spark.sql("INSERT INTO books VALUES ('Fiction', 'To Kill a Mockingbird', 'Harper Lee', 7.99)")Â
spark.sql("SELECT * FROM books").show()
Output:
4. Spark SQL supports SQL join operations between DataFrames:
authors_df = spark.createDataFrame([
 ("1001", "Harper Lee", "To Kill a Mockingbird"),
 ("1002", "George Orwell", "1984") Â
], ["id", "name", "book"])
authors_df.show()
book_sales_df = spark.createDataFrame([
 ("To Kill a Mockingbird", 10),
 ("1984", 5)
], ["book", "quantity"])Â
book_sales_df.show()Â
joined_df = authors_df.join(book_sales_df, authors_df.book == book_sales_df.book, 'inner')
joined_df.show()
Output:
Apache Spark is a lightning-fast unified analytics engine for big data and machine learning. It was built to overcome the limitations of Hadoop MapReduce which was slow for iterative algorithms that visit data set multiple times.
Some key capabilities of Spark include the following:
Simply put, Apache Spark is a framework designed to handle and analyze vast amounts of data. It offers speed, user-friendliness and adaptability, in processing datasets. It addresses the limitations of MapReduce by running workloads faster through in-memory computation.
This broad range of capabilities has made Spark an essential tool for data engineers, data scientists, analysts and more.
Let's briefly look at how Apache Spark evolved:
Spark offers many features, making it an ideal platform for processing big data workloads. Some of the key features are:
Spark utilizes in-memory computation and optimized execution for performance. It can run workloads up to 100x faster than Hadoop MapReduce. Spark achieves this speed through the following:
Spark offers over 80 high-level operators to parallelize data operations. This makes it easy to create parallel apps without writing boilerplate code.
Some of the key APIs are:
Spark provides a unified engine that supports a broad range of data analytics tasks, including batch, interactive, iterative, and streaming workloads. This eliminates the need to use separate tools.
Some of the workloads Spark handles well:
Spark runs on a variety of environments, from on-premise data centers to public cloud platforms. Spark can access data from many sources:
Spark is a general-purpose framework for data analytics that offers ease of development. Let's look at some examples of how Spark is commonly used:
Spark is utilized for large batch processing workloads such as ETL pipelines, data warehousing, and log processing. Batch jobs are well suited for Spark as they can efficiently handle large amounts of data through distributed processing.
For example, an e-commerce company processes clickstream logs of 100 GB per day. Using Spark, this data can be loaded and cleaned using Spark SQL. We can then analyze this data with Spark MLlib to generate user behavior insights.
Spark allows querying data interactively through Spark SQL or directly from Spark shell. Analysts can connect BI tools like Tableau to Spark to generate reports and dashboards from the latest data.
For example, a retail bank loads HDFS data into Spark hourly. Analysts then query this data interactively to analyze customer trends and generate daily reports.
Spark MLlib provides commonly used machine learning algorithms that can be applied to large datasets. Tasks like fraud detection, product recommendations, predictive maintenance, etc., can leverage Spark's distributed processing.
For example, an insurance firm builds Spark ML models to predict fraudulent claims by analyzing past claim data. These models are retrained nightly as new claims come in.
We can use Spark MLlib to build a logistic regression classifier:
Code:
from pyspark.ml.classification import LogisticRegression
# Load training data
training = spark.read.format("libsvm")\
  .load("data/mllib/sample_libsvm_data.txt")
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
# Fit model  Â
model = lr.fit(training)
# Print coefficients and intercept
print("Coefficients: " + str(model.coefficients))
print("Intercept: " + str(model.intercept))
Output:
Coefficients: [0.5144343718982989,-0.22079497603262063,-0.4756967403873508,1.0316237691599358,-0.3059612781510368,-0.013700513924638212,-0.41385474153683273,-0.2784643572432388]
Intercept: -1.768966151113195
Spark Streaming allows mini-batch processing on live streams of data. This enables real-time analytics for digital ad clicks, IoT sensors, financial trades, etc.
For example, a ride-sharing company uses Spark Streaming to process real-time trip data. The stream data is analyzed to identify frequent locations and surge pricing opportunities.
Spark GraphX allows graph parallel computation for tasks like social network analysis, fraud detection, recommendations, etc. The graph algorithms use graph abstraction with automatic optimization.
For example, a bank uses GraphX to analyze money transfer graphs and identify potentially fraudulent transactions and accounts in real time.
In summary, Spark is a versatile framework used for batch processing, interactive analysis, machine learning, streaming, and graph workloads on big data. The unified API makes it easy to combine these capabilities.
In this comprehensive Spark tutorial for beginners, we covered the following:
Spark is transforming big data analytics by enabling fast distributed processing on a versatile platform. Its adoption will continue to grow exponentially as data volumes explode. This makes it a must-learn skill for data professionals looking to supercharge their careers.
1. What languages does Apache Spark support?
Spark code can be written in Java, Scala, Python, R. Scala is the most commonly used.
2. How does Spark achieve speed?
Spark uses in-memory computing, lazy evaluation and code optimization to run workloads up to 100x faster than MapReduce.
3. What are some Spark use cases?
Spark is used for ETL, data analysis, machine learning, fraud detection, and recommendations on large datasets.
PAVAN VADAPALLI
Popular
Talk to our experts. We’re available 24/7.
Indian Nationals
1800 210 2020
Foreign Nationals
+918045604032
upGrad does not grant credit; credits are granted, accepted or transferred at the sole discretion of the relevant educational institution offering the diploma or degree. We advise you to enquire further regarding the suitability of this program for your academic, professional requirements and job prospects before enrolling. upGrad does not make any representations regarding the recognition or equivalence of the credits or credentials awarded, unless otherwise expressly stated. Success depends on individual qualifications, experience, and efforts in seeking employment.
upGrad does not grant credit; credits are granted, accepted or transferred at the sole discretion of the relevant educational institution offering the diploma or degree. We advise you to enquire further regarding the suitability of this program for your academic, professional requirements and job prospects before enr...