Author DP

Pranjal Yadav

2+ of articles published

Experienced Mentor / Insightful Adviser / Creative Thinker

Domain:

upGrad

Current role in the industry:

Engineering Manager - AI/ML at Razorpay

Educational Qualification:

Dual degree from the Indian Institute of Technology, Kharagpur (2010 - 2015)

Expertise:

Engineering Management

Budgeting

Statistics

Agile Methodologies

Recommender Systems

Certifications:

Google Summer of Code 2015

Coursera: Introduction to Recommender Systems

About

A data scientist and deep learning researcher at Amazon. Pranjal's areas of interest are cognitive computing, AI, parallelization and perpetual system designs for advanced analytics. Pranjal is a fast.ai student and active contributor in AI library development. With a dual degree from IIT Kharagpur, he is a passionate blogger in the field of deep learning and enjoys trekking in free time.

Published

Most Popular

15+ Apache Spark Interview Questions & Answers 2024
Blogs
Views Icon

5428

15+ Apache Spark Interview Questions & Answers 2024

Anyone who is familiar with Apache Spark knows why it is becoming one of the most preferred Big Data tools today – it allows for super-fast computation. The fact that Spark supports speedy Big Data processing is making it a hit with companies worldwide. From big names like Amazon, Alibaba, eBay, and Yahoo, to small firms in the industry, Spark has gained an enormous fan following. Thanks to this, companies are continually looking for skilled Big Data professionals with domain expertise in Spark.  For everyone who wishes to bag jobs related to a Big Data (Spark) profile, you must first successfully crack the Spark interview. Here is something that can get you a step closer to your goal – 15 most commonly asked Apache Spark interview questions! What is Spark? Spark is an open-source, cluster computing Big Data framework that allows real-time processing. It is a general-purpose data processing engine that is capable of handling different workloads like batch, interactive, iterative, and streaming. Spark executes in-memory computations that help boost the speed of data processing. It can run standalone, or on Hadoop, or in the cloud. What is RDD? RDD or Resilient Distributed Dataset is the primary data structure of Spark. It is an essential abstraction in Spark that represents the data input in an object format. RDD is a read-only, immutable collection of objects in which each node is partitioned into smaller parts that can be computed on different nodes of a cluster to enable independent data processing. Differentiate between Apache Spark and Hadoop MapReduce. The key differentiators between Apache Spark and Hadoop MapReduce are: Spark is easier to program and doesn’t require any abstractions. MapReduce is written in Java and is difficult to program. It needs abstractions. Spark has an interactive mode, whereas MapReduce lacks it. However, tools like Pig and Hive make it easier to work with MapReduce. Spark allows for batch processing, streaming, and machine learning within the same cluster. MapReduce is best-suited for batch processing. Spark can modify the data in real-time via Spark Streaming. There’s no such real-time provision in MapReduce – you can only process a batch of stored data. Spark facilitates low latency computations by caching partial results in memory. This requires more memory space. Contrarily, MapReduce is disk-oriented that allows for permanent storage. Since Spark can execute processing tasks in-memory, it can process data much faster than MapReduce.  What is the Sparse Vector? A sparse vector comprises of two parallel arrays, one for indices and the other for values. They are used for storing non-zero entries to save memory space. What is Partitioning in Spark? Partitioning is used to create smaller and logical data units to help speed up data processing. In Spark, everything is a partitioned RDD. Partitions parallelize distributed data processing with minimal network traffic for sending data to the various executors in the system. Define Transformation and Action. Both Transformation and Action are operations executed within an RDD. When Transformation function is applied to an RDD, it creates another RDD. Two examples of transformation are map() and filer() – while map() applies the function transferred to it on each element of RDD and creates another RDD, filter() creates a new RDD by selecting components from the present RDD that transfer the function argument. It is triggered only when an Action occurs. An Action retrieves the data from RDD to the local machine. It triggers the execution by using a lineage graph to load the data into the original RDD, perform all intermediate transformations, and return final results to the Driver program or write it out to file system. Explore our Popular Software Engineering Courses Master of Science in Computer Science from LJMU & IIITB Caltech CTME Cybersecurity Certificate Program Full Stack Development Bootcamp PG Program in Blockchain Executive PG Program in Full Stack Development View All our Courses Below Software Engineering Courses What is a Lineage Graph? In Spark, the RDDs co-depend on one another. The graphical representation of these dependencies among the RDDs is called a lineage graph. With information from the lineage graph, each RDD can be computed on demand – if ever a chunk of a persistent RDD is lost, the lost data can be recovered using the lineage graph information. Explore Our Software Development Free Courses Fundamentals of Cloud Computing JavaScript Basics from the scratch Data Structures and Algorithms Blockchain Technology React for Beginners Core Java Basics Java Node.js for Beginners Advanced JavaScript What is the purpose of the SparkCore? SparkCore is the base engine of Spark. It performs a host of vital functions like fault-tolerance, memory management, job monitoring, job scheduling, and interaction with storage systems. Name the major libraries of the Spark Ecosystem. The major libraries in the Spark Ecosystem are: Spark Streaming – It is used to enable real-time data streaming. Spark MLib- It is Spark’s Machine Learning library that is commonly used learning algorithms like classification, regression, clustering, etc. Spark SQL – It helps execute SQL-like queries on Spark data by applying standard visualization or business intelligence tools. Spark GraphX – It is a Spark API for graph processing to develop and transform interactive graphs.  In-Demand Software Development Skills JavaScript Courses Core Java Courses Data Structures Courses Node.js Courses SQL Courses Full stack development Courses NFT Courses DevOps Courses Big Data Courses React.js Courses Cyber Security Courses Cloud Computing Courses Database Design Courses Python Courses Cryptocurrency Courses What is YARN? Is it required to install Spark on all nodes of a YARN cluster? Yarn is a central resource management platform in Spark. It enables the delivery of scalable operations across the Spark cluster. While Spark is the data processing tool, YARN is the distributed container manager. Just as Hadoop MapReduce can run on YARN, Spark too can run on YARN.   It is not necessary to install Spark on all nodes of a YARN cluster because Spark can execute on top of YARN – it runs independently from its installation. It also includes different configurations to run on YARN such as master, queue, deploy-mode, driver-memory, executor-memory, and executor-cores.  What is the Catalyst Framework? Catalyst framework is a unique optimization framework in Spark SQL. The main purpose of a catalyst framework is to enable Spark to automatically transform SQL queries by adding new optimizations to develop a faster processing system. Read our Popular Articles related to Software Development Why Learn to Code? How Learn to Code? How to Install Specific Version of NPM Package? Types of Inheritance in C++ What Should You Know? What are the different types of cluster managers in Spark? The Spark framework comprises of three types of cluster managers: Standalone – The primary manager used to configure a cluster. Apache Mesos – The built-in, generalized cluster manager of Spark that can run Hadoop MapReduce and other applications as well. Yarn – The cluster manager for handling resource management in Hadoop. What is a Worker Node? Worker Node is the “slave node” to the Master Node. It refers to any node that can run the application code in a cluster. So, the master node assigns work to the worker nodes which perform the assigned tasks. Worker nodes process the data stored within and then reports to the master node. What is a Spark Executor? A Spark Executor is a process that runs computations and stores the data in the worker node. Every time the SparkContext connects with a cluster manager, it acquires an Executor on the nodes within a cluster. These executors execute the final tasks that are assigned to them by the SparkContext. What is a Parquet file? Parquet file is a columnar format file that allows Spark SQL to both read and write operations. Using the parquet file (columnar format) has many advantages: Column storage format consumes less space. Column storage format keeps IO operations in check. It allows you to access specific columns with ease. It follows type-specific encoding and delivers better-summarized data. There – we have eased you into Spark. These 15 fundamental concepts in Spark will help you get started with Spark.  If you are interested to know more about Big Data, check out our Advanced Certificate Programme in Big Data from IIIT Bangalore. Check our other Software Engineering Courses at upGrad.

by Pranjal Yadav

Calendor icon

08 Jan 2021

They Say Data is the New Oil – Is it Really True?
Blogs
Views Icon

5973

They Say Data is the New Oil – Is it Really True?

You must be wondering – Data Professionals are high in demand and getting popular each day but what makes them so special? What are the perks of being in a Data specific career and how can you get there? Let’s understand it all: What exactly is Big Data? On a normal day, we generate about 10GB data individually by our calls, social media usage, pictures, location traces, shopping bills and much more. Accumulating all that for every individual who has access to technology, we are looking at a billion GB of data generated across the world. If that sounds big enough to you, we can simply call it “Big Data”. Ok, there is lots of data. So what? Researchers and tech giants understood the importance of this data and started hunting people who can handle, explore and utilise this data. It gave birth to three new and crazy in-demand titles Data Engineer, Data Analyst and Data Scientist, respectively. Check out our data science certifications to upskill yourself Data Scientist Data Engineer Data Analyst Role Clean, organise and generate insights from (big) data Manage, protect, centralise and integrate data systems and source Collect, process and perform statistical analysis on data Mindset Create AI using big data Easily design data architecture Derive insights from data Skills Machine Learning, Distributed Computing and Data Visualisation Data Warehousing, Database Architectures, Extract-Transform-Load (ETL) jobs and system management Communication and Visualisation, Spreadsheet tools and Business driven intelligence Tools and Languages Python, R, SQL, Spark, Map-Reduce Hadoop, Spark, Hive, PIG, SQL SAS, VBA, R, Excel, Tableau Average Salary $125000 $110000 $90000 Cool, we understand what they do. But why the hype? Who is a Data Scientist, a Data Analyst and a Data Engineer? Let’s take a look at some success stories Xerox: After going for the Big Data solution, the company shifted its approach resulting in Xerox reorganising their hiring paradigm and lowering their support personnel attrition rates by 20%, saving the company millions of dollars in the long-term. IBM: Acquiring ‘The Weather Company’ and harnessing the Big Data collected from more than 100,000 weather monitoring sensors, specialised aircrafts, apps in gadgets and various other devices, IBM Watson benefits from more than 2.2 billion unique data gathering points enabling constant weather monitoring. The losses and damages caused by the weather account for nearly $500,000,000 annually in the US alone. Paypal: Developed an automated fraud detection system analysing billions of records achieving industry-leading loss rate of 0.5%. Tesla: Analysing enormous data generated by onboard computers on each car, engineers can predict part failures, potential safety issues and automated control and emergency lockdowns. You may think these are giants which can afford billions of dollars! What about small-scale businesses? Is Big Data Investment worth it for them? There are quite a few readymade solutions available in the market, as well as services from cloud providers like AWS, Google Cloud or Azure. All of these make Big Data analytics tools quite affordable. Regular mid-sized startups benefit from these services, some popular Indian companies utilising cloud services are Freshdesk, Sigtuple, Paralleldots, etc. Startups turned tech giants like Google, Facebook, Apple, Microsoft, etc. have set up their AI labs only to advance research on handling Big Data and creating wonders out of it. Big Data Applications That Surround You Explore our Popular Data Science Degrees Executive Post Graduate Programme in Data Science from IIITB Professional Certificate Program in Data Science for Business Decision Making Master of Science in Data Science from University of Arizona Advanced Certificate Programme in Data Science from IIITB Professional Certificate Program in Data Science and Business Analytics from University of Maryland Data Science Degrees Why are professionals in data so special? Let’s dive deeper into the mindset of a data professional, the most challenging and unique skill that separates a great data professional from other DSA (Data Science and Analytics) roles is the ability to use big data. With such huge amounts of data flowing in every second, it becomes hard to process and extract meaningful information on the fly. Data professionals solve this big data problem by using a combination of advanced algorithms and technology popularly called machine learning or deep learning which is increasingly pacing its way into our daily lives. Google assistant, Siri, Alexa, Cortana, Prisma, Snapchat filters, Facebook location, face tagging, etc. are all examples of machine learning. At the core of all these technologies lies the seed of big data providing fuel and life for deep learning or Artificial Intelligence. Read our popular Data Science Articles Data Science Career Path: A Comprehensive Career Guide Data Science Career Growth: The Future of Work is here Why is Data Science Important? 8 Ways Data Science Brings Value to the Business Relevance of Data Science for Managers The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have Top 6 Reasons Why You Should Become a Data Scientist A Day in the Life of Data Scientist: What do they do? Myth Busted: Data Science doesn’t need Coding Business Intelligence vs Data Science: What are the differences? Our learners also read: Top Python Courses for Free What makes Big Data the New Oil Big data in its enormity is a vast ocean of untapped opportunities. Tech giants are investing billions of dollars to drill down and extract the vital information hidden deep in this big data. In a direct comparison to a very similar vital resource in the modern economy: crude oil, big data indeed is the new oil/fuel for the future. “Artificial Intelligence is the new electricity”, Says Andrew Ng (AI scientist) explaining the power Artificial Intelligence brings in providing modern-day solutions. He further assures the importance of big data and probably steps its value parallel to oil making it the “new oil” of the 21st century. Moment of truth: Top Essential Data Science Skills to Learn SL. No Top Data Science Skills to Learn 1 Data Analysis Certifications Inferential Statistics Certifications 2 Hypothesis Testing Certifications Logistic Regression Certifications 3 Linear Regression Certifications Linear Algebra for Analysis Certifications “Data professionals see big data as oil (for its value) and develop expertise to extract it, process it and convert it into insights/solutions that cater not only to companies but everyone!” Recently, Google’s Deepmind processed millions of petabytes and found 2 new exoplanets hidden to the eyes of interstellar researchers. Various startups have developed image solutions that are better in detecting cancer than the best radiologists in the world. Autonomous vehicles are getting popular and Google launched earphones that allow live translation of any language to what you can understand. All these breakthroughs were only possible due to big data. Big Data Roles and Salaries in the Finance Industry Now, let’s answer the golden question: upGrad’s Exclusive Data Science Webinar for you – How upGrad helps for your Data Science Career? document.createElement('video'); https://cdn.upgrad.com/blog/alumni-talk-on-ds.mp4   How can you transform into a data professional? To become a better data professional, be it an analyst, data scientist or data engineer, one needs to understand the power and potential of big data. It is advantageous for an aspiring data professional to get their hands dirty with big data techniques. Although, a deep understanding of machine learning algorithms is a must! Feel free to review the following courses precisely aimed at advancing knowledge in the data domain: PG Program in Big Data Engineering with BITS Pilani PG Diploma in Data Science with IIIT-Bangalore PG Diploma in Machine Learning and Artificial Intelligence Got it! What are my options? Get acquainted with big data technologies and familiarise yourself with tools like Spark, Hive, Hadoop, YARN, HBase and Map-Reduce. You can decide whether to proceed as a data scientist, data analyst or data engineer. You will find endless opportunities and high-paying jobs across the globe under tech giants and other MNCs harnessing the power of big data. Data is being called the new oil. It’s changing the meaning of analytics and advancing the Artificial Intelligence revolution every day. Data is fueling the future as we speak and getting onboard a long sailing ship is a good idea.

by Pranjal Yadav

Calendor icon

05 Jan 2018

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon

Explore Free Courses