Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconData Sciencebreadcumb forward arrow iconThe Ultimate Data Science Cheat Sheet Every Data Scientists Should Have

The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have

Last updated:
29th Jan, 2021
Views
Read Time
12 Mins
share image icon
In this article
Chevron in toc
View All
The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have

For all those budding professionals and newbies alike who are thinking of taking a dive into the booming world of data science, we have compiled a quick cheat sheet to get you brushed up with the basics and methodologies that underline this field.

Data Science-The Basics

The data that gets generated in our world is in a raw form, i.e., numbers, codes, words, sentences, etc. Data Science takes this very raw data to process it using scientific methods to transform it into meaningful forms to gain knowledge and insights.

Data

Before we dive into the tenets of data science, let’s talk a bit about data, its types, and data processing.

Types of Data

Structured – Data that is stored in a tabulated format in databases. It can be either numeric or text

Unstructured – Data that cannot be tabulated with any definitive structure to speak of is called unstructured data

Semi-structured – Mixed data with traits of both structured and unstructured data

Quantitative – Data with definite numeric values that can be quantified

Big Data – Data stored in huge databases spanning multiple computers or server farms is called Big Data. Biometric data, social media data, etc. is considered Big Data. Big data is characterised by  4 V’s

Explore our Popular Data Science Online Courses

Data Preprocessing

Data Classification – It’s the process of categorizing or labeling data into classes like numerical, textual or image, text, video, etc.

Data Cleansing – It consists of weeding out missing/inconsistent/incompatible data or replacing data using one of the following methods.

  1. Interpolation
  2. Heuristic
  3. Random Assignment
  4. Nearest Neighbour

Data Masking – Hiding or masking out confidential data to maintain the privacy of sensitive information while still able to process it.

Top Data Science Skills to Learn to upskill

What is Data Science Made of?

Concepts of Statistics

Regression

Linear Regression

Linear Regression is used to establish a relationship between two variables such as supply and demand, price and consumption, etc. It relates one variable x as a linear function of another variable y as follows

Y = f(x) or Y =mx + c, where m = coefficient

Logistic regression

Logistic regression establishes a probabilistic relationship rather than a linear one between variables. The resulting answer is either 0 or 1 and we look for probabilities and the curve is an S-shaped one.

If p < 0.5, then its 0 else 1

Formula:

Y = e^ (b0 + b1x) / (1 + e^ (b0 +b1x))

where b0 = bias and b1 = coefficient

Probability

Probability helps to predict the likeliness of occurrence of an event. Some terminologies:

Sample: The set of likely outcomes

Event: It is a subset of the sample space

Random Variable: Random variables help to map or quantify likely outcomes to numbers or a line in a sample space

Probability Distributions

Discrete Distributions: Gives the probability as a set of discrete values (integer)

P[X=x] = p(x)

Explore our Popular Data Science Online Courses

Image Source

Read our popular Data Science Articles

Continuous Distributions: Gives the probability over a number of continuous points or intervals instead of discrete values. Formula:

P[a ≤ x ≤ b] = a∫b f(x) dx, where a, b are the points

Image source

Correlation and Covariance

Standard Deviation: The variation or deviation of a given dataset from its mean value

σ = √ {(Σi=1N ( xi –  x   ) ) / (N -1)}

Covariance 

It defines the extent of deviation of random variables X and Y with the mean of the dataset.

Cov(X,Y) = σ2XY ​= E[(X−μX​)(Y−μY​)] = E[XY]−μX​μY​​

Correlation

Correlation defines the extent of a linear relationship between variables along with their direction, +ve or -ve

ρXY​= σ2XY/​​​      σX *​ *σY​

Our learners also read: Top Python Courses for Free

Artificial Intelligence

The ability of machines to acquire knowledge and make decisions based on inputs is called Artificial Intelligence or simply AI.

Types

  1. Reactive Machines:  Reactive machine AI works by learning to react to predefined scenarios by narrowing down to the fastest and best options. They lack memory and are best for tasks with a defined set of parameters. Highly reliable and consistent.
  2. Limited Memory: This AI has some real-world observational and legacy data fed to it. It can learn and make decisions based on the given data but cannot gain new experiences.
  3. Theory of Mind: It is an interactive AI that can make decisions based on the behaviour of the surrounding entities.
  4. Self Awareness: This AI is aware of its existence and functioning apart from the surroundings. It can develop cognitive abilities and understand and evaluate the impacts of its own actions on the surroundings.

upGrad’s Exclusive Data Science Webinar for you –

How upGrad helps for your Data Science Career?

Read our popular Data Science Articles

AI terms

Neural Networks

Neural Networks are a bunch or network of interconnected nodes that relay data and information in a system. NNs are modeled to mimic neurons in our brains and can take decisions by learning and predicting.

Heuristics

Heuristics is the ability to predict based on approximations and estimates quickly using prior experience in situations where available information is patchy. It’s quick but not accurate or precise.

Case-Based Reasoning

The ability to learn from previous problem-solving cases and apply them in current situations to arrive at an acceptable solution

Natural Language Processing

It’s simply the ability of a machine to understand and interact directly in human speech or text. For ex, voice commands in a car

Machine Learning

Machine Learning is simply an application of AI using various models and algorithms to predict and solve problems.

Types

Supervised 

This method relies on input data that is associative with the output data. The machine is provided with a set of target variables Y and it has to arrive at the target variable through a set of input variables X under the supervision of an optimization algorithm. Examples of supervised learning are Neural Networks, Random Forest, Deep Learning, Support Vector Machines, etc.

Unsupervised

In this method, input variables have no labeling or association, and algorithms work to find patterns and clusters resulting in new knowledge and insights.

Reinforced

Reinforced learning focuses on improvisation techniques to sharpen or polish the learning behaviour. It is a reward-based method where the machine gradually improves its techniques to win a target reward.

Modeling Methods

Regression

Regression models always give numbers as output through interpolation or extrapolation of continuous data.

Classification 

Classification models come up with outputs as a class or label and are better at predicting discrete outcomes like ‘what kind’

Both regression and classification are supervised models.

Clustering

Clustering is an unsupervised model that identifies clusters based on traits, attributes, features, etc.

ML Algorithms

Decision Trees

Decision trees use a binary approach to arrive at a solution based on successive questions at each stage such that the outcome is either of the two possible ones like ‘Yes’ or ‘No’. Decision trees are simple to implement and interpret.

Random Forest or Bagging

Random Forest is an advanced algorithm of decision trees. It uses a large number of decision trees which makes the structure dense and complex like a forest. It generates multiple outcomes and thus leads to more accurate results and performance.

K- Nearest Neighbour (KNN)

kNN makes use of the proximity of the nearest data points on a plot relative to a new data point to predict which category it falls in. The new data point gets assigned to the category with a higher number of neighbours.

k = number of nearest neighbours

Naïve Bayes

Naïve Bayes works on two pillars, first that every feature of data points are independent, unrelated to each other, i.e. unique, and second on the Bayes theorem which predicts outcomes based on a condition or hypothesis.

Bayes Theorem:

P(X|Y) = {P(Y|X) * P(X)} / P(Y)

Where P(X|Y) = Conditional probability of X given occurrence of Y

P(Y|X) = Conditional probability of Y given occurrence of X

P(X), P(Y) = Probability of X and Y individually

Support Vector Machines

This algorithm tries to segregate data in space based on boundaries which can be either a line or a plane. This boundary is called a ‘hyperplane’ and is defined by the nearest data points of each class which in turn are called ‘support vectors’. The maximum distance between support vectors of either side is called margin.

Neural Networks

Perceptron

The fundamental neural network works by taking weighted inputs and outputs based on a threshold value.

Feed Forward Neural Network 

FFN is the simplest network that transmits data in only one direction. May or may not have hidden layers.

Convolutional Neural Networks  

CNN uses a convolution layer to process certain parts of the input data in batches followed by a pooling layer to complete the output.

Recurrent Neural Networks

RNN consists of a few recurrent layers between I/O layers that can store ‘historic’ data. The dataflow is bi-directional and is fed to the recurrent layers for improving predictions.

Deep Neural Networks and Deep Learning

DNN is a network with multiple hidden layers between I/O layers. The hidden layers apply successive transformations to the data before sending it to the output layer.

Deep Learning’ is facilitated through DNN and can handle huge amounts of complex data and achieve high accuracy because of multiple hidden layers

Get data science certification from the World’s top Universities. Learn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

Conclusion

Data science is a vast field that runs through different streams but comes across as a revolution and a revelation for us. Data science is booming and will change how our systems work and feel in the future.

If you are curious to learn about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Profile

Rohit Sharma

Blog Author
Rohit Sharma is the Program Director for the UpGrad-IIIT Bangalore, PG Diploma Data Analytics Program.

Frequently Asked Questions (FAQs)

1Which programming language is best suited for Data Science and why?

3. This language is by far one of the most convenient and easy to write languages with a clean syntax which improves its readability.

2What are the concepts that make data science complete?

Machine Learning
Machine Learning is another crucial component of Data Science that deals with teaching machines to predict future outcomes based on the provided data. Machine learning has three prominent modelling methods- Clustering, regression, and Classification.

3Describe the types of Machine Learning?

Machine Learning or simple ML has three major types based on their working methods. These types are as follows:
1. Supervised Learning
This is the most primitive type of ML where the input data is labelled. The machine is provided with a smaller set of data that gives the machine an insight into the problem and is trained over it.
2. Unsupervised Learning
The biggest advantage of this type is that the data is unlabelled here and the human labour is almost negligible. This opens the gate for much larger datasets to be introduced to the model.
3. Reinforced LearningThis is the most advanced type of ML that gets inspired by the lives of human beings. Desired outputs are reinforced while the useless outputs are discouraged.

Explore Free Courses

Suggested Blogs

Top 12 Reasons Why Python is So Popular With Developers in 2024
99361
In this article, Let me explain you the Top 12 Reasons Why Python is So Popular With Developers. Easy to Learn and Use Mature and Supportive Python C
Read More

by upGrad

31 Jul 2024

Priority Queue in Data Structure: Characteristics, Types &#038; Implementation
57691
Introduction The priority queue in the data structure is an extension of the “normal” queue. It is an abstract data type that contains a
Read More

by Rohit Sharma

15 Jul 2024

An Overview of Association Rule Mining &#038; its Applications
142465
Association Rule Mining in data mining, as the name suggests, involves discovering relationships between seemingly independent relational databases or
Read More

by Abhinav Rai

13 Jul 2024

Data Mining Techniques &#038; Tools: Types of Data, Methods, Applications [With Examples]
101802
Why data mining techniques are important like never before? Businesses these days are collecting data at a very striking rate. The sources of this eno
Read More

by Rohit Sharma

12 Jul 2024

17 Must Read Pandas Interview Questions &amp; Answers [For Freshers &#038; Experienced]
58170
Pandas is a BSD-licensed and open-source Python library offering high-performance, easy-to-use data structures, and data analysis tools. The full form
Read More

by Rohit Sharma

11 Jul 2024

Top 7 Data Types of Python | Python Data Types
99516
Data types are an essential concept in the python programming language. In Python, every value has its own python data type. The classification of dat
Read More

by Rohit Sharma

11 Jul 2024

What is Decision Tree in Data Mining? Types, Real World Examples &#038; Applications
16859
Introduction to Data Mining In its raw form, data requires efficient processing to transform into valuable information. Predicting outcomes hinges on
Read More

by Rohit Sharma

04 Jul 2024