The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have

For all those budding professionals and newbies alike who are thinking of taking a dive into the booming world of data science, we have compiled a quick cheat sheet to get you brushed up with the basics and methodologies that underline this field.

Data Science-The Basics

The data that gets generated in our world is in a raw form, i.e., numbers, codes, words, sentences, etc. Data Science takes this very raw data to process it using scientific methods to transform it into meaningful forms to gain knowledge and insights.

Data

Before we dive into the tenets of data science, let’s talk a bit about data, its types, and data processing.

Types of Data

Structured – Data that is stored in a tabulated format in databases. It can be either numeric or text

Unstructured – Data that cannot be tabulated with any definitive structure to speak of is called unstructured data

Semi-structured – Mixed data with traits of both structured and unstructured data

Quantitative – Data with definite numeric values that can be quantified

Big Data – Data stored in huge databases spanning multiple computers or server farms is called Big Data. Biometric data, social media data, etc. is considered Big Data. Big data is characterised by  4 V’s

Data Preprocessing

Data Classification – It’s the process of categorizing or labeling data into classes like numerical, textual or image, text, video, etc.

Data Cleansing – It consists of weeding out missing/inconsistent/incompatible data or replacing data using one of the following methods.

  1. Interpolation
  2. Heuristic
  3. Random Assignment
  4. Nearest Neighbour

Data Masking – Hiding or masking out confidential data to maintain the privacy of sensitive information while still able to process it.

What is Data Science Made of?

Concepts of Statistics

Regression

Linear Regression

Linear Regression is used to establish a relationship between two variables such as supply and demand, price and consumption, etc. It relates one variable x as a linear function of another variable y as follows

Y = f(x) or Y =mx + c, where m = coefficient

Logistic regression

Logistic regression establishes a probabilistic relationship rather than a linear one between variables. The resulting answer is either 0 or 1 and we look for probabilities and the curve is an S-shaped one.

If p < 0.5, then its 0 else 1

Formula:

Y = e^ (b0 + b1x) / (1 + e^ (b0 +b1x))

where b0 = bias and b1 = coefficient

Probability

Probability helps to predict the likeliness of occurrence of an event. Some terminologies:

Sample: The set of likely outcomes

Event: It is a subset of the sample space

Random Variable: Random variables help to map or quantify likely outcomes to numbers or a line in a sample space

Probability Distributions

Discrete Distributions: Gives the probability as a set of discrete values (integer)

P[X=x] = p(x)

Image Source

Continuous Distributions: Gives the probability over a number of continuous points or intervals instead of discrete values. Formula:

P[a ≤ x ≤ b] = a∫b f(x) dx, where a, b are the points

Image source

Correlation and Covariance

Standard Deviation: The variation or deviation of a given dataset from its mean value

σ = √ {(Σi=1N ( xi –  x   ) ) / (N -1)}

Covariance 

It defines the extent of deviation of random variables X and Y with the mean of the dataset.

Cov(X,Y) = σ2XY ​= E[(X−μX​)(Y−μY​)] = E[XY]−μX​μY​​

Correlation

Correlation defines the extent of a linear relationship between variables along with their direction, +ve or -ve

ρXY​= σ2XY/​​​      σX *​ *σY​

Artificial Intelligence

The ability of machines to acquire knowledge and make decisions based on inputs is called Artificial Intelligence or simply AI.

Types

  1. Reactive Machines:  Reactive machine AI works by learning to react to predefined scenarios by narrowing down to the fastest and best options. They lack memory and are best for tasks with a defined set of parameters. Highly reliable and consistent.
  2. Limited Memory: This AI has some real-world observational and legacy data fed to it. It can learn and make decisions based on the given data but cannot gain new experiences.
  3. Theory of Mind: It is an interactive AI that can make decisions based on the behaviour of the surrounding entities.
  4. Self Awareness: This AI is aware of its existence and functioning apart from the surroundings. It can develop cognitive abilities and understand and evaluate the impacts of its own actions on the surroundings.

AI terms

Neural Networks

Neural Networks are a bunch or network of interconnected nodes that relay data and information in a system. NNs are modeled to mimic neurons in our brains and can take decisions by learning and predicting.

Heuristics

Heuristics is the ability to predict based on approximations and estimates quickly using prior experience in situations where available information is patchy. It’s quick but not accurate or precise.

Case-Based Reasoning

The ability to learn from previous problem-solving cases and apply them in current situations to arrive at an acceptable solution

Natural Language Processing

It’s simply the ability of a machine to understand and interact directly in human speech or text. For ex, voice commands in a car

Machine Learning

Machine Learning is simply an application of AI using various models and algorithms to predict and solve problems.

Types

Supervised 

This method relies on input data that is associative with the output data. The machine is provided with a set of target variables Y and it has to arrive at the target variable through a set of input variables X under the supervision of an optimization algorithm. Examples of supervised learning are Neural Networks, Random Forest, Deep Learning, Support Vector Machines, etc.

Unsupervised

In this method, input variables have no labeling or association, and algorithms work to find patterns and clusters resulting in new knowledge and insights.

Reinforced

Reinforced learning focuses on improvisation techniques to sharpen or polish the learning behaviour. It is a reward-based method where the machine gradually improves its techniques to win a target reward.

Modeling Methods

Regression

Regression models always give numbers as output through interpolation or extrapolation of continuous data.

Classification 

Classification models come up with outputs as a class or label and are better at predicting discrete outcomes like ‘what kind’

Both regression and classification are supervised models.

Clustering

Clustering is an unsupervised model that identifies clusters based on traits, attributes, features, etc.

ML Algorithms

Decision Trees

Decision trees use a binary approach to arrive at a solution based on successive questions at each stage such that the outcome is either of the two possible ones like ‘Yes’ or ‘No’. Decision trees are simple to implement and interpret.

Random Forest or Bagging

Random Forest is an advanced algorithm of decision trees. It uses a large number of decision trees which makes the structure dense and complex like a forest. It generates multiple outcomes and thus leads to more accurate results and performance.

K- Nearest Neighbour (KNN)

kNN makes use of the proximity of the nearest data points on a plot relative to a new data point to predict which category it falls in. The new data point gets assigned to the category with a higher number of neighbours.

k = number of nearest neighbours

Naïve Bayes

Naïve Bayes works on two pillars, first that every feature of data points are independent, unrelated to each other, i.e. unique, and second on the Bayes theorem which predicts outcomes based on a condition or hypothesis.

Bayes Theorem:

P(X|Y) = {P(Y|X) * P(X)} / P(Y)

Where P(X|Y) = Conditional probability of X given occurrence of Y

P(Y|X) = Conditional probability of Y given occurrence of X

P(X), P(Y) = Probability of X and Y individually

Support Vector Machines

This algorithm tries to segregate data in space based on boundaries which can be either a line or a plane. This boundary is called a ‘hyperplane’ and is defined by the nearest data points of each class which in turn are called ‘support vectors’. The maximum distance between support vectors of either side is called margin.

Neural Networks

Perceptron

The fundamental neural network works by taking weighted inputs and outputs based on a threshold value.

Feed Forward Neural Network 

FFN is the simplest network that transmits data in only one direction. May or may not have hidden layers.

Convolutional Neural Networks  

CNN uses a convolution layer to process certain parts of the input data in batches followed by a pooling layer to complete the output.

Recurrent Neural Networks

RNN consists of a few recurrent layers between I/O layers that can store ‘historic’ data. The dataflow is bi-directional and is fed to the recurrent layers for improving predictions.

Deep Neural Networks and Deep Learning

DNN is a network with multiple hidden layers between I/O layers. The hidden layers apply successive transformations to the data before sending it to the output layer.

Deep Learning is facilitated through DNN and can handle huge amounts of complex data and achieve high accuracy because of multiple hidden layers

Conclusion

Data science is a vast field that runs through different streams but comes across as a revolution and a revelation for us. Data science is booming and will change how our systems work and feel in the future.

If you are curious to learn about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Prepare for a Career of the Future

UPGRAD AND IIIT-BANGALORE'S PG DIPLOMA IN DATA SCIENCE
Learn More

Leave a comment

Your email address will not be published.

×