For all those budding professionals and newbies alike who are thinking of taking a dive into the booming world of data science, we have compiled a quick cheat sheet to get you brushed up with the basics and methodologies that underline this field.
Data Science-The Basics
The data that gets generated in our world is in a raw form, i.e., numbers, codes, words, sentences, etc. Data Science takes this very raw data to process it using scientific methods to transform it into meaningful forms to gain knowledge and insights.
Data
Before we dive into the tenets of data science, let’s talk a bit about data, its types, and data processing.
Types of Data
Structured – Data that is stored in a tabulated format in databases. It can be either numeric or text
Unstructured – Data that cannot be tabulated with any definitive structure to speak of is called unstructured data
Semi-structured – Mixed data with traits of both structured and unstructured data
Quantitative – Data with definite numeric values that can be quantified
Big Data – Data stored in huge databases spanning multiple computers or server farms is called Big Data. Biometric data, social media data, etc. is considered Big Data. Big data is characterised by 4 V’s
Explore our Popular Data Science Online Courses
Data Preprocessing
Data Classification – It’s the process of categorizing or labeling data into classes like numerical, textual or image, text, video, etc.
Data Cleansing – It consists of weeding out missing/inconsistent/incompatible data or replacing data using one of the following methods.
- Interpolation
- Heuristic
- Random Assignment
- Nearest Neighbour
Data Masking – Hiding or masking out confidential data to maintain the privacy of sensitive information while still able to process it.
Top Data Science Skills to Learn to upskill
SL. No | Top Data Science Skills to Learn | |
1 | Data Analysis Online Courses | Inferential Statistics Online Courses |
2 | Hypothesis Testing Online Courses | Logistic Regression Online Courses |
3 | Linear Regression Courses | Linear Algebra for Analysis Online Courses |
What is Data Science Made of?
Concepts of Statistics
Regression
Linear Regression
Linear Regression is used to establish a relationship between two variables such as supply and demand, price and consumption, etc. It relates one variable x as a linear function of another variable y as follows
Y = f(x) or Y =mx + c, where m = coefficient
Logistic regression
Logistic regression establishes a probabilistic relationship rather than a linear one between variables. The resulting answer is either 0 or 1 and we look for probabilities and the curve is an S-shaped one.
If p < 0.5, then its 0 else 1
Formula:
Y = e^ (b0 + b1x) / (1 + e^ (b0 +b1x))
where b0 = bias and b1 = coefficient
Probability
Probability helps to predict the likeliness of occurrence of an event. Some terminologies:
Sample: The set of likely outcomes
Event: It is a subset of the sample space
Random Variable: Random variables help to map or quantify likely outcomes to numbers or a line in a sample space
Probability Distributions
Discrete Distributions: Gives the probability as a set of discrete values (integer)
P[X=x] = p(x)
Explore our Popular Data Science Online Courses
Read our popular Data Science Articles
Continuous Distributions: Gives the probability over a number of continuous points or intervals instead of discrete values. Formula:
P[a ≤ x ≤ b] = a∫b f(x) dx, where a, b are the points
Correlation and Covariance
Standard Deviation: The variation or deviation of a given dataset from its mean value
σ = √ {(Σi=1N ( xi – x ) ) / (N -1)}
Covariance
It defines the extent of deviation of random variables X and Y with the mean of the dataset.
Cov(X,Y) = σ2XY = E[(X−μX)(Y−μY)] = E[XY]−μXμY
Correlation
Correlation defines the extent of a linear relationship between variables along with their direction, +ve or -ve
ρXY= σ2XY/ σX * *σY
Our learners also read: Top Python Courses for Free
Artificial Intelligence
The ability of machines to acquire knowledge and make decisions based on inputs is called Artificial Intelligence or simply AI.
Types
- Reactive Machines: Reactive machine AI works by learning to react to predefined scenarios by narrowing down to the fastest and best options. They lack memory and are best for tasks with a defined set of parameters. Highly reliable and consistent.
- Limited Memory: This AI has some real-world observational and legacy data fed to it. It can learn and make decisions based on the given data but cannot gain new experiences.
- Theory of Mind: It is an interactive AI that can make decisions based on the behaviour of the surrounding entities.
- Self Awareness: This AI is aware of its existence and functioning apart from the surroundings. It can develop cognitive abilities and understand and evaluate the impacts of its own actions on the surroundings.
upGrad’s Exclusive Data Science Webinar for you –
How upGrad helps for your Data Science Career?
Read our popular Data Science Articles
AI terms
Neural Networks
Neural Networks are a bunch or network of interconnected nodes that relay data and information in a system. NNs are modeled to mimic neurons in our brains and can take decisions by learning and predicting.
Heuristics
Heuristics is the ability to predict based on approximations and estimates quickly using prior experience in situations where available information is patchy. It’s quick but not accurate or precise.
Case-Based Reasoning
The ability to learn from previous problem-solving cases and apply them in current situations to arrive at an acceptable solution
Natural Language Processing
It’s simply the ability of a machine to understand and interact directly in human speech or text. For ex, voice commands in a car
Machine Learning
Machine Learning is simply an application of AI using various models and algorithms to predict and solve problems.
Types
Supervised
This method relies on input data that is associative with the output data. The machine is provided with a set of target variables Y and it has to arrive at the target variable through a set of input variables X under the supervision of an optimization algorithm. Examples of supervised learning are Neural Networks, Random Forest, Deep Learning, Support Vector Machines, etc.
Unsupervised
In this method, input variables have no labeling or association, and algorithms work to find patterns and clusters resulting in new knowledge and insights.
Reinforced
Reinforced learning focuses on improvisation techniques to sharpen or polish the learning behaviour. It is a reward-based method where the machine gradually improves its techniques to win a target reward.
Modeling Methods
Regression
Regression models always give numbers as output through interpolation or extrapolation of continuous data.
Classification
Classification models come up with outputs as a class or label and are better at predicting discrete outcomes like ‘what kind’
Both regression and classification are supervised models.
Clustering
Clustering is an unsupervised model that identifies clusters based on traits, attributes, features, etc.
ML Algorithms
Decision Trees
Decision trees use a binary approach to arrive at a solution based on successive questions at each stage such that the outcome is either of the two possible ones like ‘Yes’ or ‘No’. Decision trees are simple to implement and interpret.
Random Forest or Bagging
Random Forest is an advanced algorithm of decision trees. It uses a large number of decision trees which makes the structure dense and complex like a forest. It generates multiple outcomes and thus leads to more accurate results and performance.
K- Nearest Neighbour (KNN)
kNN makes use of the proximity of the nearest data points on a plot relative to a new data point to predict which category it falls in. The new data point gets assigned to the category with a higher number of neighbours.
k = number of nearest neighbours
Naïve Bayes
Naïve Bayes works on two pillars, first that every feature of data points are independent, unrelated to each other, i.e. unique, and second on the Bayes theorem which predicts outcomes based on a condition or hypothesis.
Bayes Theorem:
P(X|Y) = {P(Y|X) * P(X)} / P(Y)
Where P(X|Y) = Conditional probability of X given occurrence of Y
P(Y|X) = Conditional probability of Y given occurrence of X
P(X), P(Y) = Probability of X and Y individually
Support Vector Machines
This algorithm tries to segregate data in space based on boundaries which can be either a line or a plane. This boundary is called a ‘hyperplane’ and is defined by the nearest data points of each class which in turn are called ‘support vectors’. The maximum distance between support vectors of either side is called margin.
Neural Networks
Perceptron
The fundamental neural network works by taking weighted inputs and outputs based on a threshold value.
Feed Forward Neural Network
FFN is the simplest network that transmits data in only one direction. May or may not have hidden layers.
Convolutional Neural Networks
CNN uses a convolution layer to process certain parts of the input data in batches followed by a pooling layer to complete the output.
Recurrent Neural Networks
RNN consists of a few recurrent layers between I/O layers that can store ‘historic’ data. The dataflow is bi-directional and is fed to the recurrent layers for improving predictions.
Deep Neural Networks and Deep Learning
DNN is a network with multiple hidden layers between I/O layers. The hidden layers apply successive transformations to the data before sending it to the output layer.
‘Deep Learning’ is facilitated through DNN and can handle huge amounts of complex data and achieve high accuracy because of multiple hidden layers
Get data science certification from the World’s top Universities. Learn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.
Conclusion
Data science is a vast field that runs through different streams but comes across as a revolution and a revelation for us. Data science is booming and will change how our systems work and feel in the future.
If you are curious to learn about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.