Author Profile Image

Shruti Mittal

Blog Author

Shruti Mittal is a content strategist for Data Science and Machine Learning and Artificial Intelligence programs at UpGrad. She looks after the design and delivery of content.

POSTS BY Shruti Mittal

All Blogs
Basic Fundamentals of Statistics for Data Science
Blogs
12533
If you’re an aspiring Data Scientist, being familiar with the core concepts of Statistics for Data Science. You need not be a Ph.D. in Statistics to excel at Data Science, but you need to know enough to perhaps describe a couple of basic algorithms at a dinner party. Going forward, we’ll walk you through some of the prerequisites in basics of Statistics for Data Science. If you’ve just entered the world of Data Science, you might have come across people stating “Maths” as a prerequisite to Data Science. In all honesty, it’s not Maths, per se, but you have to learn Statistics for Data Science. These days, libraries like Tensorflow hide almost all the complex Mathematics away from the user. Good for us, but it’s still good to have a basic understanding of the underlying principles on which these things work. Having a good understanding of data analytics can help you understand everything better. This article will help arm you with some theorems, concepts, and equations that will not only help your cause as a Data Scientist but will also make you sound like you aced the course on Advanced Statistical Computing big time. Basics of Statistics for Data Science Statistical Significance: Differentiating between Random and Meaningful Results Data analysis requires the ability to discern between random fluctuations in data and relevant patterns or consequences. This is where the statistical significance idea comes into play. Statistical significance assists academics and data scientists in determining whether observed results are due to chance or indicate a true link or impact. Hypothesis testing, a commonly used statistical approach, is frequently used to determine statistical significance. The procedure begins with the formulation of a null hypothesis, which asserts that the data has no significant impact or link. In contrast, the alternative hypothesis implies the presence of a substantial impact or link. Factual tests, such as the t-test or chi-square test, are used to determine quantitative relevance depending on the type of material and the review topic. If the faulty hypothesis is correct, these tests provide a p-value, which expresses the likelihood of achieving the specified outcomes by chance alone. A truly big p-value is one that is less than a defined limit, which is often 0.05. It is critical to remember that factual significance does not always correspond to reasonable pertinence or the magnitude of the impact. Depending on the precise scenario and purposes of the assessment, a very crucial consequence may have little commonsense worth, although a non-significant finding may. Understanding and using statistical significance is critical for obtaining trustworthy results from data analysis. Researchers may make educated judgments, support hypotheses, and discover significant results that contribute to the progress of knowledge and decision-making processes by distinguishing between random fluctuations and meaningful patterns. Data Visualization: Communicating Insights through Graphs and Charts Data visualization plays a crucial role in data analysis by transforming raw data into intuitive and visually appealing representations. Graphs and charts provide a powerful means of communicating insights, patterns, and trends hidden within complex datasets. Bar Charts: Bar charts are commonly used to display categorical data and compare values across different categories. They are effective in visualizing frequency counts, market shares, and other discrete data. Bar charts allow for quick comparisons and help identify the most significant categories or trends within the data. Line Charts: Line charts are ideal for visualizing trends and changes over time. They are commonly used to display time series data, such as stock prices or temperature fluctuations. Line charts allow data scientists to observe patterns, seasonality, and long-term trends, making them invaluable for forecasting and monitoring purposes. Scatter Plots: Scatter plots are useful for examining relationships between two continuous variables. By plotting data points on a Cartesian plane, data scientists can identify correlations, clusters, and outliers. Scatter plots help in understanding the nature of the relationship between variables and can aid in decision-making processes. Pie Charts: Pie charts are effective for illustrating the proportions or percentages of a whole. They are commonly used to display market shares, survey responses, or the distribution of categorical data. Pie charts provide a visual snapshot of relative proportions and make it easy to compare different categories at a glance. Heatmaps: Heatmaps are graphical representations of data where the values are encoded using colors. They are particularly useful for displaying large matrices or tabular data. Heatmaps help identify patterns, clusters, and relationships within datasets, making them valuable for tasks like correlation analysis and gene expression analysis. Statistical Distributions This is probably one of the most important things you need to know while arming yourself with prerequisite Statistics for Data Science. Poisson Distribution The Poisson distribution is one of the most essential tools in statistics. It’s used for to calculate the number of events that are likely to occur in a time interval. For instance, how many phone calls are likely to occur in any particular period of time. The funny looking symbol in this equation (λ) is known as lambda. It is used to represent the average number of events occurring per time interval. Another good example where Poisson distribution finds use is to calculate the loss in manufacturing. Suppose a machine produces sheets of metal and has X flaws per yard. Suppose, for instance, the error rate was 2 per yard of the sheet – then using Poisson distribution, we can calculate the probability that exactly two errors will occur in a yard. Binomial Distribution If you’ve ever encountered basic Statistics, you might have come across Binomial Distribution. Let’s say you had an experiment of flipping an unbiased coin thrice. Can you tell the probability of the coin showing heads on all three flips? First, from basic combinatorics, we can find out that there are eight possible combinations of results when flipping a coin thrice. Now, we can plot the probabilities of having 0,1,2, or 3 heads. That plot will give us our required binomial distribution for this problem. When graphed, you’ll notice that it looks very similar to a typical normal distribution curve, in theory, both are very similar. While Binomial Distribution is for discrete values (a limited number of coin flips), Normal Distribution takes care of continuous values. There are a number of distributions other than the ones we talked about above. If you’re an interested soul and also want to arm yourself better with the needed Statistics for Data Science, we suggest you to read up about the following distributions as well: Geometric Distribution Hypergeometric Distribution Discrete Uniform Distribution Negative Binomial Distribution Exploratory Data Analysis and its Importance to Your Business Top Data Science Skills to Learn SL. No Top Data Science Skills to Learn 1 Data Analysis Programs Inferential Statistics Programs 2 Hypothesis Testing Programs Logistic Regression Programs 3 Linear Regression Programs Linear Algebra for Analysis Programs Some Theorems and Algorithms When we talk about Statistics for Data Science, we just can’t ignore the basic theorems and algorithms that are the foundation of many libraries that you’ll be working on as a Data Scientist. There are a number of classification algorithms, clustering algorithms, neural network algorithms, decision trees, so on and so forth. In this section, we’ll talk about a few basic theorems that you should know – it’ll also help you understand other complex theorems with ease. Bayes Theorem This is one of the common theorems that you’ll come across if you’ve had any formal education in Computer Science. There have been numerous books over the years that excessively discuss Bayes Theorem and its concepts in an elaborate manner. Bayes Theorem greatly simplifies complex concepts. It explains a lot of statistical facts using a few simple variables. It supports the concept of  “conditional probability”(e.g., If A occurred, it played in role in the occurrence of B). The most appreciable thing about this is the fact that you can predict the probability of any hypothesis using just the given data points. Bayes can help you predict the probability of someone having cancer just by knowing their age. It can also let you know if an email is spam based on the number of words. This theorem is in essence used to remove uncertainty. Fun fact: Bayes Theorem helped predict locations of U-boats as well as predicting the configuration of the Enigma machine to translate the German codes, in WW2. Even in modern Data Science Bayes finds extensive applications in many algorithms. The What’s What of Data Warehousing and Data Mining K-Nearest Neighbor Algorithm This is a very easy algorithm both in terms of understanding and implementation. So much so that it’s referred to as the “lazy algorithm”. Its simplicity lies in the fact that it’s based on logical deductions than any fundamental of statistics, per se. In layman terms, this algorithm looks to find groups closest to each other. K-NN uses the concept of Euclidean Distance. It searches for local groups in and around a specified number of focal points. That number is represented by “k”. There are many approaches to finding out how large the value of ‘k’ should be as this is a user-decided value. This concept is great for feature clustering, basic market segmentation, and seeking out outliers from a group of data entries. Most modern programming languages implement the K-NN algorithm in just two lines of code. Explore our Popular Data Science Certifications Executive Post Graduate Programme in Data Science from IIITB Professional Certificate Program in Data Science for Business Decision Making Master of Science in Data Science from University of Arizona Advanced Certificate Programme in Data Science from IIITB Professional Certificate Program in Data Science and Business Analytics from University of Maryland Data Science Certifications Bagging (Bootstrap aggregating) Bagging essentially refers to creating more than one models of a single algorithm – like a decision tree. Each of the models is trained on a different sample data (this is called bootstrap sample). Therefore, each decision tree is made using different sample data – this solves the problem of overfitting to the sample size. Grouping decision trees like this essentially help in reducing the total error, as the overall variance decreases with each new tree added. A bag of such decision trees is known as a random forest. Get Started in Data Science with Python ROC Curve Analysis The term ROC stands for Receiver Operating Characteristic. The ROC analysis curve finds extensive use in Data Science. It predicts how well a test is likely to perform by measuring its overall sensitivity vs. its fall-out rate. ROC Analysis is extremely important when determining the viability of any model. How does it work? Your machine learning model might give you some inaccurate predictions. Some of them are because a particular value should’ve been ‘true’ but is instead set ‘false’, or vice-versa. What is the probability of you being correct then? Using the ROC curve, you can see how accurate your prediction is. With the two different parables, you can also figure out where to put your threshold value. The threshold is where you decide if the binary classification is positive or negative – true or false. As the two parables get closer to each other, the area under the curve will tend to zero. This essentially means that your model is tending to inaccuracy. Greater the area, greater is the accuracy of your model. This is one of the first tests used when testing any modeling, as it helps detect problems early on by telling whether or not the model is correct. A real-life example of ROC curves – They are used to depict the connection/trade-off between clinical sensitivity and specificity for cut-off for a particular test or a combination of tests – in a graphical way. To add to that, the area under the ROC curve also gives a fair idea of the benefits of using the tests mentioned above. Hence, ROC curves find extensive use in Biochemistry for choosing an appropriate cut-off. Ideally, the best cut-off is the one that has the lowest false positive rate with the highest true positive rate together.   How Can You Transition to Data Analytics? upGrad’s Exclusive Data Science Webinar for you – ODE Thought Leadership Presentation document.createElement('video'); https://cdn.upgrad.com/blog/ppt-by-ode-infinity.mp4   Importance of Statistics in Data Science From the above discussion now that you are aware of the basic concepts of Statistics and fundamentals of Statistics, let’s talk about the importance to learn Statistics for Data Science. The crucial tools and technologies to organize and find deep insights in the data, to analyze and quantify data are provided by Statistics for Data Analytics. We have given you an overview of Statistics basic concepts and the impact of Statistics on data exploration, analysis, modelling, and representation. We also indicate to the problem if their is an inconsistency while neglecting the basics of Statistics. If you are interested in joining the fastest growing industry, come straight to our website at UpGrad to follow our Statistics for Data Science tutorial as we provide both online and offline courses in the same. Once you ace up your game in atleast the fundamentals of Statistics and the Basics of Statistics, you will job ready. Read our popular Data Science Articles Data Science Career Path: A Comprehensive Career Guide Data Science Career Growth: The Future of Work is here Why is Data Science Important? 8 Ways Data Science Brings Value to the Business Relevance of Data Science for Managers The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have Top 6 Reasons Why You Should Become a Data Scientist A Day in the Life of Data Scientist: What do they do? Myth Busted: Data Science doesn’t need Coding Business Intelligence vs Data Science: What are the differences? In Conclusion… The above list of topics is by no means a comprehensive list of everything you need to know in Statistics. This list is just to give you a flavor of what all you might encounter in your journey of Data Science, and how can you be prepared for it. All in all, this article introduces to some of the core concepts of Statistics for Data Science. A deep understanding of the concepts explained coupled will help you understand the other concepts easily. If you would like to explore more and master data science, find our best online data science courses. 
Read More

by Shruti Mittal

15 Jun 2023

5 Breakthrough Applications of Machine Learning
Blogs
6849
Machine Learning is the latest buzzword floating around, and quite rightly so. It’s one of the most interesting and fastest growing subfields of Computer Science. To put it simply, Machine Learning is what makes your Artificial Intelligence intelligent. Most people find the inner-workings of Machine Learning mysterious – but that’s far from the truth. If you’re just beginning to understand Machine Learning, let us make it easier by using an analogy: Top Machine Learning and AI Courses Online Master of Science in Machine Learning & AI from LJMU Executive Post Graduate Programme in Machine Learning & AI from IIITB Advanced Certificate Programme in Machine Learning & NLP from IIITB Advanced Certificate Programme in Machine Learning & Deep Learning from IIITB Executive Post Graduate Program in Data Science & Machine Learning from University of Maryland To Explore all our certification courses on AI & ML, kindly visit our page below. Machine Learning Certification You’re trying to throw a paper-ball into a dustbin. After one attempt, you’ll get a fair idea of the amount of force you need to put. You put the required force in your second attempt, but the angle seems to be wrong. What is essentially happening here is that with each throw you’re learning something and bringing your outcome closer to the desired result. That is because we, humans, are inherently programmed to learn and grow from our experiences. Trending Machine Learning Skills AI Courses Tableau Certification Natural Language Processing Deep Learning AI Join the Artificial Intelligence Course online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career. Suppose you replace yourself with a machine. Now, we have two ways of going forward: Non-Machine Learning Approach A generic, non-machine learning approach would be to measure the angle and distance and then use a formula to calculate the optimal force required. Now, suppose we add another variable – a fan that adds some wind force. Our non-ML program will fail almost certainly owing to the added variable. If we’re to get it work, we need to reprogram it keeping the wind factor in mind and the formula. Machine Learning Approach Now, if we were to device a Machine Learning based approach for the same problem, it’d also begin with a standard formula – but, after every experience, it’d update/refractor the formula. The formula will get improved continuously using more experiences (known as ‘data points’ in the world of Machine Learning) – this will lead to improvements in the outcome as well. You experience these things on a daily basis in the form of your Facebook newsfeed, or custom curated YouTube suggestions or other things of this sort – you get the gist. What is Machine Learning? The above analogy should make it clear that Machine Learning is simply using algorithms and processes to train your system to get better with experience. However, for the sake of a technical definition, a system is said to learn from the experiences with respect to a set of tasks, if its performance at the said tasks improves with time and experience. What this essentially means is that in Machine Learning, the system improves its performance with experience. This is precisely what we noticed in our analogy as well. Types of Machine Learning Depending on your problem statement, you can use either of the three techniques to train your system: Supervised Learning Supervised Machine Learning should be applied to datasets where the label/class of each data is known. Let us imagine we want to teach our system how to distinguish between the images of a dog and a human. Suppose we have a collection of pictures that are labeled as either human or dog (labeling is done by human annotators to ensure a better quality of data). Now, we can use this data set and data classes to train our algorithm to learn the right way. Once our algorithm learns how to classify images, we can use it on different data sets- to predict the label of any new data point. Unsupervised Learning As you can guess from the name, unsupervised Machine Learning is devoid of any supervising classes or labels. We just provide our system with a large amount of data and characteristics of each data piece. For example, suppose in our earlier example we just fed a number of images (of humans and dog) to our system giving each image a characteristic. Clearly, the characteristics of humans will be similar and different from dogs. Using these characteristics, we can train our system to group data into two categories. An unsupervised version of “classification” is called as “clustering”. In clustering, we don’t have any labels. We group the datasets on the basis of common characteristics. Reinforcement Learning In reinforcement learning, there are no classes or characteristics, there’s just an end-point – pass or fail. To understand this better, consider the example of learning to play chess. After every game, the system is informed of the win/loss status. In such a case, our system does not have every move labeled as “right” or “wrong”, but only has the end-result. As our algorithm plays more games during the training, it’ll keep giving bigger “weights” (importance) to the combination of those moves that resulted in a win. Breakthrough Applications in the field of Machine Learning From our above discussion, it’s clear that Machine Learning can indeed solve a lot of problems that traditional computers just can not. Let’s look at some of the applications of Machine Learning that have changed the world as we know it: 1. Fighting Webspam Google is using “deep learning” – it’s neural network, to fight spam both online and offline. Deep Learning uses data from the users and applies natural-language processing to conclude about the emails it encountered. Not only does it help the web-users, but also the SEO companies trying to help legitimate websites rank higher using white-hat techniques. 2. Imitation Learning Imitation learning is very similar to observational learning – something we do as infants. This is extensively used in field robotics and in industries like agriculture, search, construction, rescue, military, and others. In all such situations, it’s extremely difficult to manually program the robots. To help with that, programming by demonstration – also known as collaborative methods is used coupled with Machine Learning. Take a look at this video published by Arizona state, which shows a humanoid robot learning to grasp different objects. 3. Assistive and Medical Tech Assistive robots are robots that are capable of processing sensory information, and performing actions in times of need. The Smart Tissue Autonomous Robot (STAR) was created using this type of machine learning and real-world collaborations. STAR uses ML and 3D sensing and can stitch together pig intestines (used for testing) better than any surgeon. While STAR wasn’t developed to replace the surgeons, it does offer a collaborative solution for delicate steps in medical procedures. Machine Learning also finds applications in the form of predictive measures. Like a colleague can look at a doctor’s prescription and find out what they might have missed, an artificially intelligent system too can find out the missing links in a prescription if trained well. Not only this, but AI can also look for patterns that point to possible heart failures. This can prove to be extremely helpful to doctors as they can collaborate with the virtual robot A.I to better diagnose a fatal heart condition before it strikes. The extra pair of eyes (and intelligence) can do more good than harm. Studies thus far also promise for the future application of this technology. 4. Automatic Translation/Recognition Although it looks like a simple concept, ML can also be used to translate text (even from images) into any language. Using neural networks will help in the extraction of text from an image which can then be translated into the required language before putting it back into the picture. Other than this, ML is also used in every application that deals with any kind of recognition – voice, images, text, you name it! 5. Playing Video Games Automatically This is one of the cooler applications of Machine Learning although it might not have that much of social utility like the others mentioned in the list. Machine Learning can be used to train Neural Networks to analyse the pixels on a screen and play a video game accordingly. One of the initial attempts at this was Google’s Deepmind. Popular AI and ML Blogs & Free Courses IoT: History, Present & Future Machine Learning Tutorial: Learn ML What is Algorithm? Simple & Easy Robotics Engineer Salary in India : All Roles A Day in the Life of a Machine Learning Engineer: What do they do? What is IoT (Internet of Things) Permutation vs Combination: Difference between Permutation and Combination Top 7 Trends in Artificial Intelligence & Machine Learning Machine Learning with R: Everything You Need to Know AI & ML Free Courses Introduction to NLP Fundamentals of Deep Learning of Neural Networks Linear Regression: Step by Step Guide Artificial Intelligence in the Real World Introduction to Tableau Case Study using Python, SQL and Tableau In Conclusion… Having said that, Machine Learning isn’t the solution to all your problems. You don’t need machine learning to figure out a person’s age from his DOB, but you certainly need ML to figure out a person’s age from his music preferences. For example, you’ll find that fans of Johnny Cash and the Doors are mostly 35+ in age, whereas most of the Selena Gomez fans are under 20. Machine Learning *can* be used for any problem around you, but should it? Not really. Never use machine learning as a solution to your problems without being sure that you really need your machine to learn. Otherwise, it’d be like killing mosquitoes using machine guns – they might get killed, they might not, but at the end of the day, was it worth it?
Read More

by Shruti Mittal

26 Feb 2018

Explore Free Courses

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon