The words Data, Science, or Data Science are not enough to incite a feeling of fear or dread among the readers. To be honest, they’re too cute to be even off-putting, let alone horrid, unlike the words – tessellation, k-mean, k-nearest neighbors, Euclidean Minimum Spanning Tree, and more of this sort – words that you’ll encounter on your journey of Data Science.
While “Data Science” doesn’t inspire fear, it also doesn’t explain anything about the field. Everybody knows what data is; at least in a layman sense. Data is essentially just raw bits of information. Science, on the other hand, can be used to mean any group of activities following a scientific method. So, going by this logic, we can conclude that Data Science is a field that uses scientific methods on large chunks of data. But, for what? And what exactly is Data Science?
That’s our topic for discussion today. After reading this article, you’ll be able to answer the following questions:
- What is Data Science?
- What are the different phases of a Data Science pipeline?
- Where can I see Data Science at work?
What is Data Science?
Wikipedia, the mother of all encyclopedias, defines Data Science as a field focused on extracting knowledge and insights from data by using scientific methods. However, what it doesn’t tell you, is that we humans are born data scientists. How? Let’s see.
You’re observing the world around you no matter what you’re doing. At every waking moment, you’re taking in details from your surroundings and feeding it to your brain. You then process these observations into data and use it to understand things around you by finding out meanings and make predictions of what is likely to happen next.
When you’re late to leave for work by an hour, you call in to tell them you’ll be working from home. You’re using your past observations of traffic and stoppages on the way that make you conclude that you’re likely to lose your time stuck in traffic than you’d gain by being in office. When you come into your room and see chocolate wrappers lying around, a casual analysis will tell you that someone’s been eating your chocolates in your absence.
In either of the mentioned cases, if you do these calculations and predictions in your mind, without noting it down, you’re a normal human being. On the other hand, if you go ahead and record these data points (of course in a machine-readable format) and then try to devise an algorithm (or, procedures) and computer programs to run the application. If the output of this “hypothetical” system is that “the traffic is going to suck”, or “your roommates ate your chocolates”, then bingo! You’re a data scientist.
It’s just as simple (in theory) as the above analogy makes it sound. At the end of the day, you have data, procedures, algorithms, and tools. You just need to extract knowledge from it. To do that efficiently, there’s a workflow/pipeline you must follow. Let’s see what all is included in a typical Data Science Pipeline.
Data Science Pipeline
Data science pipeline talks about the flow of the entire process – from obtaining the desired data to make accurate calculations and predictions. Let’s have a look at the elements of this pipeline:
Obtain Your Data
This is by default the first thing you need to do to practice Data Science – get the data! Just a little heads-up – there are some things you must take into consideration while obtaining your data. You must first identify all of your datasets (can be from the internet or internal/external databases). You should then extract the data into a usable format (CSV, XML, JSON, etc.)
- Database Management: Either SQL or NoSQL, depending on your needs and requirements.
- Querying these databases
- Retrieving unstructured data in the form of videos, audios, texts, documents, etc.
- Distributed storage: Hadoop, Apache Spark, or Apache Flink.
Scrubbing / Cleaning Your Data
Cleaning of the data should be given utmost importance because the final output of your system is only as good as the data you put into it. Cleaning refers to removing anomalies, filling in empty/missing values, seeing if the data is consistent, and other things of this nature.
- Scripting language: Python, R, SAS
- Data wrangling tools: Python Pandas, R
- Distributed processing: Hadoop, MapReduce/Spark
Exploring (Exploratory Data Analysis)
Now that the data is clean, you will begin to understand what patterns your data has. Different types of visualisations and statistical modelings come into use in this phase. Basically, this phase aims to derive the hidden meaning from our data.
There’s a lot that goes around in the field of Exploratory Data Analysis. If you feel it’s something you’d enjoy, don’t forget to read our article on the same.
To perform better in this phase, you need to have your “spidey senses” tingling. Go crazy and spot weird patterns or trends – always be on the lookout for something out of the box. However, while doing that, don’t forget the problem you’re aiming to solve. Don’t go too much out of the box. Exploratory data analysis is an art, and an artist should always keep the audience in mind.
- Python libraries: Numpy, Matplotlib, Pandas, Scipy
- R libraries: GGplot2, Dplyr
- Inferential statistics
- Data Visualisation
- Experimental design
Modeling (Machine Learning)
This is the fun part. Models are simply general rules in a statistical sense. A machine learning model is simply a tool in your toolkit. You have access to so many algorithms with different use-cases and objectives that a simple research will lead you to an algorithm that fits your business needs.
After cleaning the data and finding out the essential features (in the EDA phase), using a statistical model as a predictive tool will enhance your overall decision making. Instead of looking back to see “what happened?”, predictive analytics aims to answer “what next?” and “how should we go about it?”.
- Machine Learning: Supervised/Unsupervised/Reinforcement learning algorithms
- Evaluation methods
- Machine Learning Libraries: Python (Sci-kit Learn) / R (CARET)
- Linear algebra & Multivariate Calculus
Interpreting (Data Storytelling)
This is one of the more challenging tasks in the pipeline. Here, you aim to explain your findings through communication. At the end of the day, it’s all about connecting with your audience – and that is what makes storytelling a key.
Your findings are hardly useful if you are not able to convey its significance to the non-tech bunch at your office, or even your boss, for that matter. A good practice to get things in control would be to rehearse a lot. Try framing a story on your findings and telling it to a layman (preferably a kid). If they understand it, so will your boss. And if they don’t, well, you know what Einstein said:
“If you can’t explain it to a six-year-old, you don’t understand it yourself.”
This phase aims to derive true business insights. Your main challenge here is to visualize your findings and display them in a beautiful and understandable way.
- Knowledge of your business domain
- Data Visualisation tools: Tableau, D3.JS, Matplotlib, GGplot, Seaborn, etc.
- Communication: Presentation skills – both verbal and written.
This isn’t the end of our pipeline. If you’re to truly bring the best out of your system, you need to make sure you’re updating your model as and when the needs arise. In Data Science, one size does not fit all, and you’ll need to keep revisiting and updating your model.
Applications of Data Science
As it is clear by now, Data Science is a broad term, and so are its applications. Almost every application on your smartphone thrives on data. So, it’s only fair to say that it’s practically impossible to list down all the applications of data science because of its sheer omnipresence.
Let’s have a look at the broad fields that are using the magic of Data Science:
- Internet Search: How does Google return such *accurate* search results within a fraction of a second? Data Science!
- Recommendation Systems: From “people you may know” on Facebook or LinkedIn to “people who’ve bought this product also liked…” on Amazon to your daily curated playlists on Spotify to even “suggested videos” on YouTube, everything is fueled by Data Science.
- Image/Speech/Character Recognition: This pretty much goes without saying. What do you think is the brain behind “Siri”, if not Data Science? Also, how do you think Facebook recognizes your friend when you upload a photo with them? It’s not magic; it’s science – Data Science.
- Gaming: EA Sports, Sony, Nintendo, Zynga, and other giants in this domain have taken it upon themselves to take your gaming experience to an altogether new level. Games are now developed and improved using Machine Learning algorithms so that it can upgrade as you move up to higher levels.
- Price Comparison Websites: These websites are fueled by data. For them, the more the merrier. The data is fetched from the relevant websites using APIs. PriceGrabber, PriceRunner, Junglee, Shopzilla are some such websites.
If you’re from a tech background and have a little something for data, then Data Science is your true calling. The best part? There’s so much to do and explore in and around Data Science. It’s an umbrella term that covers a number of tools and technologies – mastering any one of which will make you an asset in the ever-increasing market of Data Science. UpGrad offers various courses on Data Science to keep you ahead of the curve. Don’t forget to check them out!
If you have any comments, concerns, or doubts about what you read, do let us know in the comments below!