In October 2012, Thomas Davenport and DJ Patil made a landmark claim in the month’s Harvard Business Review issue. They boldly declared Data Science to be the ‘sexiest job of the 21st century.’ While this claim is certainly debatable, there is no denying the exponential interest the nascent field has sparked in recent years.
All major companies in the world have started hiring Data Scientists and forming dedicated Data and Analytics Teams. A shortage of Data Scientists and a high demand for good Data Scientists have led many companies (such as Airbnb) to set up their own internal Data Science Universities.
The consensus is clear: Data is the currency of the 21st century. Companies that leverage data in their favor to create superior products will survive. The rest will perish. In such a scenario, it is easy to see why the Data Scientist is as important as ever, now.
But who is a Data Scientist? The skeptics say it is just a fancy name for a Statistician. Others claim it is a Computer Scientist extremely competent in statistical modeling. My favorite definition happens to be the following:
Data Scientists are people who know more statistics than Computer Programmers and more programming than Statisticians.
In other words, it is a field that brings together tools from Computer Science, Statistics, and the particular domain that the data belongs to. Under such circumstances, it is easy to see why finding good data scientists is hard, to say the least. There simply aren’t enough people who are competent at these skills, simultaneously.
This is one of the major reasons why beginners find the prospect of learning Data Science so overwhelming. Do I have to know calculus? How hard is the math? Should I learn how to programme first? What if I’m not very good at building software?
In this article, I will attempt at offering a path towards learning Data Science – that of the Python Programming Language. While this in no way is going to make you a star data scientist, it will put you en route towards that very goal.
Most data science projects (assuming you already have the raw data) involve the following components:
Exploratory Data Analysis and Visualization
Building and Deploying Machine Learning Models.
We will be looking at these steps one by one by taking a glance at the tools available to us and potent resources to learn these tools:
We have already emphasised that Statistics and Computer Science are integral components of Data Science. As a prerequisite, it is also important for you to have knowledge of basic linear algebra and programming, as well.
This learning path will assume you are coding in the Python Programming Language. Therefore, it is important that you know how to code in Python. The good news is that Python is extremely easy to learn; especially for people who have never programmed before. Its syntax is very intuitive, readable by humans and involves a very shallow learning curve.
Python, being an interpreted language, is traditionally much slower than lower level languages such as C/C++. To combat this handicap, we will be using powerful Scientific Libraries which are written in C and C++. After that, we will apply extremely powerful techniques such as vectorisation to speed up the computation process.
The aforementioned libraries don’t come bundled with Python. However, they can be downloaded as a distribution (Python included), all at once, through Anaconda offered by Continuum Analytics. This will give you all the tools you need to follow this path. Go ahead and download it here: Anaconda Download.
The Python Programming Language
As I have already mentioned, Python is an extremely easy language to learn. There are plenty of amazing resources out there to learn the language.
Here are my top favorites:
Keep in mind that you do not have to be an expert in the language. For now, learning the basics of programming and the Python syntax will do. Going through any of the above tutorials or books should suffice.
In order to understand the logic and algorithms in Machine Learning, it is important that you have a good understanding of Linear Algebra. Khan Academy’s course on the subject is an excellent and sufficient resource.
The availability of data in the real world, in a form suitable for analysis or computation, has been rare. Data Cleaning and Wrangling, simply put, is the process of transforming unclean and malformed data into a form that is suitable for a particular piece of analysis.
The data wrangling tool of choice in Python is the Pandas library. Pandas gives us access to extremely powerful data structures called Data Frames which makes the data wrangling and analysis process substantially faster and simpler. It is an open secret that the data scientist spends more than 70% of his/her time collecting and wrangling data. Becoming proficient in Pandas, therefore, is well worth the investment.
My recommended resources for learning Python are:
The power of the data scientist lies in the ability to extract information from data. And often, the best way to get that information and gain insights is by visualising the data in the best way possible.
Visualisation is also the most important step when it comes to communicating your story and results to non-technical people. Good visuals and graphs make a much more compelling case than dry numbers.
Python’s de facto visualisation library is Matplotlib. However, Matplotlib is notorious for being extremely difficult to use. To address these criticisms, the Seaborn library was created which makes creating graphs and visuals incredibly simple.
Some resources to learn about the aforementioned libraries:
The final and the most glamorous part of data science is predictive modeling and machine learning. This is the part that actually makes data-driven systems ‘intelligent’.
Machine Learning can be a complex subject with a substantially steep learning curve. However, Python’s Scikit-Learn library abstracts all the details of major Machine Learning Algorithms from us and makes training models as easy as typing out a couple of lines of code.
That said, I believe it is very important to know the basic logic underlying the algorithm that you are using to ensure that the right algorithm is used with the right problem and the right parameters.
These are my favorite resources for learning machine learning in Python:
- SciPy Machine Learning Tutorial: Part 1
- SciPy Machine Learning Tutorial: Part 2
- Python Machine Learning
With this, you are now in a good position to get your hands dirty with real life Data Science Projects!
One strongly recommended next step is Kaggle Competitions. You can make submissions to Kaggle Contests for Beginners such as Titanic: Machine Learning from Disaster and Predicting Housing Prices to get started.
Hopefully, this article has diminished if not eliminated some of your confusion on how to get started with Data Science. The road ahead might be challenging but it is also incredibly exciting. So, go ahead. There has never been a better time to be a data scientist, the ‘sexiest’ role of the century.