This blog was originally published on Medium by Aiswarya Ramachandran – an alumnus of UpGrad’s Data Science program with IIIT-Bangalore.
In one of my previous posts on Medium, I had written about how to scrape search results for a particular query string from Medium. In this post, we will go into details of analyzing the data scrapped for the search term “Data Science” to group posts based on Number of claps and Responses into different levels of popularity and also understand what makes these posts popular.
The data scrapped from Medium search results was JSON file with extensive data about each search result. To explore the structure of JSON file, I used Notepad++ with JSON plugin. The JSON file had data about the posts, author of the post and publisher associated with that post (if any). Here’s the JSON data structure for a medium post:
The code to extract data from the JSON file can be found here. In addition to extracting data from the JSON file, I also added a field with the date when the post was scrapped.
Exploratory Analysis of Posts Related to “Data Science”
On scraping results for search term “Data Science”, 831 posts were scrapped, out of which 31 were responses to a post and were excluded from the analysis. Here are the number of posts published over years, the data scrapped was from March 2013 to April 2018:
All the date fields like Created Date, First Published Date, Last Updated Date wherein milliseconds elapsed since Jan 1970. They were converted into a human readable date format using the function below
# Function to Convert EPOCH Date to Human Readable format
def convertToDateString(date): return (datetime(1970, 1, 1) + timedelta(milliseconds=date)).strftime("%Y-%m-%d %H:%M:%S")
The next step was to look at what words were most commonly occurring in the titles of these posts. As you can see from the word cloud below, Data Science, Big Data, AI, Analytics, Machine Learning, Python, self-driven (about self-driving cars) are some of the most frequently occurring words.
The distribution of Number of Claps, Number of Responses is highly skewed. 708 posts have less than 500 claps. This shows that there are few posts which become popular. Here’s the distribution of claps:
The Reading Time (mins) of most articles is between 1 to 3 min.
On Medium, each post can have a maximum of 5 tags. Tags help readers find content more easily. The more relevant tags, the easier to find. As we can see in the image, Data Science is the most frequently used tag, followed by Machine Learning, Big data, Artificial Intelligence. Here are top 10 tags related to data science:
Creating Clusters Based on User Responses
There are three metrics to measure how popular a post is on Medium viz. #Claps, #Responses and #Recommends. To make a fair comparison, I also included feature #Days between First Published and data collection date.On this feature set, I applied k-means clustering and identified three clusters. As we can see from the image below, there is a huge difference between the three metrics across clusters (Popularity Groups). Also, we can see that for the less popular posts though their median days between publishing and scrapping is the highest their engagement is very low. Here are the metrics across clusters (Popularity Groups):
Understanding What Makes a Data Science Post Popular
As we can see from the image below, for more popular articles the median for high and medium popularity articles are 9 and 7. They also have more links compared to less popular articles. This means that Popular posts refer to other posts and other sources of information adding more value to the content. Difference between Popular and Non-Popular Posts
From the image above, we can also see that the post with medium popularity is closer to a highly popular group than to the less popular group.
With a simple k-means, we were able to identify popular and non-popular posts on Medium related to Data Science.