This is the first of a two-part series.
Part One — Building A Data Warehouse
Nowadays, everyone wants to build a data warehouse. But does one really need it? Even if you need it, how do you know you’re building the right thing and when are you really going to start to reaping early benefits from it?
But first things first, what is a data warehouse? Simply put, it’s a single place where you can store data from all sources. It helps one answer the questions that require complex analysis involving data from multiple sources. You can also build a data warehouse in a fashion that you get your most frequent data requirements taken care of quickly.
A year ago, we were struggling with this question at UpGrad — to build or not to build a data warehouse?
In order to answer this, and many other such questions, we talked to a lot of other people who had done it before. The first thing that we noticed was that to build a data warehouse (or DW), you need the right team of data engineers, architects, analysts and product managers. The first question we asked was — is it really worth that much investment?
To find the right answer, we need to ask ourselves the right set of questions. These questions might take a good deal of time and energy, but once you are done with these, you will be far more confident about whether to move ahead with DW or not. Here, we’ll provide the answers we got from our own exercise to enhance your understanding, and hopefully aid you in this process of deciding whether or not to set up your own data warehouse.
Question #1: What answers do you want to get from analytics/data? And at what frequency?
As you must have noted already, this is the most important question of all. You must involve other teams (Sales, Marketing, Business) while answering these questions to make sure you don’t miss anything.
What this meant for us: We wanted 3 important answers from analytics/data:
a. Which channels in marketing are performing well i.e. multi-channel attribution?
UpGrad’s marketing team uses different channels, both online and offline, for user acquisition. We conduct offline workshops and events for professionals seeking a career upgrade. We also use online channels like Facebook and Google to attract these professionals. So it becomes very important for us to know which channels are performing well, in order to craft our marketing strategy on a weekly, or even daily, basis. Further, we also want to know whether re-marketing or offline efforts have any effect on converting these users into paid students.
b. What does our conversion funnel look like?
Our funnel looks much larger than most companies. First visit — signup — application start — application submit — test-taken/exempted — shortlist — paid. It is critical to know what the funnel looks like based on multiple different features like city, age group, acquisition channel etc.
c. Can we predict whether a user will end up paying or not, i.e. lead scoring?
Lead scoring can be based on two things — fit and interest. The fit is determined by user attributes like years of experience, GRE/GMAT/CAT score etc. Interest is based on how active the user has been on the website, or how responsive the user is to calls or emails.
Apart from these, we wanted to:
d. Track every student performance in a course or program so that we can help them at the right time.
e. Monitor student’s ratings and reviews of the course content.
We got many more such questions from different teams… but you get the idea.
Question #2: Which of these answers are already provided by the current setup, or would require only minimal tweaks?
Asking this question will give you a good sense of current database capabilities. Make sure you have the right engineers in the room when you ask this (hint: most of these would be backend engineers in a startup who look after the transactional database).
What this meant for us:
a. Multi-channel attribution
Before making a purchase, visitors makes many visits through different channels. Sometimes they simply find you on Google and come to your website, and sometimes they come to attend an offline promotional event. So when a visitor finally buys the product, we want to be able to attribute which of the channels have been most effective. To do so, we have to merge both online and offline data* in one place and run different attribution models.
b. Conversion funnels
Our funnel again includes some offline components, like shortlists and tests which are uploaded manually into Salesforce by the counselling team. The funnel requires merging webstream data to Salesforce data.
c. Lead Scoring
Most of the lead scoring tools are basic. For example, you can score on the basis of events streamed in Pardot (by Salesforce). We needed a system which could merge data from Salesforce, web analytics, and emails to give a final score based on fit and interest.
d. Student performance
Since this data is stored in a transactional database, we could find a visualisation tool like BIME or Tableau to pull the data and create these tracking dashboards.
e. Student’s ratings and reviews
Same as (d) above.
So, we started building a data warehouse schema, keeping in mind a, b, and c. Many startups don’t require lead scoring and have only one source of data for conversion funnels and attribution. For those startups, a Business Intelligence (BI) tool is more effective than actually building a data warehouse.
Question #3: Will things look different as you scale over the next 1–2 years?
At scale, your transactional database might get very large and queries could get slower or start failing. You should plan for such situations as well, while designing the warehouse.
What this meant for us:
Our student activities database table will grow very fast as we add more courses and students. The queries have already started slowing down. It made sense to keep this in mind while designing the schema.
Question #4: Is there anywhere else you want to send the data that you want in your data warehouse?
The data stored in the warehouse might have many different use cases, apart from the principal one. These use cases help you think through the schema, and include additional fields, if needed, while building the schema.
What this meant for us:
The lead score is used by the counselling team, so we have to send this to Salesforce. The fit score of the lead scoring can also be used by a particular course team to auto-exempt them from the course. The attribution model is used by the marketing team, so we have to send it to the BI tool in a particular format.
Finally, Question #5: Do you have the right team to make decisions like:
- Which analytics database should you use, based on the scale and analytics use-cases?
- What should be the schema/data model for the current use cases? Is this schema scalable?
- What kind of ETL would be required for creating the analytics database? How much time would the ETL take?
- What would be the update frequency of different tables? How should you handle real-time use cases, like the one for recommendation engines?
You will need a data engineer, a senior engineer who has already worked with data for 3–5 years at least, and a data scientist to make many of these decisions.
After thinking through these 5 questions, a startup can decide whether or not to build a data warehouse. Here’s a simple list of pros and cons of a data warehouse, to help you evaluate even further:
- You will have full control over your data, and switch over to third-party tools easily as and when they get more expensive for you or don’t meet your requirements.
- You can build data science products! Recommendations, search, sentiment analysis, spam vs ham etc. Be careful and check beforehand whether you will require real-time data for these products, or they will need to be updated hourly/daily.
- As pointed out earlier, you can save a lot of time and troubles for analysts. The queries will be faster, and data will be reliable.
- You need to invest in engineering and data storage resources heavily, long before you can start reaping benefits.
- Chances are your first build will be far from perfect. If you are an early-to-mid-stage company, a lot of processes are still evolving. You can’t cover the cases that are going to come up in the next 3–6 months. Facing questions like why didn’t we think of that, etc might end up disheartening you. You will need to brush off these small setbacks, and keep your eyes on the long-term goal.
- Most organisations don’t have the right research and patience to build the Data Warehouse solution for their needs. You will need to invest a lot of time before starting it all up.
Once you have completed this exercise, I am pretty sure you will be ready to embark upon the data analytics journey for your startup and will avoid costly mistakes. Comment below and let us know if you liked this post or found it useful. Stay tuned for the next one!
*If we had only online channels, we could have used google analytics multi-channel attribution. We also have offline events data, which can be uploaded to google analytics. Problem solved? Alas! GA forbids you from sending any personally identifiable information. In absence of email information, it’s hard to link this data to other data sources, unless you map google analytics’ ID to emails in your own database, look up these IDs and upload offline data with these IDs into GA.