(This article was originally published on Viget.com.)
A challenge that I’ve been wrestling with is the lack of a widely populated framework or systematic approach to solving data science problems. In our analytics work at Viget, we use a framework inspired by Avinash Kaushik’s Digital Marketing and Measurement Model. We use this framework on almost every project we undertake at Viget. I believe data science could use a similar framework that organizes and structures the data science process.
As a start, I want to share the questions we like to ask when solving a data science problem. Even though some of the questions are not specific to the data science domain, they help us efficiently and effectively solve problems with data science.
What is the problem we are trying to solve?
That’s the most logical first step to solving any question, right? We have to be able to articulate exactly what the issue is. Start by writing down the problem without going into the specifics, such as how the data is structured or which algorithm we think could effectively solve the problem.
Then try explaining the problem to your niece or nephew, who is a freshman in high school. It is easier than explaining the problem to a third-grader, but you still can’t dive into statistical uncertainty or convolutional versus recurrent neural networks. The act of explaining the problem at a high school stats and computer science level makes your problem, and the solution, accessible to everyone within your or your client’s organization, from the junior data scientists to the Chief Legal Officer.
Clearly defining our business problem showcases how data science is used to solve real-world problems. This high-level thinking provides us with a foundation for solving the problem. Here are a few other business problem definitions we should think about.
Who are the stakeholders for this project?
Have we solved similar problems before?
Has someone else documented solutions to similar problems?
Can we reframe the problem in any way?
And don’t be fooled by these deceivingly simple questions. Sometimes more generalized questions can be very difficult to answer. But, we believe answering these framing question is the first, and possibly most important, step in the process, because it makes the rest of the effort actionable.
Say we work at a video game company — let’s call the company Rocinante. Our business is built on customers subscribing to our massive online multiplayer game. Users are billed monthly. We have data about users who have cancelled their subscription and those who have continued to renew month after month. Our management team wants us to analyze our customer data.
What is the problem we are trying to solve?
Well, as a company, the Rocinante wants to be able to predict whether or not customers will cancel their subscription. We want to be able to predict which customers will churn, in order to address the core reasons why customers unsubscribe. Additionally, we need a plan to target specific customers with more proactive retention strategies.
Churn is the turnover of customers, also referred to as customer death. In a contractual setting - such as when a user signs a contract to join a gym - a customer “dies” when they cancel their gym membership. In a non-contractual setting, customer death is not observed and is more difficult to model. For example, Amazon does not know when you have decided to never-again purchase Adidas. Your customer death as an Amazon or Adidas customer is implied.
What are the approaches we can use to solve this problem?
There are many instances when we shouldn’t be using machine learning to solve a problem. Remember, data science is one of many tools in the toolbox. There could be a simpler, and maybe cheaper, solution out there. Maybe we could answer a question by looking at descriptive statistics around web analytics data from Google Analytics. Maybe we could solve the problem with user interviews and hear what the users think in their own words. This question aims to see if spinning up EC2 instances on Amazon Web Services is worth it. If the answer to, “Is there a simple solution,” is, “No,” then we can ask, “Can we use data science to solve this problem?” This yes or no question brings about two follow-up questions:
“Is the data available to solve this problem?” A data scientist without data is not a very helpful individual. Many of the data science techniques that are highlighted in media today — such as deep learning with artificial neural networks — requires a massive amount of data. A hundred data points is unlikely to provide enough data to train and test a model. If the answer to this question is no, then we can consider acquiring more data and pipelining that data to warehouses, where it can be accessed at a later date.
“Who are the team members we need in order to solve this problem?” Your initial answer to this question will be, “The data scientist, of course!” The vast majority of the problems we face at Viget can’t or shouldn’t be solved by a lone data scientist because we are solving business problems. Our data scientists team up with UXers, designers, developers, project managers, and hardware developers to develop digital strategies and solving data science problems is one part of that strategy. Siloing your problem and siloing your data scientists isn’t helpful for anyone.
We want to predict when a customer will unsubscribe from Rocinante’s flagship game. One simple approach to solving this problem would be to take the average customer life - how long a gamer remains subscribed - and predict that all customers will churn after X amount of time. Say our data showed that on average customers churned after 72 months of subscription. Then we could predict a new customer would churn after 72 months of subscription. We test out this hypothesis on new data and learn that it is wildly inaccurate. The average customer lifetime for our previous data was 72 months, but our new batch of data had an average customer lifetime of 2 months. Users in the second batch of data churned much faster than those in the first batch. Our prediction of 72 months didn’t generalize well. Let’s try a more sophisticated approach using data science.
Is the data available to solve this problem? The dataset contains 12,043 rows of data and 49 features. We determine that this sample of data is large enough for our use-case. We don’t need to deploy Rocinante’s data engineering team for this project.
Who are the team members we need in order to solve this problem? Let’s talk with the Rocinante’s data engineering team to learn more about their data collection process. We could learn about biases in the data from the data collectors themselves. Let’s also chat with the customer retention and acquisitions team and hear about their tactics to reduce churn. Our job is to analyze data that will ultimately impact their work. Our project team will consist of the data scientist to lead the analysis, a project manager to keep the project team on task, and a UX designer to help facilitate research efforts we plan to conduct before and after the data analysis.
How do we know if we have successfully solved the problem?
At Viget, we aim to be data-informed, which means we aren’t blindly driven by our data, but we are still focused on quantifiable measures of success. Our data science problems are held to the same standard. What are the ways in which this problem could be a success? What are the ways in which this problem could be a complete and utter failure? We often have specific success metrics and Key Performance Indicators (KPIs) that help us answer these questions.
Our UX coworker has interviewed some of the other stakeholders at Rocinante and some of the gamers who play our game. Our team believes if our analysis is inconclusive, and we continue the status quo, the project would be a failure. The project would be a success if we are able to predict a churn risk score for each subscriber. A churn risk score, coupled with our monthly churn rate (the rate at which customers leave the subscription service per month), will be useful information. The customer acquisition team will have a better idea of how many new users they need to acquire in order to keep the number of customers the same, and how many new users they need in order to grow the customer base.
What do we need to learn about the data and what analysis do we need to conduct?
At the heart of solving a data science problem are hundreds of questions. I attempted to ask these and similar questions last year in a blog post, Data Science Workflow. Below are some of the most crucial — they’re not the only questions you could face when solving a data science problem, but are ones that our team at Viget thinks about on nearly every data problem.
What do we need to learn about the data?
What type of exploratory data analysis do we need to conduct?
Where is our data coming from?
What is the current state of our data?
Is this a supervised or unsupervised learning problem?
Is this a regression, classification, or clustering problem?
What biases could our data contain?
What type of data cleaning do we need to do?
What type of feature engineering could be useful?
What algorithms or types of models have been proven to solve similar problems well?
What evaluation metric are we using for our model?
What is our training and testing plan?
How can we tweak the model to make it more accurate, increase the ROC/AUC, decrease log-loss, etc. ?
Have we optimized the various parameters of the algorithm? Try grid search here.
Is this ethical?
That last question raises the conversation about ethics in data science. Unfortunately, there is no hippocratic oath for data scientists, but that doesn’t excuse the data science industry from acting unethically. We should apply ethical considerations to our standard data science workflow. Additionally, ethics in data science as a topic deserves more than a paragraph in this article — but I wanted to highlight that we should be cognizant and practice only ethical data science.
Let’s get started with the analysis. It’s time to answer the data science questions. Because this is an example, the answer to these data science questions are entirely hypothetical.
We need to learn more about the time series nature of our data, as well as the format.
We should look into average customer lifetime durations and summary statistics around some of the features we believe could be important.
Our data came from login data and customer data, compiled by Rocinante’s data engineering team.
The data needs to be cleaned, but it is conveniently in a PostgreSQL database.
This is a supervised learning problem because we know which customers have churned.
This is a binary classification problem.
After conducting exploratory data analysis and speaking with the data engineering team, we do not see any biases in the data.
We need to reformat some of the data and use missing data imputation for features we believe are important but have some missing data points.
With 49 good features, we don’t believe we need to do any feature engineering.
We have used random forests, XGBoost, and standard logistic regressions to solve classification problems.
We will use ROC-AUC score as our evaluation metric.
We are going to use a training-test split (80% training, 20% test) to evaluate our model.
Let’s remove features that are statistically insignificant from our model to improve the ROC-AUC score.
Let’s optimize the parameters within our random forests model to improve the ROC-AUC score.
Our team believes we are acting ethically.
This process may look deceivingly linear, but data science is often a nonlinear practice. After doing all of the work in our example above, we could still end up with a model that doesn’t generalize well. It could be bad at predicting churn in new customers. Maybe we shouldn’t have assumed this problem was a binary classification problem and instead used survival regression to solve the problem. This part of the project will be filled with experimentation, and that’s totally normal.
What is the best way to communicated and circulate our results?
Our job is typically to bring our findings to the client, explain how the process was a success or failure, and explain why. Communicating technical details and explaining to non-technical audiences is important because not all of our clients have degrees in statistics. There are three ways in which communication of technical details can be advantageous:
It can be used to inspire confidence that the work is thorough and multiple options have been considered.
It can highlight technical considerations or caveats that stakeholders and decision-makers should be aware of.
It can offer resources to learn more about specific techniques applied.
It can provide supplemental materials to allow the findings to be replicated where possible.
We often use blog posts and articles to circulate our work. They help spread our knowledge and the lessons we learned while working on a project to peers. I encourage every data scientist to engage with the data science community by attending and speaking at meetups and conferences, publishing their work online, and extending a helping hand to other curious data scientists and analysts.
Our method of binary classification was in fact incorrect, so we ended up using survival regression to determine there are four features that impact churn: gaming platform, geographical region, days since last update, and season. Our team aggregates all of our findings into one report, detailing the specific techniques we used, caveats about the analysis, and the multiple recommendations from our team to the customer retention and acquisition team. This report is full of the nitty-gritty details that the more technical folks, such as the data engineering team, may appreciate. Our team also creates a slide deck for the less-technical audience. This deck glosses over many of the technical details of the project and focuses on recommendations for the customer retention and acquisition team.
We give a talk at a local data science meetup, going over the trials, tribulations, and triumphs of the project and sharing them with the data science community at large.
Why are we doing all of this?
I ask myself this question daily — and not in the metaphysical sense, but in the value-driven sense. Is there value in the work we have done and in the end result? I hope the answer is yes. But, let’s be honest, this is business. We don’t have three years to put together a PhD thesis-like paper. We have to move quickly and cost-effectively. Critically evaluating the value ultimately created will help you refine your approach to the next project. And, if you didn’t produce the value you’d originally hoped, then at the very least, I hope you were able to learn something and sharpen your data science skills.
Rocinante has a better idea of how long our users will remain active on the platform based on user characteristics, and can now launch preemptive strikes in order to retain those users who look like they are about to churn. Our team eventually develops a system that alerts the customer retention and acquisition team when a user may be about to churn, and they know to reach out to that user, via email, encouraging them to try out a new feature we recently launched. Rocinante is making better data-informed decisions based on this work, and that’s great!
I hope this article will help guide your next data science project and get the wheels turning in your own mind. Maybe you will be the creator of a data science framework the world adopts! Let me know what you think about the questions, or whether I’m missing anything, in the comments below.