Project Management in Data Science and Parkinson’s Law
In a Biology of Infectious Diseases course at the University of Virginia, Janis Antonovics made a short remark about Parkinson’s Law while answering questions about an upcoming final exam. It was ironic that I, an economics major, was learning about an economic adage from a biology professor. Nonetheless, Professor Antonovics peaked my curiosity and I quickly typed the phrase into Google. The Economist’s 1955 article states “that work expands so as to fill the time available for its completion”. If students were given two weeks to write a paper on herd immunity, they would take a full two weeks to complete the task. If a team of data scientists was given two months to analyze the Global Terrorism Database maintained by the University of Maryland, they would take a full two months to develop a thorough report. Both of these hypothetical cases adhere to Parkinson’s Law.
Data science has many rabbit holes within which you may find yourself down after a three-day data munging bender. These rabbit holes and random theories can be costly in terms of actually producing a product. You cannot come to your boss and say the report isn’t finished because you were trying out a new pipeline system to speed up your process… and it didn’t work. That’s not acceptable.
So what can we learn from Parkinson’s Law? Your project is going to take the full amount of time you give it. I like to give myself checkpoints when designing data science projects and I feel like they may be valuable to you, especially if you are the type of person to fall down a multitude of rabbit holes (as am I). I often work out the timeline backward for each checkpoint. That way I am never severally crunched for time.
1) Checkpoint for exploratory data analysis (EDA). You can spend a lifetime learning about a dataset. Many subject matter experts will spend large portions of their life learning about the nuances of certain data. Terrorism experts will know the Global Terrorism Database forward and backward. Data scientists who have been with the same company for a while will know their company’s data in depth. It is very unlikely that you will be able to understand the dataset like these people. Spend some time getting the lay of the land with the data and move on. You cannot spend your entire time learning about different nuances to the dataset, you have a product to deploy!
2) Checkpoint for data cleaning, munging, and feature engineering. You will get data in many different forms. Sometimes the data is very clean and other times you will have to spend a large amount of time cleaning and munging. But remember that the data will never be perfectly clean. A perfect dataset doesn’t exist. Eventually, you will have to call it quits on the munging part and move on. I like to focus on the most important features and work backward from there.
3) Checkpoint for modeling. Grid searching every single parameter in your model is time-consuming and computationally expensive. But let’s be honest, you could tweak your model for the next year and at the end of it, you would still think there were improvements you could make. I often simply try any model first. My first model doesn’t need to be good. It just needs to work. Once I have a viable model, I can begin the tuning part. But building a model, any model, is the first and most important step when building a client ready data science product.
4) Checkpoint for client products. This may be the most important part of your project and many new data scientists give it little thought. Your audience is most likely going to be non-technical. They are not going to care about how you used XGBoost to… and designed this brilliant feature by… This doesn’t matter to them. What matters is the product or analysis. You need to spend some time thinking about the sales pitch for your product. Making a good slide deck or writing up an executive summary of the project may be a good idea. Don’t spend all of your time building the project and forget about how people outside of the data science community will use it.
I want to leave you with a few additional points. Just because I have put a time cap on the above activities does not mean they are unimportant. You have to understand your data before starting a project. You must clean your data in order to get things to work. My point is that you need to get a product delivered and you cannot spend your entire project time on EDA. Additionally, I want to apologize if this blog post sounded preachy or accusatory. I didn’t intend for that. I am simply speaking to my younger self, attempting to help him create good data science products within specific time constraints. Parkinson’s Law states that it will take you the allotted time anyway, so you need to be conscious of time and project management. Good luck on your next data science project!