Lessons from Chicago’s West Nile Problem
For our fourth project in General Assembly’s Data Science Immersive program, a team of three other data scientists and I entered the West Nile virus competition organized by Kaggle. The competition was initiated in 2015 and it sought to predict the presence of West Nile virus in Chicago mosquitos. We were given weather data, insecticide spray data, and mosquito trap data from the City of Chicago and the Chicago Department of Public Health. All of the code and models my team built for the competition can be found in my public Github. Instead of walking through the project, I wanted to go through three key lessons I learned during the duration of the project. Specifically, I want to talk about feature engineering, the data pipeline, and subject matter expertise.
Feature Engineering:
My team was successful in getting the highest score in our cohort and I attribute our success primarily to feature engineering. We needed to find creative and unique ways to predict the presence of West Nile virus outside of our raw weather and trap data. One of my team members, Matt Terribile, engineered cumulative weather data after developing a hypothesis he built off years of camping experience. Terribile believed cumulative precipitation and heat would have a larger effect on mosquito presence than the weather recorded during the trap checking days. His hypothesis proved correct when our random forest model showed large feature importance for his engineered variables. I believe these engineered features were a major key to our model’s success.
Data Pipelines:
The data we received from Kaggle was relatively clean. My teammate Chaim Gluck quickly accounted for missing and trace values. Raw data in the real world tends to be in worse shape than Kaggle. Overall, this experience taught me about the data pipeline. Getting data from the source (the chief job of many data engineers) is very important. Then the data needs to be preprocessed and cleaned. Data munging is definitely not among my favorite aspects of data science. Nonetheless, this step is critical. My team found ourselves coming back to the data munging and data wrangling step many times and I have no doubt this will be an exercise we all repeat in the industry. After we munged enough data, we moved on to feature engineering (since I talked about it in the previous section, I will skip the details here). Modeling and machine learning are my favorite parts of data science but by this step in our project, the vast majority of our work was finished. I tweaked various features and attempted a wide variety of modeling techniques in order to maximize our ROC-AUC score on Kaggle. This project taught me a lot about the data pipeline and the workflow of a major project involving various people. It was a great experience in teamwork.
Subject Matter Expertise:
On my team, Nellie Carr took the lead with learning all about mosquitos, West Nile, and the geography of Chicago. This research was critical in framing the problem statement and without some insight into how West Nile virus propagated through mosquito populations we would have been completely lost. The most important lesson I learned in this entire project is that we need subject matter expertise in order to develop meaningful analysis. As data scientists, we must become best friends with our subject matter expert and constantly check back with them in order to direct our analysis.
Overall, this project taught me about performing data science with a team. This included everything from dividing and conquering different parts of the project to learning how to effectively use Git. I am grateful for my amazing teammates Nellie, Chaim, and Matt and I am grateful for the experiences I gained from this project.