_Understanding Movie Quality from Plot Summaries using Natural Language Processing
For my capstone project at General Assembly’s Data Science Immersive program, I built a model that took plot summaries in as inputs and determined whether that plot would resulted in a good movie. Basically, I wanted to use Natural Language Processing (NLP) on Wikipedia plot summaries to predict Metacritic Metascores for a bunch of movies. I started off by web scraping Metacritic.com with BeautifulSoup. I pulled cast member names, director names, release year, plot summaries, and the compiled Metacritic Metascore. I compared the Metascore to 100,000 plot summaries and titles from Wikipedia. I used spaCy and NLTK (Natural Language Toolkit) to preprocessing the data from Wikipedia. I stemmed, lemmatized, removed stop words and removed punctuation (the basic natural language processing). I used a Term Frequency-Inverse Document Frequency (TF-IDF) Vectorizer from Scikit Learn to process the plot data. I had a simple binary classification problem. The response variable was whether the movie was “Good” or not and the plots of each movie were converted into features. I used a support vector machine, random forest classifier, and XGBoost to classify the data. Overall, I ended up with really poor ROC-AUC scores. The difference between a good crime movie and a bad crime movie may be more nuanced than my classification problem could pick up on. I think the similarity between good and bad movies, from a plot perspective, was too difficult for my first model. This project will be further refined in the future. This project can be found on my public Github account, in a repository titled "Capstone".
_2001 SAT Analysis
I performed exploratory data analysis on College Board SAT data from 2001. I analyzed average math scores, average verbal scores, and SAT participation rates for all 50 states and the District of Columbia. I looked at which states performed well on the SAT and if participation rate was correlated with performance. Time series analysis including more recent data would offer a clearer picture into SAT participation rates across the country. The majority of the visualizations in this project were built with Tableau. Analysis and simple visualizations were constructed with Python (NumPy, SciPy, Matplotlib, Seaborn). This project can be found on my public Github account, in a repository titled "Portfolio".
_Modeling Iowa Liquor Sales
The goal of this analysis was to determine the best geographic location in Iowa on which to build a new liquor store. This analysis sought to determine location by maximizing sales. The full dataset contained upwards of 2.7 million transactions. It contained transaction level data for all stores holding a Class E liquor license in 2015 and Q1 of 2016. The location model was a regression. Correlation matrices were used to perform feature selection for the model. This analysis resulted in a model including location dummy variables, state bottle retail (or price per bottle), and the number of bottles sold as features. Lasso Regression or L1 regularization was used to reduce the error in the model. This model concluded that Dallas County Iowa was the best county in the state of Iowa in which to open a new liquor store.
A secondary model was run to determine which features most strongly affected sales in Dallas County. This secondary model looked to determine how state bottle retail, the number of bottles sold, and the volume of liquor sold (in liters) affected the sales of a Dallas County liquor store. This model concluded the best way to maximize sales in that county was to increase the number of bottles sold. Further analysis of the top performing stores in Dallas and the data associated with those stores would be necessary in order to develop a more robust business strategy. This project can be found on my public Github account, in a repository titled "Portfolio".
_Web Scraping Indeed Job Listings
I built a web scraper, which used BeautifulSoup to parse data science job listings in twenty different cities across the US. This scraper pulled 5,000 postings for jobs per location. This data was used to find out which factors most directly increased salaries for data scientists. Job listings were categories into either above the mean salary or below. Predicted salaries were developed with a random forests model and separately with support vector machines. L1 regularization was employed. The web scraper and all models were built with Python (BeautifulSoup, Scikit-Learn, NLTK, Pandas). This project can be found on my public Github account, in a repository titled "Portfolio".
_Chicago's West Nile Virus Problem
Chaim Gluck, Nellie Carr, Matt Terribile and I entered the Kaggle competition titled "West Nile Virus Prediction". The goal of this project was to predict where the West Nile Virus would be present among mosquito populations in Chicago. We began our analysis of Chicago's West Nile Virus problem with simple exploratory data analysis, developing a basic understanding of the mosquito trap analysis currently undertaken by Chicago's Department of Public Health. We developed a model using features engineered from weather data as well as other time and temperature factors. Our hypothesis, based on EDA and research on mosquito breeding, was that warmth, moisture, and a delay effect (time for larvae to mature) would most greatly effect mosquito populations. We used random forests, AdaBoost, and XGBoost to model this problem. Ultimately, we were able to predict the presence of West Nile Virus with an ROC-AUC score of 0.77. Our final analysis encompassed the model as well as a cost benefit analysis to spraying the city of Chicago. All of the parts of this project (Jupyter Notebooks, Cost-Benefit Analysis, Presentation) can be found at our public Github Repository. Click the button bellow.
_Predicting Terrorism with Bayesian Inference
The objective of this project was to gain an understanding of terrorism using Bayesian inference. My analysis specifically looked at the Phillippines in Southeast Asia. My working knowledge on terrorism in Southeast Asia was limited. But this line of inquiry was sparked by the chaotic situation in Marawi City, Philippines where Philippine Government security forces fought (and continue to fight depending on when this is read) ISIS affiliated militants known as Abu Sayyaf and Maute.
I utilized the Global Terrorism Database which contained over 170,000 cases of terrorism from around the globe. The National Consortium for the Study of Terrorism and Responses to Terrorism (START) maintains this database. START is headquartered at the University of Maryland. At the time of the submission of this project, terror attacks from 1970 to 2016 were accounted for. The database contained over one hundred and thirty variables including geographic location, type of attack, perpetrators, targets, outcomes, number of fatalities, and motivation of perpetrators. This was a very robust dataset but it did contain a fair amount of missing information. In particular, 1993 data was lost. One additional goal of this analysis was to impute the number of bombings that occurred in 1993.
The full breakdown of the Markov Chain Monte Carlo method can be found in the Jupyter Notebook associated with this project. The MCMC method revealed there was a statistical difference between pre-2010 bombings in the Philippines and post-2010 bombings. The credible intervals between the two groups did not overlap and the difference in means was statistically different than zero. Overall, this analysis isn’t very surprising because of the global trend in increased terrorism starting in the early 2000’s. An industry expert with prior knowledge of terrorism could more readily utilize MCMC methods in order to derive key insights into the world of terrorism analysis.
It seems that the number of terror bombings has increased across the globe since 2010 and increased markedly in the Philippines. Overall, it is with the help of START that we can learn more about terrorist activities and develop proper strategies in order to counteract their threat.