Jul 28

Jul 28 Web Scraping Indeed Data Science Positions

Abstract: My analysis sought to understand the salary range for data scientists across the United States. Additionally, I looked at which skills, titles, and locations yielded higher salaries for data scientists. I found the median data science salary for the US to be $86,711. I web scraped my data from Indeed.com. My models showed that titles and descriptions containing words like “data scientist” and “machine learning” tended to have higher salaries than the median salary. Phrases like “research analyst” and “data analyst” tended to have lower salaries. The location of the job also had a profound effect on salary. My models showed Washington, DC tended to have higher than median salaries for data scientists.

Introduction: This analysis pulled data from Indeed.com, the job-searching site. I built a web scraper using Python’s Beautiful Soup. The web scraper pulled over 180,000 jobs related to data science. The web scraper searched for “data science” in the Indeed search field. The results included roles like data analyst, research analyst, statistical analyst, and data engineer along with variants of data scientist. The web scraper pulled from over 70 cities in the US. Major metropolitan areas like New York City, San Francisco, and Washington, DC gave more results than other cities. Only the jobs that contained yearly salary data were included in the model building process. Hourly jobs were dropped. The final dataset consisted of 409 jobs with complete information. Though this was less than I had originally hoped for, I moved forward with model building.

Methods: I modeled the data with a variety of classification models. The details of these models are not of great importance for this specific blog post, but the code for this project can be found in my GitHub. The median salary of $86,711 was used to delineate jobs into either above the median job salary or below the median job salary. I used Random Forest Models, Logistic Regression, Support Vector Machines, and XGBoost to model the data. Random Forests and XGBoost performed the best. Again, the details of these models are beyond the scope of this blog post. The models analyzed how the title, location, and description affected which class of salary the job fell into. The models were able to correctly predict whether a job was above or below the median salary with an accuracy of 70%.

Results: My analysis found titles and descriptions containing words like “data scientist” and “machine learning” tended to have higher than median salaries. “Research analyst”, “data analyst”, and “statistical analyst” were titles or phrases associated with below median salaries. The location of the job had a determinative effect on salary. If a job were located in Arizona, it tended to have a below median salary. Jobs in the states of Washington, Illinois, and Massachusetts tended to have higher than median salaries. Jobs in Washington DC also tended to be higher than the median of $86,711.

Recommendations: If I were to re-do this project, I would run the web scraper bi-weekly for a few months in order to aggregate enough salary data. As with many situations, more data will only lead to more accurate models and better explanations. Overall, this project was a great first experience with web scraping and building Random Forest and XGBoost models.