Web Scraping with BeautifulSoup
(This article was originally published on Viget.com.)
Python’s BeautifulSoup library makes scraping web data a breeze. With a basic understanding of HTML and Python, you can pull all the data you need from web pages. In this article, I go through an example of web scraping by pulling text data from Viget.com.
Warning
Before you begin scraping a site, make sure not to violate the site’s Terms of Service. Don’t violate the rules in the site’s robots.txt, and don’t use an overly aggressive crawl rate. Read Benoit Bernard’s blog post about the legality behind web scraping before you start, and consult your legal team before scraping or crawling a site.
End of warning
The easiest sites to scrape are those with a consistent HTML structure. Let’s look at Viget.com as an example. If I wanted to pull the author name from each article, I could search each document for the class name ‘credit__author-name,’ and I would find the author’s name every time. If a site used additional class names in different contexts, we would need to be aware of those variants and account for them.
Below is a Python function I wrote to scrape Viget.com’s article pages. My goal was to get the title, author name, hashtags, date, and text of each article on Viget.com.
def get_article(url): title = 'None' author = 'None' date = 'None' article_body = 'None' site = requests.get(url, headers = {'User-Agent': 'test') html = BeautifulSoup(site.text, 'lxml') while site.status_code == 200: try: title = html.find(class_ = 'hero__title').text except: pass try: author = html.find(class_ = 'credit__author-name').text except: pass try: hashtags = [item.text for item in html.findAll(class_='category-tag ')] except: pass try: date = html.time.attrs['datetime'] except: pass try: article_body = [each.text for each in html.find_all(class_ ='text')] except: pass article_dict = {'Title' : title, 'Author' : author, 'Date' : date, 'Hashtags' : hashtags, 'Article' : article_body} return article_dict
First, I used the Requests package in Python to send an HTTP get request for the given URL. This request was saved in the variable, ‘site.’ When sending the HTML request, make sure to set your User-Agent parameter. This parameter sends your browser and operating system information to the web server of the site you are pinging. I have removed my user agent, but you can replace the ‘test’ string with your own user agent. You can find your own user agent by searching, “What is my user agent?” in Google.
I only wanted to pull the data if the response I received from the HTTP GET request had a status code of 200. A 200 status code for our GET request “means that the data was successfully fetched and transmitted.” If we received any other status code, the function would return ‘None’ for the title, author name, hashtag, etc. The simple step of including ‘while site.status.code == 200,’ ensures we are only pulling data from successful GET requests. If the page is redirected or we get an unsuccessful response, the data will not be pulled.
I then used BeautifulSoup to parse the HTML. BeautifulSoup’s documentation is useful if you are new to using the package, or if you are attempting to pull data inaccessible with standard find( ) methods. Lastly, I chose to store the data in a dictionary because a dictionary — or a list of dictionaries — can be converted into a Pandas dataframe easily.
If you are attempting to pull information from multiple pages, you could use the URL structure of the site to your advantage. On Viget.com/articles, you can see the titles, authors, dates, and hashtags from the ten most recent articles. When you view the next ten articles, the URL changes to ‘viget.com/articles/?page=2.’ You can pull the data from all subsequent pages by looping from pages 0 to n. Between BeautifulSoup and string manipulation, a lot is possible.
I wanted to combine the scraped data with Google Analytics pageview data, so I pulled a list of article URLs from GA, saving that list as ‘url_list.’ Then, I looped through ‘url_list’ and applied the function ‘get_article’ created above. The data was saved as a big list of dictionaries and ultimately converted into a Pandas dataframe.
full_list = [] for i in url_list: file = get_article(i) print i if i != None: full_list.append(file) else: full_list.append() print full_list