Hello. In this post I’ll show how to scrape semi-structured data from a target webpage with Python’s BeautifulSoup module. BeautifulSoup is indeed beautiful. It is the go-to package for scraping data and working with HTML. We’ll also use requests to grab the HTML from the target URL. The page I use in my example should be a good starting point since the HTML structure is consistent as well as the URL scheme.
Assume for some odd reason you want to get a list of quotes from the T.V. sitcom The Big Bang Theory. How would you go about doing this? First Google search result for “big bang theory quotes”. Let’s choose BBT’s most interesting character, Sheldon Cooper. Ok great, there is a list of quotes (with ratings, too!) But . . . They are scattered about 154 pages. There’s no way we’re going to click through those 154 pages, copy, paste, and strip out all the junk between the quotes by hand. The first thing to do is inspect the page source and identify where the data is located. For Chrome or Firefox (see: Firebug plugin – a great tool for viewing page structure), right click the data you want and select inspect. Here’s what you should see:
Notice how the line on the right is selected – the p element, or paragraph tag. Click it to see what is contained within. All elements indented within this paragraph element are called children (source). Here is a link to a helpful website for understanding HTML further. The quickest way to figure out how to parse the HTML and extract the correct data (but not other, irrelevant data) is to open your Python interpreter (enter “python” or “python3” from the command line), import the BeautifulSoup and requests modules, and begin experimenting with the BeautifulSoup functions. BeautifulSoup also has solid documentation. Here’s the code I ended up with:
Here is a breakdown of the core functions by line number:
- requests.get() – retrieves the HTML code from the page
- BeautifulSoup() – creates a BeautifulSoup object from the HTML, which represents the data in a nested structure. Try “print(soup.prettify())” to view the structure
- find_all() – this is called on the “soup” object to find all instances that satisfy the parameters that are passed in (in this case, the class name). This method is extremely flexible, see the documentation for other ways you can use it. The result is a list.
It turns out that p element also has a parent “div” element with the class of “quotesBody”, which I found all instances of with the find_all function. That got me a list of 15 quotes for each page. The code simply loops through those list items, grabs the text from them, writes it to a file, sleeps for 5 seconds, then loads the next page and repeats. Notice how the link is built each loop (line 10) by type casting the number k into a string, and concatenating with the base url (line 5, obscured by the light bulb) . I write newline characters (“\n”) to separate the quotes.
But what about the ratings, shouldn’t we grab those as well? Yes, we should. Even if all we need right now are the quotes, spending a few more minutes grabbing additional information can’t hurt. There may be a time where you only want to use the funniest quotes, and having that data will allow you to filter by that attribute. Or at some point the site may be taken down, and the rating data will be forever lost because it was not harvested. I also could have grabbed the episode title information. Here is the revised code that also collects ratings:
It turns out that scraping the ratings is also very simple: find_all based on the “rating” class. Since there is a rating for every quote, we can simply loop based on the length of the list array. We use that same index (j) to select the corresponding rating from the list of ratings and write it to the file after the quote. Here is what the output looks like:
Nice and neat. I used this format because it is human friendly and readable, but still structured enough that I can transform it into .csv or .json format if/when needed (except for the very first quote, for some reason Sheldon’s part is on the same line). This concludes my concise tutorial. Scraping solutions aren’t always this simple. Sometimes finding the data you need involves following several links. For more complex scraping tasks, there is a framework called Scrapy, which I also created a tutorial on. In any case, when you are scraping data from a website you should always keep the target website in mind. Some websites expressly forbid scraping and take active measures to prevent it. Always use delays instead of hammering the website’s servers with requests (sleeping for 5 seconds is more than enough usually, and will also keep you from getting banned). Feel free to post any questions in the comments and I’ll see if I can help. You can also have the full code sent to with this form: