Building a Web Scraper

Published in

Knowledge from Data: The Datafiniti Blog

5 min readSep 3, 2014

We briefly touched on how to build a web scraper in our last post on web crawling. In this post, I’ll go into more detail about how to do this. When I use the term “web scraper,” I’m referring to a very specific type of web crawler — one that looks at a specific website and extracts data from it. We do a lot of similar scraping for Datafiniti.

This post is going to cover a lot of ground, including:

Document Object Model (DOM): An object representation of HTML
Jquery: A Javascript library that will help you manipulate the DOM
Setting up our environment
Building the Scraper: Building out the scraper attribute-by-attribute
Running the Scraper: Using 80legs to run the scraper

The Document Object Model

Before we dive into building a scraper, you’ll need to understand a very important concept — the Document Object Model, aka the DOM. The DOM is how all modern web browsers look at the HTML makes up a web page. The HTML is read in by the browser and converted to a more formalized data structure that helps the browser render the content to what you actually see on the site. You can think of the DOM as a nested collection of HTML data, and can even see this in your browser. In Chrome, you get this by right-clicking and choosing “Inspect Element”:

JQuery

Because the DOM is such an accepted, standardized way of working with HTML, there are a lot of tools available for manipulating it. One of the most widely used tools is JQuery, a library that enhances Javascript by giving it a ton of DOM-manipulation functionality.

As an example, let’s say we wanted to capture all the most-nested elements in this HTML list (item-1, item-2, and item-3):

<ul class="level-1">
 <li class="item-i">I</li>
 <li class="item-ii">II
  <ul class="level-2">
   <li class="item-a">A</li>
   <li class="item-b">B
    <ul class="level-3">
     <li class="item-1">1</li>
     <li class="item-2">2</li>
     <li class="item-3">3</li>
    </ul>
   </li>
   <li class="item-c">C</li>
  </ul>
 </li>
 <li class="item-iii">III</li>
</ul>

With JQuery, we would just need to do something like this:

var innerList = $html.find(‘ul.level-3 li’);

As you’ll see, using JQuery with the DOM greatly simplifies the web scraping process.

Setting Up Our Development Environment

Now that we understand some of the basic concepts, we’re almost ready to start building our scraper. Before we can get to the fun stuff, however, we need to setup a development environment. If you do this, you’ll be able to follow along and build the scraper as you read the article. Here are the steps you need to take:

Install Git.
Clone the EightyApps repo.
Install the EightyApp tester for Chrome. Instructions are on the EightyApps rep page.
Register on 80legs.

Building the Web Scraper

Now we’re ready to get started! Open the BlankScraper.js file, which should be in the repo you just cloned, in a text editor. In your browser, open http://www.citysearch.com/profile/10192700/lockhart_tx/black_s_barbecue.html, which we’ll use as an example.

For the purposes of this tutorial, we’ll say we’re interested in collecting the following attributes:

Name
Address
City
State
Postal code

Let’s start with address. If you right-click on the web page in your browser and select “View Source”, you’ll see the full HTML for the page. Find where the address (“215 N Main St”) is displayed in HTML. You can quickly do this by also clicking on the magnifying glass in the upper left corner of the inspect elements box and then clicking on where you actually see the address displayed on the web page.

Note that the address value is stored within a span tag, which has an itemprop value of “streetAddress”. In JQuery, we can easily capture this value with this:

object.address = $html.find(‘span[itemprop=”streetAddress”]).text();

We can do similar things for city, state, and zip code:

object.city = $html.find(‘span[itemprop=”addressLocality”]).text();
object.state = $html.find(‘span[itemprop=”addressRegion”]).text();
object.postalcode = $html.find(‘span[itemprop=”postalCode”]).text();

Once we’ve built everything out, here’s what the code for the scraper looks like:

https://gist.github.com/shiondev/c569e72fec0c3d8bcf34

You can use the http://80apptester.80legs.com/ to test the code you’ve written. Just copy and paste the code in, and paste in different URLs to test what it grabs from these specific URLs.

You may be wondering where the rest of the web crawling logic is. Because we’re going to use 80legs to run this scraper, we don’t need worry about anything except processDocument and parseLinks. 80legs will handle the rest for us. We just handle what to do on each single URL the crawl hits. This really simplifies the amount of code we have to write for the web scraper.

Running the Scraper

With our scraping code complete, we head over to 80legs, login, and upload the code. We’ll also want to upload a URL list so our crawler has at least 1 URL to start from. For this example, http://www.citysearch.com/listings/lockhart-tx/barbecue_restaurants/72035_1724 is a good start.

With our code and URL list available in our 80legs account, all that’s left is to run the crawl. We can use Create a Crawl form to select all the necessary settings, and we’re off!

The crawl will take some time to run. Once it’s done, we’ll get one or more result files containing our scraped data. And that’s it!

Wrapping Up

If you found this post useful, please let us know! If anything was confusing, please comment, and we’ll do our best to clarify.

More posts will be coming in, so check back regularly. You can also review our previous posts to get more background information on web crawlers and scrapers.