Typical Uses For Web Crawlers

Datafiniti
Knowledge from Data: The Datafiniti Blog
5 min readAug 15, 2014

--

In our last post, we provided an introduction to the structure and basic operations of a web crawler. In this post, we’ll be going into more detail on specific uses cases for web crawlers. As we do this, we’ll provide some insight into how you could design web crawlers that help each of these use cases.

The One You Use But Don’t Realize It — Search Engines

How terrible would the Internet be without search engines? Search engines make the Internet accessible to everyone, and web crawlers play a critical part in making that happen. Unfortunately, many people confuse the two, thinking web crawlers are search engines, and vice versa. In fact, a web crawler is just the first part of the process that makes a search engine do what it does.

Here’s the whole process:

When you search for something in Google, Google does not run a web crawler right then and there to find all the web pages containing your search keywords. Instead, Google has already run millions of web crawls and already scraped all the content, stored it, and scored it, so it can display search results instantly.

So how do those millions of web crawls run by Google work? They’re pretty simple, actually. Google starts with a small set of URLs it already knows about and stores these as a URL list. They setup a crawl to go over this list and extract the keywords and links on each URL they crawl from this list. As each link is found, those URLs are crawled as well, and the crawl keeps going until some stopping condition.

In our previous post, we described a web crawler that extracted links from each URL crawled to feed back into the crawl. The same thing is happening here, but now the “Link Extraction App” is replaced with a “Link and Keyword Extraction App”. The log file will now contain a list of URLs crawled, along with a list of keywords on each of those URLs.

If you wanted to do this same thing on 80legs, you would just need to use the “LinksAndKeywords” 80app with your crawl. Source code for this app is available here.

The process for storing the links and keywords in a database and scoring the relevancy so search results can be returned is beyond the scope of our post, but if you’re interested, check out these pages:

The One Developers Love — Scraping Data

If we focus our crawling on a specific website, we can build out a web crawler that scrapes content or data from that website. This can be useful for pulling structured data from a website, which can then be used for all sorts of interesting analysis.

When building a crawler that scrapes data from a single website, we can provide very exact specifications. We do this by telling our web crawler app specifically where to look for the data we want. Let’s look at an example.

Let’s say we want to get some data from this website:

We want to get the address of this business (and any other business listed on this site). If we look at the HTML for this listing, it looks like this (click image to expand):

Notice the <span itemprop=”streetAddress”> tag. This is the HTML element that contains the address. If we looked at the other listings on this site, we’d see that the address is always capture in this tag. So what we want to do is configure our web crawler app to capture the text inside this element.

You can do this capturing in a lot of different ways. The apps you use with 80legs are developed in Javascript, which means you can use JQuery to access the HTML as if it were one big data “object” (called the “DOM”). In a later post, we’ll go into more detail on the DOM so you can get more familiar with it. In this case, we would just do a simple command like:

object.address = $html.find('span[itemprop="streetAddress"]').text();

We can do similar commands for all the other bits of data we’d want to scrape on this web page, and all of the other on the website. Once we do that, we’d get an object for each page like this appearing in our log file:

{
object.name: "Buckingham Floor Company",
object.address: "415 East Butler Avenue",
object.locality: "Doylestown",
object.region: "PA",
object.postalcode: "18914",
object.phone: "(215) 230-5399",
object.website: "http://www.buckinghamfloor.com"
}

After we generated this log file and downloaded it to our own database or application, we could start analyzing the data contained within.

Any other sort of data scraping will work the same way. The process will always be:

  1. Identify the HTML elements containing the data you want.
  2. Build out a web crawler app that capture those elements (80legs makes this easy).
  3. Run your crawl with this app and generate a log file containing the data.

We’ll go into more detail on building a full scraper in a future post, but if you want to give it a go now, check out our support page to see how you can do this with 80legs.

As a final note, if you’re interested in business data, we already make this available through Datafiniti. If you don’t want to bother with scraping data yourself, we already do it for you!

You can connect with us and learn more about our business, people, product, and property APIs and datasets by selecting one of the options below.

--

--

We provide instant access to business, people, product, and property data sourced from thousands of websites.