Your Biggest Untapped Source of Data

Quick – can you tell me what your company’s most valuable source of data is? Is it your company’s (a) CRM software, (b) POS system, or (c) ERP platform? If you chose any of these as your answer, you answered incorrectly. The correct answer is (d) The Web.

While your company’s internal data systems collect a lot of valuable information, they can never match the sheer potential that exists on the Web. Doug Laney at Gartner said it best:

Your company’s biggest database isn’t your transaction, CRM, ERP or other internal database. Rather it’s the Web itself and the world of exogenous data now available from syndicated and open data sources.

Of course, the challenge with using data from the web is that it’s incredibly unstructured and scattered across a vast number of websites. As a business, you likely need to leverage at least one of the following:

  • Sales lead generation and optimization
  • Competitive analysis and monitoring
  • Product pricing and assortment
  • Brand and sentiment analysis
  • Marketing automation and research

and likely much more. Traditionally, acquiring web data for these applications has been incredibly difficult. Dan Woods, a contributor for Forbes, highlights the problem. Acquiring web data means:

  1. Making it easy to identify the information on a web page or collection of web pages and assemble that information into a useful structure.
  2. Allowing the data to still be harvested correctly, even if the page changes in some way.
  3. Recognizing when new information has arrived.
  4. Harvesting data on a regular schedule.
  5. Managing and performing quality control on thousands of agents.
  6. Handling complex was of creating pages such as responsive design.
  7. Integrating harvested data into a data warehouse or other repository.

He even goes to mention a few solutions, but none of them go far enough. Web scraping is typically just the first step of a long process when it comes to consuming web data. In our own experience, we’ve found that data from traditional web scraping will still suffer from inaccurate data, poor coverage, and inaccessibility. You still need to go through a sanitazition, bundling, and distribution process before your business can actually consume web data.

Our goal at Datafiniti is to overcome the onerous challenge of consuming quality web data. Rather than using a tool, companies should expect to have a solution for web data. This level of accessibility is possible by looking beyond traditional web scraping applications and instead looking at web data solutions like Datafiniti.


“Experts” vs the Crowd: Where Analysis of Crowd-Sourced Reviews Gets It Wrong

Last year, FiveThirtyEight published a series of posts with the ostensible purpose of finding the country’s best burrito. Long story short, the burrito picked by experts before anything began was (surprise!) the winner. The primary author goes on to make implications that the crowd’s choice (as evidenced by Yelp reviews) failed to live up to the experts’ choice. In reality, we should almost never expect this to happen, and there’s a good reason why.

Before I go any further, let me say that I’m a HUGE FiveThirtyEight fan. Watching Nate Silver perfectly predict the 2012 election was as exciting for me as watching Tracy McGrady score 13 points in 35 seconds for my beloved Rockets. But for a publication that relies on data to strip away bias, this series of posts seemed to favor bias over data.

The recap post in the burrito series centers around this graphic:

barry-jester-vorb-chart

The title assumes that the crowdsourced opinion is the wrong one, as if it needs to live up to the expert opinion. This would be a more balanced title:

barry-jester-vorb-chart-newtitle

If you’re an avid reader of food blogs (and avid eater of food) like me, you’ll see the same issue perpetuated throughout the food industry. “Expert” opinions are typically valued over a “regular” person’s tastes. While there’s a place for expert opinion, I’m not sure it’s the burrito – an everyman’s food if there ever was one.

The problem with expert opinion is that it’s formed by experts. Folks whose job it is to know food and taste. They know what’s been done and what’s never been done. An expert opinion will tell you what has pushed the boundary of human achievement, but it won’t tell you what you want for lunch today. That opinion belongs to you, and chances are your tastes are more similar to the tastes of your neighbor than they are to Anthony Bourdain’s. Crowd-sourced reviews are still a great resource for finding a decent or even good restaurant. They’re not a resource for finding redefined cuisine, and nor should they be. Comparing crowd-sourced reviews to expert reviews will most likely never match up, and we shouldn’t expect them to. They’re inherently surveying two very different groups of people.

Let me say again – there’s no issue with FiveThirtyEight’s data here. They developed a metric to score the expert opinion and stuck to it. It’s a good attempt to quantify something subjective. The only issue is with putting the expert opinion above the crowd-sourced opinion.

Last year my wife and I traveled to London and went to Dinner by Heston, which is fifth among the world’s 50 best restaurants. It was good food – great even – but I’ll still take Franklin’s BBQ over it any day of the week. My mind recognizes the genius of Heston Blumenthal, but on a Sunday afternoon in Austin, TX, nothing beats this:

20141008_112723


400,000+ Hotel Reviews Across 50 States: What We Were Surprised to Learn

Nevada knows what it’s doing, California may not, and Texas hits the sweet spot.

Love it or hate it?  5 Star or 1 Star?  Reviews are one of the most significant inputs in a customer’s buying decision.  This is especially true when it comes to booking hotels.  Based on a relatively small set of reviews, we can quickly tell whether a particular hotel is stay-worthy or should be completely avoided.  However, can they tell us something more than that?  This was the question that intrigued us, and here’s what we found after combing through 400,000+ hotel reviews.

Hotel reviews are an interesting case study of what can be discovered from web data.  A lot of people review hotels online, and they do so across many different websites.  We know that more positive reviews for a hotel means more business for that hotel.  In fact, a recent study from Cornell University’s School of Hotel Administration showed that customers are twice as likely to book a hotel with positive reviews as they are a hotel with negative reviews.   A second study showed that revenue is strongly correlated with reviews.  However, we wondered what would happen if we zoomed out and looked at the data state-by-state.  Are some states better at generating reviews?  Does a higher tourism budget result in more reviews?

To answer these questions, we studied and analyzed 437,787 hotel reviews, collected from 60 unique review sites.  The review data includes the name and geography of the hotel, along with review text, date, and ratings.  We also included demographic and economic data, such as population and tourism budget (collected from third-party sources), in our analysis.

A Word About Us

Before we get started, a quick word about us – we love data at Datafiniti.  After all, we’re aspiring to provide open access to all web data.  We crawl billions of URLs and convert all the data you need on 50 million worldwide businesses and 30 million online products into an easy, friendly, and searchable database.  Yes, it’s terabytes of data, but who’s counting?  What I’m saying is we have a lot of data.

One of our resolutions this year is to start showing the different kinds of insight that can be pulled out of this data.  What happens when you start treating the Internet like a single database at your fingertips?  What insights are hiding in plain sight?  We’re about to find out.

The Data

So let’s take a look at the data.  When aggregating the data, we counted total # of reviews by state, average rating, the ratio of population to # of reviews, and the ratio of tourism budget to # of reviews.  The map below displays how each state ranks across these different metrics.

The raw data used for this chart and the article is available on Datafiniti’s Github repository.  For the full data set of reviews, please contact us.

Winners and Losers

Oklahoma & Idaho are surprising winners while the DC-VA area suffers.

Top 10 States Bottom 10 States
Hawaii (4.54) Washington D.C. (2.77)
Nevada (4.27) Virginia (3.10)
Oklahoma (4.23) Mississippi (3.45)
Idaho (4.23) Maine (3.57)
New Jersey (4.21) Washington (3.60)
New Hampshire (4.18) Tennessee (3.61)
Massachusetts (4.15) Connecticut (3.64)
Arizona (4.14) Delaware (3.67)
Alabama (4.13) New Mexico (3.68)

A good place to start looking for insights is to zero in on the top 10 states by average rating. Here, we see some expected and some not-so-expected results. Hawaii’s hotels are basking in the warm, sunny glow of a chart-topping 4.54 star rating, while Oklahoma & Idaho tie at a surprising 3rd place, each with a 4.23 rating.

Looking at the bottom 10 states, DC and VA come in at the very depressing averages of 2.77 and 3.10, respectively.  This is way outside the norm, which could hint that something is off with the hotel industry in this area.  It also hints at opportunity, though.  If hotels aren’t meeting customer expectations in this area, then there’s clearly a market need.  Larger hotel franchises may want to consider what they can do to leapfrog their competition here.

Another  interesting insight  is that most states’ average ratings are between 3.8 and 4.2.  If we consider 3 stars to be average on a 5-star scale, then this suggests one of two possible scenarios: most states (and hotels) are outperforming expectations OR most reviewers have a bias toward 4 stars. In any other setting, receiving  3 stars out of 5 may be a good thing but according to our hotel reviews data, receiving 3 stars in this context is actually pretty bad.  Businesses may need to re-assess their prior findings when taking these ratings into account.

Populated States Have More Reviews

Nevada may have more tourists posting reviews, while visitors to Michigan are silent.

Let’s start with something simple: how do the number of reviews scale with each state’s population?  Conventional wisdom would dictate the more populous the state,  the more hotel reviews you’re going to see.

Population (2010) vs. No. of  Reviews (2010 - 2013)

The chart above supports this hypothesis, but some states sit way outside this trend.  States like Nevada, New York, and Texas generate an abnormally high number of reviews based on their population.  On the other hand, other states, like Michigan, Washington and Pennsylvania, are doing very poorly when it comes to generating hotel reviews for their population.  In general, you’ll see states that outperform are those states that are typically considered more “touristy.”  Given the importance of reviews on consumer behavior, it may make sense for some states to actively encourage travelers to post reviews.

Higher Tourism Budgets Correlate with More Reviews

California looks good until you look more closely.  Is Hawaii an inefficient spender?

Let’s try a different metric.  How does a state’s tourism budget relate to it’s hotel reviews?  In some sense, we can use this as a crude proxy to see how well each state’s tourism department is helping generate business for hotels.

Tourism Budget (2013) vs. No. of Reviews (2010 - 2013)

We again see some states doing incredibly well and some not so well.  Several states like Nevada and New York are getting a lot of bang for their buck, whereas other states like Hawaii, Illinois, and Michigan might want to look at better ways of spending that money.

One interesting note: California looks like it’s doing really well.   It’s producing 960 reviews per million dollars.  It’s doing better than most, but other states (e.g., Georgia producing 2,036 reviews per million dollars) are doing much, much better.  For a state with such a large tourism budget, it would probably be worth California’s time to study how these states are making use of their budget.

Data Is Wonderful

We find data like this incredibly exciting.  By analyzing hotel reviews in aggregate, across 60 websites, we can spot some important trends.  These are important insights for businesses and policy makers alike, and it’s all out there on the web, just waiting to be transformed into actionable data.

Even though 400,000+ hotel reviews are a large data set, it’s still small in relation to the amount of total data available on the web.  Yet even with just this data, we found how review volume moves with population and state spending.  We also learned which states are outliers on these metrics.

Interested in learning more?


Download the summary data | Contact Us To Get Full Access to the Data

We’ll be analyzing more data like this throughout the year, so be sure to check back regularly.  We can’t wait to see what we find!


To Trust or Not to Trust Online Reviews, That is the Question

Online shopping is one of the greatest inventions ever.  At least, that’s my opinion.  I’m a confessed Amazon power user.  I’ve got a prime account, I use subscriptions to auto-manage most of my regular purchases, and I always read online reviews before making a purchase.  Reviews are so ingrained in my shopping behavior that I check them even when I find myself in a physical store (a rare event nowadays).

Earlier this week, I came across this YouGov article that demonstrated an incredible finding: Most people read online reviews, yet few actually trust those reviews.  In fact, only 13% of those surveyed in the YouGov study said that reviews were very trustworthy.

The article goes on to outline some very interesting points both for and against the effectiveness of online reviews:

Why Reviews Are Valuable Why Review Value Could Be Better
Of Americans who read online reviews, 90% find written ratings to be important and 41% find them very important as an aid to decision making. 21% of Reviewers wrote reviews for products or services they had never actually purchased or tried.
78% check out the review section before making a purchase 89% believe that businesses write negative reviews of competitors.
Price doesn’t dictate how often you use online reviews – although one in five (19%) only use online reviews for products over $100, a similar number (19%) use reviews for purchases of less than $10. 91% believe businesses write their own positive reviews

So What Can Businesses Do?

The conflicting story produced by the survey makes for a bit of a head-scratcher.  Clearly, businesses need to have detailed an up-to-date reviews on their shopping site or for their online products.  Unfortunately, the effectiveness of these reviews are impacted by a very healthy dose of skepticism from the typical online shopper.

Some possible suggestions:

  1. Businesses can actively filter out reviews that sound promotional or disingenuous.  Semantic analysis of the review text can help with this.
  2. Implement some sort of customer verification process to verify every reviewer is a real customer of the product.
  3. Allow users to vote the review up or down in order to add social trust to the reviews.

Ultimately, a business wants to encourage shoppers to buy the product they’re viewing.  Reviews are a huge part of this, but it’s clear that more can be done to improve their value.  At scale, moving the needle just a small amount toward a purchase can have a multi-million dollar effect.

 


Voltron V2 – “Mauler”

As most of you probably know, today we are rolling out version two of Voltron, or, as it’s known around here, MAULER. The name is a fitting evolution of the overall project, which started as an experiment with Phantom JS and Scraper JS, two Javascript libraries made to help with web crawling, and grew into some talented developers on our team creating their own library to make web requests and crawl websites.

More Reliable Web Crawling

Version one of Voltron already provided you with the ability to crawl an impressive number of URLs per second. However, with this new – and improved (a cliché, we know, but true) – web scraper, you’ll not only be able to crawl more URLs, you’ll also be able to do so more reliably. Arguably one of the main and most important features of MAULER, and the one that allows for increased speed and reliability, is the internal structure of the crawler. Previously, crawls were run through external nodes, meaning they could be viewing web pages from international IP addresses, which can – and often did – skew the results. Now every URL will be accessed by an internal HTTP request, meaning that we have much more control over what the web scraper sees when it crawls the raw HTML. This also means that what we and those of you who use Voltron see in the tester (a Chrome extension we provide) should almost always exactly match what the web crawler actually sees.

In addition to writing our own HTTP class to make web requests, we’ve also created our own extended version of Cheerio, which is a lighter weight version of jQuery that only uses core jQuery features in order to more quickly perform DOM traversal (up to 8x the speed of JSDOM, according to the Cheerio Github page). Though the project uses all Cheerio functionality, we created our extended version, aptly named Captain Crunch, in order to make use of certain jQuery functions that we commonly use in web scraping, including .each, .not, .makeArray, .filter, and .prop.  The code for Captain Crunch will be available soon on our Github page.

As mentioned before, MAULER began as a test of whether Phantom JS and Scraper JS would provide us with greater web crawling capabilities. However, the decision was made to leave these two libraries behind and create our own HTTP class and extended Cheerio library in order to provide a dramatic increase in consistency and speed. While experimenting with these two libraries, we found that they required the creation of a window object, something which can take up to a valuable five seconds. Through the creation of our own HTTP class and the use of Captain Crunch, we are able to avoid this and, thus, achieve the faster and more reliable results we wanted.

Faster Web Crawling

Another step up that MAULER provides is an accelerated and more dependable transporter. Responsible for the part of Voltron that composes the results files for each crawl, this component of Voltron used to be somewhat of a bottleneck, as it only allowed one Redis key to write to it at a time. The new transporter on the other hand allows multiple keys to push data to Redis at the same time and only starts creating results files once the Redis list has reached 10 mb of data. This will allows crawls to run through the entire Voltron system much faster now as other components, such as the crawlers themselves and the URL distributor, no longer have to wait as long for the transporter to finish its work.

Faster Bug Fixes

The third and final main new feature of MAULER is the QA environment. This has been a feature that our whole team at Datafiniti – and we’re sure you as well – have been eager to have. Due to crawls being run internally with this new version of our web scraper and much more custom error handling, it will be much easier to find bugs (and fix them!), much faster for us to roll out new features, and much simpler to do testing on crawls and issues.

We’re beyond excited to be bringing the improved Voltron web crawler to you and hope you are excited to try it out. Please reach out to us at contact@datafiniti.net if you have any questions regarding the new crawler and its capabilities or would like a demonstration of our new 80legs product.


Upcoming Downtime

80legs will be unavailable on Monday, September 22, from 10 am to 4 pm central US time (GMT-6). This downtime will help us deploy a major update to our back-end infrastructure, which will significantly improve crawling performance.

Please note that all crawls still running at 10 am (GMT-6) September 22 will be canceled. You will be able to run new crawls after 4 pm (GMT-6) September 22.

The update will provide the following benefits:

  1. More consistent crawling speeds – no more slow periods.
  2. More reliable crawling performance – URLs will be more consistently crawled.
  3. Better internal visibility – we’re deploying an internal QA infrastructure that will give us more tools to debug and improve 80legs.

Building a Web Scraper

We briefly touched on how to build a web scraper in our last post on web crawling.  In this post, I’ll go into more detail about how to do this.  When I use the term “web scraper,” I’m referring to a very specific type of web crawler – one that looks at a specific website and extracts data from it.  We do a lot of similar scraping for Datafiniti.

This post is going to cover a lot of ground, including:

  1. Document Object Model (DOM):  An object representation of HTML
  2. Jquery:  A Javascript library that will help you manipulate the DOM
  3. Setting up our environment
  4. Building the Scraper:  Building out the scraper attribute-by-attribute
  5. Running the Scraper:  Using 80legs to run the scraper

The Document Object Model

Before we dive into building a scraper, you’ll need to understand a very important concept – the Document Object Model, aka the DOM.  The DOM is how all modern web browsers look at the HTML makes up a web page.  The HTML is read in by the browser and converted to a more formalized data structure that helps the browser render the content to what you actually see on the site.  You can think of the DOM as a nested collection of HTML data, and can even see this in your browser.  In Chrome, you get this by right-clicking and choosing “Inspect Element”:

buildscraper-1

JQuery

Because the DOM is such an accepted, standardized way of working with HTML, there are a lot of tools available for manipulating it.  One of the most widely used tools is JQuery, a library that enhances Javascript by giving it a ton of DOM-manipulation functionality.

As an example, let’s say we wanted to capture all the most-nested elements in this HTML list (item-1, item-2, and item-3):

<ul class="level-1">
 <li class="item-i">I</li>
 <li class="item-ii">II
  <ul class="level-2">
   <li class="item-a">A</li>
   <li class="item-b">B
    <ul class="level-3">
     <li class="item-1">1</li>
     <li class="item-2">2</li>
     <li class="item-3">3</li>
    </ul>
   </li>
   <li class="item-c">C</li>
  </ul>
 </li>
 <li class="item-iii">III</li>
</ul>

With JQuery, we would just need to do something like this:

var innerList = $html.find(‘ul.level-3 li’);

As you’ll see, using JQuery with the DOM greatly simplifies the web scraping process.

Setting Up Our Development Environment

Now that we understand some of the basic concepts, we’re almost ready to start building our scraper.  Before we can get to the fun stuff, however, we need to setup a development environment.  If you do this, you’ll be able to follow along and build the scraper as you read the article.  Here are the steps you need to take:

  1. Install Git.
  2. Clone the EightyApps repo.
  3. Install the EightyApp tester for Chrome.  Instructions are on the EightyApps rep page.
  4. Register on 80legs.

Building the Web Scraper

Now we’re ready to get started!  Open the BlankScraper.js file, which should be in the repo you just cloned, in a text editor.  In your browser, open http://www.houzz.com/pro/jeff-halper/exterior-worlds-landscaping-and-design, which we’ll use as an example.

For the purposes of this tutorial, we’ll say we’re interested in collecting the following attributes:

  • Name
  • Address
  • City
  • State
  • Postal code
  • Contact

Let’s start with address.  If you right-click on the web page in your browser and select “View Source”, you’ll see the full HTML for the page.  Find where the address (“1717 Oak Tree Drive”) is displayed in HTML.  You can quickly do this by also clicking on the magnifying glass in the upper left corner of the inspect elements box and then clicking on where you actually see the address displayed on the web page.

Note that the address value is stored within a span tag, which has an itemprop value of “streetAddress”.  In JQuery, we can easily capture this value with this:

object.address = $html.find(‘span[itemprop=”streetAddress”]).text();

We can do similar things for city, state, and zip code:

object.city = $html.find(‘span[itemprop=”addressLocality”]).text();
object.state = $html.find(‘span[itemprop=”addressRegion”]).text();
object.postalcode = $html.find(‘span[itemprop=”postalCode”]).text();

Some attributes may be a little harder to get at than others.

Take a look at how the contact for this business (“Jeffrey Halper”) is stored in the HTML.  There isn’t really a unique HTML tag for it.  It’s using a non-unique <dt class=”value”> tag.  Fortunately, JQuery still gives us the tools to find this tag:

object.contact = $html.find(‘dt:contains(“Contact:”)’).next().text();

This code finds the div containing the text “Contact:”, traverses to the next HTML tag, and captures the text in that tag.

Once we’ve built everything out, here’s what the code for the scraper looks like:

You’ll notice that there are only two methods in this code.  The first is called processDocument, which contains all the logic needed to extract data or content from the web page.  The second is parseLinks, which grabs the next set of links to crawl from a web page it’s currently on.  I’ve filled out the parseLinks to make the crawl more efficient.  While we could let the code return every link found, what I’ve provided here results in the crawl focusing on more URLs that actually have the data we want to scrape.

You can use the EightyAppTester Extension in Chrome to test the code you’ve written.  Just copy and paste the code in, and paste in different URLs to test what it grabs from these specific URLs.

You may be wondering where the rest of the web crawling logic is.  Because we’re going to use 80legs to run this scraper, we don’t need worry about anything except processDocument and parseLinks.  80legs will handle the rest for us.  We just handle what to do on each single URL the crawl hits.  This really simplifies the amount of code we have to write for the web scraper.

Running the Scraper

With our scraping code complete, we head over to 80legs, login, and upload the code.  We’ll also want to upload a URL list so our crawler has at least 1 URL to start from.  For this example, http://www.houzz.com/professionals/ is a good start.

With our code and URL list available in our 80legs account, all that’s left is to run the crawl.  We can use Create a Crawl form to select all the necessary settings, and we’re off!

buildscraper-2

The crawl will take some time to run.  Once it’s done, we’ll get one or more result files containing our scraped data.  And that’s it!

Wrapping Up

If you found this post useful, please let us know!  If anything was confusing, please comment, and we’ll do our best to clarify.

More posts will be coming in, so check back regularly.  You can also review our previous posts to get more background information on web crawlers and scrapers.

Further Reading


Typical Uses For Web Crawlers

In our last post, we provided an introduction to the structure and basic operations of a web crawler.  In this post, we’ll be going into more detail on specific uses cases for web crawlers.  As we do this, we’ll provide some insight into how you could design web crawlers that help each of these use cases.

The One You Use But Don’t Realize It – Search Engines

How terrible would the Internet be without search engines?  Search engines make the Internet accessible to everyone, and web crawlers play a critical part in making that happen.  Unfortunately, many people confuse the two, thinking web crawlers are search engines, and vice versa.  In fact, a web crawler is just the first part of the process that makes a search engine do what it does.

Here’s the whole process:

 

Diagram of How Search Engines Work

 

When you search for something in Google, Google does not run a web crawler right then and there to find all the web pages containing your search keywords.  Instead, Google has already run millions of web crawls and already scraped all the content, stored it, and scored it, so it can display search results instantly.

So how do those millions of web crawls run by Google work?  They’re pretty simple, actually.  Google starts with a small set of URLs it already knows about and stores these as a URL list.  They setup a crawl to go over this list and extract the keywords and links on each URL they crawl from this list.  As each link is found, those URLs are crawled as well, and the crawl keeps going until some stopping condition.

 

Diagram of How Web Crawlers Work

 

In our previous post, we described a web crawler that extracted links from each URL crawled to feed back into the crawl.  The same thing is happening here, but now the “Link Extraction App” is replaced with a “Link and Keyword Extraction App”.  The log file will now contain a list of URLs crawled, along with a list of keywords on each of those URLs.

If you wanted to do this same thing on 80legs, you would just need to use the “LinksAndKeywords” 80app with your crawl.  Source code for this app is available here.

The process for storing the links and keywords in a database and scoring the relevancy so search results can be returned is beyond the scope of our post, but if you’re interested, check out these pages:

The One Developers Love – Scraping Data

If we focus our crawling on a specific website, we can build out a web crawler that scrapes content or data from that website.  This can be useful for pulling structured data from a website, which can then be used for all sorts of interesting analysis.

When building a crawler that scrapes data from a single website, we can provide very exact specifications.  We do this by telling our web crawler app specifically where to look for the data we want.  Let’s look at an example.

Let’s say we want to get some data from this website:

Buckingham Floor Company   Doylestown  PA  US 18914

We want to get the address of this business (and any other business listed on this site).  If we look at the HTML for this listing, it looks like this (click image to expand):

html-scraping

Notice the <span itemprop=”streetAddress”> tag.  This is the HTML element that contains the address.  If we looked at the other listings on this site, we’d see that the address is always capture in this tag.  So what we want to do is configure our web crawler app to capture the text inside this element.

You can do this capturing in a lot of different ways.  The apps you use with 80legs are developed in Javascript, which means you can use JQuery to access the HTML as if it were one big data “object” (called the “DOM”).  In a later post, we’ll go into more detail on the DOM so you can get more familiar with it.  In this case, we would just do a simple command like:

object.address = $html.find('span[itemprop="streetAddress"]').text();

We can do similar commands for all the other bits of data we’d want to scrape on this web page, and all of the other on the website.  Once we do that, we’d get an object for each page like this appearing in our log file:

{
  object.name: "Buckingham Floor Company",
  object.address: "415 East Butler Avenue",
  object.locality: "Doylestown",
  object.region: "PA",
  object.postalcode: "18914",
  object.phone: "(215) 230-5399",
  object.website: "http://www.buckinghamfloor.com"
}

After we generated this log file and downloaded it to our own database or application, we could start analyzing the data contained within.

Any other sort of data scraping will work the same way.  The process will always be:

  1. Identify the HTML elements containing the data you want.
  2. Build out a web crawler app that capture those elements (80legs makes this easy).
  3. Run your crawl with this app and generate a log file containing the data.

We’ll go into more detail on building a full scraper in a future post, but if you want to give it a go now, check out our support page to see how you can do this with 80legs.

As a final note, if you’re interested in business data, we already make this available through Datafiniti.  If you don’t want to bother with scraping data yourself, we already do it for you!


What is Web Crawling?

Introduction

Web crawling can be a very complicated and technical subject to understand.  Every web page on the Internet is different from the next, which means every web crawler is different (at least in some way) from the next.

We do a lot of web crawling to collect the data you see in Datafiniti.  In order to help our users get a better understanding of how this process works, we’re embarking on an extensive series of posts to provide better insight into what a web crawler is, how it works, how it can be used, and the challenges involved.

Here are the posts we have planned:

  1. What is Web Crawling?
  2. Typical use cases for web crawlers
  3. Building a web scraper
  4. Different data formats for storing data from a web crawl: CSV, JSON, and Databases
  5. How to use JSON data
  6. Challenges with scraping data
  7. Web crawling use cases: collecting pricing data
  8. Web crawling use cases: collecting business reviews
  9. Web crawling use cases: collecting product reviews
  10. Comparison of different web crawlers

So let’s get started!

The Web Page, Deconstructed

We actually need to define what a web page is before we can really understand how a web crawler works.  A lot of people think of a web page as what they see in their browser window, which is right, but that’s not what a web page is when a web crawler sees it.  So let’s look at a web page like a web crawler.

When you see http://www.cnn.com, you see something like this:

 

Living News   Personal Wellness  Love Life  Work Balance and Home Style   CNN.com

 

In fact, what you are seeing is the combination of many different “resources”, which your web browser is combining together to show you the page you see.  Here’s an abridged version of what happens:

  1. You type in “http://www.cnn.com”.
  2. Your browser says ok, let me GET “http://www.cnn.com”.
  3. CNN’s server says, hey browser, here’s the content for that page.  At this point, the browser is only returning the HTML source code of “http://www.cnn.com”, which looks something like this:
    html_source
  4. Your browser looks through this code and notices a few things.  It notices there are a few style resources needed.  It also notices there are several image resources needed.
  5. The browser now says, I need to GET all of these resources as well.
  6. Once all the resources for the page are received, it combines them all and displays the page you see.

This is what your browser does.  A web crawler can get all the same resources, but if you tell it to GET “http://www.cnn.com”, it will only fetch the HTML source code.  That’s all it knows about it until you tell it do something else (possibly with the information in the HTML).  By the way, “GET” is the actual technical term for the type of request being made by the crawler and your browser.

A Very Basic Web Crawler

Alright, so now that we understand that requesting “http://www.cnn.com” will only return HTML source code, let’s see what we can do with that.

Let’s imagine our web crawler as a little app.  When you start this app, it asks you for what web page you want to crawl.  That’s its only input: a list of URLs, or in this case, a list containing 1 URL.

You enter “http://www.cnn.com”.  At this point, the web crawler gets the HTML source code of this URL.  The HTML is like a very long piece of semi-structured text.  It’s going to write that text to a separate file.  Just to make it easy on us, the web crawler will also write which URL belongs to this source code.

The whole thing can be visualized like this:

What is Web Crawling Illustration 1

A Slightly More Complicated Web Crawler

So the web crawler can’t do much right now, but it can do the basic thing any web crawler needs to do, which is to get content from a URL.  Now we need to expand it to get more than 1 URL.

There are two ways we can do this.  First, we can supply more than 1 URL in our URL list as input.  The web crawler would then iterate through each URL in this list, and write all the data to the same log file, like so:

What is Web Crawling Illustration 2

Another way would be to use the HTML source code from each URL as a way to find the next set of URLs to crawl.  If you look at the HTML source code for any page, you’ll find several references to anchor tags, which look like <a href=””>some text</a>.  These are the links you see on a web page, and they can tell the web crawler where other URLs are.

So all we need to do now is extract the URLs of those links and then feed those in as a new URL list to the app, like so:

What is Web Crawling Illustration 3

In fact, this is how web crawlers for search engines typically work.  They start with a list of “top-level domains” (e.g., cnn.com, facebook.com, etc.) as their URL list, step through that list, and then crawl to all the links found on the pages they crawl.

So What’s the Purpose of the Web Crawler?

We now have the conceptual understanding of what a typical web crawler does, but it may not be clear what it’s real purpose is.

The ultimate purpose of any web crawler is to collect content or data from the web.  “Content” or “data” can mean a wide variety of things, including everything from the full HTML source code of every URL requested, or even just a yes/no if a specific keyword exists on a page.  In our next blog post, we’ll cover some common use cases, and expand upon how our conceptual “web crawling app” we’ve described here could be expanded to fit those use cases.

Want to Try Web Crawling Yourself?

If you’re interested in trying to run your web crawls, we recommend using 80legs.  It’s the same platform we use to run crawls for Datafiniti.

 


New User Agent for 80legs

On Thursday, July 17th, we’ll be changing the user-agent for the 80legs crawler form “008” to “voltron”.

We recognize that changing the user-agent for our web crawler could potentially be controversial, but in this case we feel it’s strongly warranted.  Over 4 months ago, we launched a completely new back-end for 80legs.  Although we still call the system “80legs”, in reality it’s a completely different web crawler.  One of the biggest features of the new crawler is that it’s considerably better about crawling websites respectfully.  In fact, we haven’t received a single complaint from webmasters since we launched the new crawler.

With this change, the 80legs crawler will now only obey robots.txt directives for the “voltron” user-agent.  It will ignore directives for the “008” user-agent.  We feel this change in behavior is appropriate, as it gives our users the chance to crawl websites inaccessible to the old crawler while still giving webmasters the opportunity to control traffic coming from the new crawler.