Why Some Websites Can’t Be Included In Datafiniti

Datafiniti
Knowledge from Data: The Datafiniti Blog
3 min readJul 19, 2016

--

In an ideal world, data would be open and accessible to everyone, making it easy to build powerful applications on top of the data. Unfortunately, we’re not quite there yet. Datafiniti goes a long way to make this vision a reality by making most of the data available on the web accessible, but even we encounter roadblocks as well. From time to time, we’ll encounter websites that can’t be crawled or scraped, and therefore cannot be included as data sources for us or our customers.

Top Reasons Why a Website Can’t Be Crawled:

1. The website uses robots.txt to block web crawlers

Websites can provide a file called robots.txt at their top-level path (e.g., https://datafiniti.co/robots.txt). The robots.txt file tells web crawlers what URLs can and cannot be accessed. A full description of this file is available on Wikipedia. Some websites are very restrictive and may disallow all access by web crawlers.

Datafiniti’s web crawlers obey these restrictions.

2. The website blocks crawlers by their IP address

Many people employ web scrapers that don’t follow standard protocols for crawling websites. Because of this, website administrators have resorted to more advanced techniques to blocking scrapers. By blocking any requests originating from a particular IP address, a website can prevent any scraping activity coming from the computer at that address. Websites that do this typically have pre-defined rules in place to automatically block IP addresses that are making too many requests in a short period of time.

While Datafiniti’s web crawlers have their own rules in place to avoid triggering these blocks, some websites employ very aggressive rules and may block request rates as slow as 1 request/second. In these cases, our crawler may get blocked or slowed down to the point where the website becomes useless as a data source.

3. The website uses cookies/sessions to serve information

Some websites require visitors to accept cookie or session data before they will serve relevant data. Cookies are used by websites to identify individual visitors and track their behavior as they navigate through the website.

Datafiniti’s web crawlers do not currently support cookies, so it cannot access websites that require them.

4. The website requires a user to login

If a website requires a user login before serving any relevant data, then Datafiniti cannot crawl the website. Beyond any technical limitations, our research has shown that crawling data behind logins typically results in the user account being banned.

5. The website uses form submission or POST requests to show data

While it may not be immediate apparent in all cases, some websites use invisible form submission (aka “POST” requests) to show data. Behind the scenes, the website is submitting specific data (e.g., a category listing in a directory) in order to display data results.

Datafiniti doesn’t currently support this functionality.

What To Do When a Website Can’t Be Crawled

The great thing about the web is that there’s a ton of data out there. Compared to the volume of total available data, the chance of any one website containing unique information is very low. For example, if a review site like Yelp can’t be crawled, there are several other websites that also have rich review data of local businesses. In many cases, a single website can be replaced with one or more other websites to produce equal or better data.

You can connect with us and learn more about our business, people, product, and property APIs and datasets by selecting one of the options below.

--

--

We provide instant access to business, people, product, and property data sourced from thousands of websites.