The Money Pit: Why You Shouldn’t Build Your Own Web Scrapers

--

Have you heard of the Oak Island Money Pit? Some have said it’s the last great unsolved mystery — a hidden treasure buried by pirates on a remote island that generations of families have tried in vain to uncover.

Image courtesy http://www.oakislandmoneypit.com/

The story goes like this: In 1795, a young boy named Daniel McGinnis saw some curious lights coming from the island off the coast of his parent’s house. When he later got to the island, he discovered a small hole, some removed oak trees, and a block and tackle hanging from a severed tree limb. Daniel and his friends began digging away for the buried treasure that certainly lay beneath. Remarkably, he came across some wooden slats buried in the ground, but never the treasure itself. In fact, excavators have found a number of fascinating objects, including an ancient stone with a strange inscription, but no one has ever found actual treasure. The Oak Island Money Pit has had excavators working on it from 1795 to 2003.

The tale reminds me of the perils of building your own web scraper. That sounds silly, I know, but there are several similarities:

  • There’s a promise of a great treasure (valuable data).
  • You’re not sure where to look and there’s no guarantee you will find anything once you start digging.
  • You end up spending a lot of money for the initial excavation (implementation).
  • Once you feel your work is done, the hole you’ve dug (scraper you’ve built) collapses in on itself (the website changes its layout).
  • Keeping the hole open (maintaining the scraper) becomes more expensive than the initial work.

The challenges of building your own web scraper(s)

Let’s break down the potential costs and pitfalls when building your own web scraper:

Initial setup costs end up higher than expected. Web scraping can seem like a simple process at first, and it can be — if everything goes right. Unfortunately, it rarely does. As you spend time making incremental improvements and fixing bugs, you end up realizing you’ve spent a lot more time and money than you expected.

Web scrapers require regular maintenance. Websites are constantly changing their layouts, which means web scrapers have to be updated fairly often. You’ll pay developers on a regular basis to keep the scrapers working.

Scraping data from a single website may not be enough. In many cases, websites offer incomplete information. E-commerce sites may not list manufacturer part numbers, business review sites may not have phone numbers, and so on. You’ll typically need more than one website to build a complete picture of your data set.

Merging data from multiple websites presents its own challenges. If you require data from multiple websites, you’ll need to merge records at some point. This requires solving many challenges when it comes to normalizing and comparing data — yet another unaccounted-for cost.

The true cost of a web scraper

It can be difficult to determine the true cost of a web scraper. Here are some items you’d have to budget for:

  1. Developer time to build one or more web scrapers.
  2. Server cost to continually run your web scrapers.
  3. Data storage cost to store data collected.
  4. Developer time to maintain web scrapers.
  5. Developer time to normalize, merge, and process scraped data.

Let’s try to estimate what these costs would be.

We know at Datafiniti that our cost for (1) above is about $500/web scraper, but that’s with a very fine-tuned process and crawling platform. If you’re starting from scratch, your implementation cost will be much higher.

Our server costs are in the $10K-30K/month range, but that’s with running hundreds of web crawls constantly. This cost covers (2) and (3) above. Your server costs are likely to be lower, but you can still expect anywhere between $500-$2000/month for any real scale in your data volume.

If you’re scraping data from 5 or more websites, expect 1 of those websites to require a complete overhaul each month. That’s another $500/month of developer time, at the minimum. This covers (4) above.

Finally, expect significant investment if you want to cover (5) above. If we’re extremely optimistic, we need 1 month of a senior developer’s time for setting up data processing, which will cost $10K or more.

So let’s summarize the most optimistic scenario:

  1. Developer time for web scraper implementation: $500/web scraper
  2. Server cost to run scrapers and store data: $500/month
  3. Developer time to maintain web scrapers: $500/month
  4. Developer time to implement data processing: $10,000

Assuming you need to scrape 5 or more websites each month, you’re looking at an implementation cost of $12,500 and a recurring cost of $1,000/month.

These are very rough estimates, but hopefully they help paint a picture of the true cost of running a web scraping operation. What seems like a simple dig next to a tree can quickly turn into an all-consuming, costly operation.

Don’t let a treasure hunt turn into a money pit

The Oak Island Money Pit has consumed the lives (sometimes literally) of dozens of people and cost millions of dollars over the course of 100 years. To this day, no one has found anything of real value. As people eventually discovered, the biggest challenge facing any treasure hunter on Oak Island was a series of underground caves connected to the ocean. Any attempt to dig a hole was essentially an attempt to plug up the sea itself.

Web scraping faces a similar challenge — the web is a wild and constantly changing place. Any data collection at scale requires a massive infrastructure. This is exactly the reason we built Datafiniti — to save our customers from the frustration of plugging up the ocean.

Unlike Oak Island, there is real treasure on the web, but it can be difficult to find. We’re here to help!

For more information on the story of Oak Island, I recommend http://www.oakislandmoneypit.com/.

--

--