The Potential of Instant Access to Web Data

In my last post, I asked the question “Is Web Data Possible?” At first, this question may seem obvious, but closer inspection of the challenges in making web data consumable makes it apparent how difficult the problem really is. It also highlights why it hasn’t happened yet, despite many attempts to do so.

At Datafiniti, we’re making web data truly available for the first time, and we’re fascinated the possibilities opening up such a data source represents. What would you do if you could get instant access to all web data? It’s a question that touches on the possibility of accessing almost all human experience and knowledge instantly. It’s incredibly exciting, but also difficult, to think about its impact.

Where We Are Now

The concept of instant access to all web data is still in its infancy. Businesses are already realizing the fruits of data-driven processes and decision-making. Most of this has occurred by using information that’s already available from internal systems – CRMs, SCMs, ERPs, etc. But as more phases of the customer’s journey goes online, the more of that customer’s data is native to the web. This has rapidly resulted in the web becoming the largest repository of customer preferences, interactions, and comments, causing Doug Laney of Gartner to comment that the web is the largest database for any company.

“Web scraping” is how most people refer to accessing web data, but this method is incredibly incomplete and error-prone. It doesn’t produce web data in any usable sense. It just produces a simple copy-and-paste log file. Without any refinement through sanitization, aggregation, and other data enrichment techniques, it provides a poor representation of the web data needed by most organizations. Yet despite the poor data quality it produces, it does provide some value and is a popular choice for acquiring web content.

The Next Phase

So, if current approaches of using a small sliver of web data are already providing some utility, what could the potential of instantly accessing ALL web data hold? Right now, it’s difficult to forecast its impact, but we know it will be huge.

The most immediate effect is obvious: web data will significantly improve any business’ ability to react to market changes.

Businesses that thrive are those that are nimble, efficient and responsive to the market. However, all of that is only possible if businesses can access comprehensive information on their customers’ motivations, competitors’ offerings, and overall market ecosystem. Unfortunately, this data, when available, is often incomplete and not current. One way to supplement this critical data set is to leverage web data. The large aggregation of consumer and competitor web data will provide insights that internal company data collection methods would be hard-pressed to deliver. Web data fills the large data gap that exists today for almost every business. Filling that gap means better insight into customers, competitors, and the market as a whole.

Like I said, all of the above is the immediate effect of web data. What comes next has the potential to change how our society as a whole behaves.

We’ve already seen how enabling instant access to single points of web content have revolutionized our society. Google has effectively made the web an extension of every person’s own knowledge. Now apply this same concept to businesses. What happens when the web is an extension of every business’ own database? There is a next-generation of applications and analytics waiting to be imagined and released once web data is a reality.

How You Can Learn More

I’ll be sharing some possible ideas and prototypes for this analysis during the upcoming NewCo Austin event. Register here to attend our presentation on May 29th, 2015 at 9:30 am at our downtown offices. We’d love to have you over!

Bounding the Infinite: Is Web Data Possible?

The infinite.

We can understand the concept but never truly appreciate the scope. Yet we as a people have created something infinite: the Internet. At over 45 billion known web pages, with an exponential growth rate, the Internet is infinite. It contains almost every conceivable piece of information, an endless supply of content, and – potentially – an infinite source of data. Data on businesses, products, real estate, people, and much more all exist on the Internet. Applications that could leverage this data would provide an enormous amount of value to individuals, businesses, and society.

Unfortunately, leveraging web data has so far been unsuccessful. Although a tremendous amount of value lies within web data, it can’t be used because it needs a consistent structure to make it consumable. The source code representation of a product listing on Amazon has almost no overlapping structure to the same product listing on Walmart. The value of web data will manifest once you can tap into both listings, and millions others, without requiring any additional translation from raw source to consumable information.


This is exactly what we’ve done at Datafiniti. By providing a single database of web, we’ve enabled businesses to leverage information from across the Internet in a standardized, easy way. With a single API call, you can access over 50 million records on businesses, products, and properties sourced from hundreds of websites. We continue to increase the size of our data, the variety of sources, and the types of available data on a daily basis.

During the upcoming NewCo Austin event, I’ll be speaking about how we make this possible and what applications we’re enabling by making web data easily consumable. If you’d like to attend our presentation, please register here or contact us. We’d love to share our vision for the possibilities of web data with you!

Web Scraping Away $5 Billion

Is web scraping good? Or is it bad?

Before I wrestle with the answer, here’s what happened on Monday:

NASDAQ posted the 1st quarter earnings on Twitter’s investor relations web page for 45 secs at ~3:00 p.m. EST. This information was scraped by Selerity, a data analysis firm that provides financial information and analyses to other financial companies. Selerity then tweeted this information at 3:07 p.m.

By the time the market closed by 4:00 p.m. EST, Twitter stock had lost nearly 18% in stock value and nearly $5B in market cap.


So was web scraping beneficial good or bad in this case? The comments and tone of most news sources suggested “Web Scraping” or “Data Scraping” was a nefarious act. Or a hack. In fact, it’s neither.

In the case of Twitter, the earnings information was posted on a public site. Selerity has an event-detection algorithm that picked up new content in less than a second. Someone at Selerity reviewed the information and, since it was public information, saw it was fit to post. This was a simple case of taking public information and sharing it with.. the public. So, web scraping done in a responsible way of a public website is a good thing.

However, if the data had been procured from websites that specifically forbid scraping or have a gated access to their data, that would not only be unethical, but also illegal.

But this was not the case with Twitter’s earnings leak. This was the result of incompetence, not of malfeasance. And web scraping, in this case, will actually help both Nasdaq and Twitter to review their earnings release policies.

From Datafiniti’s point of view, web scraping should ALWAYS be a technology for good. By aspiration. Through design. Which is why we adhere to the Charter for Responsible and Ethical Acquisition of Data (CREAD). This helps us ensure that all the web data we acquire, before sanitization and organization, is collected ethically and only from publicly available sources. Several data provider companies are endorsing this charter as they realize CREAD helps the data ecosystem provide transparency and sustainable open access to all web data.

If you’re a data provider, please contact us about joining CREAD. If you’re a consumer of web data, be sure to ask your provider if they follow CREAD’s principles.

Houston’s Economy in 20 Years: Houston 2035

Why does a city thrive? How does one sustain city growth?

These questions have perplexed city planners & politicians for as long as there were cities to plan. While cities may not be destined to follow a natural rise and fall, there aren’t many examples of cities that have followed a steady progression throughout their history. Today’s mega-cities like Singapore, New York City, and Tokyo may seem like modern-day marvels, but change can come quickly. Just ask any senior citizen from Detroit.

On May 29th, I’ll be speaking at Xconomy’s Houston 2035. The discussion at this day-long event will center around what factors will contribute to developing Houston’s high-tech economy. Houston is a fascinating study in city growth. The economy is driven by the energy market, which has propelled its population and GDP to the 4th-most in the US, but has also brought severe recessions along the way. The trend is upwards, but can this be sustained?

Source: US Bureau of Economic Analysis

Source: US Bureau of Economic Analysis

Houston 2035 will no doubt highlight the need for diversifying beyond energy and try to answer how this can be done. A critical part of diversification is fostering an environment for entrepreneurship and technology. Does Houston have the ecosystem in place to create this environment? Is that culture already developing? At Datafiniti, we’re always interested in how web data can provide answers to questions like these. Perhaps the city’s current concentration of different industries can provide insight. Some of the data we’ll likely be looking at from our own database:

  • Does Houston have the same concentration of investment firms as other high-tech cities?
  • Does it have the necessary number of service professionals to support a startup ecosystem?
  • What technologies are most-represented by Houston businesses?

These questions may evolve as we dive into the data, but I expect we’ll have some interesting insights in time for Houston 2035. Be sure to attend if you’d like to hear them first-hand! If you’re interested in attending, you can receive a $145 discount on registration by using the code “Data”.

The Need for Ethical Data Acquisition

Web Data. Customers. Transparency. Trust.

These words individually mean something but when combined together drive one of Datafiniti’s key operating principles. The issues of Data, Transparency and Trust were recently highlighted in an excellent Harvard Business Review article titled “Customer Data: Designing for Transparency and Trust”. As the authors point out:

“Companies that are transparent about the information they gather, give customers control of their personal data, and offer fair value in return for it will be trusted and will earn ongoing and even expanded access. Those that conceal how they use personal data and fail to provide value for it stand to lose customers’ goodwill—and their business”

We could not agree more.

Datafiniti’s core values stem from the need to make web data accessible to businesses. Among the many challenges in doing this is the imperative to gather web data in an ethical and responsible manner. As the authors note – several companies have strong incentives to profit from data that are collected surreptitiously or unethically. This makes them behave in a manner that can only be detrimental in the long run.

There is a similar trend in the Data Provider Ecosystem. Web data acquisition has become a vital necessity for any data-driven company. Often, such a company is willing to cut ethical corners or pretend to absolve themselves by outsourcing these tasks to companies who will. Acquiring web data by false means abuses a privilege that has been granted willingly and in the long term, will make the data ecosystem less open and less trustworthy.

The article spells out three enlightened data principles and captures the essence of building transparency and trust amongst its customers.

Datafiniti, along with several progressive partners and customers, has embarked on a similar initiative called the Charter for Responsible and Ethical Acquisition of Data (CREAD). Every customer or partner we have interacted with has welcomed this initiative and enthusiastically supported it. Simple to understand and straight-forward to implement, the 5 principles are:

  1. Self-Identify the web agent that is acquiring the data
  2. Obey the robots file
  3. A self-regulated rate limiter
  4. Restrict data gathering to public data only
  5. Facilitate open access to web data

By adhering to these principles in spirit and in practice, customers and partners can be assured of avoiding the negative consequences of unethical data acquisition. More importantly, they send a strong signal to their customers and the eco-system they serve on where they stand on data, transparency and trust.

And that is a good thing. For everybody.

You can read more about CREAD here. Expect to see more about this initiative very soon.

Moving from “Big Data” to “Rich Data”

The Gartner Hype Cycle is real. This is especially true in the web data market, which I believe has been entrenched in the “trough of disillusionment” for the last 5 years.

Gartner s 2014 Hype Cycle for Emerging Technologies Maps the Journey to Digital Business

Web scraping has been around for decades, and companies providing web scraping services have been around for the last 8-10 years. Unfortunately, during that time, no one has really cracked the nut as far as making web data truly consumable. That’s because web data is still just “big data”. It’s not yet become “rich data”.

Techradar posted a great article today about the differences between rich data and big data. Here are some key insights:

“It’s like the difference between crude and refined oil,” says Dr. Rado Kotorov, Chief Innovation Officer at Information Builders. “Combining data provides new context and new use cases for the data. For example, combining social media data with transactional data can provide insight into purchases and thus lead to product innovation.”

“Rich data can be used to answer different kinds of questions that would previously have been difficult,” says Southard Jones at cloud business intelligence and analytics company Birst. “Linking up multiple sources of information can help see things in new ways or across the whole process, rather than just one team’s responsibility.”

Essentially, the key is combining data from different sources to provide more insight and context. This is why web scraping alone is not good enough to make web data valuable. Web scraping provides raw big data from single sources. A business needs to provide more structure and combine data from multiple sources together to make it more valuable to the business’ goals.

As businesses begin to realize the limitations of big data pulled from raw web scraping, they’ll see that it’s actually rich web data that they’re after.

Your Biggest Untapped Source of Data

Quick – can you tell me what your company’s most valuable source of data is? Is it your company’s (a) CRM software, (b) POS system, or (c) ERP platform? If you chose any of these as your answer, you answered incorrectly. The correct answer is (d) The Web.

While your company’s internal data systems collect a lot of valuable information, they can never match the sheer potential that exists on the Web. Doug Laney at Gartner said it best:

Your company’s biggest database isn’t your transaction, CRM, ERP or other internal database. Rather it’s the Web itself and the world of exogenous data now available from syndicated and open data sources.

Of course, the challenge with using data from the web is that it’s incredibly unstructured and scattered across a vast number of websites. As a business, you likely need to leverage at least one of the following:

  • Sales lead generation and optimization
  • Competitive analysis and monitoring
  • Product pricing and assortment
  • Brand and sentiment analysis
  • Marketing automation and research

and likely much more. Traditionally, acquiring web data for these applications has been incredibly difficult. Dan Woods, a contributor for Forbes, highlights the problem. Acquiring web data means:

  1. Making it easy to identify the information on a web page or collection of web pages and assemble that information into a useful structure.
  2. Allowing the data to still be harvested correctly, even if the page changes in some way.
  3. Recognizing when new information has arrived.
  4. Harvesting data on a regular schedule.
  5. Managing and performing quality control on thousands of agents.
  6. Handling complex was of creating pages such as responsive design.
  7. Integrating harvested data into a data warehouse or other repository.

He even goes to mention a few solutions, but none of them go far enough. Web scraping is typically just the first step of a long process when it comes to consuming web data. In our own experience, we’ve found that data from traditional web scraping will still suffer from inaccurate data, poor coverage, and inaccessibility. You still need to go through a sanitazition, bundling, and distribution process before your business can actually consume web data.

Our goal at Datafiniti is to overcome the onerous challenge of consuming quality web data. Rather than using a tool, companies should expect to have a solution for web data. This level of accessibility is possible by looking beyond traditional web scraping applications and instead looking at web data solutions like Datafiniti.

“Experts” vs the Crowd: Where Analysis of Crowd-Sourced Reviews Gets It Wrong

Last year, FiveThirtyEight published a series of posts with the ostensible purpose of finding the country’s best burrito. Long story short, the burrito picked by experts before anything began was (surprise!) the winner. The primary author goes on to make implications that the crowd’s choice (as evidenced by Yelp reviews) failed to live up to the experts’ choice. In reality, we should almost never expect this to happen, and there’s a good reason why.

Before I go any further, let me say that I’m a HUGE FiveThirtyEight fan. Watching Nate Silver perfectly predict the 2012 election was as exciting for me as watching Tracy McGrady score 13 points in 35 seconds for my beloved Rockets. But for a publication that relies on data to strip away bias, this series of posts seemed to favor bias over data.

The recap post in the burrito series centers around this graphic:


The title assumes that the crowdsourced opinion is the wrong one, as if it needs to live up to the expert opinion. This would be a more balanced title:


If you’re an avid reader of food blogs (and avid eater of food) like me, you’ll see the same issue perpetuated throughout the food industry. “Expert” opinions are typically valued over a “regular” person’s tastes. While there’s a place for expert opinion, I’m not sure it’s the burrito – an everyman’s food if there ever was one.

The problem with expert opinion is that it’s formed by experts. Folks whose job it is to know food and taste. They know what’s been done and what’s never been done. An expert opinion will tell you what has pushed the boundary of human achievement, but it won’t tell you what you want for lunch today. That opinion belongs to you, and chances are your tastes are more similar to the tastes of your neighbor than they are to Anthony Bourdain’s. Crowd-sourced reviews are still a great resource for finding a decent or even good restaurant. They’re not a resource for finding redefined cuisine, and nor should they be. Comparing crowd-sourced reviews to expert reviews will most likely never match up, and we shouldn’t expect them to. They’re inherently surveying two very different groups of people.

Let me say again – there’s no issue with FiveThirtyEight’s data here. They developed a metric to score the expert opinion and stuck to it. It’s a good attempt to quantify something subjective. The only issue is with putting the expert opinion above the crowd-sourced opinion.

Last year my wife and I traveled to London and went to Dinner by Heston, which is fifth among the world’s 50 best restaurants. It was good food – great even – but I’ll still take Franklin’s BBQ over it any day of the week. My mind recognizes the genius of Heston Blumenthal, but on a Sunday afternoon in Austin, TX, nothing beats this:


400,000+ Hotel Reviews Across 50 States: What We Were Surprised to Learn

Nevada knows what it’s doing, California may not, and Texas hits the sweet spot.

Love it or hate it?  5 Star or 1 Star?  Reviews are one of the most significant inputs in a customer’s buying decision.  This is especially true when it comes to booking hotels.  Based on a relatively small set of reviews, we can quickly tell whether a particular hotel is stay-worthy or should be completely avoided.  However, can they tell us something more than that?  This was the question that intrigued us, and here’s what we found after combing through 400,000+ hotel reviews.

Hotel reviews are an interesting case study of what can be discovered from web data.  A lot of people review hotels online, and they do so across many different websites.  We know that more positive reviews for a hotel means more business for that hotel.  In fact, a recent study from Cornell University’s School of Hotel Administration showed that customers are twice as likely to book a hotel with positive reviews as they are a hotel with negative reviews.   A second study showed that revenue is strongly correlated with reviews.  However, we wondered what would happen if we zoomed out and looked at the data state-by-state.  Are some states better at generating reviews?  Does a higher tourism budget result in more reviews?

To answer these questions, we studied and analyzed 437,787 hotel reviews, collected from 60 unique review sites.  The review data includes the name and geography of the hotel, along with review text, date, and ratings.  We also included demographic and economic data, such as population and tourism budget (collected from third-party sources), in our analysis.

A Word About Us

Before we get started, a quick word about us – we love data at Datafiniti.  After all, we’re aspiring to provide open access to all web data.  We crawl billions of URLs and convert all the data you need on 50 million worldwide businesses and 30 million online products into an easy, friendly, and searchable database.  Yes, it’s terabytes of data, but who’s counting?  What I’m saying is we have a lot of data.

One of our resolutions this year is to start showing the different kinds of insight that can be pulled out of this data.  What happens when you start treating the Internet like a single database at your fingertips?  What insights are hiding in plain sight?  We’re about to find out.

The Data

So let’s take a look at the data.  When aggregating the data, we counted total # of reviews by state, average rating, the ratio of population to # of reviews, and the ratio of tourism budget to # of reviews.  The map below displays how each state ranks across these different metrics.

The raw data used for this chart and the article is available on Datafiniti’s Github repository.  For the full data set of reviews, please contact us.

Winners and Losers

Oklahoma & Idaho are surprising winners while the DC-VA area suffers.

Top 10 States Bottom 10 States
Hawaii (4.54) Washington D.C. (2.77)
Nevada (4.27) Virginia (3.10)
Oklahoma (4.23) Mississippi (3.45)
Idaho (4.23) Maine (3.57)
New Jersey (4.21) Washington (3.60)
New Hampshire (4.18) Tennessee (3.61)
Massachusetts (4.15) Connecticut (3.64)
Arizona (4.14) Delaware (3.67)
Alabama (4.13) New Mexico (3.68)

A good place to start looking for insights is to zero in on the top 10 states by average rating. Here, we see some expected and some not-so-expected results. Hawaii’s hotels are basking in the warm, sunny glow of a chart-topping 4.54 star rating, while Oklahoma & Idaho tie at a surprising 3rd place, each with a 4.23 rating.

Looking at the bottom 10 states, DC and VA come in at the very depressing averages of 2.77 and 3.10, respectively.  This is way outside the norm, which could hint that something is off with the hotel industry in this area.  It also hints at opportunity, though.  If hotels aren’t meeting customer expectations in this area, then there’s clearly a market need.  Larger hotel franchises may want to consider what they can do to leapfrog their competition here.

Another  interesting insight  is that most states’ average ratings are between 3.8 and 4.2.  If we consider 3 stars to be average on a 5-star scale, then this suggests one of two possible scenarios: most states (and hotels) are outperforming expectations OR most reviewers have a bias toward 4 stars. In any other setting, receiving  3 stars out of 5 may be a good thing but according to our hotel reviews data, receiving 3 stars in this context is actually pretty bad.  Businesses may need to re-assess their prior findings when taking these ratings into account.

Populated States Have More Reviews

Nevada may have more tourists posting reviews, while visitors to Michigan are silent.

Let’s start with something simple: how do the number of reviews scale with each state’s population?  Conventional wisdom would dictate the more populous the state,  the more hotel reviews you’re going to see.

Population (2010) vs. No. of  Reviews (2010 - 2013)

The chart above supports this hypothesis, but some states sit way outside this trend.  States like Nevada, New York, and Texas generate an abnormally high number of reviews based on their population.  On the other hand, other states, like Michigan, Washington and Pennsylvania, are doing very poorly when it comes to generating hotel reviews for their population.  In general, you’ll see states that outperform are those states that are typically considered more “touristy.”  Given the importance of reviews on consumer behavior, it may make sense for some states to actively encourage travelers to post reviews.

Higher Tourism Budgets Correlate with More Reviews

California looks good until you look more closely.  Is Hawaii an inefficient spender?

Let’s try a different metric.  How does a state’s tourism budget relate to it’s hotel reviews?  In some sense, we can use this as a crude proxy to see how well each state’s tourism department is helping generate business for hotels.

Tourism Budget (2013) vs. No. of Reviews (2010 - 2013)

We again see some states doing incredibly well and some not so well.  Several states like Nevada and New York are getting a lot of bang for their buck, whereas other states like Hawaii, Illinois, and Michigan might want to look at better ways of spending that money.

One interesting note: California looks like it’s doing really well.   It’s producing 960 reviews per million dollars.  It’s doing better than most, but other states (e.g., Georgia producing 2,036 reviews per million dollars) are doing much, much better.  For a state with such a large tourism budget, it would probably be worth California’s time to study how these states are making use of their budget.

Data Is Wonderful

We find data like this incredibly exciting.  By analyzing hotel reviews in aggregate, across 60 websites, we can spot some important trends.  These are important insights for businesses and policy makers alike, and it’s all out there on the web, just waiting to be transformed into actionable data.

Even though 400,000+ hotel reviews are a large data set, it’s still small in relation to the amount of total data available on the web.  Yet even with just this data, we found how review volume moves with population and state spending.  We also learned which states are outliers on these metrics.

Interested in learning more?

Download the summary data | Contact Us To Get Full Access to the Data

We’ll be analyzing more data like this throughout the year, so be sure to check back regularly.  We can’t wait to see what we find!

To Trust or Not to Trust Online Reviews, That is the Question

Online shopping is one of the greatest inventions ever.  At least, that’s my opinion.  I’m a confessed Amazon power user.  I’ve got a prime account, I use subscriptions to auto-manage most of my regular purchases, and I always read online reviews before making a purchase.  Reviews are so ingrained in my shopping behavior that I check them even when I find myself in a physical store (a rare event nowadays).

Earlier this week, I came across this YouGov article that demonstrated an incredible finding: Most people read online reviews, yet few actually trust those reviews.  In fact, only 13% of those surveyed in the YouGov study said that reviews were very trustworthy.

The article goes on to outline some very interesting points both for and against the effectiveness of online reviews:

Why Reviews Are Valuable Why Review Value Could Be Better
Of Americans who read online reviews, 90% find written ratings to be important and 41% find them very important as an aid to decision making. 21% of Reviewers wrote reviews for products or services they had never actually purchased or tried.
78% check out the review section before making a purchase 89% believe that businesses write negative reviews of competitors.
Price doesn’t dictate how often you use online reviews – although one in five (19%) only use online reviews for products over $100, a similar number (19%) use reviews for purchases of less than $10. 91% believe businesses write their own positive reviews

So What Can Businesses Do?

The conflicting story produced by the survey makes for a bit of a head-scratcher.  Clearly, businesses need to have detailed an up-to-date reviews on their shopping site or for their online products.  Unfortunately, the effectiveness of these reviews are impacted by a very healthy dose of skepticism from the typical online shopper.

Some possible suggestions:

  1. Businesses can actively filter out reviews that sound promotional or disingenuous.  Semantic analysis of the review text can help with this.
  2. Implement some sort of customer verification process to verify every reviewer is a real customer of the product.
  3. Allow users to vote the review up or down in order to add social trust to the reviews.

Ultimately, a business wants to encourage shoppers to buy the product they’re viewing.  Reviews are a huge part of this, but it’s clear that more can be done to improve their value.  At scale, moving the needle just a small amount toward a purchase can have a multi-million dollar effect.