How Do We Move Past Data Wrangling?

“Data Wrangling” needs to be a thing of the past. The business world has recognized the value of data (and big data) for several years now. Unfortunately, it’s still stuck in the quagmire of messy data. The New York Times recently published an article “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”. This is a depressing title, but it’s dead-on. Some key points from the article:

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

But if the value comes from combining different data sets, so does the headache. Data from sensors, documents, the web and conventional databases all come in different formats.

The second quote really stands out for me. It specifically mentions the frustration that comes with combining data sets sourced from the web. Of course, at Datafiniti, we’re working to remove this burden from the business analyst or data scientist. Beyond our company though, data providers need to embrace the challenge of providing clean data. With the exception of our own company, we haven’t seen this happen within the field of web data. Traditional web scraping companies have been content to license complicated software to their customers and let them or a partner company handle the sanitization that inevitably comes next. As the table stakes rise in the data world, data wrangling will be absorbed up the data value chain, and businesses will expect clean data to be provided upfront.

At Datafiniti, we’re excited to not only see this evolution, but be a driving force for it. Let’s go from 80% data wrangling / 20% insights to 0% data wrangling / 100% insights. Let’s make data wrangling a thing of the past.

How Web Data Can Help Your Job Search

Last week some friends of ours visited Austin. Like so many people, they were moving to this amazing city for new jobs. One of them was on the verge of accepting a job at Whole Foods, and the other was beginning his search. Since his wife’s office would be downtown, he wanted to find other companies that were located downtown. Since his background was in software development, he needed to find a company that was in technology.

Surprisingly, getting a list of technology companies in a specific geographic area is not an easy thing to do, at least with traditional tools like Google, LinkedIn, etc. My friend had already exhausted all the normal options for this information and he knew his list of companies was woefully small.

As he described his problem with me, I knew I could help. After all, Datafiniti customers face the same problems before they come to us, but on a larger scale. They know the information is on the web, but they don’t know how to compile it quickly and easily. With a smile on my face, I told him I could get his list right away.

I opened up my laptop and issued a request to Datafiniti:

Screen capture from my API client

Screen capture from my API client

Within a few minutes, I had over 1,000 businesses in a file for him, ready to sort through. Since the data included zip codes, he could easily filter out the companies located in downtown Austin. Armed with the websites, he could start going through the list and finding career pages, job openings, and more. Here’s a screenshot of that file:


It was incredibly gratifying to use Datafiniti to help a friend with his job search. It just goes to show many potential applications there are for clean, organized web data.

Fighting Human Trafficking with Web Data

One of the central tenets of our work at Datafiniti is that web data has the potential for tremendous positive change in our world. We recently came across an article that demonstrates this perfectly, entitled “The escort database that combats human trafficking“.

The article talks about DIG (Domain-specific Insight Graphs), which crawls the entire web and converts content into data (sound familiar?). In this case, the data is a collection of markers to help identify human trafficking activity on the web and track down missing persons.

The creators of DIG highlight many of the challenges with and benefits of web data. Specifically:

The internet contains seemingly limitless information, but we’re constrained by our ability to search that information and come up with meaningful results.

The UK’s Human Trafficking Centre identified 2,255 potential victims of human trafficking in 2012, and the Missing Persons Advocacy Network estimated 200,000 US children are at high risk for trafficking into the sex industry. Better tools to address the unwieldy problem of police scouring the entire web for clues are an obvious priority.

These observations mirror the issues we tackle every day at Datafiniti. Although the data vertical is different, the challenges and approaches are incredibly similar.

It’s very exciting to see others recognize the potential of web data and use it for such tremendous social good as this. We’re sure to see many other applications benefiting society as web data becomes more and more accessible.

How Web Data Will Make Business Intelligence Smarter Than Anything We’ve Seen So Far

In the past 2 blogs, I have touched upon about the nature of web data and the immense potential it can unleash if businesses leveraged it appropriately. In this post, I’ll try to offer a glimpse into the future of business intelligence applications that becomes possible with quality web data, the applications beginning to take flight, as well as their evolution in the near future.

Always-On Pricing Intelligence

What’s Happening Now

Visibility into online prices has always been murky. With its quality web data, Datafiniti is finally shedding some light into the vast universe of online product listings. Armed with our quality web data, customers have been able to audit their brand merchandising, get instant access to online product assortment, and compare their online reviews to those of their competitors.

What We’ll See Next

As the window opens on online product data, retailers are going to discover unseen opporunities. By accessing online pricing and reviews from multiple websites almost instantly, a retailer will find gaps in their competitors’ offering or find where new markets are surfacing. Brands will also instantly react to positive or negative reviews, whenever and wherever they show up, resulting in a better brand experience and effective customer service for consumers.

Reputation Management Analytics

What’s Happening Now

One of the early use cases we’re seeing is reputation management for businesses. Franchises like Starbucks, Subway, and others have thousands of locations around the world. Every day reviews are posted online for these locations, and analysts at these firms need visibility into that activity. Traditionally this visibility has been anything but real-time. Now with Datafiniti providing regular updates to reviews across multiple sites, our customers are approaching instant visibility.

What We’ll See Next

As the review data provided by Datafiniti becomes layered with sentiment analysis, businesses will get a real-time pulse on customer moods. We’ve seen this happen with access to Twitter data, but we haven’t seen it across multiple websites or even online data sources. In other words, we’ll go beyond tunnel-vision and move to a wide-angle lens of customer sentiment across the Internet.

Complete Snapshots of Sales Prospects

What’s Happening Now

Supplementing sales leads with external data has been around for a while now. Instant access to web data takes it to another level. Datafiniti customers are getting access to investment activity, news articles, and more for companies ranging from Fortune 500s to startups. By getting timely information like this, sales personnel can obviously make more informed decisions and better segment their sales targets.

What We’ll See Next

With each data point added to a sales lead, the sales process becomes more efficient. Web data will make hyper-targeted sales leads a reality, with sales personnel knowing everything from which person to contact to what products the company already uses. As the sales process reaches this kind of hyper-efficiency, the business sales cycle will become incredibly tight.

Bounding the infinite web is a challenging and inspiring task. Using the latest technology to help businesses leverage quality web data will change the nature and the future of business. The above examples of business intelligence applications are just a small glimpse into what becomes possible with quality web data. What new applications will arise out of the instant access to web data? What impact will it have on business and society? As providers of data that will power this future, we are extremely excited be part of this progress and eager to see what others will develop.

If you haven’t already, register for our presentation during NewCo ATX. I’ll be showing some examples of the above applications, along with other great illustrations of the power of web data.

The Potential of Instant Access to Web Data

In my last post, I asked the question “Is Web Data Possible?” At first, this question may seem obvious, but closer inspection of the challenges in making web data consumable makes it apparent how difficult the problem really is. It also highlights why it hasn’t happened yet, despite many attempts to do so.

At Datafiniti, we’re making web data truly available for the first time, and we’re fascinated the possibilities opening up such a data source represents. What would you do if you could get instant access to all web data? It’s a question that touches on the possibility of accessing almost all human experience and knowledge instantly. It’s incredibly exciting, but also difficult, to think about its impact.

Where We Are Now

The concept of instant access to all web data is still in its infancy. Businesses are already realizing the fruits of data-driven processes and decision-making. Most of this has occurred by using information that’s already available from internal systems – CRMs, SCMs, ERPs, etc. But as more phases of the customer’s journey go online, the more of that customer’s data is native to the web. This has rapidly resulted in the web becoming the largest repository of customer preferences, interactions, and comments, causing Doug Laney of Gartner to comment that the web is the largest database for any company.

“Web scraping” is how most people refer to accessing web data, but this method is incredibly incomplete and error-prone. It doesn’t produce web data in any usable sense. It just produces a simple copy-and-paste log file. Without any refinement through sanitization, aggregation, and other data enrichment techniques, it provides a poor representation of the web data needed by most organizations. Yet despite the poor data quality it produces, it does provide some value and is a popular choice for acquiring web content.

The Next Phase

So, if current approaches of using a small sliver of web data are already providing some utility, what could the potential of instantly accessing ALL web data hold? Right now, it’s difficult to forecast its impact, but we know it will be huge.

The most immediate effect is obvious: web data will significantly improve any business’ ability to react to market changes.

Businesses that thrive are those that are nimble, efficient and responsive to the market. However, all of that is only possible if businesses can access comprehensive information on their customers’ motivations, competitors’ offerings, and overall market ecosystem. Unfortunately, this data, when available, is often incomplete and not current. One way to supplement this critical data set is to leverage web data. The large aggregation of consumer and competitor web data will provide insights that internal company data collection methods would be hard-pressed to deliver. Web data fills the large data gap that exists today for almost every business. Filling that gap means better insight into customers, competitors, and the market as a whole.

Like I said, all of the above is the immediate effect of web data. What comes next has the potential to change how our society as a whole behaves.

We’ve already seen how enabling instant access to single points of web content have revolutionized our society. Google has effectively made the web an extension of every person’s own knowledge. Now apply this same concept to businesses. What happens when the web is an extension of every business’ own database? There is a next-generation of applications and analytics waiting to be imagined and released once web data is a reality.

How You Can Learn More

I’ll be sharing some possible ideas and prototypes for this analysis during the upcoming NewCo Austin event. Register here to attend our presentation on May 29th, 2015 at 9:30 am at our downtown offices. We’d love to have you over!

Bounding the Infinite: Is Web Data Possible?

The infinite.

We can understand the concept but never truly appreciate the scope. Yet we as a people have created something infinite: the Internet. At over 45 billion known web pages, with an exponential growth rate, the Internet is infinite. It contains almost every conceivable piece of information, an endless supply of content, and – potentially – an infinite source of data. Data on businesses, products, real estate, people, and much more all exist on the Internet. Applications that could leverage this data would provide an enormous amount of value to individuals, businesses, and society.

Unfortunately, leveraging web data has so far been unsuccessful. Although a tremendous amount of value lies within web data, it can’t be used because it needs a consistent structure to make it consumable. The source code representation of a product listing on Amazon has almost no overlapping structure to the same product listing on Walmart. The value of web data will manifest once you can tap into both listings, and millions others, without requiring any additional translation from raw source to consumable information.


This is exactly what we’ve done at Datafiniti. By providing a single database of web, we’ve enabled businesses to leverage information from across the Internet in a standardized, easy way. With a single API call, you can access over 50 million records on businesses, products, and properties sourced from hundreds of websites. We continue to increase the size of our data, the variety of sources, and the types of available data on a daily basis.

During the upcoming NewCo Austin event, I’ll be speaking about how we make this possible and what applications we’re enabling by making web data easily consumable. If you’d like to attend our presentation, please register here or contact us. We’d love to share our vision for the possibilities of web data with you!

Web Scraping Away $5 Billion

Is web scraping good? Or is it bad?

Before I wrestle with the answer, here’s what happened on Monday:

NASDAQ posted the 1st quarter earnings on Twitter’s investor relations web page for 45 secs at ~3:00 p.m. EST. This information was scraped by Selerity, a data analysis firm that provides financial information and analyses to other financial companies. Selerity then tweeted this information at 3:07 p.m.

By the time the market closed by 4:00 p.m. EST, Twitter stock had lost nearly 18% in stock value and nearly $5B in market cap.


So was web scraping beneficial good or bad in this case? The comments and tone of most news sources suggested “Web Scraping” or “Data Scraping” was a nefarious act. Or a hack. In fact, it’s neither.

In the case of Twitter, the earnings information was posted on a public site. Selerity has an event-detection algorithm that picked up new content in less than a second. Someone at Selerity reviewed the information and, since it was public information, saw it was fit to post. This was a simple case of taking public information and sharing it with.. the public. So, web scraping done in a responsible way of a public website is a good thing.

However, if the data had been procured from websites that specifically forbid scraping or have a gated access to their data, that would not only be unethical, but also illegal.

But this was not the case with Twitter’s earnings leak. This was the result of incompetence, not of malfeasance. And web scraping, in this case, will actually help both Nasdaq and Twitter to review their earnings release policies.

From Datafiniti’s point of view, web scraping should ALWAYS be a technology for good. By aspiration. Through design. Which is why we adhere to the Charter for Responsible and Ethical Acquisition of Data (CREAD). This helps us ensure that all the web data we acquire, before sanitization and organization, is collected ethically and only from publicly available sources. Several data provider companies are endorsing this charter as they realize CREAD helps the data ecosystem provide transparency and sustainable open access to all web data.

If you’re a data provider, please contact us about joining CREAD. If you’re a consumer of web data, be sure to ask your provider if they follow CREAD’s principles.

Houston’s Economy in 20 Years: Houston 2035

Why does a city thrive? How does one sustain city growth?

These questions have perplexed city planners & politicians for as long as there were cities to plan. While cities may not be destined to follow a natural rise and fall, there aren’t many examples of cities that have followed a steady progression throughout their history. Today’s mega-cities like Singapore, New York City, and Tokyo may seem like modern-day marvels, but change can come quickly. Just ask any senior citizen from Detroit.

On May 29th, I’ll be speaking at Xconomy’s Houston 2035. The discussion at this day-long event will center around what factors will contribute to developing Houston’s high-tech economy. Houston is a fascinating study in city growth. The economy is driven by the energy market, which has propelled its population and GDP to the 4th-most in the US, but has also brought severe recessions along the way. The trend is upwards, but can this be sustained?

Source: US Bureau of Economic Analysis

Source: US Bureau of Economic Analysis

Houston 2035 will no doubt highlight the need for diversifying beyond energy and try to answer how this can be done. A critical part of diversification is fostering an environment for entrepreneurship and technology. Does Houston have the ecosystem in place to create this environment? Is that culture already developing? At Datafiniti, we’re always interested in how web data can provide answers to questions like these. Perhaps the city’s current concentration of different industries can provide insight. Some of the data we’ll likely be looking at from our own database:

  • Does Houston have the same concentration of investment firms as other high-tech cities?
  • Does it have the necessary number of service professionals to support a startup ecosystem?
  • What technologies are most-represented by Houston businesses?

These questions may evolve as we dive into the data, but I expect we’ll have some interesting insights in time for Houston 2035. Be sure to attend if you’d like to hear them first-hand! If you’re interested in attending, you can receive a $145 discount on registration by using the code “Data”.

The Need for Ethical Data Acquisition

Web Data. Customers. Transparency. Trust.

These words individually mean something but when combined together drive one of Datafiniti’s key operating principles. The issues of Data, Transparency and Trust were recently highlighted in an excellent Harvard Business Review article titled “Customer Data: Designing for Transparency and Trust”. As the authors point out:

“Companies that are transparent about the information they gather, give customers control of their personal data, and offer fair value in return for it will be trusted and will earn ongoing and even expanded access. Those that conceal how they use personal data and fail to provide value for it stand to lose customers’ goodwill—and their business”

We could not agree more.

Datafiniti’s core values stem from the need to make web data accessible to businesses. Among the many challenges in doing this is the imperative to gather web data in an ethical and responsible manner. As the authors note – several companies have strong incentives to profit from data that are collected surreptitiously or unethically. This makes them behave in a manner that can only be detrimental in the long run.

There is a similar trend in the Data Provider Ecosystem. Web data acquisition has become a vital necessity for any data-driven company. Often, such a company is willing to cut ethical corners or pretend to absolve themselves by outsourcing these tasks to companies who will. Acquiring web data by false means abuses a privilege that has been granted willingly and in the long term, will make the data ecosystem less open and less trustworthy.

The article spells out three enlightened data principles and captures the essence of building transparency and trust amongst its customers.

Datafiniti, along with several progressive partners and customers, has embarked on a similar initiative called the Charter for Responsible and Ethical Acquisition of Data (CREAD). Every customer or partner we have interacted with has welcomed this initiative and enthusiastically supported it. Simple to understand and straight-forward to implement, the 5 principles are:

  1. Self-Identify the web agent that is acquiring the data
  2. Obey the robots file
  3. A self-regulated rate limiter
  4. Restrict data gathering to public data only
  5. Facilitate open access to web data

By adhering to these principles in spirit and in practice, customers and partners can be assured of avoiding the negative consequences of unethical data acquisition. More importantly, they send a strong signal to their customers and the eco-system they serve on where they stand on data, transparency and trust.

And that is a good thing. For everybody.

You can read more about CREAD here. Expect to see more about this initiative very soon.

Moving from “Big Data” to “Rich Data”

The Gartner Hype Cycle is real. This is especially true in the web data market, which I believe has been entrenched in the “trough of disillusionment” for the last 5 years.

Gartner s 2014 Hype Cycle for Emerging Technologies Maps the Journey to Digital Business

Web scraping has been around for decades, and companies providing web scraping services have been around for the last 8-10 years. Unfortunately, during that time, no one has really cracked the nut as far as making web data truly consumable. That’s because web data is still just “big data”. It’s not yet become “rich data”.

Techradar posted a great article today about the differences between rich data and big data. Here are some key insights:

“It’s like the difference between crude and refined oil,” says Dr. Rado Kotorov, Chief Innovation Officer at Information Builders. “Combining data provides new context and new use cases for the data. For example, combining social media data with transactional data can provide insight into purchases and thus lead to product innovation.”

“Rich data can be used to answer different kinds of questions that would previously have been difficult,” says Southard Jones at cloud business intelligence and analytics company Birst. “Linking up multiple sources of information can help see things in new ways or across the whole process, rather than just one team’s responsibility.”

Essentially, the key is combining data from different sources to provide more insight and context. This is why web scraping alone is not good enough to make web data valuable. Web scraping provides raw big data from single sources. A business needs to provide more structure and combine data from multiple sources together to make it more valuable to the business’ goals.

As businesses begin to realize the limitations of big data pulled from raw web scraping, they’ll see that it’s actually rich web data that they’re after.