50 Million Unique Products and Counting

I’m very excited to announce that we’ve passed an important milestone! Datafiniti now has over 50 million unique products available from Product Data. Over the last few months, we’ve begun to see a dramatic increase in the number of product records we’re collecting, thanks to growth from existing sources and a faster rate of producing new sources.

Here’s what our internal metrics dashboard shows (pardon the black background, it’s an ops thing):

Descartes   Graph    Data Metrics   Products

As you can see, we’ve increased our coverage by 20% this month alone. As a Datafiniti customer, you can expect to see this kind of coverage continue. What this means is more product data coverage for our customers, more possible insights, and more amazing applications being built on top of Datafiniti. Onward and upward!

The One Thing that Matters with Data: Thoughts from the Wolfram Data Summit

2015-09-01 09.35.30

Earlier this month I had the pleasure of speaking at the 2015 Wolfram Data Summit. This boutique conference is an intimate gathering of some of the world’s top data minds, including quant trading analysts, NASA engineers, and of course, Stephen Wolfram himself. During the two day event, I had the chance to hear first-hand what challenges are faced by the data industry as a whole. Here are my notes on the presentations I found most interesting.

Opening Keynote

Stephen Wolfram, Wolfram

As expected, Dr. Wolfram’s opening keynote focused on the automation of computation, which is a core focus for Wolfram the company. The goal at Wolfram is to build in as much knowledge as possible. This creates a very “dense” language, which stands in contrast to a “light” language. Wolfram’s Mathematica language has a ton of functionality built into it, and doesn’t require third-party libraries to supplement that functionality.

For me, it’s not clear whether this is the best approach, as it requires something of a central authority (Wolfram) to advance the language. That said, it’s produced some incredible results, such as Emerald Cloud Lab, which automatically generates biochemistry experiments with repeatable results.

Using Big Data to Predict Online Trends

Peter Sirota, Quantcast

Peter’s talk focused on the “visit graph” that Quantcast has built out to model consumer behavior across the web. This is a graph of websites someone has visited, what ads were shown on those visits, where they saw those ads, and more. It lets them make demographic inferences about each person and people that are similar to them in order to provide more targeted advertising. Some interesting insights they’ve pulled out include a higher male/female ratio among eHarmony users in California vs the rest of the west coast, and being able to identify when certain professions should be shown travel ads based on industry conferences (e.g., dentists traveling to San Francisco). My ears perked up when Peter mentioned the following:

“A mediocre algorithm running on a large, high-quality data set is better than a great model running on poor data.”

Best States for Data Innovation

Daniel Castro, Center for Data Innovation

This talk was an interesting survey of how each government (national and local) is making data public accessible. Daniel quickly highlighted how each country stacks up in terms of # of available data sets. The UK is the farthest ahead in terms of open data, Russia is the farthest behind. Canada and the US are tied for third. He also highlighted the rise of “data poverty” among local communities, which means that certain communities are disproportionately lacking when it comes to available open data.

I was happy to hear that Texas, our home state, is one of the leaders in open data, and surprised to hear that rural states tend to do better, mostly due to smart meters being used in agriculture.

Leveraging Public Data to Explore Urban Life

Ben Wellington, Two Sigma

This was probably the most engaging talk of the conference. Ben runs a great blog called IQuantNY, which highlights his work using open data to expose curiosities and inefficiencies in New York City. He talked about some of his favorites, including:

  • Density of parking tickets by geo coordinates exposing unmarked parking zones
  • Odd distribution of health code scores suggesting leniency by health inspectors

The great thing about this talk was how Ben was able to illustrate how having access to data created positive, real-world change. The city government has enacted actual policy changes based on his analysis. This is data helping citizens. Ben made it clear that the analysis wasn’t the hard part, it was just getting access to the data that was the real hurdle.

Data is the Only Marketplace

Russell Foltz-Smith, TrueCar

If you’re not familiar with TrueCar, check it out. It’s a great website that relies on data to provide consumers more insight into car pricing. Russell talked about how important this data is, and how TrueCar’s business doesn’t exist without access to it. In fact, he went on to propose that every industry is based on data exchange at its core. An interesting thought to be sure.

Working with Dirty Data

Nicholas Marko, Geisinger Health System

This talk was a fascinating dive into all the different ways data can effect healthcare. Perhaps most interesting (or troubling) was Nicholas’ claim:

“Any way you can screw up data, we’ve done it in healthcare.”

He went on to express how this creates an opportunity for any health system to become a leader in healthcare just by using cleaner data. For them, data quality is a premium feature that is worth an incredible amount.

The Web Data Blindspot

Shion Deysarkar, Datafiniti

Of course, I also got the chance to present, and you can see that presentation below. My talk as well highlighted the power of data (web data in particular) and how it creates new possibilities.

My take away from these talks was that data availability and data quality are huge issues to any industry. It’s fascinating just how critical data is in all aspects of our world, personal or business. Data can help us avoid unwanted parking fees. It can help us find a car. It can even help our doctors treat us correctly. We see the need for high quality data from our customers, but this need extends beyond web data to every type of data.

Psst.. Fashion Retail, Your Pricing Slip is Showing

Fashion Week 2015 in NYC created a fantasy land of colors, fabrics and garments to make people feel their most powerful. As the glamorous event drew to a close, designers got a chance to display their innovation, imagination and understanding of customers. But behind the glitz, glam and tents, teams of marketing executives and buyers are working diligently to make these designs a commercial success. These include important decisions such as choosing the designs that fit the needs for the season, calculating production volume, timing, assortment and, most important of all, pricing.

The pricing of fashion retail brands is a multi-dimensional data problem and presents a tantalizing opportunity for Datafiniti. While there are important considerations in fashion retail product pricing such as brand perceptions, retail locations, target segment perceptions & aspirations, finding the right price is one of the most challenging ones. Here are two simple reasons why –

1. The Keystone Markup: is a widely used pricing approach among the brands and retailers. A sample Keystone Markup would work as shown below –

Keystone Markup

This clearly shows that pricing assumptions made earlier in the pricing chain amplify the overall pricing by the time consumer sees it. In a highly competitive environment, getting the right data to make these assumptions is critical.

2. The Channel Challenge: For most product categories and price bands, brick and mortar stores continue to struggle against online retail stores. A simple application of the Keystone Markup shows the economic challenge –


The large cost differential puts additional onus on brands and retailers to understand the competitive pricing environment and fine-tune their pricing accordingly or they lose out to the online retail channel.

Understanding the pricing environment as well as how the products are priced gives brands and retailers a competitive edge. At Datafiniti, we believe that web data can deliver the prices and the insights that retailers and brands have been craving for.

We have over 46,000,000 products with pricing data available today. As a quick experiment, we took a look at the pricing of luxury shoes (26,155 to be exact) to see how the luxury shoe brands price their products and what kind of pricing strategies are employed by the luxury brand shoe retailers. We were surprised and thrilled at the insights we could derive such as how the top fashion brands use 3 prominent pricing strategies for all their models.

Lux Brand Price Dist

Watch this space – we will share these interesting insights in the coming weeks.

3 Reasons Why Businesses Should (Re)Consider Using Web Data

Businesses, large and small, are skeptical of using web data.

And I don’t blame them.

If a business needs useful web data, the process to acquire it is tedious, messy and expensive. Current business processes are super-efficient and have limited slack time. They need to get to the right web data as quickly as possible. Once they find this web data, they need to ingest and integrate it quickly into their BI systems. Web data needs to be accurate, comprehensive, and up-to-date.

In most situations where a company uses web data, one of these requirements is not met, thus limiting the impact of web data on the business.

It doesn’t need to be like this.


There are at least 3 reasons why data-driven businesses should consider using web data in more diligent and pervasive manner –

Reason #1: Monitor Dynamic Market Trends Continuously

Traditional methods of monitoring market trends (quarterly or monthly updates, analyst reports, sales channels, etc.) are too slow to provide the data needed for emerging business models. Companies need to access key market events as and when they happen. The web can become the primary source for this key data.

Reason #2: Gather Comprehensive Competitive Intelligence Swiftly

As markets and consumers become ultra-segmented, a comprehensive approach to competitive intelligence is critical to a functional BI process. By understanding competitors’ pricing or consumer sentiment or inventory levels, appropriate range of responses can be considered within the business. There are few sources that are more comprehensive than the web.

Reason #3: Respond Rapidly, Intelligently and Profitably

Business can react or respond rapidly with a deeper understanding of a given situation, be it a competitive product price or uncertainty in the supply chain or an impending consumer crisis. By incorporating web data into their critical information flows, businesses can now have more robust BI systems and better data to make informed management decisions.

These are but some of the many reasons why enterprises, large and small, should use web data. This is a fascinating topic for us at Datafiniti. I will be presenting some interesting examples at the upcoming Wolfram Data Summit on how businesses can leverage web data and have a positive impact on themselves and their ecosystems.

Look me up at the conference or watch this space for another update.

How I Stopped Being Afraid of AI

Reducing the burden of human labor using mechanical devices such as the wheel, the lever, the sail, the steam engine, etc. is not a new idea. In most cases, these devices helped societies vastly improve their quality of life and increase opportunities for its citizenry. However, using intelligent, independently thinking machines to help, enhance or substitute human labor and more importantly human thought, is a new phenomenon.

The 2014 short documentary film by C.G.P. Grey called Humans Need Not Apply thoughtfully discusses this impact of automation on humans and paints a rather bleak future of work.

There is an inherent unease about the kind of tasks intelligent machines are now performing while replacing human workers. This view is also shared by some rather influential figures in technology and science such as Ray Kurzweil and Elon Musk.

But, then, a glass half-empty glass is also half-full.

There is a different, more optimistic perspective on Artificial Intelligence – that there is a vast, untapped, positive impact it can have on humans and on the nature of work. That AI is another tool, albeit more powerful and more impactful tool, but a tool nevertheless whose power is waiting to be harnessed by businesses. In fact, at Datafiniti, we leverage AI to perform significant amount of tasks that help create a better and more robust product.

There are also other perspectives on AI. Geoff Colvin of Fortune, recently argued in his book, Humans Are Underrated, that people fearful that their jobs are at risk may be asking the wrong question as to what kind of work a computer will never be able to do. Geoff_Colvin_Book
Instead, Mr. Colvin proposes that we ask what are the activities that we humans, driven by our deepest nature or by the realities of daily life, will simply insist be performed by other humans, even if computers could do them?

We think the subject of AI and its application areas are so exciting that we have even proposed a panel at the upcoming 2016 SXSW Interactive titled – How I Stopped Being Afraid of AI. We think this will bring a fresh perspective to this raging debate. Please click here to know more and vote for us.

We are more likely to fear what we do not understand. Get to know AI and what it can do for your business.

Introducing Property Data from Datafiniti


Today is a BIG day at Datafiniti! We are very excited to announce the release of Datafiniti Property Data.

The Datafiniti Property Data will allow businesses instant access to the cleanest, most comprehensive, up-to-date web data such as rental prices and rental inventory in the sharing economy, values of property investments and overall real-estate market trends. At launch Datafiniti Property Data will have over 1,000,000 listings and will cover the entire market of about 10,000,000 online listings by Q4 2015. You can know more about the product here.

Datafiniti continues to rapidly increase our data coverage and currently has over 72,000,000 unique web data records including

  • 40,000,000+ unique Products information including product pricing and product reviews
  • 32,000,000+ unique Businesses information including location and reviews

Our Property Data helps our current customers in several ways including strategic planning, marketing initiatives, sales & business development. Datafiniti is making a significant impact to the businesses of our current customers. If your business deals with the Real Estate ecosystem, then you should let us know how we can help you.

Need for Speed: Solving the Real Estate Sector’s Need for Fresh Data

“There is gold in them thar hills” said Mulberry Sellers, a Mark Twain character lured by the California gold rush. Today, Mulberry would say “There is gold in them thar web.”

There is indeed a huge amount of publicly-available data filled with critical information your business. A recent Techcrunch article highlights the use of such data in the Real Estate sector:

.. up-to-date information is crucial for ensuring consumer trust. Sites like Trulia need data to be in near real-time. If, for instance, a consumer wants to view the value of homes in his or her neighborhood, Trulia can display the recent sale prices of homes rather than values from two or three years ago.

It is critical for the Real Estate sector to have the latest data, especially so in the accommodation market. The sharing economy has disrupted the hospitality industry. With companies like HomeAway, Booking.com, AirBnB offering attractive rentals at affordable prices, incumbent players like Marriott and Hilton need to know the latest pricing and inventory levels of these rental spaces to stay competitive. Real estate portals like MyRentToOwn.com and other emerging businesses in the Real Estate sector ecosystem need access to the latest listing data to service their customers effectively.

As the supply of homes dwindle (see chart below) and consumers lifestyles get more connected, the need for the latest information is all the more critical. The data needs of these businesses are varied but the data needs be to be current and comprehensive.

Having the freshest online data continues to remain challenge for Real Estate sector with no satisfactory solution. Until now.

Using publicly available data, Datafiniti will soon introduce a solution that solves the need for clean, current and comprehensive real estate data. Providing instant access to this data is going to start a quiet revolution in the way Real Estate uses web data.

There is surely gold in them thar web.

How Do We Move Past Data Wrangling?

“Data Wrangling” needs to be a thing of the past. The business world has recognized the value of data (and big data) for several years now. Unfortunately, it’s still stuck in the quagmire of messy data. The New York Times recently published an article “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”. This is a depressing title, but it’s dead-on. Some key points from the article:

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

But if the value comes from combining different data sets, so does the headache. Data from sensors, documents, the web and conventional databases all come in different formats.

The second quote really stands out for me. It specifically mentions the frustration that comes with combining data sets sourced from the web. Of course, at Datafiniti, we’re working to remove this burden from the business analyst or data scientist. Beyond our company though, data providers need to embrace the challenge of providing clean data. With the exception of our own company, we haven’t seen this happen within the field of web data. Traditional web scraping companies have been content to license complicated software to their customers and let them or a partner company handle the sanitization that inevitably comes next. As the table stakes rise in the data world, data wrangling will be absorbed up the data value chain, and businesses will expect clean data to be provided upfront.

At Datafiniti, we’re excited to not only see this evolution, but be a driving force for it. Let’s go from 80% data wrangling / 20% insights to 0% data wrangling / 100% insights. Let’s make data wrangling a thing of the past.

How Web Data Can Help Your Job Search

Last week some friends of ours visited Austin. Like so many people, they were moving to this amazing city for new jobs. One of them was on the verge of accepting a job at Whole Foods, and the other was beginning his search. Since his wife’s office would be downtown, he wanted to find other companies that were located downtown. Since his background was in software development, he needed to find a company that was in technology.

Surprisingly, getting a list of technology companies in a specific geographic area is not an easy thing to do, at least with traditional tools like Google, LinkedIn, etc. My friend had already exhausted all the normal options for this information and he knew his list of companies was woefully small.

As he described his problem with me, I knew I could help. After all, Datafiniti customers face the same problems before they come to us, but on a larger scale. They know the information is on the web, but they don’t know how to compile it quickly and easily. With a smile on my face, I told him I could get his list right away.

I opened up my laptop and issued a request to Datafiniti:

Screen capture from my API client

Screen capture from my API client

Within a few minutes, I had over 1,000 businesses in a file for him, ready to sort through. Since the data included zip codes, he could easily filter out the companies located in downtown Austin. Armed with the websites, he could start going through the list and finding career pages, job openings, and more. Here’s a screenshot of that file:


It was incredibly gratifying to use Datafiniti to help a friend with his job search. It just goes to show many potential applications there are for clean, organized web data.

Fighting Human Trafficking with Web Data

One of the central tenets of our work at Datafiniti is that web data has the potential for tremendous positive change in our world. We recently came across an article that demonstrates this perfectly, entitled “The escort database that combats human trafficking“.

The article talks about DIG (Domain-specific Insight Graphs), which crawls the entire web and converts content into data (sound familiar?). In this case, the data is a collection of markers to help identify human trafficking activity on the web and track down missing persons.

The creators of DIG highlight many of the challenges with and benefits of web data. Specifically:

The internet contains seemingly limitless information, but we’re constrained by our ability to search that information and come up with meaningful results.

The UK’s Human Trafficking Centre identified 2,255 potential victims of human trafficking in 2012, and the Missing Persons Advocacy Network estimated 200,000 US children are at high risk for trafficking into the sex industry. Better tools to address the unwieldy problem of police scouring the entire web for clues are an obvious priority.

These observations mirror the issues we tackle every day at Datafiniti. Although the data vertical is different, the challenges and approaches are incredibly similar.

It’s very exciting to see others recognize the potential of web data and use it for such tremendous social good as this. We’re sure to see many other applications benefiting society as web data becomes more and more accessible.