Active vs Passive Data Variety

The "3 V's" are typically portrayed as a problem to be solved by Big Data and Data Science. The trite knee-jerk response is that they're not problems but opportunities.

But what about taking that trite statement to heart? If we did that, we would seek out additional data, rather than be content with the Big Data that happens to be in our Hadoop or that happens to be streaming in from Spark Streaming!

Increasing "Volume" is straightforward -- doing more of the same. Increasing "Velocity" means incorporating Data Streaming, which I've been blogging about here for the past eight months or more.

But what about "Variety"? What does it mean to increase data variety?

An infographic from Kapow Software lists nine varieties of data. I will use those nine, add some of my own, and group them as follows:

On-Demand Data (Internal)

Within organizations, there are often internal web services that provide data. A prime example is user account info. This can be used to augment a clickstream data stream on the fly.

The Kapow infographic lists Data Storage as a source, by which I suppose is meant a direct JDBC connection to an RDBMS, but that is (or should be) rarely done today -- to have an extra analytics application come in and access a line-of-business database directly. The reasons are maintaining a consistent interface and service load. A web service can put a facade over schema changes, and it can also ensure the analytics application doesn't interfere through heavy demand with the business application.

The Kapow infographic also lists Archives (scanned paper documents), Documents (word processor), Media, and Business Apps. With the exception of some narrow applications of Media, such as facial and voice recognition, we are still at a point in Big Data and Data Science that all of these are high-hanging fruit, to be pursued after everything else in this blog post has been implemented.

Also listed is Machine Log Data. Now, many Big Data projects start out with consuming server logs because that is what is most easily available, so it might seem odd to look to machine log data as a way to increase variety on an existing Big Data system. There is a small opportunity here: increasing the log level. If the logs can be associated with users, then additional information from the logs may yield additional information about users. This is a prime example of spinning what was considered information overload -- system administrators are constantly complaining about culling logs and rotating logs -- and turning it into an greater source of Big Data.

On-Demand Data (External)

Kapow mentions public web such as government data sources, but the greatest value often comes from commercial pay web services. Examples include geo-IP lookup and postal mailing address scrubbing.

Streaming Data (External)

A typical data streaming architecture consists of a single internal data stream such as a clickstream, which may be augmented by one of the static on-demand data sources listed above. Alternatively, another typical data streaming architecture hooks up to a Twitter firehose to perform real-time sentiment analysis.

But what about merging these two -- e.g. having an internal clickstream merged with an external Twitter firehose? If a user logged into your website is also tweeting his/her experience, the two streams can be merged to provide real-time system alerts or marketing offer interventions.

Instrument Software

While websites may already be instrumented by nature, perhaps you have iOS or Android apps that could be instrumented. Also, if your regular website for desktops is a single-page app, a lot may be missed by capturing only server-side logs. For single-page apps, it may be necessary to add Javascript code to send to the server events for user interactions that would otherwise be strictly locally handled.

And while increasing the logging level for existing logs was suggested above, it may be the case that custom server software that was written in-house may have to be modified to produce extra desired log output.

Instrument the Real World

Kapow does mention sensors, but again, we're not talking about just passively accepting data coming from sensors. There may be additional sensors we can place to provide valuable data.

For example, a fitness gym may want to provide its patrons with heart and other physiological monitors and fine-grained location trackers within the gym, in order to recommend appropriate fitness classes or personal trainers, and perhaps even to predict future churn more accurately than attendance data alone would provide.

Competitive Advantage

If everyone is already doing Big Data and Data Science, then the way to get competitive advantage is to seek out Bigger Data, and bigger means Variety, assuming you've already got your Volume and Velocity up.