Big Data is Stupid Data

The "Big Data" marketing hype obscures the fact that more actionable, valuable insights are likely to be found in the right smaller "Smart Data" sets in contrast to large data sets.

While the term "Big Data" is properly defined as data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time - the marketing hype promises technology to collect, store and crunch huge amounts of data to get value and provide advantage.

As many organizations are now learning, it is very difficult to get any value out of large data sets without clear goals, employing sophisticated data science techniques (e.g., machine learning and algorithms) and the right data crunching and analytical technologies. Getting value from data requires special data science methods to differentiate signal from noise to extract meaning and requires special compute systems and power.

While large data sets may provide great value in specific situations, the savvy professional data scientist knows the right combination and variety of "Smart Data" is usually more important than "Big Data" and is more likely to add significant value. One of the most important roles of data scientists is to select the appropriate variety of small data sets for a specific goal versus collecting and storing huge volumes of data.

There are a number of reasons for prioritizing the selection and collection of the right data and different varieties of "Smart Data" over "Big Data". One major reason is the curse of big data. Simply put, you will find more "statistically significant" relationships in larger data sets. "Statistically significant" means a statistical assessment of whether observations reflect a pattern rather than just chance and may or may not be meaningful. The larger the data set, the more "statistically significant" relationships will have no meaning - creating greater opportunity to mistake noise for signal.

Thus, "Big Data" produces more correlations and patterns between data - yet also produces much more noise than signal. The number of false positives will rise significantly. In other words, more correlations without causation leading to an illusion of reality.

Big data makes it harder to find the needle (actionable, valuable insights) in a larger and larger haystack. The danger is that we will increasingly be tricked by randomness found in big data and make bad decisions as a result believing noise is signal.

I suggest valuing the right "Smart Data" over "Big Data" and focusing on carefully selecting a variety of data sets relevant to a specific goal to maximize the probability of obtaining meaning and value from data.