Data Fusion is Data Destruction

Data Fusion originated as a technique from the 1980's defense industry where the goal was to, when doing digital intelligence, to come up with a single, unified truth -- e.g. combining information from multiple sources the belief is that enemy tank serial number A123 is at location X.

Then in the 1990's and beyond, marketers took up the approach to come up with a single, unified version of a potential customer; e.g. taking a simplistic data fusion approach, if two sources claimed a person was married and one single, then the unified truth was that the potential customer was considered married for marketing purposes.

This was all done to simplify processing. Having a single version of the truth allowed for smaller data and simpler algorithms.

But now we are in the era of Big Data, and high-powered machine learning algorithms are free, pre-packaged, and ready to download. The constraints and assumptions of the 1980's no longer hold true; instead, as I blogged last week, data should never be deleted.

Machine learning algorithms can take advantage of multiple sources of data. For example, a machine learning algorithm may figure out that credit scoring agency A is more accurate for Florida residents and that credit scoring agency B is more accurate for Alabama residents. Well, you may say, once that's figured out why not just discard the corresponding weaker data for each resident? That would be data fusion thinking. A machine learning algorithm can adapt over time as more data becomes available; it can detect shifts in accuracy, such as if credit scoring agency B improves its accuracy for Florida.

In short, in the era of Big Data and NoSQL, additional data should just take up additional columns, and just let the machine learning algorithms deal with it.

Now, even though we shouldn't practice Data Fusion, there are still some things we can learn from it. For example, a technique called Fuzzy Data Fusion exploits redundancy in sources of information to score the reliability of each source. Applying this notion to the context of Big Data and machine learning, sources that are scored as reliable could be used as a training set reference in a boosting scenario.

Remember, fusion equals destruction. Don't delete that data, use it.