Although the term has been around Hadoop circles for a few years, 2014 saw its rise to prominence. And while it's fine for an organization mature in Hadoop to adopt multi-tenancy, the technology is still too immature to fire up in an organization new to Hadoop.
Diagram Notes: 1. Yellow documents are map outputs 2. Not shown is that Hadoop spools map outputs to disk before reduce task reads them, whereas Spark keeps the map outputs in RDDs.
In thermodynamics, entropy measures disorder, or the amount of unuseful dispersed heat.
In information theory, entropy measures how difficult it is to ZIP a file, i.e. how poorly a lossless compressor performs. This is because Claude Shannon based his definition of entropy upon the probabilities of seeing various outcomes such as various strings of bits.
The "3 V's" are typically portrayed as a problem to be solved by Big Data and Data Science. The trite knee-jerk response is that they're not problems but opportunities.
But what about taking that trite statement to heart? If we did that, we would seek out additional data, rather than be content with the Big Data that happens to be in our Hadoop or that happens to be streaming in from Spark Streaming!
Data Fusion originated as a technique from the 1980's defense industry where the goal was to, when doing digital intelligence, to come up with a single, unified truth -- e.g. combining information from multiple sources the belief is that enemy tank serial number A123 is at location X.