Insights of four Colorado Big Data experts
I just returned from the January, 2014 Meetup of the Boulder/Denver Big Data Meetup group. They stumbled into an amazing format. When they had to postpone their original speaker for a few months, they arranged with four execs, engineers and data scientists from Boulder and Denver companies involved in Big Data and Data Science to each give a 20 minute presentation, followed by a 20 minute roundtable. It turned out to be an amazing format, with each talk more focused than the typical 45-60 minute talk, yet more substantive than 10-minute "lightning talks" (in part because these were polished speakers in contrast to lightning talk sessions that often cater to novices).
Below I super-condense and pick out some highlights:
Part of Aaron's talk was the intentionally controversial topic about how to define Big Data. His contention was that the high performance computing (HPC) world has had the "three V's," velocity, volume, and variety for decades and so those are not new. Aaron took a descriptive, observational approach in trying to define Big Data and noted that what Big Data projects have in common are the tools and processes that have risen up.
Personally, I had always liked the definition of "data too big to fit on a single machine," but in light of Aaron's observation that supercomputers have already been doing that, and that we wouldn't call supercomputers "Big Data," I'm now rethinking that. My thought now is that a better definition might be "systems that horizontally scale on commodity hardware." It's a counterintuitive definition because note that I'm defining systems rather than data! But whether people consciously recognize it or not, I think that's what people really mean when they say, "Big Data."
Very technical, but I was heartened to hear that others have the problem of versioning Avro over a Kafka stream between the producer side and the consumer side. Rally has developed a tool called Marshmallow that wraps Kafka and Avro and hides all the versioning complexity. I've asked Jonathan to e-mail me a link to github and will post it here if I get it.
I had already heard a lot about Full Contact's architecture and technology from the November, 2013 Boulder/Denver Storm Meetup, but the information that surprised me tonight was that they plan to move away from Storm and onto something like RabbitMQ, so that they can have complete control over peformance, especially over processes that are I/O bound.
The best insight that Shawn shared actually came during the Q&A roundtable afterward when his colleague asked a question of the table. A lesson learned at ReturnPath is to not let each data analyst/data scientist come up with their own summarized data, because each one, for example, may use a slightly differently defined denominator in their ratios, leading to confusion when comparing reports from different analysts. So they plan to systemize and standardize the summarization of data.
This then led into the discussion of unstructured vs. structured data. The consensus was that Big Data should be used to capture the raw, unadulterated, and unstructured data and the data should be kept there immutable. That data should then be summarized into some kind of data warehouse, i.e. highly structured and probably an RDBMS, since querying directly from Big Data is problematic and slow. This is a practice that has been solidifying for the past two or more years, but it was good to hear it so concretely.