Mordac the Preventer of Information Services Leads to Unnecessary Hadoopness
Today's Wall Street Journal article The Joys and Hype of Software Called Hadoop (free mirror), in which Michael Walker, president of the Data Science Association, was quoted, reports "frustration" by adopters of Hadoop. The article cites several reasons, from performance on large data sets to difficulty of merging Hadoop data with data warehouses.
There are several other reasons that I can personally think of, too:
- repeated readaption to (including re-architecting to) the ever-changing landscape of tools in the Hadoop ecosystem
- attempting to apply Agile to a research project (which is OK when done well) when the organization only has Agile experience on plain old three-tier systems
- critical data fields missing that would, say, otherwise end up at the top of a machine-learned decision tree
- attempting to provision Hadoop (or worse, Spark) on standard data center servers, or worse (and shockingly even more common) on data center VMs
- undersized Hadoop clusters that can only retain a short lookback period of data
- hiring purple unicorns instead of growing developers from within to learn Hadoop, as they already have the hard-won domain knowledge
- ill-defined funding for the Hadoop project (as with webmasters of the 1990's, does it come out of the marketing budget, the IT budget, or perhaps some other department (such as fraud detection/information security) that would make use of it?
Then there's the reason I hinted at last week, which is that Hadoop may not even be necessary. As I wrote then, a $15k R workstation (256GB RAM, 16TB HDD, Tesla GPU) that has access to corporate data stores has the capability of producing data insights when the data is "not huge" (less than 150TB, of which less than 200GB needs to be analyzed at any one time).
But you will rarely, if ever, see such a beast of a (useful) workstation in a corporate environment. Why not?
- No manager wants the "data scientist" to have a more powerful computer than the manager has
- IT doesn't want to support it
- With everyone in a cube farm -- or even bullpen -- these days, there is insufficient physical security for such an expensive capital asset (containing possibly sensitive data to boot!)
- Correspondingly, no manager wants to allocate a precious physical office that could provide said physical security, especially if said manager doesn't even have an office
- IT departments feel more comfortable giving under-powered laptops to data scientists, and telling them to access Hive running on a pretend Hadoop cluster comprised of VMs
- Getting approval for such an unusual piece of equipment for a single person would require sign-offs possibly all the way to the top. It's easier to just beg the data center for some more VMs instead.
- The office's Internet connection may be only be tens of megabits/sec fast, when gigabit would make such a setup useful
The current paradigm of wimpy laptops accessing the massive data center is like 1960's-era mainframe computing. We need to enter the 1970's where people can actually run Visicalc on their Apple ][, or the modern equivalent thereof.