The End of Data Science As We Know It
Much of this is simply Data Science coasting into the "Trough of Disillusionment". Gartner's 2013 Big Data Hype Cycle has Data Science just starting to climb the "Peak of Inflated Expectations," but I say from the evidence this past week that it has definitely already passed that peak and has started slipping into that "Trough of Disillusionment".
The Gartner chart posits that Data Science won't reach the "Plateau of Productivity" for more than another ten years. I think that's about right.
But there is truth in both Miko Matsumura's Slashdot post and Vincent Granville's Data Science Central post. There is a truth in there that they only just touched on, and it is part of a contrarian opinion I've been developing over the past few weeks that many disagree with me on, including other DSA board members.
With Coursera pumping out experts in Machine Learning, Statistics, Data Analytics, R, and Bayesian Networks, and with R and IPython Notebook having available these advanced algorithms pre-canned and ready to use, self-proclaimed "data scientists" -- or at least data analytics professionals -- will be a commodity, as Vincent Granville has already observed.
There is another big piece that is changing, and that is the paradigm shift away from the "Data Lake" (aka "Hadump") toward real-time ingest. In the Data Lake scenario, data is messy and requires a data scientist to intelligently deal with missing values and design experiments and quasi-experiments. It's the proof of concept phase in any new Big Data endeavor.
In real-time ingest scenario, data is clean -- or at least cleaner -- by design. Disparate data sets from multiple ingest points may be fused together, but it is by advance arrangement and engineering. The task of data cleansing, which is always 90% of the work involved in data science, thus is transferred from data scientists in first generation Big Data and Data Science systems over to software developers.
And that type of software development isn't the easy type -- it's clustered, parallel, and performance-driven. It's not the sort that's easily off-shored, especially since it involves integration with existing corporate data sources.
The reason for the shift from 1st generation to 2nd generation is that once insights are gleaned from 1st generation systems, consumers of those insights start demanding more up-to-date tracking, meaning real-time. The timeline depicted above of course is overly simplistic. Not only are different companies at different stages of adoption of Big Data and Data Science, but there will always be a demand for fishing expeditions within the Data Lake, but to a much lesser extent compared to the deluge we are in now that was sparked by the realization that storage is cheap enough now to not have to throw away data.
In the current era of second generation systems, where R, IPython Notebook, and machine learning software libraries are available, reliance on top-notch data scientists isn't actually taken to zero as I have depicted above. They're still needed to set direction, validate results, and perform the occasional fishing expedition as new data sources become available and come online. But they're a smaller portion of the team going forward.
Overall, the demand for top-notch Data Scientists I predict will still continue to grow. As we continue toward the Plateau of Productivity, every organization will need Data Science in order to remain competitive. But due to automation and due to clean-data-by-design of second generation systems, their numbers will be dwarfed by the number of software developers required. The number of teams requiring Data Science expertise will grow as we reach the Plateau of Productivity, but the portion of Data Scientists within those teams will shrink. And they may need to add Social Sciences to their expertise to remain competitive.