Yesterday, Brazil lost to Germany 7-1 in the World Cup semifinals. But Germany had been preparing for the Brazilian team for two years using Data Science. 50 students from a Cologne university compiled a database on player behaviors, resulting in tactic changes for the German team.
Data Science is more than just statistics and machine learning on numbers. A lot of data is "unstructured," which means text (or worse, both text and numbers). While natural language processing has been around for half a century, its importance in the fields of Big Data and Data Science is growing and can no longer be ignored if one is to maintain competitive advantage.
There is a planet full of tools, and herein I describe one grain of sand out of that planet: Semantic Similarity Metrics.
This past week, Intel announced a future Xeon would have an FPGA integrated on the chip, and still plug into a standard CPU socket.
This was reported around the various blogs and news outlets, but little attention to what it could actually be used for. In the popular press, FPGA seems to be thought of as an odd cousin to GPUs, sometimes useful for BitCoin mining and cracking encryption.
When we practice data science, even if we've done everything correctly and in an unbiased manner, how do we know that our message has been correctly and fully received?
Every human communication goes through a "noisy channel" as illustrated below (image is from idealliance.org).
A couple of months ago, I blogged about Peak Hard Drive, that hard drive capacities were leveling off and how this would impact the footprints of data centers in the era of Big Data. Since then, there have been two major announcements about SSDs that indicate they may come to the rescue:
UPDATE 2014-05-20: Matei Zaharia commented on the issue of combining map() and lookup(), stating that it's not within the current design to allow nested RDD functions, and that it's a feature he'd like to see added in the future. In the same Jira ticket, I had posted a workaround using join() which works fine.