Facebook data scientists recently conducted an online experiment on 689,003 unknowing Facebook users - likely including children under the age of 18 - to see if it could manipulate and change user emotions. One group had positive words like “love” and “nice” filtered out of their News Feeds. Another group had negative words like “hurt” and “nasty” filtered.
Data Science is more than just statistics and machine learning on numbers. A lot of data is "unstructured," which means text (or worse, both text and numbers). While natural language processing has been around for half a century, its importance in the fields of Big Data and Data Science is growing and can no longer be ignored if one is to maintain competitive advantage.
There is a planet full of tools, and herein I describe one grain of sand out of that planet: Semantic Similarity Metrics.
This past week, Intel announced a future Xeon would have an FPGA integrated on the chip, and still plug into a standard CPU socket.
This was reported around the various blogs and news outlets, but little attention to what it could actually be used for. In the popular press, FPGA seems to be thought of as an odd cousin to GPUs, sometimes useful for BitCoin mining and cracking encryption.
Data scientists must always remember that data sets are not objective - they are selected, collected, filtered, structured and analyzed by human design. Naked and hidden biases in selecting, collecting, structuring and analyzing data present serious risks.
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+, while using Apache Avro as the data serialization format.
When we practice data science, even if we've done everything correctly and in an unbiased manner, how do we know that our message has been correctly and fully received?
Every human communication goes through a "noisy channel" as illustrated below (image is from idealliance.org).
The easiest person in the world to fool is yourself. Data scientists sometimes fool themselves - in matters trivial and important. Thus, I strongly suggest that we acknowledge real or subconscious biases in ourselves, the data, the analysis and group think. It is prudent for data science teams to have both internal and external checks and balances to expose potential biases and better understand objective reality.