Ten (10) categories of functionality:
Data access, filtering and manipulation: This refers to the product's ability to access and integrate data from disparate sources and types, and to transform and prepare data for modeling.
Stream Library is a Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for estimating: cardinality (i.e. counting things); set membership; top-k elements and frequency. One particularly useful feature is that cardinality estimators with compatible configurations may be safely merged.
These classes may be used directly in a JVM project or with the provided shell scripts and good old Unix IO redirection.
United States President Barack Obama recently introduced DJ Patil as the new Chief Data Scientist of the United States Government. See video: Data Science: Where are We Going?
Apache Kafka is high-throughput, publish-subscribe messaging system rethought of as a distributed commit log. The new Kafka 0.8.2.0 release introduces many new features, improvements and fixes including:
- A new Java producer for ease of implementation and enhanced performance.
- A Kafka-based offset storage.
- Delete topic support.
- Per topic configuration of preference for consistency over availability.
- Scala 2.11 support and dropping support for Scala 2.8.
- LZ4 Compression.
I often receive phone calls from organizations, aspiring data scientists and reporters about whether data science would be a good career choice for women. My response is absolutely yes, for the following reasons: