Real-time data science

Looking back on 2013, the world of Hadoop emerged from the era of batch processing and into streaming processing. In the context of "crisp and actionable," actionable often comes with an expiration date. If you don't take action soon enough, it's too late.

As I learned at the February 2013 Strata conference, Electronic Arts instrumented their online games to continuously stream data into their Big Data system. And because much of their revenue now comes from in-game purchases (in games that are often free to play), they need to know immediately if, e.g., in-game purchases suddenly drop off for whatever reason. They need to take action immediately, not next week.

Here I look at four streaming technologies, and then conclude with some pointers to general streaming computation techniques.

Part of the Berkeley Data Analytics Stack, Spark Streaming is a layer on top of Spark. Spark is a distributed, redundant RAM-based data storage and computation system that was accepted as an Apache Incubator project in 2013. Whereas Hadoop distributes data across the hard drives in a cluster of computers, Spark distributes data across the RAMs in a cluster of computers.

Spark Streaming breaks incoming stream into batches of, for example, 2-second batches or 10-second batches, or however long you set the batch window to be. One criticism of Spark Streaming is that it is not "true real time" since it is up to a batch window size behind. Another limitation is that all processing for that batch must complete before the next batch comes in; otherwise, the behavior is undefined. That can leave a lot of compute capacity unused if the system is tuned to handle the worst-case batch, but normal-size batches leave the processors idle for much of the batch time window.

Storm, another Apache project originally from Twitter, is the other big name in streaming. Its biggest and most well-known shortcoming is that it doesn't guarantee that it won't duplicate events. It guarantees each event will delivered at minimum once. Since this can wreak havoc with real-time metric computation, an optional layer called Trident can de-dupe, but at a high performance penalty.

Storm has a great advantage over Spark Streaming in that it has an "ack" system built-in. If processing an event throws an exception or otherwise fails to "ack" its parent in the directed acyclic graph (DAG) of event flow processing, the event can be saved by the parent for later processing when the problem has been fixed.

Storm also has a web-based admin GUI for starting/stopping/monitoring running DAGs.

Since I don't have any first-hand experience with S4, I have to rely on others' evaluations. At a presentation by Ted Dunning of MapR, he indicated that S4 was designed for topologies and situations that are more complex than usually encountered in the real world. A Yahoo! researcher blogged an excellent comparison between S4 and Storm, and boils it down to guaranteed delivery (Storm) vs. state recovery in case of fault (S4).

Druid is different than the rest in that it's more than just streaming; it also provides a data store and querying system. It's an all-in-one solution whereas the others are usually married to Hadoop.

Druid breaks the storage into two parts, one for real-time (typically 30 minutes worth of data), and the other for permanent "deep storage". A single querying facade can query both stores transparently.

The bad news is that the querying is currently limited. With no SQL, queries must be formed using a stilted JSON syntax, and the ability to create complex queries is limited.

But querying is fast. In fact, it's Druid's hallmark. In response to the common problem of cubing data with high dimensionality, the Druid folks decided instead to just be able to query really fast by storing the data in a distributed RAM-based cluster. With fast querying, there's no need to pre-compute and pre-aggregate.

LinkedIn contributed Samza to Apache in September, 2013. It's new and I haven't tried it yet, but it appears to integrate Kafka and YARN more deeply than the others, and it maintains state similar to how Spark Streaming allows with its RDDs (resilient distributed dataset) and its updateStateByKey() method.

While all the others are for self-hosted infrastructure, Amazon Kinesis is strictly for the could. Again, since I haven't used it myself, I rely on another blogger who informs that Kinesis is restricted to a single procedure, and so isn't suited for complex processing. There is also no rollback capability.

Stream Computation Techniques

It's evident fairly quickly that the mean can be computed in an incremental manner. If mn is the mean of x1 ... xn, then mn+1 = (n*mn + xn+1)/(n+1).

Much less obvious is that it's possible to calculate the standard deviation incrementally. Donald Knuth credits a 1962 paper by Welford with a clever technique.

Even more clever is calculating cardinality incrementally, i.e. the number of distinct items in a set, i.e. SELECT COUNT DISTINCT(*) -- and doing so without having to keep the entire dataset in memory. The trick is to use a hash function that uniformly distributes to the range [0.0,1.0) and then keep track of the most number of leading zeroes. E.g. if the set as 100 items, probabilistically we would expect the smallest hash value to end up being around 0.01, or one leading zero. For a set of 1000 items, some hash values by chance would probably end up being less than that, and the smallest overall would probably end up being around 0.001, or two leading zeroes.

Finally, classic machine learning algorithms such as K-means and decision tree construction have been adapted to incremental versions for streaming data. Joao Gama, a Portuguese professor, has assembled them into an excellent book:

Applications

Basic metrics permit real-time monitoring for:

  • Fraud detection
  • Fault detection
  • Capacity overload detection

Streaming machine learning adds:

  • Intelligent automated adjustment of thresholds to the above
  • Ability to detect opportunities, not just problems. E.g. product affinity of noticing products long-time product A and brand-new product B could alert management to craft a marketing campaign for product B to those who have bought A in the past.

Conclusion

While "real-time data science" doesn't mean taking conventional, manual, deep data exploration and analysis and applying it to real-time data, what it does mean for data scientists is that they are increasingly feeling the pressure to have some results automated and real-time, and data scientists need to be comfortable with these streaming technologies and algorithms.