Apache Spark 1.0 almost here. Is it ready with 16 (*) "unresolved blockers" in Jira? (* UPDATED x2)
UPDATE 2014-05-20: Matei Zaharia commented on the issue of combining map() and lookup(), stating that it's not within the current design to allow nested RDD functions, and that it's a feature he'd like to see added in the future. In the same Jira ticket, I had posted a workaround using join() which works fine.
UPDATE 2014-05-19: The Spark team would like two points clarified:
1. All but two "blockers" are for future releases (e.g. for Spark 1.1 in the future).
2. Submitters can set their own priority level, and some may have set the issues they submitted to "blocker" without coordinating with anyone else
These are important clarifications to my off-the-cuff blog headline, but in my opinion the deeper point still stands. One Spark contributor noted that with the surge in interest in Spark, there hasn't been enough time to even triage the incoming reports (which is actually consistent with point #2 above from the Spark team). I elaborated further on the psychological importance of "1.0" directly on the Spark dev mailing list.
I just want to reiterate how enthusiastic I continue to be for Spark, as I have been for the past 15 months. One of my concerns is seeing "1.0" as being perceived by corporate VP's as a green-light to implement, and then experience issues they weren't expecting, possibly leading to a loss of reputation for Spark.
Apache Spark 1.0 is to be released any day now; currently "release candidate 6 (rc6)" is being evaluated and will be voted upon imminently. But is it ready?
There are currently 16 issues marked as "unresolved blockers" in Jira for Spark, at least one of which is known to produce erroneous data results.
Then there is the state of the REPL, the interactive Spark Shell recently lauded for making Spark accessible to data scientists, as opposed to just hard-core software developers. Because the Spark Shell wraps every user-entered command and class to do its interactive magic, some basic Spark functions fail to operate, such as lookup() and anything requiring equals() on a compound key (i.e. custom Scala class as opposed to just using String or Int for a key) for groupByKey() and other combineByKey() derivatives. It even affects map(), the most fundamental of all functional programming operations.
Even putting the REPL aside and considering just writing full-fledged Scala programs, the native language of Spark, simple combinations such as map() and lookup() throw exceptions.
Don't get me wrong. Spark is a great platform, and is where it should be after two years of open source development. It's the "1.0" badge that I object to. It feels more like a 0.9.2 release.