This week Databricks announced GraphFrames, a library posted to spark-packages.org that is based on Spark SQL Dataframes rather than RDDs (as GraphX is). GraphFrames is still a work in progress -- it is currently at the 0.1 version -- so it provides interoperability with GraphX (graphs can be converted back and forth).
Last night, Tathagata Das resolved SPARK-11290, "Implement trackStateByKey for improved state management", which will bring a 7x performance improvement to Spark Streaming when Spark 1.6 is released in December, 2015.
trackStateByKey() offers three benefits over updateStateByKey(), which has served as the workhorse of Spark Streaming since its inception in 2012:
Apache Spark itself
Spark originally came out of Berkeley AMPLab and even today AMPLab projects, even though they are not in Apache Spark Foundation, enjoy a status a bit over your everyday github project.
Spark's own MLLib forms the bottom layer of the three-layer ML Base, with MLI being the middle layer and ML Optimizer being the most abstract layer.
The data scientist is dead. Long live data science!
Well, not dead, but certainly dying. Up until late 2012, the Google search popularity for "data scientist" tracked that for "data science" but thereafter has sagged.
This trend is even confirmed, though to a lesser degree, in Indeed.com job postings:
Why is this? I can think of three possible reasons:
Spark 1.5 was released today. Of the 1,516 Jira tickets that comprise the 1.5 release, I have highlighted a few important ones below, broken down by major Spark component.
The first major phase of Project Tungsten (aside from a small portion that went into 1.4)