The Ideal Data Science Programming Language

This past week, I participated in a panel at the Rocky Mountain High Performance Computing Symposium discussing the future of HPC software.

But really the HPC and Big Data & Data Science worlds are converging as I blogged was the emerging consensus at the international SC13 conference.

So the question is, what would be the ideal programming language to address HPC and Data Science problems? The ideal language would have the following qualities:

  • Exploit data locality in a cluster like Hadoop Map/Reduce Java and Spark/Scala
  • Utilize RAM in a cluster like Spark/Scala, and not just disk like Hadoop Map/Reduce
  • Be object/functional like Spark/Scala (and to a lesser extent like Javascript and Python)
  • Have a Mathematica-syle notebook like IPython Notebook and Databricks cloud
  • Use the same language for visualization as for number crunching. IPython Notebook has this if one is willing to live with Matplotlib, but a second language of Javascript is required if d3.js is needed for custom visualizations. Databricks cloud demoed some nice visualizations, but since it is still in closed beta, it is not known how customizable they are in Scala.
  • Have a large machine learning library available, as R has and as Spark/Scala has.
  • Be supported directly in a majority of browsers (like Javascript is) so that the same visualization code and be immediately deployed to the world's clients
  • Be able to be compiled so that the same number-crunching (and machine learning) code can be immediately deployed to servers
  • Have resiliency, to keep on computing even if some nodes in a cluster go down. This is a must for exascale computing. Spark/Scala has this to a degree already with RDDs (resilient distributed datasets).
  • Support metaprogramming like Julia and LISP, to enable expressions and bits of code to be stored as data. Exposing code as data is something LISP had down over half a century ago, and has since been mimicked in less than first-class ways by other languages ever since (until Julia), such as with business rule languages and the ASM Java bytecode manipulation library that Spark/Scala uses.
  • Use a simple reference-counted garbage-collector instead of the mark-sweep approach of the JVM that shuts down processing periodically and effectively limits JVMs to 100GB of RAM. This would avoid kludges like Tachyon for Spark/Scala. Simple reference counting won't catch cyclically-referenced structures, leading to memory leaks if code is buggy, but that is the price to be paid for predictable performance and handling today's larger memory sizes.
  • Support GPUs. This has been discussed in the context of Spark/Scala but not implemented.
  • Support FPGAs. Velocidata has a drag-and-drop dataflow programming system, but it is very expensive and not publicly extensible.
  • Present high-level operators to the programmer, and let a global optimizer handle optimization and data location and shuffling. I.e. avoid the current HPC paradigm of programmers manually declaring parallelism with OpenMP and declaring node assignment and message passing with MPI. And avoid the less-than-transparent parallelism over a cluster that R and Python offer as bolt-ons. In theory, Spark/Scala offers this and has the potential to plug in better optimizers. In the ideal world, we would have a situation similar to that in the SQL world, where if the optimizer doesn't do a great job, the programmer can do an EXPLAIN PLAN, and then in response add hints to the query, restructure the query, or add indexes.

Spark/Scala satisfies most of the above, with the three major caveats of being tied to the JVM garbage collector, not being able to run natively in browsers, and FPGA support being on nobody's radar. The surmountable hurdles are the notebook interface (just have to wait for the Databricks cloud to exit closed beta status) and global performance optimization (not an easy problem, but not intractable given the open source status of Spark).