Pandas Creator: Probably Never Any Python-Centric Big Data Solutions

At the January 10 Data Day Texas 2015, Wes McKinney, creator of Pandas and author of the book Python for Data Analysis, concluded his presentation with a slide that said:

The time for a "dark horse" Python-centric big data solution has probably passed us by. Maybe better to pursue alliances.

The reasons given include that Big Data is presently JVM-centric -- Java for Hadoop and Scala for Spark -- and Python is not a JVM language. Of course, IPython Notebook will continue to be popular and powerful, and as I've written several times before, with the proper beefy workstation, over 100TB of data can be handled without having to resort to cluster computing.

In a further bit of irony, IPython Notebook serves as the foundation for Spark Notebook, which is a fork of Scala Notebook.

So unless something changes drastically, when it comes to Big Data, Python will just be a bolt-on technology.


I've found the most innovative environments those that relied on a mix of technology.  Linix is written in C, Hadoop in Java.  Map-Reduce in Java is the most native, but most miserable.  We can generate it using Pig, or stream it through anything, say Python or R.

If we want to improve the speed we can go to Spark (written in Scala), probably embedded in more Scala or Python.  We can get more speed out of Python using its scientific & math libraries, or precompiling it.

If we want even more speed we can go with Impala on Hadoop (written in C), which is right now the fastest solution on Hadoop.  We're restricted to SQL, which is probably best for quite a bit of reporting, but isn't flexible enough to handle very much machine-learning.   Still, you can generate UDFs from Python to C for some work like classification that are blazing fast, and you can pull the results of either SQL or the UDFS into Python, R, or whatever.

I like this flexibility.  I've worked with shops that were very JVM focused and they seldom delivered much innovation.   Too many blinders, too much focus on their ecosystem, too much focus on application - not enough on analysis.   The combination of R, Python, Julia and Spark sounds a lot healthier to me.