Using RDBMS for Data Science?
On 8 Dec, 2014 By michaelmalak 0 Comments
Can one use an RDBMS for Data Science? There are some situations where it makes sense, if all of the following conditions hold:
- Your data starts its life in an RDBMS anyway. If, on the other hand, your data is instead in flat files or coming from a data stream, you probably should build for the future by building a Hadoop/Spark system. Or just leave it in flat files and process it on a beefy workstation (see below). An RDBMS is going to need to index the data, and if the data is not already in there, it's a very long process because most RDBMS's don't utilize more than one processor core at a time to create an index, and thus the process can take days.
- You have a beefy workstation to do the analysis. A blog earlier this year compared Impala vs. PostgreSQL performance, and it showed Impala being more than 200x as fast as PostgreSQL at doing a COUNT DISTINCT. I dislike this comparison for two reasons:
- First, the data size was only 10GB. With consumer desktop computers commonly sporting 32GB, or even 64GB, it would be reasonable to outfit an analysis computer with 256GB.
- Second, in the particular example of COUNT DISTINCT, a hash-based cardinality algorithm would perform much faster
- You plan doing the machine learning on said beefy workstation. While some RDBMS's have basic machine learning libraries available, such as data mining, most data scientists will expect to have the standard full complement of algorithms that are available from R or IPython Notebook.
- You're not streaming data. An RDBMS can accept 2000 INSERTs/second. A Kafka queue (itself a distributed cluster), in contrast, can accept 2 million insertions per second, and these in turn can be consumed in a distributed fashion by either a Storm or Spark Streaming cluster. I estimate that the clickstream from a top-100 website would exceed the 2000 events/second limit of an RDBMS. But then we're not all running a top-100 website. Which is the point of this blog post -- there are some situations where an RDBMS is OK. But if your goal is to be a top-100 website, maybe you'd better plan ahead :-).
- Your total data size is less than 144 TB. An RDBMS is for non-big data; i.e., without having to resort to sharding it needs to fit on a single machine. 24 6TB drives comes to 144TB.
- The size of your data subsets can be efficiently processed by the cores and/or GPUs in your beefy workstation. While the total data size may be 144TB, you're typically going to be working with data subsets that are much smaller. Not only should those data subsets be able to fit in the, say 256GB, RAM of your beefy workstation, they should also be small enough that the number of CPU cores (perhaps 10 cores times two processors for a total of 20 cores) can efficiently perform machine learning on them. Or you have a GPGPU and have identified machine learning software that can make effective use of it.
Advantages of an RDBMS
- SQL. SQL is wonderful. Just look at the Hadoop world. The holy grail has always been SQL-on-Hadoop.
- Performant way to subset the data. One of the most common tasks in data science is selecting a subset of the data to work on. This is painfully slow in Hive. If you have an indexed RDMBS, it can be very quick to read out the subset of the data into a beefy R workstation.
The Non-RDBMS Alternative
Again, just to be clear, the typical non-RDBMS alternative is a Hadoop/Spark cluster. A Hadoop/Spark cluster is distributed and intrinsically has a high-degree of parallelism and use of RAM. The "beefy workstation" mentioned in at least three of the bullets above attempts to compensate for the lack of a Spark cluster with prodigious amounts of RAM, number of cores, and machine learning software packages.
But for non-big data, and especially if your data is already nicely indexed on an RDBMS, Hadoop and Spark could be complicating factors rather than solutions.