There’s much confusion about Cloudera’s SQL plans and beliefs, and the company has mainly itself to blame. That said, here’s what I think is going on.
- Hive is good at some tasks and terrible at others.
- Hive is good at batch data transformation.
- Hive is bad at ad-hoc query, unless you really, really need Hive’s scale and low license cost. One example, per Eli Collins: Facebook has a 500 petabyte Hive warehouse, but jokes that on a good day an analyst can run 6 queries against it.
- Impala is meant to be good at what Hive is bad at – i.e., fast-response query. (Cloudera mentioned reliable 100 millisecond response times for at least one user.)
- Impala is also meant to be good at what Hive is good at, and will someday from Cloudera’s standpoint completely supersede Hive, but Cloudera is in no hurry for that day to arrive. Hive is more mature. Hive still has more SQL coverage than Impala. There’s a lot of legacy investment in Hive. Cloudera gets little business advantage if a customer sunsets Hive.
- Impala is already decent at some tasks analytic RDBMS are commonly used for. Cloudera insists that some queries run very quickly on Impala. I believe them.
- Impala is terrible at others, including some of the ones most closely associated with the concept of “data warehousing”. Data modeling is a big zero right now. Impala’s workload management, concurrency and all that are very immature.
- There are some use cases for which SQL-on-Hadoop blows away analytic RDBMS, for example ones involving data transformations – perhaps on multi-structured data – that are impractical in RDBMS.
And of course, as vendors so often do, Cloudera generally overrates both the relative maturity of Impala and the relative importance of the use cases in which its offerings – Impala or otherwise – shine.
Spark is on the rise, to an even greater degree than I thought last month.
- Numerous clients and other companies I talk with have adopted Spark, plan to adopt Spark, or at least think it’s likely they will. In particular:
- A number of analytic-stack companies are joining ClearStory in using Spark. Most of the specifics are confidential, but I hope some will be announced soon.
- MapR has joined Cloudera in supporting Spark, and indeed — unlike Cloudera — is supporting the full Spark stack.
- Mike Olson of Cloudera is on record as predicting that Spark will be the replacement for Hadoop MapReduce. Just about everybody seems to agree, except perhaps for Hortonworks folks betting on the more limited and less mature Tez. Spark’s biggest technical advantages as a general data processing engine are probably:
- The Directed Acyclic Graph processing model. (Any serious MapReduce-replacement contender will probably echo that aspect.)
- A rich set of programming primitives in connection with that model.
- Support also for highly-iterative processing, of the kind found in machine learning.
- Flexible in-memory data structures, namely the RDDs (Resilient Distributed Datasets).
- A clever approach to fault-tolerance.
- Spark is a major contender in streaming.
- There’s some cool machine-learning innovation using Spark.
- Spark 1.0 will drop by mid-May, Apache voters willin’ an’ the creek don’ rise. Publicity will likely ensue, with strong evidence of industry support.*
*Yes, my fingerprints are showing again.
The most official description of what Spark now contains is probably the “Spark ecosystem” diagram from Databricks. However, at the time of this writing it is slightly out of date, as per some email from Databricks CEO Ion Stoica (quoted with permission):
… but if I were to redraw it, SparkSQL will replace Shark, and Shark will eventually become a thin layer above SparkSQL and below BlinkDB.
With this change, all the modules on top of Spark (i.e., SparkStreaming, SparkSQL, GraphX, and MLlib) are part of the Spark distribution. You can think of these modules as libraries that come with Spark.
In an unfortunate non-development, Tachyon is not (yet?) part of Spark, and so it is hard for a Spark job’s data to be shared with other jobs (Spark or otherwise) or processes. That said:
- The tight integration of data structures and processes gives similar performance benefits to those of in-process vs. out-of-process in-database analytic functions. (It also of course raises similar stability concerns, but those seem less important in the case of Spark than of a true DBMS.)
- From a Hadoop vendor’s standpoint, Tachyon’s benefit of not requiring HDFS (Hadoop Distributed File System) isn’t important, and Tachyon somewhat conflicts with a newish effort called HDFS Caching.
A couple of Spark machine learning stories are very cool, in that they involve intra-day retraining of models. The better-known one is Yahoo’s, which in a prototype built in 120 lines of code trains a new model for recommendation of each candidate top-page news story. When I challenged that anecdote, Ion told me about his own former company Conviva, which retrains models every minute to decide which particular source of streaming video each client system will be connected to.
I am generally skeptical of immature SQL efforts, and SparkSQL is no exception. That said, it seems to be going in sensible directions, which should be welcome to those folks who used or were planning to use Shark anyway.
- SparkSQL actually has its own optimizer, rather than using the inappropriate Hive one. As with many new optimizers, it’s starting out rule-based, but is planned to become cost-based down the road.
- SparkSQL can run queries against data that’s either inside Spark or outside-but-accessible.
- SparkSQL can be accessed via Python and other APIs.
- Spark works with the Hive metastore, nee’ HCatalog.
And finally, there’s no public news as to what Databricks’ own business is. I think that’s a bit silly, but in fairness:
- The Spark 1.0 launch will consume every bit of marketing bandwidth they have.
- They don’t yet want to commit to a delivery date of their first offering.