In a companion introduction to Kafka post, I observed that Kafka at its core is remarkably simple. Confluent offers a marchitecture diagram that illustrates what else is on offer, about which I’ll note:
- The red boxes — “Ops Dashboard” and “Data Flow Audit” — are the initial closed-source part. No surprise that they sound like management tools; that’s the traditional place for closed source add-ons to start.
- “Schema Management”
- Is used to define fields and so on.
- Is not equivalent to what is ordinarily meant by schema validation, in that …
- … it allows schemas to change, but puts constraints on which changes are allowed.
- Is done in plug-ins that live with the producer or consumer of data.
- Is based on the Hadoop-oriented file format Avro.
Kafka offers little in the way of analytic data transformation and the like. Hence, it’s commonly used with companion products.
- Per Confluent/Kafka honcho Jay Kreps, the companion is generally Spark Streaming, Storm or Samza, in declining order of popularity, with Samza running a distant third.
- Jay estimates that there’s such a companion product at around 50% of Kafka installations.
- Conversely, Jay estimates that around 80% of Spark Streaming, Storm or Samza users also use Kafka. On the one hand, that sounds high to me; on the other, I can’t quickly name a counterexample, unless Storm originator Twitter is one such.
- Jay’s views on the Storm/Spark comparison include:
- Storm is more mature than Spark Streaming, which makes sense given their histories.
- Storm’s distributed processing capabilities are more questionable than Spark Streaming’s.
- Spark Streaming is generally used by folks in the heavily overlapping categories of:
- Spark users.
- Analytics types.
- People who need to share stuff between the batch and stream processing worlds.
- Storm is generally used by people coding up more operational apps.
If we recognize that Jay’s interests are obviously streaming-centric, this distinction maps pretty well to the three use cases Cloudera recently called out.
Complicating this discussion further is Confluent 2.1, which is expected late this quarter. Confluent 2.1 will include, among other things, a stream processing layer that works differently from any of the alternatives I cited, in that:
- It’s a library running in client applications that can interrogate the core Kafka server, rather than …
- … a separate thing running on a separate cluster.
The library will do joins, aggregations and so on, and while relying on core Kafka for information about process health and the like. Jay sees this as more of a competitor to Storm in operational use cases than to Spark Streaming in analytic ones.
We didn’t discuss other Confluent 2.1 features much, and frankly they all sounded to me like items from the “You mean you didn’t have that already??” list any young product has.
- My October, 2014 post on Streaming for Hadoop is a sort of predecessor to this two-post series.