I talked with a couple of Cloudera folks about HBase last week. Let me frame things by saying:
- The closest thing to an HBase company, ala MongoDB/MongoDB or DataStax/Cassandra, is Cloudera.
- Cloudera still uses a figure of 20% of its customers being HBase-centric.
- HBaseCon and so on notwithstanding, that figure isn’t really reflected in Cloudera’s marketing efforts. Cloudera’s marketing commitment to HBase has never risen to nearly the level of MongoDB’s or DataStax’s push behind their respective core products.
- With Cloudera’s move to “zero/one/many” pricing, Cloudera salespeople have little incentive to push HBase hard to accounts other than HBase-first buyers.
- Cloudera no longer dominates HBase development, if it ever did.
- Cloudera is the single biggest contributor to HBase, by its count, but doesn’t make a majority of the contributions on its own.
- Cloudera sees Hortonworks as having become a strong HBase contributor.
- Intel is also a strong contributor, as are end user organizations such as Chinese telcos. Not coincidentally, Intel was a major Hadoop provider in China before the Intel/Cloudera deal.
- As far as Cloudera is concerned, HBase is just one data storage technology of several, focused on high-volume, high-concurrency, low-latency short-request processing. Cloudera thinks this is OK because of HBase’s strong integration with the rest of the Hadoop stack.
- Others who may be inclined to disagree are in several cases doing projects on top of HBase to extend its reach. (In particular, please see the discussion below about Apache Phoenix and Trafodion, both of which want to offer relational-like functionality.)
Cloudera’s views on HBase history — in response to the priorities I brought to the conversation — include:
- HBase initially favored consistency over performance/availability, while Cassandra initially favored the opposite choice. Both products, however, have subsequently become more tunable in those tradeoffs.
- Cloudera’s initial contributions to HBase focused on replication, disaster recovery and so on. I guess that could be summarized as “scaling”.
- Hortonworks’ early HBase contributions included (but were not necessarily limited to):
- Making recovery much faster (10s of seconds or less, rather than minutes or more).
- Some of that consistency vs. availability tuning.
- “Coprocessors” were added to HBase ~3 years ago, to add extensibility, with the first use being in security/permissions.
- With more typical marketing-oriented version numbers:
- HBase .90, the first release that did a good job on durability, could have been 1.0.
- HBase .92 and .94, which introduced coprocessors, could have been Version 2.
- HBase .96 and .98 could have been Version 3.
- The recent HBase 1.0 could have been 4.0.
The HBase roadmap includes:
- A kind of BLOB/CLOB (Binary/Character Large OBject) support.
- Intel is heavily involved in this feature.
- The initial limit is 10 megabytes or so, due to some limitations in the API (I didn’t ask why that made sense). This happens to be all the motivating Chinese customer needs for the traffic photographs it wants to store.
- Various kinds of “multi-tenancy” support (multi-tenancy is one of those terms whose meaning is getting stretched beyond recognition), including:
- Mixed workload support (short-request and analytic) on the same nodes.
- Mixed workload support on different nodes in the same cluster.
- Security between different apps in the same cluster.
- (Still in the design phase) Bottleneck Whack-A-Mole, with goals including but not limited to:
- Scale-out beyond the current assumed limit of ~1200 nodes.
- More predictable performance, based on smaller partition sizes.
- (Possibly) Multi-data-center fail-over.
Not on the HBase roadmap per se are global/secondary indexes. Rather, we talked about projects on top of HBase which are meant to provide those. One is Apache Phoenix, which supposedly:
- Makes it simple to manage compound keys. (E.g., City/State/ZipCode)
- Provides global secondary indexes (but not in a fully ACID way).
- Offers some very basic JOIN support.
- Provides a JDBC interface.
- Offers efficiencies in storage utilization, scan optimizations, and aggregate calculations.
Another such project is Trafodion — supposedly the Welsh word for “transaction” — open sourced by HP. This seems to be based on NonStop SQL and Neoview code, which counter-intuitively have always been joined at the hip.
There was a lot more to the conversation, but I’ll stop here for two reasons:
- This post is pretty long already.
- I’m reserving some of the discussion until after I’ve chatted with vendors of other NoSQL systems.
- My July 2011 post on HBase offers context, as do the comments on it.
I found yesterday’s news quite unpleasant.
- A guy I knew and had a brief rivalry with in high school died of colon cancer, a disease that I’m at high risk for myself.
- GigaOm, in my opinion the best tech publication — at least for my interests — shut down.
- The sex discrimination trial around Kleiner Perkins is undermining some people I thought well of.
So I want to unclutter my mind a bit. Here goes.
1. There are a couple of stories involving Sam Simon and me that are too juvenile to tell on myself, even now. But I’ll say that I ran for senior class president, in a high school where the main way to campaign was via a single large poster, against a guy with enough cartoon-drawing talent to be one of the creators of the Simpsons. Oops.
2. If one suffers from ulcerative colitis as my mother did, one is at high risk of getting colon cancer, as she also did. Mine isn’t as bad as hers was, due to better tolerance for medication controlling the disease. Still, I’ve already had a double-digit number of colonoscopies in my life. They’re not fun. I need another one soon; in fact, I canceled one due to the blizzards.
Pro-tip — never, ever have a colonoscopy without some kind of anesthesia or sedation. Besides the unpleasantness, the lack of meds increases the risk that the colonoscopy will tear you open and make things worse. I learned that the hard way in New York in the early 1980s.
3. Five years ago I wrote optimistically about the evolution of the information ecosystem, specifically using the example of the IT sector. One could argue that I was right. After all:
- Gartner still seems to be going strong.
- O’Reilly, Gartner and vendors probably combine to produce enough good conferences.
- A few traditional journalists still do good work (in the areas covered by this blog Doug Henschen comes to mind).
- A few vendor folks are talented and responsible enough to add to the discussion. A few small-operation folks — e.g. me — are still around.
Still, the GigaOm news is not encouraging.
4. As TechCrunch and Pando reported, plaintiff Ellen Pao took the stand and sounded convincing in her sexual harassment suit against Kleiner Perkins (but of course she hadn’t been cross-examined yet). Apparently there was a major men-only party hosted by partner Al Gore, a candidate I first supported in 1988. And partner Ray Lane, somebody who at Oracle showed tremendous management effectiveness, evidently didn’t do much to deal with Pao’s situation.
At some point I want to write about a few women who were prominent in my part of the tech industry in the 1980s — at least Ann Winblad, Esther Dyson, and Sandy Kurtzig, maybe analyst/investment banker folks Cristina Morgan and Ruthann Quindlen as well. We’ve come a long way since those days (when, in particular, I could briefly list a significant fraction of the important women in the industry). There seems to be a lot further yet to go.
5. All that said — I’m indeed working on some cool stuff. Some is evident from recent posts. Other may be reflected in an upcoming set of posts that focus on NoSQL, business intelligence, and — I hope — the intersection of the two areas.
6. Speaking of recent posts, I did one on marketing for young companies that brings a lot of advice and tips together. I think it’s close to being a must-read.
- Continuuity toured in 2012 and touted its “app server for Hadoop” technology.
- Continuuity recently changed its name to Cask and went open source.
- Cask’s product is now called CDAP (Cask Data Application Platform). It’s still basically an app server for Hadoop and other “big data” — ouch do I hate that phrase — data stores.
- Cask and Cloudera partnered.
- I got a more technical Cask briefing this week.
- App servers are a notoriously amorphous technology. The focus of how they’re used can change greatly every couple of years.
- Partly for that reason, I was unimpressed by Continuuity’s original hype-filled positioning.
So far as I can tell:
- Cask’s current focus is to orchestrate job flows, with lots of data mappings.
- This is supposed to provide lots of developer benefits, for fairly obvious reasons. Those are pitched in terms of an integration story, more in a “free you from the mess of a many-part stack” sense than strictly in terms of data integration.
- CDAP already has a GUI to monitor what’s going on. A GUI to specify workflows is coming very soon.
- CDAP doesn’t consume a lot of cycles itself, and hence isn’t a real risk for unpleasant overhead, if “overhead” is narrowly defined. Rather, performance drags could come from …
- … sub-optimal choices in data mapping, database design or workflow composition.
I’d didn’t push the competition point hard (the call was generally a bit rushed due to a hard stop on my side), but:
- Cask thinks it doesn’t have much in the way of exact or head-to-head competitors, but cites Spring and WibiData/Kiji as coming closest.
- I’d think that data integration vendors who use Hadoop as an execution engine (Informatica, Syncsort and many more) would be in the mix as well.
- Cask disclaimed competition with Teradata Revelytix, on the theory that Cask is focused on operational/”real-time” use cases, while Revelytix Loom is focused on data science/investigative analytics.
To reiterate part of that last bullet — like much else we’re hearing about these days, CDAP is focused on operational apps, perhaps with a streaming aspect.
To some extent CDAP can be viewed as restoring the programmer/DBA distinction to the non-SQL world and streaming worlds. That is:
- Somebody creates a data mapping “pattern”.
- Programmers (including perhaps the creator) write to that pattern.
- Somebody (perhaps the creator) tweaks the mapping to optimize performance, or to reflect changes in the underlying data management.
Further notes on CDAP data access include:
- Cask is proud that a pattern can literally be remapped from one data store to another, although I wonder how often that is likely to happen in practice.
- Also, a single “row” can reference multiple data stores.
- Cask’s demo focused on imposing a schema on a log file, something you might do incrementally as you decide to extract another field of information. This is similar to major use cases for schema-on-need and for Splunk.
- For most SQL-like access and operations, CDAP relies on Hive, even to external data stores or non-tabular data. Cask is working with Cloudera on Impala access.
Examples of things that Cask supposedly makes easy include:
- Chunking streaming data by time (e.g. 1 minute buckets).
- Generating database stats (histograms and so on).
Tidbits as to how Cask perceives or CDAP plays with other technologies include:
- Kafka is hot.
- Spark Streaming is hot enough to be on the CDAP roadmap.
- Cask believes that its administrative tools don’t conflict with Cloudera Manager or Ambari, because they’re more specific to an application, job or dataset.
- CDAP is built on Twill, which is a thread-like abstraction over YARN that Cask contributed to Apache. Mesos is in the picture as well, as a YARN alternative.
- Cask is seeing some interest in Flink. (Flink is basically a Spark alternative out of Germany, which I’ve been dismissing as unneeded.)
Cask has ~40 people, multiple millions of dollars in trailing revenue, and — naturally — high expectations for future growth. I neglected, however, to ask how that revenue was split between subscription, professional services and miscellaneous. Cask expects to finish 2015 with a healthy two-digit number of customers.
Cask’s customers seem concentrated in usual-suspect internet-related sectors, although Cask gave it a bit of an enterprise-y spin by specifically citing SaaS (Software as a Service) and telecom. When I asked who else seems to be a user or interested based on mailing list activity, Cask mentioned a lot of financial services and some health care as well.
- Cask doesn’t have the obvious .com URL.
I’m on record as believing that:
- Hadoop needs a memory-centric storage grid.
- Tachyon is a strong candidate to fill the role.
- It’s an open secret that there will be a Tachyon company. However, …
- … no details have been publicized. Indeed, the open secret itself is still officially secret.
- Tachyon technology, which just hit 0.6 a couple of days ago, still lacks many features I regard as essential.
- As a practical matter, most Tachyon interest to date has been associated with Spark. This makes perfect sense given Tachyon’s origin and initial technical focus.
- Tachyon was in 50 or more sites last year. Most of these sites were probably just experimenting with it. However …
- … there are production Tachyon clusters with >100 nodes.
As a reminder of Tachyon basics:
- You do I/O with Tachyon in memory.
- Tachyon data can optionally be persisted.
- That “tiered storage” capability — including SSDs — was just introduced in 0.6. So in particular …
- … it’s very primitive and limited at the moment.
- I’ve heard it said that Intel was a big contributor to tiered storage/SSD support. (Solid-State Drives.)
- Tachyon has some ability to understand “lineage” in the Spark sense of term. (In essence, that amounts to knowing what operations created a set of data, and potentially replaying them.)
Beyond that, I get the impressions:
- Synchronous write-through from Tachyon to persistent storage is extremely primitive right now — but even so I am told it is being used in production by multiple companies already.
- Asynchronous write-through, relying on lineage tracking to recreate any data that gets lost, is slightly further along.
- One benefit of adding Tachyon to your Spark installation is a reduction in garbage collection issues.
And with that I have little more to say than my bottom lines:
- If you’re writing your own caching layer for some project you should seriously consider adapting Tachyon instead.
- If you’re using Spark you should seriously consider using Tachyon as well.
- I think Tachyon will be a big deal, but it’s far too early to be sure.