Well-resourced Silicon Valley start-ups typically announce their existence multiple times. Company formation, angel funding, Series A funding, Series B funding, company launch, product beta, and product general availability may not be 7 different “news events”, but they’re apt to be at least 3-4. Platfora, no exception to this rule, is hitting general availability today, and in connection with that I learned a bit more about what they are up to.
In simplest terms, Platfora offers exploratory business intelligence against Hadoop-based data. As per last weekend’s post about exploratory BI, a key requirement is speed; and so far as I can tell, any technological innovation Platfora offers relates to the need for speed. Specifically, I drilled into Platfora’s performance architecture on the query processing side (and associated data movement); Platfora also brags of rendering 100s of 1000s of “marks” quickly in HTML5 visualizations, but I haven’t a clue as to whether that’s much of an accomplishment in itself.
Platfora’s marketing suggests it obviates the need for a data warehouse at all; for most enterprises, of course, that is a great exaggeration. But another dubious aspect of Platfora marketing actually serves to understate the product’s merits — Platfora claims to have an “in-memory” product, when what’s really the case is that Platfora’s memory-centric technology uses both RAM and disk to manage larger data marts than could reasonably be fit into RAM alone. Expanding on what I wrote about Platfora when it de-stealthed:
- Platfora incrementally batch-loads data from Hadoop into its own bare-bones SQL data store, and does BI against that. That data store:
- Of course wants to run in-memory whenever possible …
- … but also has a significant disk-based aspect.
- Is true-columnar on disk and in memory alike.
- Stores all columns from a given row on the same nodes.
- Specifically, Platfora builds star-schema data marts, called “lenses”. To avoid data bloat on the Platfora servers:
- Two lenses with the same data often only store it once.
- The data for a given lens can be “evicted” if it won’t be needed for a while. (But the specifications for the lens are of course kept in case you want to rebuild it later.)
Notes on Platfora’s Hadoop ETL (Extract/Transform/Load) include:
- The basic idea is that you periodically re-run a job to pick up incremental changes since the last load.
- Right now that’s just a cron job or something. Platfora plans to add scheduling features imminently.*
- Platfora is sensitive to Hive partitioning.
- Platfora can run filters and so on to extract non-Hive data (the more common case).
*But in a sad comment on Hadoop’s workload management capabilities, Platfora doesn’t expect these features to be much used, at least at first.
Platfora’s aggregation story goes something like this:
- If an aggregate can be updated incrementally — for example a count or sum — Platfora probably will maintain it for you and update it on load.
- Ditto if it can be maintained almost incrementally — for example an average.
- Platfora also does Distinct calculations, even though those have to be worked through on its own servers.
As you would expect, Version 1 of the Platfora data store has various limitations, such as:
- Platfora Version 1 can’t do much with arrays or (other) nested data structures — it just transforms them into JSON strings.
- Platfora’s SQL support is limited.
- The Platfora data store has a “fat head” master (but at least that head is multi-node).
Naturally, Platfora hopes to fix these issues down the road.
Finally, a few company notes:
- Platfora has had 20 beta users, mainly but not entirely among online businesses.
- Platfora has close to 50 people.
- Platfora is currently focused on US direct sales, relying on inbound leads.
- The trend to clustered computing is sustainable.
- The trend to appliances is also sustainable.
- The “single” enterprise cluster is almost as much of a pipe dream as the single enterprise database.
I shall explain.
Arguments for hosting applications on some kind of cluster include:
- If the workload requires more than one server — well, you’re in cluster territory!
- If the workload requires less than one server — throw it into the virtualization pool.
- If the workload is uneven — throw it into the virtualization pool.
Arguments specific to the public cloud include:
- A large fraction of new third-party applications are SaaS (Software as a Service). Those naturally live in the cloud.
- Cloud providers have efficiencies that you don’t.
That’s all pretty compelling. However, these are not persuasive reasons to put everything on a SINGLE cluster or cloud. They could as easily lead you to have your VMware cluster and your Exadata rack and your Hadoop cluster and your NoSQL cluster and your object storage OpenStack cluster — among others — all while participating in several different public clouds as well.
Why would you not move work into a cluster at all? First, if ain’t broken, you might not want to fix it. Some of the cluster options make it easy for you to consolidate existing workloads — that’s a central goal of VMware and Exadata — but others only make sense to adopt in connection with new application projects. Second, you might just want device locality. I have a gaming-class PC next to my desk; it drives a couple of monitors; I like that arrangement. Away from home I carry a laptop computer instead. Arguments can be made for small remote-office servers as well.
To put all that more simply:
- Moving existing applications to new platforms often isn’t worth the trouble.
- Many needs can be best met by single, physically local devices.
Appliances are a natural form factor for single-purpose computing. It is reasonable to characterize as “appliances” — in the computing sense of the term — medical equipment, vehicles, cash machines, cash registers, enterprise security devices, home entertainment, exercise machines and, yes, refrigerators; computers, in some form, can be found almost anywhere. But appliances also are a convenient way to package enterprise systems — configurations will be correct, installation will be simpler, and fortunate software-centric appliance vendors may capture margins on hardware sales and support. And the idea of SaaS-like continuous updates to your enterprise systems seems much more reasonable in the case of a locked-down appliance-like configuration.
Circling back to the beginning, I’d say there are multiple reasons not to expect all your computing to be done on a single cluster:
- You might want to use appliances don’t fit into that cluster.
- You might want to use SaaS offerings that don’t fit into that cluster.
- The efficiency gains from using a single cluster aren’t that much greater than the gains from using a few of them.
- You might want different parts of your computing work to be done in-house and in the public cloud.
- You might want different parts of your data to be kept in different countries.
- Different kinds of work might fit better onto differently-configured nodes, and current cloud/cluster technology doesn’t do a wonderful job with heterogeneity.
- A lot of computing is so inherently small and local that it shouldn’t be clustered at all.
Ceteris paribus, fewer clusters are better than more of them. But all things are not equal, and it’s not reasonable to try to reduce your clusters to one — not even if that one is administered with splendid efficiency by low-cost workers, in a low-cost building, drawing low-cost electric power, in a low-cost part of the world.
If I had my way, the business intelligence part of investigative analytics — i.e. , the class of business intelligence tools exemplified by QlikView and Tableau — would continue to be called “data exploration”. Exploration what’s actually going on, and it also carries connotations of the “fun” that users report having with the products. By way of contrast, I don’t know what “data discovery” means; the problem these tools solve is that the data has been insufficiently explored, not that it hasn’t been discovered at all. Still “data discovery” seems to be the term that’s winning.
Confusingly, the Teradata Aster library of functions is now called “Discovery” as well, although thankfully without the “data” modifier. Further marketing uses of the term “discovery” will surely follow.
Enough terminology. What sets exploration/discovery business intelligence tools apart? I think these products have two essential kinds of feature:
- Query modification.
- Query result revisualization.*
Here’s what I mean.
*I’d wanted to call this re-presentation. But that would have been … pun-ishing.
The canonical form of query modification is:
- There’s a scatter plot or other graphical data visualization.
- You select a rectangular area on the graph.
- A new visualization is drawn.
That capability is much more useful in systems that allow you to change how the data is visualized, both:
- Before you select a subset of the results (so you can choose which visualization is easiest to select from).
- After you’ve made the selection (it would be silly to stay in a monthly bar chart if you’ve just selected a single month).
Other forms of query modification, such as faceted drill-down or parameterization, don’t depend as heavily on flexible revisualization. Perhaps not coincidentally, they’ve been around longer in some form or other than have the QlikView/Tableau/Spotfire kinds of interfaces. But at today’s leading edge, query modification and query result revisualization are joined at the hip.
What else is important for these tools?
- Good UI design, of course.
- Speed — split seconds matter.
- Most of the same features that matter for business intelligence tools with other kinds of UI.
Please note that speed is a necessary condition for exploratory BI, not a sufficient one; a limited UI that responds really fast is still a limited UI.
As for how the speed is achieved — three consistent themes are columnar storage, compression, and RAM. Beyond that, the details vary significantly from product to product, and I won’t try to generalize at this time.
- The importance of data exploration flexibility (July, 2012)
- QlikView architecture (June, 2010)
- A cool QlikView feature that isn’t particularly tied to data exploration (November, 2011)
- Endeca’s underlying technology (April, 2011)
The cardinal rules of DBMS development
Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.
That’s if things go extremely well.
Rule 2: You aren’t an exception to Rule 1.
- Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
- Mixed workload management is harder than you’re assuming it is.
- Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.
DBMS with Hadoop underpinnings …
… aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well aware of this fact.
But note that the HadoopDB prototype — on which Hadapt was based — was completed and the paper presented in 2009.
… has been around long enough to make a good DBMS. It used to make a solid XML DBMS. Now SQL and JSON are also in the mix. The SQL part is a reversal of MarkLogic’s long-time stance. The JSON part gets MarkLogic out of the usually-losing side of the XML/JSON debate.
RDBMS-oriented Hadoop file formats are confusing
I’ve recently tried asking both Cloudera and Hortonworks about the “columnar” file formats beneath their respective better-Hive efforts, each time getting the response “Let me set you up with a call with the right person.” Cloudera also emailed over a link to Parquet, evidently the latest such project.
Specific areas about which I’m confused (and the same questions apply to any of these projects, as they seem similarly-intended) include but are not limited to:
- Is it truly columnar (doesn’t seem so, based on the verbiage), or more PAX-like, or something else entirely?
- What’s the nested data structure story? (It seems there is one.)
- What’s the compression story?
Come to think of it, the name “Parquet” suggests that either:
- Rows and columns are mixed together.
- Somebody has the good taste to be a Celtics fan.
Whither analytic platforms?
I’ve been a big advocate of analytic platform technology, but interest hasn’t increased as much as I expected. Teradata Aster seems to be doing well, but not so extremely well that IBM Netezza, Sybase IQ, et al. feel the need to be aggressive in their responses. Vendors have, for the most part, put decent capabilities in place; but the energy I’d looked for isn’t there.
I think that problems include:
- Analytic platforms are marketed too purely as a development play. Selling six-to-seven figure application development deals is hard.
- But selling analytic performance — the other main benefit — is harder than it used to be. Good enough is often good enough. In particular …
- … a lot of analytic work is being conceded, rightly or wrongly, to Hadoop.
- More generally, selling advanced analytic tools is commonly a tough, niche-oriented business.
Also, some of the investigative analytics energy has been absorbed by business intelligence tools, specifically ones with “discovery” interfaces — Tableau, QlikView, and so on.
I coined a new term, dataset management, for my clients at Revelytix, which they indeed adopted to describe what they do. It would also apply to the recently released Cloudera Navigator. To a first approximation, you may think of dataset management as either or both:
- Metadata management in a structured-file context.
- Lineage/provenance, auditing, and similar stuff.
Why not just say “metadata management”? First, the Revelytix guys have long been in variants of that business, and they’re tired of the responses they get when they use the term. Second, “metadata” could apply either to data about the file or to data about the data structures in the file or perhaps to data about data in the file, making “metadata” an even more confusing term in this context than in others.
My idea for the term dataset is to connote more grandeur than would be implied by the term “table”, but less than one might assume for a whole “database”. I.e.:
- A dataset contains all the information about something. This makes it a bigger deal than a mere table, which could be meaningless outside the context of a database.
- But the totality of information in a “dataset” could be less comprehensive than what we’d expect in a whole “database”.
As for the specific products, both of which you might want to check out:
- Cloudera Navigator:
- Is one product from a leading Hadoop company.
- Assumes you use Cloudera’s flavor of Hadoop.
- Is generally available.
- Starts with auditing (lineage coming soon).
- Revelytix Loom:
- Is the main product of a small metadata management company.
- Is distro-agnostic.
- Is in beta.
- Already does lineage.
Hadoop 2.0/YARN is the first big step in evolving Hadoop beyond a strict Map/Reduce paradigm, in that it at least allows for the possibility of non- or beyond-MapReduce processing engines. While YARN didn’t meet its target of general availability around year-end 2012, Arun Murthy of Hortonworks told me recently that:
- Yahoo is a big YARN user.
- There are other — paying — YARN users.
- YARN general availability is now targeted for well before the end of 2013.
Arun further told me about Tez, the next-generation Hadoop processing engine he’s working on, which he also discussed in a recent blog post:
With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG [Directed Acyclic Graph] of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others. The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.
This is similar to the approach of BDAS Spark:
Rather than being restricted to Maps and Reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order.
although Tez won’t match Spark’s richer list of primitive operations.
More specifically, there will be six primitive Tez operations:
- HDFS (Hadoop Distributed File System) input and output.
- Sorting on input and output (I’m not sure why that’s two operations rather than one).
- Shuffling of input and output (ditto).
A Map step would compound HDFS input, output sorting, and output shuffling; a Reduce step compounds — you guessed it! — input sorting, input shuffling, and HDFS output.
I can’t think of much in the way of algorithms that would be logically impossible in MapReduce yet possible in Tez. Rather, the main point of Tez seems to be performance, performance consistency, response-time consistency, and all that good stuff. Specific advantages that Arun and I talked about included:
- The requirement for materializing (onto disk) intermediate results that you don’t want to is gone. (Yay!)
- Hadoop jobs will step on each other’s toes less. Instead of Maps and Reduces from unrelated jobs getting interleaved, all the operations from a single job will by default be executed in one chunk. (Even so, I see no reason to expect early releases of Tez to do a great job on highly concurrent mixed workload management.)
- Added granularity brings opportunities for additional performance enhancements, for example in the area of sorting. (Arun loves sorts.)
From time to time I advise a software vendor on how, whether, or to what extent it should offer its technology in open source. In summary, I believe:
- The formal differences between “open source” and “closed source” strategies are of secondary importance.
- The attitudinal and emotional differences between “open source” and “closed source” approaches can be large.
- A pure closed source strategy can make sense.
- A closed source strategy with important open source aspects can make sense.
- A pure open source strategy will only rarely win.
An “open source software” business model and strategy might include:
- Software given away for free.
- Demand generation to encourage people to use the free version of the software.
- Subscription pricing for additional proprietary software and support.
- Direct sales, and further marketing, to encourage users of the free stuff to upgrade to a paid version.
A “closed source software” business model and strategy might include:
- Demand generation.
- Free-download versions of the software.
- Subscription pricing for software (increasingly common) and support (always).
- Direct sales, and associated marketing.
Those look pretty similar to me.
Of course, there can still be differences between open and closed source. In particular:
- Open source can help with sales to enterprises that don’t trust a new vendor to keep progressing.
- Open source can hurt with sales to enterprises that jump at the opportunity to do what they want, themselves, for “free” and — which in some cases is important to them — in secret.
- Open source has fewer pricing option than closed.
Summing up the story so far, then, closed source is a superior strategy to open, except and to the extent that your are forced down the open route. More precisely, any advantages to an open source strategy can also be captured by having a hybrid open/closed strategy that emphasizes the closed part.
So what part of the story haven’t I told yet? Mainly, it’s open source marketing. Open source can seem virtuous and/or cool — to users, influencers, or even your own engineers. But while that’s true of people, it’s less true of companies, which are unlikely to spend a lot of money on the basis of coolness or virtue. Rather, the strictest believers in acquiring open source software do so precisely because it’s something for which they don’t have to pay, or pay much.
Further, some people think pro bono is a business strategy, because if you build up enough users, monetization can eventually follow. In the cases of more-or-less explicit advertising, pro bono really does work. I give away the content of this blog; in return, people contact me from time to time and offer to buy my services — with “sales cycles” so short as to be unworthy of the name. Fun ensues, and profit. The connection is even clearer in the case of traditional mass media, or of internet services such as Twitter and Facebook. But when what you’re selling and giving away are both technology, the pro bono story has to be something like “We’ll get you hooked on the free stuff, then charge you for the rest.”
That may be great for games, but how does it work for professional software? There are some special cases, mainly:
- Your product can be used by awesomely impressive internet companies that, while refusing to pay for software themselves, validate it for adoption by lesser organizations that indeed are willing to pay. This has worked for multiple projects started by those companies themselves, such as Hadoop and memcached, but only one I think of that wasn’t — MySQL.
- You can let users gain attachment to your free stuff, then sell your whole company to somebody who now wants to sell them other stuff, presumably closed source (or hardware), or who just is impressed by the awesomeness of your technology. This strategy has produced a very small number of great exits — XenSource, arguably Nicira (although Nicira itself disagrees), maybe a couple of others.
But in most cases, the strategy loops back what I described at the top of this post:
- A free core product, which may be genuinely valuable to some/most users, and which certainly offers them a great opportunity to test the technology, plus …
- … a chargeable/proprietary add-on, which is required for the most serious work, …
- … or else just support.
There aren’t actually a lot of major examples in the “just support” camp* — the main ones who come to mind are Red Hat, 10gen, and Hortonworks, and two of those three are for products that were open source projects long before the respective companies were founded. And so we’re right back to an Enterprise Edition/Community Edition split.
*Or “mainly just support” — as per my recent post on Hadoop distributions, almost everybody offers SOMETHING proprietary.
This all still leaves an attitudinal distinction among (in decreasing order of open source rah-rah virtue):
- Build and promote a great free product. One of these years, get around to building and promoting a great chargeable one as well.
- Build and promote both a great free product and a great chargeable one.
- Build and promote a great chargeable product, and give a subset of it away for free. That subset should be good too.
- Build and promote a great chargeable product, and give a crappy subset away for free.
I think #3 makes the most sense. #4 is bad because I don’t believe in promoting or distributing crappy products even for free. #2 is too big a challenge to tackle, in technology and marketing alike. And #1 is only for the most patient vendors with the deepest of pockets.
There’s also the possibility of open sourcing software and then making your main revenue from being the best hosting company for it. But to date that has worked mainly for Automattic.
Finally — what about open source as a development strategy? Well, there are indeed some projects with multiple sets of major contributors — Linux, R, Hadoop, Postgres and so on. But for projects that originate with a single sponsoring vendor, my general observation still stands:
- Open source software commonly gets community contributions for connectors, adapters, and (national) language translations.
- But useful contributions in other areas are much rarer.
- The open/closed source distinction is central to only a few of the issues on our strategy and execution worksheets, mainly the ones influenced by pricing. However, it is at least slightly relevant to a considerable fraction of them.
- I glossed over the free-like-speech/free-like-beer distinction a bit; hopefully my usage was clear in context.
One elephant went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.
Two elephants went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.
Three elephants went out to play
– Popular children’s song
It’s Strata week, with much Hadoop news, some of which I’ve been briefed on and some of which I haven’t. Rather than delve into fine competitive details, let’s step back and consider some generalities. First, about Hadoop distributions and distro providers:
- Conceptually, the starting point for a “Hadoop distribution” is some version of Apache Hadoop.
- Hortonworks is still focused on Hadoop 1 (without YARN and so on), because that’s what’s regarded as production-ready. But Hortonworks does like HCatalog.
- Cloudera straddles Hadoop 1 and Hadoop 2, shipping aspects of Hadoop 2 but not recommending them for production use.
- Some of the newer distros seem to be based on Hadoop 2, if the markitecture slides are to be believed.
- Optionally, the version numbers of different parts of Hadoop in a distribution could be a little mismatched, if the distro provider takes responsibility for testing them together.
- Cloudera seems more willing to do that than Hortonworks.
- Different distro providers may choose different sets of Apache Hadoop subprojects to include.
- Cloudera seems particularly expansive in what it is apt to include. Perhaps not coincidentally, Cloudera folks started various Hadoop subprojects.
- Optionally, distro providers’ additional proprietary code can be included, to be used either in addition to or instead of Apache Hadoop code. (In the latter case, marketing can then ensue about whether this is REALLY a Hadoop distribution.)
- Hortonworks markets from a “more open source than thou” stance, even though:
- It is not a purist in that regard.
- That marketing message is often communicated by Hortonworks’ very closed-source partners.
- Several distro providers, notably Cloudera, offer management suites as a big part of their proprietary value-add. Hortonworks, however, is focused on making open-source Ambari into a competitive management tool.
- Performance is another big area for proprietary code, especially from vendors who look at HDFS (Hadoop Distributed File System) and believe they can improve on it.
- I conjecture packaging/installation code is often proprietary, but that’s a minor issue that doesn’t get mentioned much.
- Hortonworks markets from a “more open source than thou” stance, even though:
- Optionally, third parties’ code can be provided, open or closed source as the case may be.
Most of the same observations could apply to Hadoop appliance vendors.
Besides code, Hadoop distribution providers commonly offer support. The Hadoop support situation is confused, largely because:
- Marketing around Hadoop support capabilities and experience is sparse …
- … except for the Hortonworks vs. Cloudera General Hadoop Expertise Urinary Olympics.
- I don’t hear a lot of complaints about anybody’s Hadoop support.
- One should distinguish between, say, Tier 1 and Tier 3 support.
- Since most serious Hadoop development is done by Cloudera and Hortonworks, those two vendors are by far the best qualified to do Tier 3+ support.
- Since Cloudera has the most Hadoop market share to date, it also has the most Hadoop support experience (any and all tiers).
- Some of the other contenders are huge companies that presumably know how to support enterprise customers. This includes both distro providers and others (e.g. Oracle, which sells a Cloudera-based appliance and handles Tier 1 support for that itself).
And finally, reasons that come to mind for choosing particular distributions include:
- Cloudera Manager is (relatively speaking) mature.
- Cloudera Navigator seems promising.
- Cloudera has the most experienced Hadoop services operation.
- Cloudera has the development “axe” in some parts of Hadoop and is second only to Hortonworks in the others.
- Cloudera has lots of partner support.
- Cloudera is the best-funded company whose main business is Hadoop.
- With the arguable exception of Cloudera, Hortonworks has much more Hadoop expertise than any other outfit, including the development “axe” in a variety of areas.
- Hortonworks has lots of partner support.
- Hortonworks is the second-best-funded company whose main business is Hadoop.
- Because of its low reliance on proprietary code, Hortonworks has great “escapability”, and correspondingly weak pricing power vs. its customers.
- Intel’s Hadoop performance hacks may be legit.
- Intel was evidently early in supporting Chinese Hadoop users.
- If you want to use the Greenplum DBMS, using the Pivotal/Greenplum Hadoop distribution too would seem to be thematic.
- At one point MapR seemed to have a performance advantage. I don’t know whether that’s still the case.
- Some believe that IBM removes obstacles, and grants blessings of prosperity and wisdom.
My former friends at Greenplum no longer talk to me, so in particular I wasn’t briefed on Pivotal HD and Greenplum HAWQ. Pivotal HD seems to be yet another Hadoop distribution, with the idea that you use Greenplum’s management tools. Greenplum HAWQ seems to be Greenplum tied to HDFS.
The basic idea seems to be much like what I mentioned a few days ago — the low-level file store for Greenplum can now be something else one has heard of before, namely HDFS (Hadoop Distributed File System, which is also an option for, say, NuoDB). Beyond that, two interesting quotes in a Greenplum blog post are:
When a query starts up, the data is loaded out of HDFS and into the HAWQ execution engine.
In addition, it has native support for HBase, supporting HBase predicate pushdown, hive[sic] connectivity, and offering a ton of intelligent features to retrieve HBase data.
The first sounds like the invisible loading that Daniel Abadi wrote about last September on Hadapt’s blog. (Edit: Actually, see Daniel’s comment below.) The second sounds like a good idea that, again, would also be a natural direction for vendors such as Hadapt.