If I had my way, the business intelligence part of investigative analytics — i.e. , the class of business intelligence tools exemplified by QlikView and Tableau — would continue to be called “data exploration”. Exploration what’s actually going on, and it also carries connotations of the “fun” that users report having with the products. By way of contrast, I don’t know what “data discovery” means; the problem these tools solve is that the data has been insufficiently explored, not that it hasn’t been discovered at all. Still “data discovery” seems to be the term that’s winning.
Confusingly, the Teradata Aster library of functions is now called “Discovery” as well, although thankfully without the “data” modifier. Further marketing uses of the term “discovery” will surely follow.
Enough terminology. What sets exploration/discovery business intelligence tools apart? I think these products have two essential kinds of feature:
- Query modification.
- Query result revisualization.*
Here’s what I mean.
*I’d wanted to call this re-presentation. But that would have been … pun-ishing.
The canonical form of query modification is:
- There’s a scatter plot or other graphical data visualization.
- You select a rectangular area on the graph.
- A new visualization is drawn.
That capability is much more useful in systems that allow you to change how the data is visualized, both:
- Before you select a subset of the results (so you can choose which visualization is easiest to select from).
- After you’ve made the selection (it would be silly to stay in a monthly bar chart if you’ve just selected a single month).
Other forms of query modification, such as faceted drill-down or parameterization, don’t depend as heavily on flexible revisualization. Perhaps not coincidentally, they’ve been around longer in some form or other than have the QlikView/Tableau/Spotfire kinds of interfaces. But at today’s leading edge, query modification and query result revisualization are joined at the hip.
What else is important for these tools?
- Good UI design, of course.
- Speed — split seconds matter.
- Most of the same features that matter for business intelligence tools with other kinds of UI.
Please note that speed is a necessary condition for exploratory BI, not a sufficient one; a limited UI that responds really fast is still a limited UI.
As for how the speed is achieved — three consistent themes are columnar storage, compression, and RAM. Beyond that, the details vary significantly from product to product, and I won’t try to generalize at this time.
- The importance of data exploration flexibility (July, 2012)
- QlikView architecture (June, 2010)
- A cool QlikView feature that isn’t particularly tied to data exploration (November, 2011)
- Endeca’s underlying technology (April, 2011)
Oracle Universal Content Management 10gR3 was released in May 2007. Since that time, Oracle WebCenter Content 11g has been released, and Oracle WebCenter 12c is on the horizon. For 10gR3 customers, the next step down the WebCenter path is to upgrade to 11g. However, some customers don’t know where to begin in terms of an upgrade – not when their current version is supporting numerous business processes, contains thousand of high-value content items, and has been customized numerous time to meet business requirements.
Join Jason Lamon, Senior Marketing Associate, and Alan Mackenthun, Technical Program Manager at Fishbowl Solutions as they discuss Fishbowl’s path, package and promise for WebCenter Content 11g upgrades. They are also privileged to be joined by Mike Kohorst – IT Application Manager at Ryan Companies, who will discuss their recent 11g upgrade success, as well as their future plans for the system. We hope you will be able to join us!
Date: Thursday, March 21st
Time: 1 pm EST, Noon CST
The post Fishbowl Webinar – A Path, Package, and Promise for WebCenter Content 11g Upgrades appeared first on C4 Blog by Fishbowl Solutions.
The cardinal rules of DBMS development
Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.
That’s if things go extremely well.
Rule 2: You aren’t an exception to Rule 1.
- Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
- Mixed workload management is harder than you’re assuming it is.
- Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.
DBMS with Hadoop underpinnings …
… aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well aware of this fact.
But note that the HadoopDB prototype — on which Hadapt was based — was completed and the paper presented in 2009.
… has been around long enough to make a good DBMS. It used to make a solid XML DBMS. Now SQL and JSON are also in the mix. The SQL part is a reversal of MarkLogic’s long-time stance. The JSON part gets MarkLogic out of the usually-losing side of the XML/JSON debate.
RDBMS-oriented Hadoop file formats are confusing
I’ve recently tried asking both Cloudera and Hortonworks about the “columnar” file formats beneath their respective better-Hive efforts, each time getting the response “Let me set you up with a call with the right person.” Cloudera also emailed over a link to Parquet, evidently the latest such project.
Specific areas about which I’m confused (and the same questions apply to any of these projects, as they seem similarly-intended) include but are not limited to:
- Is it truly columnar (doesn’t seem so, based on the verbiage), or more PAX-like, or something else entirely?
- What’s the nested data structure story? (It seems there is one.)
- What’s the compression story?
Come to think of it, the name “Parquet” suggests that either:
- Rows and columns are mixed together.
- Somebody has the good taste to be a Celtics fan.
Whither analytic platforms?
I’ve been a big advocate of analytic platform technology, but interest hasn’t increased as much as I expected. Teradata Aster seems to be doing well, but not so extremely well that IBM Netezza, Sybase IQ, et al. feel the need to be aggressive in their responses. Vendors have, for the most part, put decent capabilities in place; but the energy I’d looked for isn’t there.
I think that problems include:
- Analytic platforms are marketed too purely as a development play. Selling six-to-seven figure application development deals is hard.
- But selling analytic performance — the other main benefit — is harder than it used to be. Good enough is often good enough. In particular …
- … a lot of analytic work is being conceded, rightly or wrongly, to Hadoop.
- More generally, selling advanced analytic tools is commonly a tough, niche-oriented business.
Also, some of the investigative analytics energy has been absorbed by business intelligence tools, specifically ones with “discovery” interfaces — Tableau, QlikView, and so on.
I coined a new term, dataset management, for my clients at Revelytix, which they indeed adopted to describe what they do. It would also apply to the recently released Cloudera Navigator. To a first approximation, you may think of dataset management as either or both:
- Metadata management in a structured-file context.
- Lineage/provenance, auditing, and similar stuff.
Why not just say “metadata management”? First, the Revelytix guys have long been in variants of that business, and they’re tired of the responses they get when they use the term. Second, “metadata” could apply either to data about the file or to data about the data structures in the file or perhaps to data about data in the file, making “metadata” an even more confusing term in this context than in others.
My idea for the term dataset is to connote more grandeur than would be implied by the term “table”, but less than one might assume for a whole “database”. I.e.:
- A dataset contains all the information about something. This makes it a bigger deal than a mere table, which could be meaningless outside the context of a database.
- But the totality of information in a “dataset” could be less comprehensive than what we’d expect in a whole “database”.
As for the specific products, both of which you might want to check out:
- Cloudera Navigator:
- Is one product from a leading Hadoop company.
- Assumes you use Cloudera’s flavor of Hadoop.
- Is generally available.
- Starts with auditing (lineage coming soon).
- Revelytix Loom:
- Is the main product of a small metadata management company.
- Is distro-agnostic.
- Is in beta.
- Already does lineage.
Hadoop 2.0/YARN is the first big step in evolving Hadoop beyond a strict Map/Reduce paradigm, in that it at least allows for the possibility of non- or beyond-MapReduce processing engines. While YARN didn’t meet its target of general availability around year-end 2012, Arun Murthy of Hortonworks told me recently that:
- Yahoo is a big YARN user.
- There are other — paying — YARN users.
- YARN general availability is now targeted for well before the end of 2013.
Arun further told me about Tez, the next-generation Hadoop processing engine he’s working on, which he also discussed in a recent blog post:
With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG [Directed Acyclic Graph] of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others. The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.
This is similar to the approach of BDAS Spark:
Rather than being restricted to Maps and Reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order.
although Tez won’t match Spark’s richer list of primitive operations.
More specifically, there will be six primitive Tez operations:
- HDFS (Hadoop Distributed File System) input and output.
- Sorting on input and output (I’m not sure why that’s two operations rather than one).
- Shuffling of input and output (ditto).
A Map step would compound HDFS input, output sorting, and output shuffling; a Reduce step compounds — you guessed it! — input sorting, input shuffling, and HDFS output.
I can’t think of much in the way of algorithms that would be logically impossible in MapReduce yet possible in Tez. Rather, the main point of Tez seems to be performance, performance consistency, response-time consistency, and all that good stuff. Specific advantages that Arun and I talked about included:
- The requirement for materializing (onto disk) intermediate results that you don’t want to is gone. (Yay!)
- Hadoop jobs will step on each other’s toes less. Instead of Maps and Reduces from unrelated jobs getting interleaved, all the operations from a single job will by default be executed in one chunk. (Even so, I see no reason to expect early releases of Tez to do a great job on highly concurrent mixed workload management.)
- Added granularity brings opportunities for additional performance enhancements, for example in the area of sorting. (Arun loves sorts.)
Free Application Showcases WebCenter Mobile Access Capabilities for Sales Enablement and Offline Content Access
Fishbowl Solutions is excited to announce the release of our Fishbowl Mobile Library Android application on the Google Play Store. This free version of Fishbowl’s Mobile Tablet Application for Oracle WebCenter Content allows customers to experience mobile WebCenter functionality first hand. By enabling mobile access to Oracle WebCenter Content, users can find, store, view, and organize content from the Oracle WebCenter repository directly on their tablets.
Features of the Fishbowl Mobile Library app include access to Oracle WebCenter Content with the ability to view content including PDFs, HTML, images, and video, as well as the option to keep local copies of content items for offline use. These copies are synced with WebCenter when reconnected. Customized Folder View options enable the user to organize local content into a personalized folder structure and the Content Sharing feature provides the option to add items to an email cart which emails links to download items from a secure temporary download site.
The Android application is built on Fishbowl’s Mobility API and serves as a reference for mobile applications integrated with Oracle WebCenter Content. This mobile content library application framework is currently in production on both Android Tablets and the Apple iPad.
The app is also available for iPad from the iTunes Store download the free Fishbowl ECM Mobile App Download on iTunes.
The post “Fishbowl Mobile Library” Android Tablet App Now Available on Google Play appeared first on C4 Blog by Fishbowl Solutions.
From time to time I advise a software vendor on how, whether, or to what extent it should offer its technology in open source. In summary, I believe:
- The formal differences between “open source” and “closed source” strategies are of secondary importance.
- The attitudinal and emotional differences between “open source” and “closed source” approaches can be large.
- A pure closed source strategy can make sense.
- A closed source strategy with important open source aspects can make sense.
- A pure open source strategy will only rarely win.
An “open source software” business model and strategy might include:
- Software given away for free.
- Demand generation to encourage people to use the free version of the software.
- Subscription pricing for additional proprietary software and support.
- Direct sales, and further marketing, to encourage users of the free stuff to upgrade to a paid version.
A “closed source software” business model and strategy might include:
- Demand generation.
- Free-download versions of the software.
- Subscription pricing for software (increasingly common) and support (always).
- Direct sales, and associated marketing.
Those look pretty similar to me.
Of course, there can still be differences between open and closed source. In particular:
- Open source can help with sales to enterprises that don’t trust a new vendor to keep progressing.
- Open source can hurt with sales to enterprises that jump at the opportunity to do what they want, themselves, for “free” and — which in some cases is important to them — in secret.
- Open source has fewer pricing option than closed.
Summing up the story so far, then, closed source is a superior strategy to open, except and to the extent that your are forced down the open route. More precisely, any advantages to an open source strategy can also be captured by having a hybrid open/closed strategy that emphasizes the closed part.
So what part of the story haven’t I told yet? Mainly, it’s open source marketing. Open source can seem virtuous and/or cool — to users, influencers, or even your own engineers. But while that’s true of people, it’s less true of companies, which are unlikely to spend a lot of money on the basis of coolness or virtue. Rather, the strictest believers in acquiring open source software do so precisely because it’s something for which they don’t have to pay, or pay much.
Further, some people think pro bono is a business strategy, because if you build up enough users, monetization can eventually follow. In the cases of more-or-less explicit advertising, pro bono really does work. I give away the content of this blog; in return, people contact me from time to time and offer to buy my services — with “sales cycles” so short as to be unworthy of the name. Fun ensues, and profit. The connection is even clearer in the case of traditional mass media, or of internet services such as Twitter and Facebook. But when what you’re selling and giving away are both technology, the pro bono story has to be something like “We’ll get you hooked on the free stuff, then charge you for the rest.”
That may be great for games, but how does it work for professional software? There are some special cases, mainly:
- Your product can be used by awesomely impressive internet companies that, while refusing to pay for software themselves, validate it for adoption by lesser organizations that indeed are willing to pay. This has worked for multiple projects started by those companies themselves, such as Hadoop and memcached, but only one I think of that wasn’t — MySQL.
- You can let users gain attachment to your free stuff, then sell your whole company to somebody who now wants to sell them other stuff, presumably closed source (or hardware), or who just is impressed by the awesomeness of your technology. This strategy has produced a very small number of great exits — XenSource, arguably Nicira (although Nicira itself disagrees), maybe a couple of others.
But in most cases, the strategy loops back what I described at the top of this post:
- A free core product, which may be genuinely valuable to some/most users, and which certainly offers them a great opportunity to test the technology, plus …
- … a chargeable/proprietary add-on, which is required for the most serious work, …
- … or else just support.
There aren’t actually a lot of major examples in the “just support” camp* — the main ones who come to mind are Red Hat, 10gen, and Hortonworks, and two of those three are for products that were open source projects long before the respective companies were founded. And so we’re right back to an Enterprise Edition/Community Edition split.
*Or “mainly just support” — as per my recent post on Hadoop distributions, almost everybody offers SOMETHING proprietary.
This all still leaves an attitudinal distinction among (in decreasing order of open source rah-rah virtue):
- Build and promote a great free product. One of these years, get around to building and promoting a great chargeable one as well.
- Build and promote both a great free product and a great chargeable one.
- Build and promote a great chargeable product, and give a subset of it away for free. That subset should be good too.
- Build and promote a great chargeable product, and give a crappy subset away for free.
I think #3 makes the most sense. #4 is bad because I don’t believe in promoting or distributing crappy products even for free. #2 is too big a challenge to tackle, in technology and marketing alike. And #1 is only for the most patient vendors with the deepest of pockets.
There’s also the possibility of open sourcing software and then making your main revenue from being the best hosting company for it. But to date that has worked mainly for Automattic.
Finally — what about open source as a development strategy? Well, there are indeed some projects with multiple sets of major contributors — Linux, R, Hadoop, Postgres and so on. But for projects that originate with a single sponsoring vendor, my general observation still stands:
- Open source software commonly gets community contributions for connectors, adapters, and (national) language translations.
- But useful contributions in other areas are much rarer.
- The open/closed source distinction is central to only a few of the issues on our strategy and execution worksheets, mainly the ones influenced by pricing. However, it is at least slightly relevant to a considerable fraction of them.
- I glossed over the free-like-speech/free-like-beer distinction a bit; hopefully my usage was clear in context.
One elephant went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.
Two elephants went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.
Three elephants went out to play
– Popular children’s song
It’s Strata week, with much Hadoop news, some of which I’ve been briefed on and some of which I haven’t. Rather than delve into fine competitive details, let’s step back and consider some generalities. First, about Hadoop distributions and distro providers:
- Conceptually, the starting point for a “Hadoop distribution” is some version of Apache Hadoop.
- Hortonworks is still focused on Hadoop 1 (without YARN and so on), because that’s what’s regarded as production-ready. But Hortonworks does like HCatalog.
- Cloudera straddles Hadoop 1 and Hadoop 2, shipping aspects of Hadoop 2 but not recommending them for production use.
- Some of the newer distros seem to be based on Hadoop 2, if the markitecture slides are to be believed.
- Optionally, the version numbers of different parts of Hadoop in a distribution could be a little mismatched, if the distro provider takes responsibility for testing them together.
- Cloudera seems more willing to do that than Hortonworks.
- Different distro providers may choose different sets of Apache Hadoop subprojects to include.
- Cloudera seems particularly expansive in what it is apt to include. Perhaps not coincidentally, Cloudera folks started various Hadoop subprojects.
- Optionally, distro providers’ additional proprietary code can be included, to be used either in addition to or instead of Apache Hadoop code. (In the latter case, marketing can then ensue about whether this is REALLY a Hadoop distribution.)
- Hortonworks markets from a “more open source than thou” stance, even though:
- It is not a purist in that regard.
- That marketing message is often communicated by Hortonworks’ very closed-source partners.
- Several distro providers, notably Cloudera, offer management suites as a big part of their proprietary value-add. Hortonworks, however, is focused on making open-source Ambari into a competitive management tool.
- Performance is another big area for proprietary code, especially from vendors who look at HDFS (Hadoop Distributed File System) and believe they can improve on it.
- I conjecture packaging/installation code is often proprietary, but that’s a minor issue that doesn’t get mentioned much.
- Hortonworks markets from a “more open source than thou” stance, even though:
- Optionally, third parties’ code can be provided, open or closed source as the case may be.
Most of the same observations could apply to Hadoop appliance vendors.
Besides code, Hadoop distribution providers commonly offer support. The Hadoop support situation is confused, largely because:
- Marketing around Hadoop support capabilities and experience is sparse …
- … except for the Hortonworks vs. Cloudera General Hadoop Expertise Urinary Olympics.
- I don’t hear a lot of complaints about anybody’s Hadoop support.
- One should distinguish between, say, Tier 1 and Tier 3 support.
- Since most serious Hadoop development is done by Cloudera and Hortonworks, those two vendors are by far the best qualified to do Tier 3+ support.
- Since Cloudera has the most Hadoop market share to date, it also has the most Hadoop support experience (any and all tiers).
- Some of the other contenders are huge companies that presumably know how to support enterprise customers. This includes both distro providers and others (e.g. Oracle, which sells a Cloudera-based appliance and handles Tier 1 support for that itself).
And finally, reasons that come to mind for choosing particular distributions include:
- Cloudera Manager is (relatively speaking) mature.
- Cloudera Navigator seems promising.
- Cloudera has the most experienced Hadoop services operation.
- Cloudera has the development “axe” in some parts of Hadoop and is second only to Hortonworks in the others.
- Cloudera has lots of partner support.
- Cloudera is the best-funded company whose main business is Hadoop.
- With the arguable exception of Cloudera, Hortonworks has much more Hadoop expertise than any other outfit, including the development “axe” in a variety of areas.
- Hortonworks has lots of partner support.
- Hortonworks is the second-best-funded company whose main business is Hadoop.
- Because of its low reliance on proprietary code, Hortonworks has great “escapability”, and correspondingly weak pricing power vs. its customers.
- Intel’s Hadoop performance hacks may be legit.
- Intel was evidently early in supporting Chinese Hadoop users.
- If you want to use the Greenplum DBMS, using the Pivotal/Greenplum Hadoop distribution too would seem to be thematic.
- At one point MapR seemed to have a performance advantage. I don’t know whether that’s still the case.
- Some believe that IBM removes obstacles, and grants blessings of prosperity and wisdom.
My former friends at Greenplum no longer talk to me, so in particular I wasn’t briefed on Pivotal HD and Greenplum HAWQ. Pivotal HD seems to be yet another Hadoop distribution, with the idea that you use Greenplum’s management tools. Greenplum HAWQ seems to be Greenplum tied to HDFS.
The basic idea seems to be much like what I mentioned a few days ago — the low-level file store for Greenplum can now be something else one has heard of before, namely HDFS (Hadoop Distributed File System, which is also an option for, say, NuoDB). Beyond that, two interesting quotes in a Greenplum blog post are:
When a query starts up, the data is loaded out of HDFS and into the HAWQ execution engine.
In addition, it has native support for HBase, supporting HBase predicate pushdown, hive[sic] connectivity, and offering a ton of intelligent features to retrieve HBase data.
The first sounds like the invisible loading that Daniel Abadi wrote about last September on Hadapt’s blog. (Edit: Actually, see Daniel’s comment below.) The second sounds like a good idea that, again, would also be a natural direction for vendors such as Hadapt.
WibiData is essentially on the trajectory:
- Started with platform-ish technology.
- Selling analytic application subsystems, focused for now on personalization.
- Hopeful of selling complete analytic applications in the future.
The same, it turns out, is true of Causata.* Talking with them both the same day led me to write this post.
*Differences between the companies include:
- WibiData started out with some serious HBase/Hadoop technology, whereas …
- … Causata just changed its underpinnings to HBase/Hadoop …
- … after hiring new, application-oriented leadership.
I know WibiData (client since they had <10 employees) much better than Causata (one conversation ever).
The problem for those vendors and other analytic application aspirants is that it is very hard to offer a complete analytic application. In particular:
- Suppose they want to offer a great solution for, say, website personalization.* It’s hard to do that without offering something that creates complete websites — specifically, complete unique websites. Whoops.
- OK, let’s suppose they solve that problem, drawing a clean line between the personalization and creative parts. Then is it really enough for them to just personalize websites? Shouldn’t they also personalize email? Mobile ads? In-store offers? Shouldn’t that all be tied to campaign design? And by the way, they need the capacity to incorporate almost any kind of data you can imagine, while applying any kind of modeling algorithm that can offer differentiated results.
- On the other hand, suppose they only deliver the common analytic subsystems for various functions? How do they sell that? How do they even demo it? Are they at the mercy of “last mile functionality” partners?
*There are various semantic issues as to whether the correct word is “personalization”, “customization”, etc. In this post, I’m ignoring them.
My proposed answer starts:
- Even though it’s impractical to offer across-the-board, full-featured, full-suite, highly competitive analytic applications …
- … offer something that purports to be a complete analytic app anyway.
Maybe the “complete” app is, from the customer’s standpoint, at least a “good start”. Maybe you really can deliver an awesome application for a narrow area of functionality — and the customer adopts it with confidence, knowing that she can integrate the core technology into a broader suite if she wants to.
As I’m telling the story, the real differentiation is apt to be in the subsystem, not in the finished app. So for a sanity check, let’s consider when would that might be the case. Examples that come to mind include:
- Small-/mid-market, vertical-market BI. The best example of this may be Google Analytics, for website owners and administrators — but that’s most famous as a free product. Perhaps there are also examples in more conventional enterprise-adoption scenarios. (PivotLink for retailers? I’m not sure how mature their application functionality really is.)
- Any of the four scenarios I outlined in my post on third-party analytics. One notable example is stock quote services such as Bloomberg. But that’s really an information-selling business much more than an analytic-functionality one.
- Price-setting analytics — Zilliant, Vendavo, and so on. Those outfits indeed seem to focus on application fit-and-finish as much as on price optimization expertise. But I’d guess that the most successful companies in that market are still in the 10s of millions of annual revenues; for example, Zilliant recently boasted of its 100th customer.
I don’t think any of those cases are sufficient to undermine my conclusions, namely:
- Making a big business from “complete” analytic applications will in most cases require some heretofore undiscovered insights or conceptual breakthroughs (business model or technology as the case may be).
- Analytic application subsystems are where most of the near-term opportunity lies.
- It will likely be wise to offer “complete” analytic applications even so.
Perhaps the single toughest question in all database technology is: Which different purposes can a single data store serve well? — or to phrase it more technically — Which different usage patterns can a single data store support efficiently? Ted Codd was on multiple sides of that issue, first suggesting that relational DBMS could do everything and then averring they could not. Mike Stonebraker too has been on multiple sides, first introducing universal DBMS attempts with Postgres and Illustra/Informix, then more recently suggesting the world needs 9 or so kinds of database technology. As for me — well, I agreed with Mike both times.
Since this is MUCH too big a subject for a single blog post, what I’ll do in this one is simply race through some background material. To a first approximation, this whole discussion is mainly about data layouts — but only if we interpret that concept broadly enough to comprise:
- Every level of storage (disk, RAM, etc.).
- Indexes, aggregates and raw data alike.
To date, nobody has ever discovered a data layout that is efficient for all usage patterns. As a general rule, simpler data layouts are often faster to write, while fancier ones can boost query performance. Specific tradeoffs include, but hardly are limited to:
- Big blocks of data compress better, and can be also be faster to retrieve than a number of smaller blocks holding the same amount of data. Small blocks of data can be less wasteful to write. And different kinds of storage have different minimum block sizes.
- Operating on compressed data offers multiple significant efficiencies. But you have to spend cycles (de)compressing it, and it’s only practical for some compression schemes.
- Fixed-length tabular records can let you compute addresses rather than looking them up in indexes. Yay! But they also waste space.
- Tokenization can help with the fixed-/variable-length tradeoff.
- Pointers are wonderfully efficient for some queries, at least if you’re not using spinning disk. But they can create considerable overhead to write and update.
- Indexes, materialized views, etc. speed query performance, but can be costly to write and maintain.
- Storing something as a BLOB (Binary Large OBject), key-value payload, etc. is super-fast — but if you want to look at it, you usually have to pay for retrieving the whole thing.
What’s more, different data layouts can have different implications for logging, locking, replication, backup and more.
So what would happen if somebody tried to bundle all conceivable functionality into a single DBMS, with a plan to optimize the layout of any particular part of the database as appropriate? I think the outcome would be tears – for the development effort would be huge, while the benefits would be scanty. The most optimistic cost estimates could run in the 100s of millions of dollars, with more realistic ones adding a further order of magnitude. But no matter what the investment, the architects would be on the horns of a nasty dilemma:
- If there’s much commonality among the component DBMS, each one would be sub-optimal.
- If there’s little commonality among them, then there’s also little benefit to the combination.
Adding insult to injury, all the generality would make it hard to select optimum hardware for this glorious DBMS — unless, of course, a whole other level of development effort made it work well across very heterogeneous clusters.
Less megalomaniacally, there have been many attempts to combine two or more alternate data layouts in a single DBMS, with varying degrees of success. In the relational-first world:
- Analytic DBMS have combined row and column data models so fluidly that I’ve made fun of Oracle for not being able to pull it off. SAP HANA sort of does the same thing, but perhaps with a columnar bias, and not just for analytics.
- Relational DBMS can also have a variety of index types, suitable for different relational use cases. This is especially true for analytic uses of general-purpose RDBMS.
- Oracle, DB2, PostgreSQL, and Informix have had full extensibility architectures since the 1990s. That said:
- Almost all the extensions come from the DBMS vendors themselves.
- Extensions that resemble (or are) a tabular datatype — for example geospatial or financial-date — are often technically well-regarded.
- Others are usually not so strong technically, but in a few cases sell well anyway (e.g. Oracle Text).
- While Microsoft never went through the trouble of offering full extensibility, otherwise the SQL Server story is similar.
- Sybase’s extensibility projects went badly in the 1990s, and Sybase doesn’t seem to have tried hard in that area since.
- IBM DB2, Microsoft SQL Server, and Oracle added XML capabilities around the middle of the last decade.
- Analytic platforms can wind up with all sorts of temporary data structures.
- Analytic DBMS have various ways to reach out and touch Hadoop.
- Non-relational DBMS commonly have indexes that at least support relational-like SELECTs. JOINs can be more problematic, but MarkLogic finally has them. Tokutek even offers a 3rd-party indexing option for MongoDB.
- Hadoop is growing into what is in effect is a family of DBMS and other data stores — generic HDFS, HBase, generic Hive, Impala, and so on. At the moment, however, none of them is very mature. BDAS/Spark/Shark ups the ante further, but of course that’s less mature yet.
- Hadapt combines Hadoop and PostgreSQL.
- DataStax combines Cassandra, Hadoop, and Solr.
- Akiban fondly thinks its data layouts are well-suited for relational tables, JSON, and XML alike. (But business at Akiban may be in flux.)
- GenieDB (Version 1 only) and NuoDB are both implemented over key-value stores. GenieDB Version 2 is implemented over Berkeley DB or MySQL.
- Membase/Couchbase was first implemented over SQLite, then over (a forked version of) CouchDB.
1. It boggles my mind that some database technology companies still don’t view compression as a major issue. Compression directly affects storage and bandwidth usage alike — for all kinds of storage (potentially including RAM) and for all kinds of bandwidth (network, I/O, and potentially on-server).
Trading off less-than-maximal compression so as to minimize CPU impact can make sense. Having no compression at all, however, is an admission of defeat.
2. People tend to misjudge Hadoop’s development pace in either of two directions. An overly expansive view is to note that some people working on Hadoop are trying to make it be all things for all people, and to somehow imagine those goals will soon be achieved. An overly narrow view is to note an important missing feature in Hadoop, and think there’s a big business to be made out of offering it alone.
At this point, I’d guess that Cloudera and Hortonworks have 500ish employees combined, many of whom are engineers. That allows for a low double-digit number of 5+ person engineering teams, along with a number of smaller projects. The most urgently needed features are indeed being built. On the other hand, a complete monument to computing will not soon emerge.
3. Schooner’s acquisition by SanDisk has led to the discontinuation of Schooner’s SQL DBMS SchoonerSQL. Schooner’s flash-optimized key-value store Membrain continues. I don’t have details, but the Membrain web page suggests both data store and cache use cases.
4. There’s considerable personnel movement at Boston-area database technology companies right now. Please ping me directly if you care.
5. I talked recently with Ashish Thusoo of Qubole. Qubole’s initial offering is a Hive-in-the-cloud, started by the guys who invented Hive. Qubole’s coolest new technical feature vs. generic Hive seems to be a disk-based columnar cache that lives with the servers, to help “smooth over the jitters” between Amazon EC2 and S3. Qubole company basics include:
- Founded last year.
- 15 early adopters, generally from mid-sized internet companies. Some of the adopters are already paying.
- 12 employees.
6. In my recent When I am a VC Overlord post, I wrote:
4. I will not fund any software whose primary feature is that it is implemented in the “cloud” or via “SaaS”. A me-too product on a different platform is still a me-too product.
5. I will not fund any pitch that emphasizes the word “elastic”. Elastic is an important feature of underwear and pajamas, but even in those domains it does not provide differentiation.
Cloud/SaaS deployments give you a chance at providing superior ease of use/installation/administration, without compromising functionality — but they don’t automatically guarantee it. It’s hard work to make your customers’ lives easier.*
*This is the second consecutive post in which I’ve used a similar line. I’ll try to stop now. What’s really scary is that I was inspired by the old Frank Perdue ad “It takes a tough man to make a tender chicken.”
7. Ofir Manor of EMC is skeptical about Oracle’s claims for Hybrid Columnar Compression. But he didn’t really dig up that much dirt, except that he seems to think 10X compression is more of a ceiling than the floor that Oracle marketing suggests it is. The money quote is:
Oracle used to provide 3x compression, now it provides 10x compression, so no wonder the best references customers are seeing about 3.4x savings…
That 3X is from Oracle’s Basic Compression, which seems to be a block-level dictionary scheme.
Code generation is most beneficial for queries that execute simple expressions and the interpretation overhead is most pronounced. For example, a query that is doing a regular expression match over each row is not going to benefit from code generation much because the interpretation overhead is low compared to the regex processing time.
Code generation may end up like compression — an architectural feature that DBMS just obviously should have.
It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.
Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:
- “We get data into a form in which it can be analyzed.” This is the story behind, among others:
- Most of the data integration and ETL (Extract/Transform/Load) industries, software vendors and consulting firms alike.
- Many things that purport to be “analytic applications” or data warehouse “quick starts”.
- “Data reduction” use cases in event processing.*
- Text analytics tools.
- “Forget all that transformation foofarah — just load (or write) data into our thing and start analyzing it immediately.” This at various times has been much of the story behind:
- Relational DBMS, according to their inventor E. F. Codd.
- MOLAP (Multidimensional OnLine Analytic Processing), also according to RDBMS inventor E. F. Codd.
- Any kind of analytic DBMS, or general purpose DBMS used for data warehousing.
- Newer kinds of analytic DBMS that are faster than older kinds.
- The “data mart spin-out” feature of certain analytic DBMS.
- In-memory analytic data stores.
- NoSQL DBMS that have a few analytic features.
- TokuDB, similarly.
- Electronic spreadsheets, from VisiCalc to Datameer.
- “Our tools help you with specific kinds of analyses or analytic displays.” This is the story underlying, among others:
- The business intelligence industry.
- The predictive analytics industry.
- Algorithmic trading use cases in complex event processing.*
- Some analytic applications.
*Complex event/stream processing terminology is always problematic.
My thoughts on all this start:
- There are many possibilities for the “right” way to manage analytic data. Generally, these are not the same as the “right” way to write the data, as that choice needs to be optimized for user experience (including performance), reliability, and of course cost.
- I.e., it is usually best to move data from where you write it to where you (at least in part) analyze it.
- Vendors who suggest they have a complete solution for getting data ready to be analyzed are … optimists.
- This specifically includes “magic data stores”, such as fast analytic RDBMS (on which I’m very bullish) or in-memory analytic DBMS (about which I’m more skeptical). They’re great starting points, but they’re not the whole enchilada.
- There are many ways to help with preparing data for analysis. Some of them are well-served by the industry. Some, however, are not.
1. There are many terms for all this. I once titled a post “Data that is derived, augmented, enhanced, adjusted, or cooked”. “Data munging” and “data wrangling” are in the mix too. And I’ve heard the term data preparation used several different ways.
2. Microsoft told me last week that the leading paid-for data products in their data-for-sale business are for data cleaning. (I.e., authoritative data to help with the matching/cleaning of both physical and email addresses.) Salesforce.com/data.com told me something similar a while back. This underscores the importance of data cleaning/data quality, and more generally of master data management.
Yes, I just said that data cleaning is part of master data management. Not coincidentally, I buy into to the view that MDM is an attitude and a process, not just a specific technology.
3. Everybody knows that Hadoop usage involves long-ish workflows, in which data keeps get massaged and written back to the data store. But that point is not as central to how people think about Hadoop as it probably should be.
4. One thing people have no trouble recalling is that Hadoop is a great place to dump stuff and get it out later. Depending on exactly what you have in mind, there are various metaphors for this, most of which have something to do with liquids. Most famous is “big bit bucket”, but also used have been “data refinery”, “data lake”, and “data reservoir”.
5. For years, DBMS and Hadoop vendors have bundled low-end text analytics capabilities rather than costlier state-of-the-art ones. I think that may be changing, however, mainly in the form of Attensity partnerships.
Truth be told, I’m not wholly current on text mining vendors — but when I last was, Attensity was indeed the best choice for such partnerships. And I’m not aware of any subsequent developments that would change that conclusion.
- Merv Adrian’s contrast between Hadoop and data integration tours some of the components of ETL suites. (February, 2013)
- Part of why analytic applications are usually incomplete are the issues discussed in this post.
- De-anonymization is an important — albeit privacy-threatening — way of making data more analyzable. (January, 2011)
- I updated my thoughts on Gartner’s Logical Data Warehouse concept earlier this month.
I recently complained that the Gartner Magic Quadrant for Data Warehouse DBMS conflates many use cases into one set of rankings. So perhaps now would be a good time to offer some thoughts on how to tell use cases apart. Assuming you know that you really want to manage your analytic database with a relational DBMS, the first questions you ask yourself could be:
- How big is your database? How big is your budget?
- How do you feel about appliances?
- How do you feel about the cloud?
- What are the size and shape of your workload?
- How fresh does the data need to be?
Let’s drill down.
How big is your database? How big is your budget?
Taken together, these questions tell you which choices are even feasible. Does your database fit into RAM, at a price you can afford? Does it fit onto a single, perhaps large, server? If both answers are “No”, then you need a real scale-out system, querying disk or flash (which itself could be hard to afford). Otherwise, you have more options.
Note that database compression has a big influence on what fits where.
How do you feel about appliances?
Depending on considerations such as database size, the choice of Oracle, Teradata, IBM Netezza, or Microsoft SQL Server may mandate or at least strongly suggest an appliance form factor. For most other analytic DBMS, an appliance is more optional. Are appliances good for you? Bad? Indifferent? Trade-offs include:
- Appliances often involve paying a premium for hardware purchase and/or support.
- Appliances often are easy(ier) to install and manage.
- Appliances are easier to upgrade in some ways (everything’s integrated), but harder in others (less ability to upgrade bottlenecked parts).
- Appliances often don’t play well in the cloud.
How do you feel about the cloud?
Analytic DBMS run better on good hardware and predictable bandwidth (hence all those appliances). These can be hard to find in the cloud. So, not coincidentally, can be analytic DBMS references, although most vendors can muster a few.
If you feel you need to run your analytic RDBMS in the cloud now, check references carefully. If you only are concerned about the cloud as some indefinite future, then you might want to rule out a few appliance-only vendors, but otherwise you probably shouldn’t worry. Cloud hardware and networking are getting better, and RDBMS software vendors are gaining experience in cloud deployments.
What are the size and shape of your workload?
Different analytic databases can have very different kinds of workloads. Tasks include:
- Complex, long-running queries.
- Repetitive reports of varying degrees of complexity.
- Simple queries.
- Large, scheduled loads.
- Continuous or near-continuous/micro-batch loads.
The big issue is — how many of each kind of task need to performed concurrently, and in what combinations? If you’re refreshing 10,000 dashboards, several hundred of which might be getting drill-down queries at once, while trying to do a few scan-heavy queries in the background and some 15-way joins, most analytic DBMS might disappoint you. (Indeed, I’d ask whether you might want to split up that work among two or more systems.) Different DBMS — and different hardware/storage/networking configurations — shine in different scenarios.
How fresh does the data need to be?
Any serious analytic DBMS can be loaded daily or hourly, edge cases perhaps excepted. In most cases 15 minute intervals work as well, or even 5, but check whether those load latencies would interfere with any performance optimizations. But if you want sub-second data freshness, or even several-second — well, that has to be a top-tier architectural issue.
If your analytics are simple enough, it’s appealing to do the immediate-response ones straight from your transactional database. If not, you may need some kind of streaming-replication setup. Usually, I wind up recommending replication approaches that don’t yet have a lot of maturity or references. Tread carefully here.
The following blog post comes from Fishbowl Senior Software Consultant Alan Mackenthun. Alan is Fishbowl’s resident records management expert and has been architecting such systems for over nine years. In working with a WebCenter customer, Alan was able to propose a solution that will enable the customer to configure WebCenter so that a group of users can be dynamically assigned to review dispositions. This isn’t a well-documented feature so we wanted to share it with the rest of the WebCenter community.
At its core records management is the management of the destruction of content when it’s no longer needed. Usually, business processes dictate that someone review the content and approve destruction before the content is permanently deleted. Out of the box, you can either assign a specific user as a reviewer on a retention step or you could allow the default system reviewer alias to review dispositions, but there was no way to assign a group of users or to dynamically assign users.
Assigning a specific user may work in smaller organizations but even then, if a specific user is assigned and then they go on vacation or leave the company, all related disposition rules would have to be found and updated. It was very difficult to make this work in a larger organization where document owners could be spread among separate business units or departments.
With the enhancement documented in the TKB referenced below, you can easily reference an alias in disposition rules. To do so simply enter:
as the reviewer where “<my alias>” is the name of the alias you’d like to reference. The real benefit here is that if you have Departmental Record Coordinators (DRCs) who review content in certain categories scheduled for destruction (disposition), you can assign the alias rather than named users. Then if the DRC changes, the client only needs to update the alias, rather than all categories where that DRC was referenced.
Additionally, leveraging the ability to reference a script function gives you much more power. Some categories of content, such as correspondence or memos, span all business units and departments. On the other hand, there isn’t one person or group in an organization who should be approving the destruction of this content. Instead, this feature allows you to reference a script function that can take the value of a business unit and/or department metadata field and map the value of this organizational unit to a user or alias who would be assigned as the reviewer. To do so simply enter:
as the reviewer where “<myscript>” is the name of the custom IdocScript function you’d like to reference (of course we at Fishbowl would be happy to help implement such a function if needed).
Oracle Support Document 1470906.1 (How to Request Approval Notification for a Group of People for a Disposition Action) can be found at: https://support.oracle.com/epmos/faces/DocumentDisplay?id=1470906.1
The post How to Assign a Group of People to a Disposition Action using Oracle WebCenter appeared first on C4 Blog by Fishbowl Solutions.
Comments on Gartner’s 2012 Magic Quadrant for Data Warehouse Database Management Systems — evaluations
To my taste, the most glaring mis-rankings in the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management are that it is too positive on Kognitio and too negative on Infobright. Secondarily, it is too negative on HP Vertica, and too positive on ParAccel and Actian/VectorWise. So let’s consider those vendors first.
Gartner seems confused about Kognitio’s products and history alike.
- Gartner calls Kognitio an “in-memory” DBMS, which is not accurate.
- Gartner doesn’t remark on Kognitio’s worst-in-class* compression.
- Gartner gives Kognitio oddly high marks for a late, me-too Hadoop integration strategy.
- Gartner writes as if Kognitio’s next attempt at the US market will be the first one, which is not the case.
- Gartner says that Kognitio pioneered data warehouse SaaS (Software as a Service), which actually has existed since the pre-relational 1970s.
Gartner is correct, however, to note that Kognitio doesn’t sell much stuff overall.
In the cases of HP Vertica, Infobright, ParAccel, and Actian/VectorWise, the 2012 Gartner Magic Quadrant for Data Warehouse Database Management’s facts are fairly accurate, but I dispute Gartner’s evaluation. When it comes to Vertica:
- I think HP’s troubles are less relevant to HP Vertica than Gartner does.
- In particular, Vertica’s lack of integration with Autonomy isn’t a big deal. Many relational DBMS vendors don’t even own a text search engine to not-integrate with, and the number of vendors with seriously effective analytic RDBMS/text search integration strategies is zero.
- Gartner is correct to note that Vertica’s integration with the rest of HP, for example the hardware side, has been slow — but again, so what?
- Gartner correctly praises Vertica’s analytic platform capabilities, but then seems to criticize Vertica’s capabilities in user-defined functions — notwithstanding that Vertica’s analytic platform capabilities are implemented via UDFs.
- Gartner seems to criticize Vertica’s “volume credentials”, even though Vertica’s number of petabyte-scale analytic RDBMS customers may be second only to Teradata’s.
That said, I defer to Gartner’s opinion that HP Vertica’s sales momentum has disappointed, even if against higher expectations than one might have for vendors with 1/10 of Vertica’s installed base.
2 years ago, I simply said “What Gartner said in connection with Ingres is too inaccurate to deserve detailed attention.” This year’s Gartner Magic Quadrant for Data Warehouse Database Management isn’t that bad on the subject of Actian,* but it’s not great either. Writing mainly about Actian’s VectorWise, Gartner dings it for both features and bugginess, and correctly notes that VectorWise is only suitable for fairly small data warehouses. But Gartner gives VectorWise higher marks than Exasol even so. Gartner also writes that VectorWise has a “long tradition of having loyal supporters”, notwithstanding that VectorWise’s initial release was less than 3 years ago.
*Ingres’ new name, in honor of a 2011 pivot that seems to already have been deprecated
What the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management says about ParAccel isn’t too different on the facts from what I wrote in December, but Gartner is more enthused than I am. Basically:
- ParAccel is playing catch-up in features and company stability, and Gartner somehow sees that as a strength.
- Gartner dings ParAccel for a variety of product weaknesses.
- Gartner is breathless about ParAccel being used on a MicroStrategy SaaS site. (Hint: Deals like that go to vendors willing to accept very low prices.)
- Gartner is also impressed with ParAccel’s Amazon involvement. (Deals like that aren’t lucrative either, except insofar as Amazon bought some ParAccel stock.)
That Gartner ranks ParAccel ahead of HP Vertica baffles me. Perhaps Gartner views benchmarks as more significant than I do, or is otherwise judging ParAccel to have an important advantage in performance.
I also am in substantial agreement with the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management about Infobright facts, but this time I’m the one with the more favorable interpretation. I agree that Infobright is a bit limited in features, in the areas Gartner cites and in analytic platform capabilities as well. Even so, Infobright is far ahead of VectorWise in — also low-priced — sales, and ahead in product stability and features too. Yet Gartner gives Infobright vastly lower marks than Actian. I suspect that the essence of our disagreement is that Gartner sees Infobright’s focus on machine-generated data as something that “limits market expansion”, while I see machine-generated data as something that is by every measure* growing to be a majority of the whole.
*By raw volume that’s been true for a while. But if we adjust for value and so on, the crossover is arguably still a way off.
I’ll run through the other vendors cited in the 2012 Gartner Magic Quadrant for Data Warehouse Database Management more quickly, in approximate declining order of Gartner’s rankings.
- Gartner loves Teradata, but has some concerns over TCO (Total Cost of Ownership). Makes sense to me.
- Gartner is more impressed with Oracle’s technology than I am. I find it hard to take seriously a data warehouse RDBMS vendor that can’t deliver a true columnar storage option.
- Gartner’s write-up of IBM gets lost in IBM’s vast sea of products. I sympathize.
- Gartner’s discussion of SAP/Sybase IQ was overly brief. But given my own difficulties staying up to speed on my clients over there, I sympathize. Please stay tuned.
- Gartner’s discussion of Microsoft gets overly caught up in “logical data warehouse” foofarah, but basically it makes sense. Pending some promised briefings, I’m more optimistic about Microsoft’s analytic DBMS offerings than I’ve been for a long time. Please stay tuned.
- Gartner notes difficulties validating EMC Greenplum’s customer claims. I sympathize. Gartner also notes a bunch of product issues that make me wonder why EMC Greenplum’s overall rating isn’t even lower.
- Gartner’s view of Exasol seems similar to mine.
- I’d evaluate 1010data on the basis of its spreadsheet-like analytic tools, not its DBMS technology.
- Gartner seems to have difficulty finding non-trivial “Strengths” for Calpont. I sympathize.
- Gartner notes difficulties contacting SAND. I sympathize, since SAND’s senior management resigned en masse during the July, 2012 quarter. (Page 30 of that link.)
The 2012 Gartner Magic Quadrant for Data Warehouse Database Management Systems is out. I’ll split my comments into two posts — this one on concepts, and a companion on specific vendor evaluations.
- Maintaining working links to Gartner Magic Quadrants is an adventure. But as of early February, 2013, this link seems live.
- I also commented on the 2011, 2010, 2009, 2008, 2007, and 2006 Gartner Magic Quadrants for Data Warehouse DBMS.
Let’s start by again noting that I regard Gartner Magic Quadrants as a bad use of good research. On the facts:
- Gartner collects a lot of input from traditional enterprises. I envy that resource.
- Gartner also does a good job of rounding up vendor claims about user base sizes and the like. If nothing else, you should skim the MQ report for that reason.
- Gartner observations about product feature sets are usually correct, although not so consistently that they should be relied on.
When it comes to evaluations, however, the Gartner Data Warehouse DBMS Magic Quadrant doesn’t do as well. My concerns (which overlap) start:
- The Gartner MQ conflates many different use cases into one ranking (inevitable in this kind of work, but still regrettable).
- A number of the MQ vendor evaluations seem hard to defend. So do some of Gartner’s specific comments.
- Some of Gartner’s criteria seemingly amount to “parrots back our opinions to us”.
- As do I, Gartner thinks a vendor’s business and financial strength are important. But Gartner overdoes the matter, drilling down into picky issues it can’t hope to judge, such as assessing a vendor’s “ability to generate and develop leads.” *
- The 2012 Gartner Data Warehouse DBMS Magic Quadrant is closer to being a 1-dimensional ranking than 2-dimensional, in that entries are clustered along the line x=y. This suggests strong correlation among the results on various specific evaluation criteria.
*I may focus more on marketing communications strategy than the whole Gartner database research team combined — but the only way I’d know whether Teradata’s lead gen is better than HP Vertica’s or vice-versa would be if both vendors happened to raise the matter during consulting sessions.
Specific product feature areas Gartner seems to emphasize include:
- Alignment with a “logical data warehouse” strategy.
- Analytic platform features.
- Administrative tools, including workload management.
- “Self-tuning” performance.
- Scale-out capabilities.
Most of this makes sense. But Gartner has been talking about the “logical data warehouse” for a long time without ever seeming to firm up what it is, as evidenced for example by some dueling summaries of the concept. So let’s drill down on the LDW.
I think “logical data warehouse” will wind up like “master data management” — i.e., it will be a goal and a business process, aided but not subsumed by some characteristic software. Beyond that, I’d say that generic, functional, high-performance data federation* software is a pipedream — building it would be as hard as building the mythical single DBMS that gives great functionality and performance, in all use cases, for all kinds of data. Just as DBMS need to be at least somewhat specialized in purpose, data federation software needs to be as well.
*While I disapprove, data virtualization seems to be the term that will win for describing data federation.
When Gartner refers to the “logical data warehouse” capabilities of analytic RDBMS — and the first sentence of the MQ report indeed specifies that the subject is “relational database management systems” — it seems to be looking for two things:
- Built-in data federation/query routing capabilities; i.e., specific features that help the DBMS interoperate with other data stores. But there seems to be little reference to relational federation/ external tables (which many vendors support) or text federation (which vendors with built-in search support, although that would mainly be Oracle, and its search is slow). Rather, this part of LDW is currently all about Hadoop interoperability, with bonus points for mentioning HCatalog.
- Management of multi-structured data. But with limited exceptions, nobody’s doing that well in an analytic RDBMS. And even when they do, that’s pretty much the opposite of the federation that the rest of the logical data warehouse concept seems to be about.
For those and other reasons, referring to the “logical data warehouse” features of an analytic RDBMS is problematic. I imagine Gartner will keep working at the “logical data warehouse” concept until it is more successfully fleshed out. But little weight should be placed on Gartner’s LDW-feature-evaluations of analytic RDBMS at this time.
In typical debates, the extremists on both sides are wrong. “SQL vs. NoSQL” is an example of that rule. For many traditional categories of database or application, it is reasonable to say:
- Relational databases are usually still a good default assumption …
- … but increasingly often, the default should be overridden with a more useful alternative.
Reasons to abandon SQL in any given area usually start:
- Creating a traditional relational schema is possible …
- … but it’s tedious or difficult …
- … especially since schema design is supposed to be done before you start coding.
Some would further say that NoSQL is cheaper, scales better, is cooler or whatever, but given the range of NewSQL alternatives, those claims are often overstated.
Sectors where these reasons kick in include but are not limited to:
- Retailing, especially online. Different kinds of products have different kinds of attributes, making a Grand Cosmic Schema rather complex. Examples I’ve blogged about include:
- Amazon relied on an in-memory object-oriented DBMS for its used books inventory lookup back in 2005.
- A Microsoft customer managed book and DVD inventory in XML the same year.
- More recently, 10gen spoke of a wireless telco offering cell phones and service plans in the same product catalog, built over MongoDB.
- Human resources. Employee-centric applications are naturally full of hierarchies, which can be annoying to flatten. Non-relational approaches I’ve blogged about include Workday’s object model and Neo4j’s graph-based contribution.
- Web log analysis. Web logs can be particularly hard to flatten, as per my post on (that sense of) nested data structures.
- More generally, marketing and other applications that maintain detailed profiles of customers or prospects. The information in these profiles is often based on a large variety of marketing campaigns, third-party databases, and analytic exercises. As the inputs pile up, the schemas get ever hairier.
- Electronic medical records. Medical records are one area where non-relational approaches may actually have majority share. I blogged about one example in 2008.
Or to quote a 2008 post,
Conor O’Mahony, marketing manager for IBM’s DB2 pureXML, talks a lot about one of my favorite hobbyhorses — schema flexibility* — as a reason to use an XML data model. In a number of industries he sees use cases based around ongoing change in the information being managed:
- Tax authorities change their rules and forms every year, but don’t want to do total rewrites of their electronic submission and processing software.
- The financial services industry keeps inventing new products, which don’t just have different terms and conditions, but may also have different kinds of terms and conditions.
- The same, to some extent, goes for the travel industry, which also keeps adding different kinds of offers and destinations.
- The energy industry keeps adding new kinds of highly complex equipment it has to manage.
Conor also thinks market evidence shows that XML’s schema flexibility is important for data interchange. For example, hospitals (especially in the US) have disparate medical records and billing systems, which can make information interchange a chore.
*I now call that dynamic schemas.
So, for fear of Frankenschemas, should we flee from RDBMS altogether? Hardly. For social proof, please note:
- Every application area I’ve cited can be and often is handled via relational techniques.
- Some of the non-relational alternatives I’ve mentioned, such as XML or object-oriented DBMS, haven’t enjoyed a lot of traction.
- Even the most successful NoSQL vendors are tiny when compared to the relational behemoths.
More conceptually, I’d say that the advantages of a relational DBMS start:
- In theory and practice alike, the advantages of normalization and joins.
- In theory and practice alike, the advantages of loose coupling between your database design and your application. (I think that’s a cleaner way of saying it than to focus on “reusing” the database, but it amounts to the same thing.)
- In practice, performance and functionality in anything using indexes, even if joins aren’t involved.
- In practice, maturity and functionality in general.
Those aren’t chopped liver.
I’ve hacked both the PHP and CSS that drive this website. But if I had to write PHP or CSS from scratch, I literally wouldn’t know how to begin.
Something similar, I suspect, is broadly true of “business analysts.” I don’t know how somebody can be a competent business analyst without being able to generate, read, and edit SQL. (Or some comparable language; e.g., there surely are business analysts who only know MDX.) I would hope they could write basic SELECT statements as well.
But does that mean business analysts are comfortable with the fancy-schmantzy extended SQL that the analytic platform vendors offer them? I would assume that many are but many others are not. And thus I advised such a vendor recently to offer sample code, and lots of it — dozens or hundreds of isolated SQL statements, each of which does a specific task.* A business analyst could reasonably be expected to edit any of those to point them his own actual databases, even though he can’t necessarily be expected to easily write such statements from scratch.
*Actually, the vendor is Teradata Aster. After I showed them a draft of this post, they indicated that it’s OK to use their name in the post, and they fondly think they’re already doing what I suggest in their current product.
Similar thoughts apply to other software domains. If one of your selling points is some variant on “ease of development”, yet it’s difficult for you to supply generous amounts of sample code, then probably either:
- You’re really doing a great job at visual programming, point-and-click, or some other code-free paradigm. Congratulations!
- Or your product isn’t as easy to program for as you hope.
- Or you’re so confused as to what your product is used for that you can’t imagine what kinds of sample code to whip up.
Please note that these are not exclusive ORs.
I’m not suggesting “app stores where users can post and sell — or give away — their own apps”. Those may be good ideas (although probably not as good as you think), but they miss the point. You need to do the basic work yourself. Or, if it’s a big expensive deal for you to do the work, then you should make your product more usable. For if it’s hard for YOU to program in your technology, why would somebody else pay you so that they may have the privilege of doing so?