Skip navigation.

Curt Monash

Syndicate content
Choices in data management and analysis
Updated: 17 hours 15 min ago

Introduction to Deep Information Sciences and DeepDB

Sat, 2013-04-13 22:33

I talked Friday with Deep Information Sciences, makers of DeepDB. Much like TokuDB — albeit with different technical strategies — DeepDB is a single-server DBMS in the form of a MySQL engine, whose technology is concentrated around writing indexes quickly. That said:

  • DeepDB’s indexes can help you with analytic queries; hence, DeepDB is marketed as supporting OLTP (OnLine Transaction Processing) and analytics in the same system.
  • DeepDB is marketed as “designed for big data and the cloud”, with reference to “Volume, Velocity, and Variety”. What I could discern in support of that is mainly:
    • DeepDB has been tested at up to 3 terabytes at customer sites and up to 1 billion rows internally.
    • Like most other NewSQL and NoSQL DBMS, DeepDB is append-only, and hence could be said to “stream” data to disk.
    • DeepDB’s indexes could at some point in the future be made to work well with non-tabular data.*
    • The Deep guys have plans and designs for scale-out — transparent sharding and so on.

*For reasons that do not seem closely related to product reality, DeepDB is marketed as if it supports “unstructured” data today.

Other NewSQL DBMS seem “designed for big data and the cloud” to at least the same extent DeepDB is. However, if we’re interpreting “big data” to include multi-structured data support — well, only half or so of the NewSQL products and companies I know of share Deep’s interest in branching out. In particular:

Edit: MySQL has some sort of an optional NoSQL interface, and hence so presumably do MySQL-compatible TokuDB, GenieDB, Clustrix, and MemSQL.

Also, some of those products do not today have the transparent scale-out that Deep plans to offer in the future.

Among the 10 people listed as part of Deep Information Sciences’ team, I noticed 2 who arguably had DBMS industry experience, in that they worked at virtualization vendor Virtual Iron, and stayed on for a while after Virtual Iron was bought by Oracle. One of them, Chief Scientist & Architect Tom Hazel, also was at Akiban for a few months, where he did actually work on a DBMS. Other Deep Information Sciences notes include:

  • Deep has 25 or so people in all.
  • Deep had a recent $10 million funding round.
  • Deep Information Sciences is the former Cloudtree, which as of February, 2011 was pursuing quite a different strategy. (Evidently there was a pivot.) Deep was founded in 2010.
  • There are 2 paying customers for DeepDB, even though it’s still in beta, and 8 trials. A similar number of trials and strategic partners are queued up.
  • DeepDB general availability is expected later this quarter.

Although our call was blessedly technical, we didn’t have a chance to go through the DeepDB architecture in great detail. That said, DeepDB seems to store data in all of 3 ways:

  • An in-memory row store.
  • An on-disk row store with a very different architecture.
  • Indexes, which can also serve as a column store.

Notes on that include:

  • DeepDB’s in-memory row store is designed to manage single rows as much as possible, rather than pages. Indeed, there are “aspects of tries”, although we didn’t drill down into what exactly that meant.
  • Indexes are streamed to disk no less than once every 15 seconds, by default, and perhaps with latency as low as 10 milliseconds.
  • Perhaps the most important point I didn’t grasp is “segments”. The data and indexes on disk are stored in segments, which can be of different sizes, and which may each carry some summary data/metadata/whatever. Somehow, this is central to DeepDB’s design.
  • In what is evidently a design focus, DeepDB tries to get the benefit of “in-memory data” that isn’t actually taking up RAM. B-trees can point at rows that aren’t actually in memory. Segments evicted from cache can leave some metadata or summary data behind.
  • DeepDB’s compression story seems to be a work in progress.
    • There’s prefix compression already, at least in the indexes, which Deep just calls “compaction”.
    • Other compression is working in the lab, but not scheduled for Version 1.0.
      • Block compression seems to be in play.
      • Delta compression was mentioned once
      • Dictionary compression wasn’t mentioned at all.
    • DeepDB apparently will keep compressed data in cache, then decompress it to operate on it.
    • Different segments can be compressed/uncompressed differently.
  • DeepDB’s on-disk row store is append-only. Time-travel is being worked on. While I forgot to ask, it seems likely that DeepDB has MVCC (Multi-Version Concurrency Control). :)

And finally: DeepDB in its current form is a “drop-in” InnoDB replacement, but not necessarily bug-compatible.

Appliances, clusters and clouds

Sat, 2013-03-23 23:05

I believe:

  • The trend to clustered computing is sustainable.
  • The trend to appliances is also sustainable.
  • The “single” enterprise cluster is almost as much of a pipe dream as the single enterprise database.

I shall explain.

Arguments for hosting applications on some kind of cluster include:

  • If the workload requires more than one server — well, you’re in cluster territory!
  • If the workload requires less than one server — throw it into the virtualization pool.
  • If the workload is uneven — throw it into the virtualization pool.

Arguments specific to the public cloud include:

  • A large fraction of new third-party applications are SaaS (Software as a Service). Those naturally live in the cloud.
  • Cloud providers have efficiencies that you don’t.

That’s all pretty compelling. However, these are not persuasive reasons to put everything on a SINGLE cluster or cloud. They could as easily lead you to have your VMware cluster and your Exadata rack and your Hadoop cluster and your NoSQL cluster and your object storage OpenStack cluster — among others — all while participating in several different public clouds as well.

Why would you not move work into a cluster at all? First, if ain’t broken, you might not want to fix it. Some of the cluster options make it easy for you to consolidate existing workloads — that’s a central goal of VMware and Exadata — but others only make sense to adopt in connection with new application projects. Second, you might just want device locality. I have a gaming-class PC next to my desk; it drives a couple of monitors; I like that arrangement. Away from home I carry a laptop computer instead. Arguments can be made for small remote-office servers as well.

To put all that more simply:

  • Moving existing applications to new platforms often isn’t worth the trouble.
  • Many needs can be best met by single, physically local devices.

Appliances are a natural form factor for single-purpose computing. It is reasonable to characterize as “appliances” — in the computing sense of the term — medical equipment, vehicles, cash machines, cash registers, enterprise security devices, home entertainment, exercise machines and, yes, refrigerators; computers, in some form, can be found almost anywhere. But appliances also are a convenient way to package enterprise systems — configurations will be correct, installation will be simpler, and fortunate software-centric appliance vendors may capture margins on hardware sales and support. And the idea of SaaS-like continuous updates to your enterprise systems seems much more reasonable in the case of a locked-down appliance-like configuration.

Circling back to the beginning, I’d say there are multiple reasons not to expect all your computing to be done on a single cluster:

  • You might want to use appliances don’t fit into that cluster.
  • You might want to use SaaS offerings that don’t fit into that cluster.
  • The efficiency gains from using a single cluster aren’t that much greater than the gains from using a few of them.
  • You might want different parts of your computing work to be done in-house and in the public cloud.
  • You might want different parts of your data to be kept in different countries.
  • Different kinds of work might fit better onto differently-configured nodes, and current cloud/cluster technology doesn’t do a wonderful job with heterogeneity.
  • A lot of computing is so inherently small and local that it shouldn’t be clustered at all. :)

Ceteris paribus, fewer clusters are better than more of them. But all things are not equal, and it’s not reasonable to try to reduce your clusters to one — not even if that one is administered with splendid efficiency by low-cost workers, in a low-cost building, drawing low-cost electric power, in a low-cost part of the world.

One database to rule them all?

Wed, 2013-02-20 23:52

Perhaps the single toughest question in all database technology is: Which different purposes can a single data store serve well? — or to phrase it more technically — Which different usage patterns can a single data store support efficiently? Ted Codd was on multiple sides of that issue, first suggesting that relational DBMS could do everything and then averring they could not. Mike Stonebraker too has been on multiple sides, first introducing universal DBMS attempts with Postgres and Illustra/Informix, then more recently suggesting the world needs 9 or so kinds of database technology. As for me — well, I agreed with Mike both times. :)

Since this is MUCH too big a subject for a single blog post, what I’ll do in this one is simply race through some background material. To a first approximation, this whole discussion is mainly about data layouts — but only if we interpret that concept broadly enough to comprise:

  • Every level of storage (disk, RAM, etc.).
  • Indexes, aggregates and raw data alike.

To date, nobody has ever discovered a data layout that is efficient for all usage patterns. As a general rule, simpler data layouts are often faster to write, while fancier ones can boost query performance. Specific tradeoffs include, but hardly are limited to:

  • Big blocks of data compress better, and can be also be faster to retrieve than a number of smaller blocks holding the same amount of data. Small blocks of data can be less wasteful to write. And different kinds of storage have different minimum block sizes.
  • Operating on compressed data offers multiple significant efficiencies. But you have to spend cycles (de)compressing it, and it’s only practical for some compression schemes.
  • Fixed-length tabular records can let you compute addresses rather than looking them up in indexes. Yay! But they also waste space.
  • Tokenization can help with the fixed-/variable-length tradeoff.
  • Pointers are wonderfully efficient for some queries, at least if you’re not using spinning disk. But they can create considerable overhead to write and update.
  • Indexes, materialized views, etc. speed query performance, but can be costly to write and maintain.
  • Storing something as a BLOB (Binary Large OBject), key-value payload, etc. is super-fast — but if you want to look at it, you usually have to pay for retrieving the whole thing.

What’s more, different data layouts can have different implications for logging, locking, replication, backup and more.

So what would happen if somebody tried to bundle all conceivable functionality into a single DBMS, with a plan to optimize the layout of any particular part of the database as appropriate? I think the outcome would be tears – for the development effort would be huge, while the benefits would be scanty. The most optimistic cost estimates could run in the 100s of millions of dollars, with more realistic ones adding a further order of magnitude. But no matter what the investment, the architects would be on the horns of a nasty dilemma:

  • If there’s much commonality among the component DBMS, each one would be sub-optimal.
  • If there’s little commonality among them, then there’s also little benefit to the combination.

Adding insult to injury, all the generality would make it hard to select optimum hardware for this glorious DBMS — unless, of course, a whole other level of development effort made it work well across very heterogeneous clusters.

Less megalomaniacally, there have been many attempts to combine two or more alternate data layouts in a single DBMS, with varying degrees of success. In the relational-first world:

  • Analytic DBMS have combined row and column data models so fluidly that I’ve made fun of Oracle for not being able to pull it off. SAP HANA sort of does the same thing, but perhaps with a columnar bias, and not just for analytics.
  • Relational DBMS can also have a variety of index types, suitable for different relational use cases. This is especially true for analytic uses of general-purpose RDBMS.
  • Oracle, DB2, PostgreSQL, and Informix have had full extensibility architectures since the 1990s. That said:
    • Almost all the extensions come from the DBMS vendors themselves.
    • Extensions that resemble (or are) a tabular datatype — for example geospatial or financial-date — are often technically well-regarded.
    • Others are usually not so strong technically, but in a few cases sell well anyway (e.g. Oracle Text).
    • While Microsoft never went through the trouble of offering full extensibility, otherwise the SQL Server story is similar.
    • Sybase’s extensibility projects went badly in the 1990s, and Sybase doesn’t seem to have tried hard in that area since.
  • IBM DB2, Microsoft SQL Server, and Oracle added XML capabilities around the middle of the last decade.
  • Analytic platforms can wind up with all sorts of temporary data structures.
  • Analytic DBMS have various ways to reach out and touch Hadoop.

Further:

  • Non-relational DBMS commonly have indexes that at least support relational-like SELECTs. JOINs can be more problematic, but MarkLogic finally has them. Tokutek even offers a 3rd-party indexing option for MongoDB.
  • Hadoop is growing into what is in effect is a family of DBMS and other data stores — generic HDFS, HBase, generic Hive, Impala, and so on. At the moment, however, none of them is very mature. BDAS/Spark/Shark ups the ante further, but of course that’s less mature yet.
  • Hadapt combines Hadoop and PostgreSQL.
  • DataStax combines Cassandra, Hadoop, and Solr.
  • Akiban fondly thinks its data layouts are well-suited for relational tables, JSON, and XML alike. (But business at Akiban may be in flux.)
  • GenieDB (Version 1 only) and NuoDB are both implemented over key-value stores. GenieDB Version 2 is implemented over Berkeley DB or MySQL.
  • Membase/Couchbase was first implemented over SQLite, then over (a forked version of) CouchDB.

Related links

Notes and links, February 17, 2013

Sun, 2013-02-17 21:54

1. It boggles my mind that some database technology companies still don’t view compression as a major issue. Compression directly affects storage and bandwidth usage alike — for all kinds of storage (potentially including RAM) and for all kinds of bandwidth (network, I/O, and potentially on-server).

Trading off less-than-maximal compression so as to minimize CPU impact can make sense. Having no compression at all, however, is an admission of defeat.

2. People tend to misjudge Hadoop’s development pace in either of two directions. An overly expansive view is to note that some people working on Hadoop are trying to make it be all things for all people, and to somehow imagine those goals will soon be achieved. An overly narrow view is to note an important missing feature in Hadoop, and think there’s a big business to be made out of offering it alone.

At this point, I’d guess that Cloudera and Hortonworks have 500ish employees combined, many of whom are engineers. That allows for a low double-digit number of 5+ person engineering teams, along with a number of smaller projects. The most urgently needed features are indeed being built. On the other hand, a complete monument to computing will not soon emerge.

3. Schooner’s acquisition by SanDisk has led to the discontinuation of Schooner’s SQL DBMS SchoonerSQL. Schooner’s flash-optimized key-value store Membrain continues. I don’t have details, but the Membrain web page suggests both data store and cache use cases.

4. There’s considerable personnel movement at Boston-area database technology companies right now. Please ping me directly if you care.

5. I talked recently with Ashish Thusoo of Qubole. Qubole’s initial offering is a Hive-in-the-cloud, started by the guys who invented Hive. Qubole’s coolest new technical feature vs. generic Hive seems to be a disk-based columnar cache that lives with the servers, to help “smooth over the jitters” between Amazon EC2 and S3. Qubole company basics include:

  • Founded last year.
  • 15 early adopters, generally from mid-sized internet companies. Some of the adopters are already paying.
  • 12 employees.

6. In my recent When I am a VC Overlord post, I wrote:

4. I will not fund any software whose primary feature is that it is implemented in the “cloud” or via “SaaS”. A me-too product on a different platform is still a me-too product.

5. I will not fund any pitch that emphasizes the word “elastic”. Elastic is an important feature of underwear and pajamas, but even in those domains it does not provide differentiation.

Cloud/SaaS deployments give you a chance at providing superior ease of use/installation/administration, without compromising functionality — but they don’t automatically guarantee it. It’s hard work to make your customers’ lives easier.*

*This is the second consecutive post in which I’ve used a similar line. I’ll try to stop now. What’s really scary is that I was inspired by the old Frank Perdue ad “It takes a tough man to make a tender chicken.” :)

7. Ofir Manor of EMC is skeptical about Oracle’s claims for Hybrid Columnar Compression. But he didn’t really dig up that much dirt, except that he seems to think 10X compression is more of a ceiling than the floor that Oracle marketing suggests it is. The money quote is:

Oracle used to provide 3x compression, now it provides 10x compression, so no wonder the best references customers are seeing  about 3.4x savings…

That 3X is from Oracle’s Basic Compression, which seems to be a block-level dictionary scheme.

8. Nong Li of Cloudera wrote in praise of the code generation option in Impala. 3x performance is mentioned. What interested me was a nice observation that goes beyond Impala:

Code generation is most beneficial for queries that execute simple expressions and the interpretation overhead is most pronounced. For example, a query that is doing a regular expression match over each row is not going to benefit from code generation much because the interpretation overhead is low compared to the regex processing time.

Code generation may end up like compression — an architectural feature that DBMS just obviously should have.

Key questions when selecting an analytic RDBMS

Wed, 2013-02-06 10:32

I recently complained that the Gartner Magic Quadrant for Data Warehouse DBMS conflates many use cases into one set of rankings. So perhaps now would be a good time to offer some thoughts on how to tell use cases apart. Assuming you know that you really want to manage your analytic database with a relational DBMS, the first questions you ask yourself could be:

  • How big is your database? How big is your budget?
  • How do you feel about appliances?
  • How do you feel about the cloud?
  • What are the size and shape of your workload?
  • How fresh does the data need to be?

Let’s drill down.

How big is your database? How big is your budget?

Taken together, these questions tell you which choices are even feasible. Does your database fit into RAM, at a price you can afford? Does it fit onto a single, perhaps large, server? If both answers are “No”, then you need a real scale-out system, querying disk or flash (which itself could be hard to afford). Otherwise, you have more options.

Note that database compression has a big influence on what fits where.

How do you feel about appliances?

Depending on considerations such as database size, the choice of Oracle, Teradata, IBM Netezza, or Microsoft SQL Server may mandate or at least strongly suggest an appliance form factor. For most other analytic DBMS, an appliance is more optional. Are appliances good for you? Bad? Indifferent? Trade-offs include:

  • Appliances often involve paying a premium for hardware purchase and/or support.
  • Appliances often are easy(ier) to install and manage.
  • Appliances are easier to upgrade in some ways (everything’s integrated), but harder in others (less ability to upgrade bottlenecked parts).
  • Appliances often don’t play well in the cloud.

How do you feel about the cloud?

Analytic DBMS run better on good hardware and predictable bandwidth (hence all those appliances). These can be hard to find in the cloud. So, not coincidentally, can be analytic DBMS references, although most vendors can muster a few.

If you feel you need to run your analytic RDBMS in the cloud now, check references carefully. If you only are concerned about the cloud as some indefinite future, then you might want to rule out a few appliance-only vendors, but otherwise you probably shouldn’t worry. Cloud hardware and networking are getting better, and RDBMS software vendors are gaining experience in cloud deployments.

What are the size and shape of your workload?

Different analytic databases can have very different kinds of workloads. Tasks include:

  • Complex, long-running queries.
  • Repetitive reports of varying degrees of complexity.
  • Simple queries.
  • Large, scheduled loads.
  • Continuous or near-continuous/micro-batch loads.

The big issue is — how many of each kind of task need to performed concurrently, and in what combinations? If you’re refreshing 10,000 dashboards, several hundred of which might be getting drill-down queries at once, while trying to do a few scan-heavy queries in the background and some 15-way joins, most analytic DBMS might disappoint you. (Indeed, I’d ask whether you might want to split up that work among two or more systems.) Different DBMS — and different hardware/storage/networking configurations — shine in different scenarios.

How fresh does the data need to be?

Any serious analytic DBMS can be loaded daily or hourly, edge cases perhaps excepted. In most cases 15 minute intervals work as well, or even 5, but check whether those load latencies would interfere with any performance optimizations. But if you want sub-second data freshness, or even several-second — well, that has to be a top-tier architectural issue.

If your analytics are simple enough, it’s appealing to do the immediate-response ones straight from your transactional database. If not, you may need some kind of streaming-replication setup. Usually, I wind up recommending replication approaches that don’t yet have a lot of maturity or references. Tread carefully here.

Related links

Comments on Gartner’s 2012 Magic Quadrant for Data Warehouse Database Management Systems — evaluations

Tue, 2013-02-05 07:25

To my taste, the most glaring mis-rankings in the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management are that it is too positive on Kognitio and too negative on Infobright. Secondarily, it is too negative on HP Vertica, and too positive on ParAccel and Actian/VectorWise. So let’s consider those vendors first.

Gartner seems confused about Kognitio’s products and history alike.

  • Gartner calls Kognitio an “in-memory” DBMS, which is not accurate.
  • Gartner doesn’t remark on Kognitio’s worst-in-class* compression.
  • Gartner gives Kognitio oddly high marks for a late, me-too Hadoop integration strategy.
  • Gartner writes as if Kognitio’s next attempt at the US market will be the first one, which is not the case.
  • Gartner says that Kognitio pioneered data warehouse SaaS (Software as a Service), which actually has existed since the pre-relational 1970s.

Gartner is correct, however, to note that Kognitio doesn’t sell much stuff overall.

* non-existent

In the cases of HP Vertica, Infobright, ParAccel, and Actian/VectorWise, the 2012 Gartner Magic Quadrant for Data Warehouse Database Management’s facts are fairly accurate, but I dispute Gartner’s evaluation. When it comes to Vertica:

  • I think HP’s troubles are less relevant to HP Vertica than Gartner does.
  • In particular, Vertica’s lack of integration with Autonomy isn’t a big deal. Many relational DBMS vendors don’t even own a text search engine to not-integrate with, and the number of vendors with seriously effective analytic RDBMS/text search integration strategies is zero.
  • Gartner is correct to note that Vertica’s integration with the rest of HP, for example the hardware side, has been slow — but again, so what?
  • Gartner correctly praises Vertica’s analytic platform capabilities, but then seems to criticize Vertica’s capabilities in user-defined functions — notwithstanding that Vertica’s analytic platform capabilities are implemented via UDFs.
  • Gartner seems to criticize Vertica’s “volume credentials”, even though Vertica’s number of petabyte-scale analytic RDBMS customers may be second only to Teradata’s.

That said, I defer to Gartner’s opinion that HP Vertica’s sales momentum has disappointed, even if against higher expectations than one might have for vendors with 1/10 of Vertica’s installed base.

2 years ago, I simply said “What Gartner said in connection with Ingres is too inaccurate to deserve detailed attention.” This year’s Gartner Magic Quadrant for Data Warehouse Database Management isn’t that bad on the subject of Actian,* but it’s not great either. Writing mainly about Actian’s VectorWise, Gartner dings it for both features and bugginess, and correctly notes that VectorWise is only suitable for fairly small data warehouses. But Gartner gives VectorWise higher marks than Exasol even so. Gartner also writes that VectorWise has a “long tradition of having loyal supporters”, notwithstanding that VectorWise’s initial release was less than 3 years ago.

*Ingres’ new name, in honor of a 2011 pivot that seems to already have been deprecated

What the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management says about ParAccel isn’t too different on the facts from what I wrote in December, but Gartner is more enthused than I am. Basically:

  • ParAccel is playing catch-up in features and company stability, and Gartner somehow sees that as a strength.
  • Gartner dings ParAccel for a variety of product weaknesses.
  • Gartner is breathless about ParAccel being used on a MicroStrategy SaaS site. (Hint: Deals like that go to vendors willing to accept very low prices.)
  • Gartner is also impressed with ParAccel’s Amazon involvement. (Deals like that aren’t lucrative either, except insofar as Amazon bought some ParAccel stock.)

That Gartner ranks ParAccel ahead of HP Vertica baffles me. Perhaps Gartner views benchmarks as more significant than I do, or is otherwise judging ParAccel to have an important advantage in performance.

I also am in substantial agreement with the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management about Infobright facts, but this time I’m the one with the more favorable interpretation. I agree that Infobright is a bit limited in features, in the areas Gartner cites and in analytic platform capabilities as well. Even so, Infobright is far ahead of VectorWise in — also low-priced — sales, and ahead in product stability and features too. Yet Gartner gives Infobright vastly lower marks than Actian. I suspect that the essence of our disagreement is that Gartner sees Infobright’s focus on machine-generated data as something that “limits market expansion”, while I see machine-generated data as something that is by every measure* growing to be a majority of the whole.

*By raw volume that’s been true for a while. But if we adjust for value and so on, the crossover is arguably still a way off.

I’ll run through the other vendors cited in the 2012 Gartner Magic Quadrant for Data Warehouse Database Management more quickly, in approximate declining order of Gartner’s rankings.

  • Gartner loves Teradata, but has some concerns over TCO (Total Cost of Ownership). Makes sense to me.
  • Gartner is more impressed with Oracle’s technology than I am. I find it hard to take seriously a data warehouse RDBMS vendor that can’t deliver a true columnar storage option.
  • Gartner’s write-up of IBM gets lost in IBM’s vast sea of products. I sympathize.
  • Gartner’s discussion of SAP/Sybase IQ was overly brief. But given my own difficulties staying up to speed on my clients over there, I sympathize. Please stay tuned.
  • Gartner’s discussion of Microsoft gets overly caught up in “logical data warehouse” foofarah, but basically it makes sense. Pending some promised briefings, I’m more optimistic about Microsoft’s analytic DBMS offerings than I’ve been for a long time. Please stay tuned.
  • Gartner notes difficulties validating EMC Greenplum’s customer claims. I sympathize. Gartner also notes a bunch of product issues that make me wonder why EMC Greenplum’s overall rating isn’t even lower.
  • Gartner’s view of Exasol seems similar to mine.
  • I’d evaluate 1010data on the basis of its spreadsheet-like analytic tools, not its DBMS technology.
  • Gartner seems to have difficulty finding non-trivial “Strengths” for Calpont. I sympathize.
  • Gartner notes difficulties contacting SAND. I sympathize, since SAND’s senior management resigned en masse during the July, 2012 quarter. (Page 30 of that link.)

Comments on Gartner’s 2012 Magic Quadrant for Data Warehouse Database Management Systems — concepts

Tue, 2013-02-05 07:24

The 2012 Gartner Magic Quadrant for Data Warehouse Database Management Systems is out. I’ll split my comments into two posts — this one on concepts, and a companion on specific vendor evaluations.

Links:

  • Maintaining working links to Gartner Magic Quadrants is an adventure. But as of early February, 2013, this link seems live.
  • I also commented on the 2011, 2010, 2009, 2008, 2007, and 2006 Gartner Magic Quadrants for Data Warehouse DBMS.

Let’s start by again noting that I regard Gartner Magic Quadrants as a bad use of good research. On the facts:

  • Gartner collects a lot of input from traditional enterprises. I envy that resource.
  • Gartner also does a good job of rounding up vendor claims about user base sizes and the like. If nothing else, you should skim the MQ report for that reason.
  • Gartner observations about product feature sets are usually correct, although not so consistently that they should be relied on.

When it comes to evaluations, however, the Gartner Data Warehouse DBMS Magic Quadrant doesn’t do as well. My concerns (which overlap) start:

  • The Gartner MQ conflates many different use cases into one ranking (inevitable in this kind of work, but still regrettable).
  • A number of the MQ vendor evaluations seem hard to defend. So do some of Gartner’s specific comments.
  • Some of Gartner’s criteria seemingly amount to “parrots back our opinions to us”.
  • As do I, Gartner thinks a vendor’s business and financial strength are important. But Gartner overdoes the matter, drilling down into picky issues it can’t hope to judge, such as assessing a vendor’s “ability to generate and develop leads.” *
  • The 2012 Gartner Data Warehouse DBMS Magic Quadrant is closer to being a 1-dimensional ranking than 2-dimensional, in that entries are clustered along the line x=y. This suggests strong correlation among the results on various specific evaluation criteria.

*I may focus more on marketing communications strategy than the whole Gartner database research team combined — but the only way I’d know whether Teradata’s lead gen is better than HP Vertica’s or vice-versa would be if both vendors happened to raise the matter during consulting sessions.

Specific product feature areas Gartner seems to emphasize include:

  • Alignment with a “logical data warehouse” strategy.
  • Analytic platform features.
  • Compression.
  • Administrative tools, including workload management.
  • “Self-tuning” performance.
  • Scale-out capabilities.

Most of this makes sense. But Gartner has been talking about the “logical data warehouse” for a long time without ever seeming to firm up what it is, as evidenced for example by some dueling summaries of the concept. So let’s drill down on the LDW.

I think “logical data warehouse” will wind up like “master data management” — i.e., it will be a goal and a business process, aided but not subsumed by some characteristic software. Beyond that, I’d say that generic, functional, high-performance data federation* software is a pipedream — building it would be as hard as building the mythical single DBMS that gives great functionality and performance, in all use cases, for all kinds of data. Just as DBMS need to be at least somewhat specialized in purpose, data federation software needs to be as well.

*While I disapprove, data virtualization seems to be the term that will win for describing data federation.

When Gartner refers to the “logical data warehouse” capabilities of analytic RDBMS — and the first sentence of the MQ report indeed specifies that the subject is “relational database management systems” — it seems to be looking for two things:

  • Built-in data federation/query routing capabilities; i.e., specific features that help the DBMS interoperate with other data stores. But there seems to be little reference to relational federation/ external tables (which many vendors support) or text federation (which vendors with built-in search support, although that would mainly be Oracle, and its search is slow). Rather, this part of LDW is currently all about Hadoop interoperability, with bonus points for mentioning HCatalog.
  • Management of multi-structured data. But with limited exceptions, nobody’s doing that well in an analytic RDBMS. And even when they do, that’s pretty much the opposite of the federation that the rest of the logical data warehouse concept seems to be about.

For those and other reasons, referring to the “logical data warehouse” features of an analytic RDBMS is problematic. I imagine Gartner will keep working at the “logical data warehouse” concept until it is more successfully fleshed out. But little weight should be placed on Gartner’s LDW-feature-evaluations of analytic RDBMS at this time.