DBMS2

Syndicate content
Choices in data management and analysis
Updated: 8 hours 57 min ago

Vertica update

Wed, 2008-05-07 23:39

Another TDWI conference approaches. Not coincidentally, I had another Vertica briefing. Primary subjects included some embargoed stuff, plus (at my instigation) outsourced data marts. But I also had the opportunity to follow up on a couple of points from February’s briefing, namely:

Vertica has about 35 paying customers. That doesn’t sound like a lot more than they had a quarter ago, but first quarters can be slow.

Vertica’s list price is $150K/terabyte of user data. That sounds very high versus the competition. On the other hand, if you do the math versus what they told me a few months ago — average initial selling price $250K or less, multi-terabyte sites — it’s obvious that discounting is rampant, so I wouldn’t actually assume that Vertica is a high-priced alternative.

Vertica does stress several reasons for thinking their TCO is competitive. First, with all that compression and performance, they think their hardware costs are very modest. Second, with the self-tuning, they think their DBA costs are modest too. Finally, they charge only for deployed data; the software that stores copies of data for development and test is free.

Categories: Other

Database blades are not what they used to be

Wed, 2008-05-07 23:27

In which we bring you another instantiation of Monash’s First Law of Commercial Semantics: Bad jargon drives out good.

When Enterprise DB announced a partnership with Truviso for a “blade,” I naturally assumed they were using the term in a more-or-less standard way, and hence believed that it was more than a “Barney” press release.* Silly me. Rather than referring to something closely akin to “datablade,” EnterpriseDB’s “blade” program turns out to just to be a catchall set of partnerships.

*A “Barney” announcement is one whose entire content boils down to “I love you; you love me.”

According to EnterpriseDB CTO Bob Zurek, the main features of the “blade” program include:

  • Accreditation

  • Joint distribution, including distribution by the blade partner of Postgres Plus

  • Interface between the blade partner and EnterpriseDB’s field organization

Of the 16 blade partnerships announced in the initial press release, only one much resembles the datablade concept. That would be HyperBac, which is offering compression and encryption, as part of high-performance backup. (Bob says HyperBac’s compression reduces exported file size by around 90%, and it’s also extremely fast.) From where I sit, that’s a modified data access method, and hence worthy of the term “blade.”

Bob said that the next closest thing EnterpriseDB has to a true datablade at this time, and getting closer, actually is none of the other 15 partnerships. It’s Oracle compatibility. That makes sense; Oracle compatibility starts in the parser, and might have data access method and hence optimization implications as well. However, in saying this Bob presumably was not counting support for datatypes such as text and geospatial. Unless I’m very wrong about how they’re implemented, those are about as genuine as datablades ever get.

Categories: Other

Outsourced data marts

Wed, 2008-05-07 23:14

Call me slow on the uptake if you like, but it’s finally dawned on me that outsourced data marts are a nontrivial segment of the analytics business. For example:

  • I was just briefed by Vertica, and got the impression that data mart outsourcers may be Vertica’s #3 vertical market, after financial services and telecom. Certainly it seems like they are Vertica’s #3 market if you bundle together data mart outsourcers and more conventional OEMs.
  • When Netezza started out, a bunch of its early customers were credit data-based analytics outsourcers like Acxiom.
  • After nagging DATAllegro for a production reference, I finally got a good one — TEOCO. TEOCO specializes in figuring out whether inter-carrier telcom bills are correct. While there’s certainly a transactional invoice-processing aspect to this, the business seems to hinge mainly around doing calculations to figure out correct charges.
  • I was talking with Pervasive about Pervasive Datarush, a beta product that lets you do super-fast analytics on data even if you never load it into a DBMS in the first place. I challenged them for use cases. One user turns out to be an insurance claims rule-checking outsourcer.
  • One of Infobright’s references is a French CRM analytics outsourcer, 1024 Degres.
  • 1010data has built up a client base of 50-60, including a number of financial and retail blue-chippers, with a soup-to-nuts BI/analysis/columnar database stack.
  • I haven’t heard much about Verix in a while, but their niche was combining internal sales figures with external point-of-sale/prescription data to assess retail (especially pharma) microtrends.

To a first approximation, here’s what I think is going on.

Privacy laws force some outsourcing. It’s often OK to use credit data to decide what you’ll market at whom, even when it’s not OK to actually see the credit data itself. What’s more, in some cases data can’t leave a country, so if you don’t have critical business mass in that particular country, it’s natural to use an outsourcer who does.

Privacy even aside, owners of proprietary data are natural analytics outsourcers. Either you ship your data to your customers to do with as they please — and impose on them the expense of managing it — or you manage it for them.

Analytic “secret sauce” software providers also are natural outsourcers. Most proprietary analytic rules are pretty simple-minded. Outsourcing preserves mystique and pricing power.

The usual benefits of SaaS apply. Fast set-up, no fixed costs, etc. are all goodness, just as they are in the transactional world.

With that as background, the big change in the analytics outsourcing market is the same as the one sweeping the rest of the analytics world — interactive access to detail data is finally becoming affordable. If you just run weekly or monthly reports, and there may be no reason to distinguish between analytic and transactional processing. But if you want to allow ad-hoc query, unlimited drilldown, or live dashboards, then you’re talking a serious data mart technology stack.

And I do mean “data mart”. Outsourcing an enterprise data warehouse, with all of your proprietary transactional data, doesn’t make much sense unless you’re a complete SaaS shop already outsourcing that data in the first place.

Categories: Other

Truviso and EnterpriseDB blend event processing with ordinary database management

Tue, 2008-04-29 21:52

Truviso and EnterpriseDB announced today that there’s a Truviso “blade” for Postgres Plus. By email, EnterpriseDB Bob Zurek endorsed my tentative summary of what this means technically, namely:

  • There’s data being managed transactionally by EnterpriseDB.

  • Truviso’s DML has all along included ways to talk to a persistent Postgres data store.

  • If, in addition, one wants to do stream processing things on the same data, that’s now possible, using Truviso’s usual DML.

Note: Extended-relational DBMS like Postgres, Oracle, DB2, and Informix/Illustra have long offered the ability to add blades/cartridges. It’s easy to understand what these do when they simply add native management for a new datatype, and extend the parser, optimizer, and data access methods accordingly. But blades are used in other ways as well, and I’ve always found that somewhat confusing. A little bit of that appears to be going on in this case.

Bob added that there have been a lot of inquiries about the announcement today, without specifying from whom. Truviso marketing chief Roman Bukary, late of SAP, sent over some generic use cases, which pretty much boil down to my first two bullet points above. (More precisely, they agree if you replace “transactionally” with “persistently”; Roman also foresees data warehousing uses.)

I like this announcement. With one probable exception, it’s a good fit for every major use of event processing; the exception is super-low-latency apps, where no extraneous overhead is tolerable. (Those are found mainly in algorithmic trading, but could arise in security and network management as well.) But then, Truviso is being positioned away from its initial currency trading focus anyway.

Super-low-latency aside, the other big current use case for event processing is data reduction. I.e., you have a lot of incoming data – e.g., via satellite telemetry or intelligence intercepts or network monitoring sensors, or monitoring character movement in an MMO (Massively Multiplayer Online) game. You try to grab all the “interesting” stuff, while disregarding or even throwing away the rest. But the “throwing away” part is a little worrisome. So if instead you can seamlessly persist everything, even for a short period of time (e.g., measured in days), that’s goodness. Even if you can’t keep it all even for a short while – well, if the point of data reduction is to retain only a fraction of the incoming data, this scheme could make it easier to persist the keepers.

Another current use case for event processing is rules engines. Progress Apama has a rules paradigm all the way down, while Coral8 tells happily of a customer who uses event processing for all kinds of rules-based real-time CRM. But the Coral8 example is closely integrated with conventional persistent data stores, and the same is likely for other similar applications. Business activity monitoring (BAM) would be a special case of this.

As you know, my ultimate dream for business intelligence/analytic uses of event processing goes beyond BAM. I think many individuals in an enterprise should each track many different (but related) KPIs (Key Performance Indicators). Current query loads for reporting, dashboards, ad hoc query, etc. could easily go up by 2-3 orders of magnitude. When that happens, you want to consider different ways of doing things, specifically memory-centric ones. Normal memory-centric data processing might get the job done, but I have a suspicion that the right architecture will wind up looking a lot like event processing.

Once again, that’s a use for event processing that naturally integrates tightly with a persistent database.

Related links:

Please subscribe to our feed!

Categories: Other

The Mark Logic story in XML database management

Tue, 2008-04-29 06:35

Mark Logic* has an interesting, complex story. They sell a technology stack based on an XML DBMS with text search designed in from the get go. They usually want to be known as a “content” technology provider rather than a DBMS vendor, but not quite always.

*Note: Product name = MarkLogic, company name = Mark Logic.

I’ve agreed to do a white paper and webcast for Mark Logic (sponsored, of course). But before I start serious work on those, I want to blog based on what I know. As always, feedback is warmly encouraged.

Some of the big differences between MarkLogic and other DBMS are:

  • MarkLogic’s primary DML/DDL (Data Manipulation/Description Language) is XQuery. Indeed, Mark Logic is in many ways the chief standard-bearer for pure XQuery, as opposed to SQL/XQuery hybrids.

  • MarkLogic’s XML processing is much faster than many alternatives. A client told me last year that – in an application that had nothing to do with MarkLogic’s traditional strength of text search – MarkLogic’s performance beat IBM DB2/Viper’s by “an order of magnitude.” And I think they were using the phrase correctly (i.e., 10X or so).

  • MarkLogic indexes all kinds of entities and facts, automagically, without any schema-prebuilding. (Nor, I gather, do they depend on individual documents carrying proper DTDs.) So there actually isn’t a lot of DDL. (Mark Logic claims in one test MarkLogic had more or less 0 DDL, vs. 20,000 lines in DB2/Viper.) What MarkLogic indexes includes, as Mark Logic puts it:

    • Every word

    • Every piece of structure

    • Every parent-child relationship

    • Every value.

  • As opposed to most extended-relational DBMS, MarkLogic indexes all kinds of information in a single, tightly integrated index. Mark Logic claims this is part of the reason for MarkLogic’s good performance, and asserts that competitors’ lack of full integration often causes overhead and/or gets in the way of optimal query plans. (For example, Mark Logic claims that Microsoft SQL Server’s optimizer is so FUBARed that it always does the text part of a search first.) Interestingly, Intersystems’ object-oriented Cache’ does pretty much the same thing.

  • MarkLogic is proud of its text search extensions to XQuery. I’ve neglected to ask how that relates to the XQuery standards process. (For example, text search wasn’t integrated into the SQL standard until SQL3.)

Other architectural highlights include:

  • MarkLogic uses timestamps and appends for updates, rather than updates-in-place, much like Netezza or Illustra. Cleanup is done in the background. As long as your volume of changes (as opposed to inserts or reads) is sufficiently low, this can be more efficient than traditional approaches. Timestamping also makes it easy to write certain application functionality in publishing (“go live” times for content is a current use) and compliance (a possible future).

  • MarkLogic is ACID-compliant. Thus, you can read data as soon as it’s inserted, without a separate re-indexing step. Other native XML systems may not have that property (e.g., Mark Logic asserts DB2 Viper doesn’t.)

  • Mark Logic claims MarkLogic has relatively efficient (optional) range indexes. (This was in response to a question; details are secret.) Inverted-list DBMS like ADABAS and Model 204 have been doing decently efficient range queries for 30 years, so this claim is both credible and not terribly important.

Related links:

Please subscribe to our feed!

Categories: Other

ParAccel pricing

Fri, 2008-04-25 08:33

I made a round of queries about data warehouse software or appliance pricing, and am posting the results as I get them. Earlier installments featured Teradata and Netezza. Now ParAccel is up.

ParAccel’s software license fees are actually very simple — $50K per server or $100K per terabyte, whichever is less. (If you’re wondering how the per-TB fee can ever be the smaller one, please recall that ParAccel offers a memory-centric approach to sub-TB databases.)

Details about how much data fits on a node are hard to come by, as is clarity about maintenance costs. Even so, pricing turns out to be one of the rare subjects on which ParAccel is more forthcoming than most competitors.

Categories: Other

Yet another data warehouse database and appliance overview

Fri, 2008-04-25 01:34

For a recent project, it seemed best to recapitulate my thoughts on the overall data warehouse specialty DBMS and appliance marketplace. While what resulted is highly redundant with what I’ve posted in this blog before, I’m sharing anyway, in case somebody finds this integrated presentation more useful. The original is excerpted to remove confidential parts.

… This is a crowded market, with a lot of subsegments, and blurry, shifting borders among the subsegments.

… Everybody starts out selling consumer marketing and telecom call-detail-record apps. …

Oracle and similar products are optimized for updates above everything else. That is, short rows of data are banged into tables. The main indexing scheme is the “b-tree,” which is optimized for finding specific rows of data as needed, and also for being updated quickly in lockstep with updates to the data itself.

By way of contrast, an analytic DBMS is optimized for some or all of:

  • Small numbers of bulk updates, not large numbers of single-row updates.

  • Queries that may involve examining or returning lots of data, rather than finding single records on a pinpoint basis.

  • Doing arithmetic calculations – commonly simple arithmetic, sorts, etc. – on the data.

Database and/or DBMS design techniques that have been applied to analytic uses include:

  • “Denormalizing” the database, by pre-joining tables. This makes queries cheaper, but updates more costly. It’s implicit in single-fact-table designs.

  • “Star indexes”, which capture the benefits of denormalization. But they are large, and costly to update.

  • “Materialized views”, which precompute query results (joins and or aggregations). These obviously accelerate queries that use those results, but you have to pay the cost of continually updating them as data changes.

  • “Range partitioning”, in which data in (say) certain date ranges is clustered together on disk for more efficient processing.

  • “Hypercubes”, aka “MOLAP” (Multi-Dimensional OnLine Analytic Processing). The costs and benefits are extreme forms of those I’ve already cited. At least, the costs are; the benefits aren’t seeming so extreme any more, causing the technology to be increasingly outmoded.

  • “Bit-mapped indexes.” This is another approach to indexing that is fast on queries, at the cost of making updates slower. In its pure form, it’s well-suited for columns with low “cardinality” – i.e., a small number of values. (E.g., colors, sizes, etc.) But it can be extended to cover higher-cardinality cases.

  • Database administration tools to help with the complex choices involved in writing SQL, selecting indexes, etc.

  • Recommended hardware configurations, because the right mix of disks, processors, etc. might otherwise be non-obvious.

That’s pretty much the list of techniques used in general-purpose DBMS products such as Oracle and Microsoft SQL Server. But if you put them all together, you’re still left with the problems:

  • The techniques that greatly accelerate queries also greatly slow down updates.

  • You use a lot of extra disk space for all those indexes.

  • There’s a tremendous amount of labor involved in getting it all right.

  • Because of these drawbacks, you’re likely to optimize only for certain subsets of the queries you’d really like to run. Indeed, you may not make all of your data available for analytic querying.

Specialty analytic DBMS can do a lot better than general-purpose DBMS because:

  • They can run on “shared-nothing” MPP (Massively Multi-Parallel Processing) architectures. Most vendors make this choice, because:

  • Using larger numbers of smaller parts is fundamentally cheaper, if you don’t have a lot of MPP overhead. Most of the vendors have figured out clever ways to avoid that overhead.

  • For larger databases, I/O becomes an absolute bottleneck. But in a shared-nothing DBMS, you can do I/O truly in parallel.

  • If you simplify your software sufficiently, you may be able to get great compression, which has myriad benefits – most obviously to disk costs and I/O, but it can go further than that. Most contenders post-Netezza are good to great at compression. Netezza is playing catch-up. Teradata isn’t really better than Oracle, et al.

  • Disks spin slowly. The fastest disk drive you can buy has 15,000 RPMs, vs. the 1,200 RPMs hard disk technology was introduced with in 1956. (Most systems use 7,500 or 10,000 RPMs.) So random-access disk reads have become the single greatest bottleneck to analytic processing. One solution is to optimize your DBMS for table scans or other sequential reads – i.e., read more bytes of data, but at a much higher per-byte rate. To varying degrees, the analytic DBMS with row-based architectures are optimized for sequential reads. I published two white papers focusing on this point in 2007, sponsored by DATAllegro. http://www.monash.com/whitepapers.html

  • You also can break rows apart, and organize data by columns. Columnar architectures have tremendous advantages if you only ever want to retrieve a small fraction of a row. They also can help with compression and general query speed. They are hard to update, however. Vertica has some very clever techniques to beat the update speed problem. ParAccel argues that this cleverness isn’t needed, and more straightforward techniques suffice.

  • You can have specialized hardware designs or optimizations, even beyond the shared-nothing MPP. Netezza has an FPGA, which is almost a custom chip. Kickfire has some kind of custom chip. Calpont keeps trying and failing with a custom chip. Teradata is a lot like standard hardware, but they have their own switching system. DATAllegro and other vendors do use standard hardware, but rely on more inter-node communication than might otherwise be there. Columnar vendors, however, tend to be fairly hardware-agnostic.

Beyond raw database size, characteristics of the database and workload that affect which analytic DBMS works best include:

  • Do you have to do any significant volume of low-latency updates at all? If so, how low? (15 minute latency is a common but still minority data warehousing requirement, both in cases where there’s a legitimate business benefit and in cases where there is not. Most products meet that requirement, some more gracefully than others.)

  • Are your result sets likely to be huge? (E.g., inputs into SAS data mining software). Fairly large? Single-row? Columnar systems are bad at single-row result sets.

  • How many queries are likely to be running at once? The ability to handle concurrency well is a function of product maturity even more than basic architecture. Each time Netezza or DATAllegro has a major release, they tell me that now their concurrency is great and confess it wasn’t so hot in the prior version. Very high concurrency is a call center or feeding a website’s personalization. Medium concurrency is reporting and dashboards for a large but not huge enterprise. Low concurrency is serving a few specialized data analysts in a department.

  • What absolute response time do you need? (Are you serving a call center? A personalized web site? A user who doesn’t mind tapping her fingers for a few minutes, but doesn’t want to wait a few hours? A user who wants a response within a few seconds?) Different DBMS are optimized a bit differently. But frankly, if a system has great price/performance, it usually will be good in any scenario.

  • How much are you doing in the way of arithmetic calculations? An application very light on data volume and heavy on arithmetic is sometimes a genuine excuse for using MOLAP. Otherwise, it’s nice to have good flexibility with a feature called “user defined functions”.

  • When you bring back a row, do you typically want the whole row, or are many of the columns of that row just wasted I/O? If it’s the latter, columnar systems shine. This is particularly common in consumer marketing/targeting types of applications, where you may start with 1000 or more columns of data.

  • Are you basically querying a single large “fact table” across many “dimensions”, or a small group of closely-related fact tables? Or is the database schema significantly more complicated than that? Vertica only allows one fact table. At the other extreme, Teradata has for decades been optimized for any kind of schema. Most systems let you use any kind of schema you want, but that doesn’t mean they perform well in all scenarios.

Categories: Other

Optimizing WordPress database usage

Thu, 2008-04-24 03:42

There’s an amazingly long comment thread on Coding Horror about WordPress optimization. Key points and debates include:

  • WordPress makes scads of database calls on every page. (20 is the supposed default number. That sounds a little high to me, but not wholly incredible.)
  • Therefore one should use a caching plug-in. WP-Cache is the preferred one. WP-Super-Cache gets some votes as perhaps being even better.
  • In theory the database cache should handle most of the problem. (After all, many of those database queries are the same for every page.) In practice, it often doesn’t, even if you use dedicated (as opposed to shared) web hosting.
  • LAMP vs. Microsoft stack (uh-oh).
  • Drupal vs. WordPress vs. Movable Type vs. Joomla vs. do-it-yourself (uh-oh too).

Another theme is — well, it’s WordPress “theme” design. Do you really need all those calls? The most dramatic example I can think of one I experienced soon after I started this blog. Some themes have the cool feature that, in the category list on the sidebar, there’s a count of the number of posts in the category. Each category. I love that feature, but its performance consequences are not pretty.

As previously noted, we’ll be doing an emergency site upgrade ASAP. Once we’re upgraded to WordPress 2.5, I hope to deploy a rich set of back-end plug-ins. One of the caching ones will be among them.

Please subscribe to our feed!

Categories: Other

DATAllegro finally has a blog

Mon, 2008-04-21 13:16

It took a lot of patient nagging, but DATAllegro finally has a blog. Based on the first post, I predict:

  • DATAllegro’s blog will live up to CEO Stuart Frost’s talent for clear, interesting writing.
  • Like a number of other vendor blogs — e.g., Netezza’s — DATAllegro’s will have infrequent but usually long posts.

The crunchiest part of the first post is probably

Another very important aspect of performance is ensuring sequential reads under a complex workload. Traditional databases do not do a good job in this area - even though some of the management tools might tell you that they are! What we typically see is that the combination of RAID arrays and intervening storage infrastructure conspires to break even large reads by the database into very small reads against each disk. The end result is that most large DW installations have very large arrays of expensive, high-speed disks behind them - and still suffer from poor performance.

I’ve pounded the table about sequential reads multiple times — including in a (DATAllegro-sponsored) white paper — but the point about misleading management tools is new to me.

Now if I could just get a production DATAllegro reference, I’d be completely happy …

Categories: Other

Netezza pricing

Mon, 2008-04-21 09:38

In connection with the announcement of the Teradata 2500, I asked some Teradata competitors about pricing. Netezza’s response amounted to “We don’t disclose list pricing, but our cheapest system handles about 3 1/4 TB and sells for under $200K.” So Netezza’s actual pricing is well below the list price of the Teradata 2500.

Categories: Other

Teradata introduces lower-cost appliances

Mon, 2008-04-21 09:27

After months of leaks, Teradata has unveiled its new lines of data warehouse appliances, raising the total number either from 1 to 3 (my view) or 0 to 2 (what you believe if you think Teradata wasn’t previously an appliance vendor). Most significant is the new Teradata 2500 series, meant to compete directly with the smaller data warehouse specialists. Highlights include:

  • An oddly precise estimated capacity of “6.12 terabytes”/node (user data). This estimate is based on 30% compression, which is low by industry standards, and surely explains part of the price umbrella the Teradata 2500 is offering other vendors.

  • $125K/TB of user data. Obviously, list pricing and actual pricing aren’t the same thing, and many vendors don’t even bother to disclose official price lists. But the Teradata 2500 seems more expensive than most smaller-vendor alternatives.

  • Scalability up to 24 nodes (>140 TB).

  • Full Teradata application-facing functionality. Some of Teradata’s rivals are still working on getting all of their certifications with tier-1 and tier-2 business intelligence tools. Teradata has a rich application ecosystem.

  • What will be controversial performance, until customer-benchmark trends clearly emerge.

The Teradata 2500 is coming out of the chute with two customers – a new-customer retailer buying a single cabinet (i.e., 6.12 TB), and an existing customer for whom fewer details seem available. So far as I can tell, the sales force has had the product since late January, although the first leaks I got incorrectly suggested the system would only scale to a limited number of nodes.

Other products in the announcement included:

  • The Teradata 5550, a routine annual upgrade to the Teradata 5500.

  • The Teradata 550. This is a low-end, single-server SMP box introduced 9 or so months ago, originally meant for application development and testing. But some customers have been using it for deployment, and Teradata is now officially acknowledging that. It only scales to 2-3 TB of user data.

The Teradata 2500’s performance should be below the Teradata 5550’s for three reasons:

  • More disk per node.

  • Less CPU per node (2 cores vs. 4).

  • The removal of some “workload management” performance features found in the 5500 series.

The same considerations apply to a comparison between the Teradata 2500 and the older Teradata 5000, but in that case they’re offset by a year of Moore’s Law benefit.

Teradata’s performance claims for the 2500, in essence, are:

  • The 2500 is focused on decision-support applications, where all that workload-management stuff doesn’t matter as.

  • Although we can do additional things well our competitors can’t, we also rival them in performance in their sweet area, namely sequential/table-scan-oriented decision support.

  • In fact, we beat them on lots of customer benchmarks.

  • By the way, even the simplified workload management capability gives good concurrency when compared with what the little guys offer.

Teradata competitors’ stories are along the lines of:

  • We clobber Teradata in customer benchmarks.

  • Now they’re offering a system a lot slower than the ones we already beat.

DATAllegro offers a detailed critique of the Teradata 2500 based on pre-release information, both on functionality and the numbers. (E.g., they argue that 6.12 TB of user data counted the Teradata way isn’t as much as it sounds like; I’m checking on that.)

So what does this all mean? If the Teradata 2500 were as aggressively priced as I originally thought (my bad – I simply misheard their per-terabyte prices for absolute figures), this announcement would be a huge event. As matters stand – well, DBMS and other enterprise vendors’ “crippled” products don’t have a stellar history. I wouldn’t be surprised if, a year from now, we saw an upgraded Teradata 2500 series, with more aggressive pricing and features.

Alternatively: In the initial release, Teradata has chosen not to have any interoperability between the 5500, 2500, and 550 series. I think that should and perhaps will change, with the 55xx and 25xx working together in a hub/spoke manner. Otherwise, missing-features arguments like the one DATAllegro makes will be too compelling. For that matter, I wouldn’t be surprised if Teradata bought a smaller rival, in which case heterogeneous hub/spoke synchronization would be a really good idea as soon as they could implement it.

If hub/spoke integration is one feature I’d recommend Teradata get cracking on, the other – and even bigger – one is compression. All CPU/disk trade-offs notwithstanding, better compression is an obvious and big price/performance win.

Please subscribe to our feed!


Technorati Tags: ,

Categories: Other

Kickfire kicks off

Thu, 2008-04-17 22:35

I chatted with Raj Cherabuddi and others on the Kickfire (formerly C2) team for over an hour on Monday, and now have a better sense of their story. There are some very basic questions I still don’t have answers to; I’ll fill those in when I can.

Highlights of what I have and haven’t figured out so far include:

  • Kickfire’s technology has two main parts: A SQL co-processor chip and a MySQL storage engine.

  • Kickfire makes a Type 0 appliance. If I understood correctly, it contains the chip, a couple of standard CPU cores, and 64 gigs of RAM. Or else it contains just the chip, and is meant to be hooked up to a 2U box with 64 gigs of RAM. I’m confused.

  • The Kickfire box can handle up to 3 terabytes of user data. The disk required for that is 4-5 terabytes without redundancy, 2X with. Based on that formulation and other clues, I’m guessing Kickfire — unlike other appliance vendors — doesn’t build in storage itself.

  • I don’t know whether the Kickfire chip is true custom silicon or an FPGA emulation.

  • The essential idea of the chip is dataflow programming for SQL, with pipelining between operations. This eliminates the overhead of registers and context switching. I don’t know what the trade-offs are, if any.

  • Kickfire’s database software is columnar, operating on compressed data even in RAM. In that, Kickfire’s story is most similar to Vertica’s, although I’m guessing Exasol may do something similar as well. Like Vertica, Kickfire uses multiple compression methods (they’re reluctant to give detail, but agreed it would be fair to say they use both something like dictionary/token and something like delta compression).

  • Kickfire’s software is ACID-compliant. You can do incremental loads or trickle feeds. Bulk load speed is 100 Gb/hour. Kickfire’s solution for the traditional problem of updating column stores is called “snapshots.” Without giving details, they position that as similar to the Vertica solution.

  • Like other MySQL storage engines, Kickfire inherits whatever data connectivity, stored procedure capabilities, user-defined functions ability, etc. that MySQL has.

  • Kickfire has no paying customers, but does have a slide showing many logos of “prospects and beta customers.”

  • Kickfire has no MPP capabilities at this time, but says adding those is “on the roadmap” and will be “easy.”

  • Kickfire submitted a 100 Gb TPC-H result, in which it beat the previous leaders — Exasol, ParAccel, and Microsoft – on price-performance, and lagged only Exasol and ParAccel on absolute performance. Kickfire is extremely proud of this. Indeed, I don’t recall another vendor ascribing that much weight to them in the entire history of TPCs.* Kickfire seems unfazed by the fact that its result is for a system listed with a ship date 6 months in the future (I’m guessing that’s the latest the TPC will allow), while the other results are for systems available today.

*Somebody – perhaps adman extraordinaire Rick Bennett? — may want to check my memory on this, but I think Oracle’s famed “Gentlemen, start your snails” ad in the early 1990s was about PC World tests, not TPCs. Oracle also had an ad about WW1-style planes nosediving, but I don’t think those referenced TPCs either.

Categories: Other

Relational purists should root for ScaleDB

Sun, 2008-04-13 08:10

I just put up a long post about a small development-stage company, ScaleDB. The punchline is that ScaleDB has a data access method — an extension of Patricia tries — that gives referential integrity and updatable views for free.

People who think current “relational” DBMS aren’t relational enough often suggest that’s the kind of foundation DBMS should have. And unlike Required Technologies’ TransRelational (TM) shtick, ScaleDB’s really is an OLTP-oriented approach.

Please subscribe to our feed!

Categories: Other

ScaleDB presents The Revenge of the Pointer

Sun, 2008-04-13 08:03

The MySQL user conference is upon us, and hence so are MySQL-related product announcements, including storage engines. One such is Kickfire. ScaleDB — smaller and earlier-stage — is another.

In a nutshell, ScaleDB’s proposition is:

  • Innovative approach to indexing relational DBMS, providing performance advantages.

  • Shared-everything scale-up that ScaleDB believes will leapfrog the MySQL engine competition already in Release 1. (In my opinion, this is the least plausible part of the ScaleDB story.)

  • State-of-the-art me-too facilities for locking, logging, replication/fail-over, etc., also already in Release 1.

Like many software companies with non-US roots, ScaleDB seems to have started with a single custom project, using a Patricia trie indexing system. Then they decided Patricia tries might be really useful for relational OLTP as well. The ScaleDB team now features four developers, plus half-time or so “Chief Architect” involvement from Vern Watts. Watts seems to pretty much have been Mr. IMS for the past four decades, and thus surely knows a whole lot about pointer-based database management systems; presumably, he’s responsible for the generic DBMS design features that are being added to the innovative indexing scheme. On ScaleDB’s advisory board is PeopleSoft veteran Rick Berquist, about whom I’ve had fond thoughts ever since he talked me into focusing on consulting as the core of my business.*

*More precisely, Rick pretty much tricked me into doing a day of consulting for $15K, then revealed that’s what he’d done, expressing the thought that he’d very much gotten his money’s worth. But I digress …

ScaleDB has no customers to date, but hopes to be in beta by the end of this year. Angels and a small VC firm have provided bridge loans; otherwise, ScaleDB has no outside investment. ScaleDB’s business model thoughts include:

  • $1,000/server/year license fee, or something in that range.

  • Early focus on Web 2.0 kinds of customers (e.g., social networking companies may enjoy the join performance ScaleDB plans to offer).

  • Early focus on MySQL OLTP (but, like proud parents everywhere, they think the technology is so wonderful that it could eventually be pretty much all things to all people).

The company is based in Menlo Park, CA.

Probably I should explain what Patricia tries actually are, and how they can help relational DBMS. An ordinary trie* is a way of indexing data that looks a lot like – unsurprisingly – a tree. For example, suppose you need to index a lot of character strings, each consisting of lower-case Latin letters. From the root node you point to the 26 possibilities for starting letter. From those you point to the next possible letter, and so on. Combinatorial explosion is averted because you only have edges if there’s actually a string with that letter combination. Thus, when indexing a corpus of classic novels, there might be a path i-t-i-s-a-t-r-u-t-h-u-n-… and so on, but none that starts i-a-u-z-z-z.

*”Trie” is sometimes pronounced like “tree”, sometimes like “try.”

Patricia tries add a now-obvious compression technique. Namely, if there’s only one branch from a node, just collapse it. Thus, the example I gave above would become something more like i-t-i-s-a-truth-universally-acknowledged-…, or perhaps something even more compact.

While these ideas were evidently invented with text documents in mind, there’s no reason they can’t be applied to other kinds of strings – specifically, to those stored in relational databases. (And numbers can just be treated as strings of bits.) As I wrote last year in discussing solidDB, which uses a similar approach:

The canonical index structure in a disk-centric OLTP RDBMS is a tree of blocks. The record sought is in a block somewhere. There are index blocks whose entries are pointers to the correct block based on values in the index column. There are index blocks of pointers to other index blocks. And so on. One can traverse these trees in very few steps, but each step is costly, because each step involves examining the whole block.

SolidDB, by way of contrast, uses a core index structure called the trie. The key value on which the record search is based is divided into chunks of bits. Each chunk leads to a tree node with a small number of choices for the next chunk. There are more steps, but each step is much cheaper.

Benefits of this strategy include compression and in-memory performance. But a naive implementation would, as in other pointer-based systems, lead to unacceptable disk thrashing. ScaleDB’s answer is to layer the index, essentially creating a “trie of tries.” The company confidently claims that, in almost all cases, data can be found via a single disk read. Part of that story is the assertion that their indexing scheme achieves tremendous compression vs. conventional b-trees.

So far, that all sounds like a performance win, of unclear magnitude. (ScaleDB says it’s hoping for a 3X or better performance advantage versus traditional b-tree-based approaches.) But there’s another cool part as well. The ScaleDB trie doesn’t necessarily end with the first row it finds; it also reaches through to capture foreign-key relationships. E.g., if customer FOO123 places an order with OrderID BAR456, the BAR456 isn’t just found via the path B-A-R-4-5-6. It also can be found via FOO-1-2-3-BAR-456. Thus, referential integrity and updatable views are baked into the core database management architecture.

I look forward to seeing how this all works out, in Release 1 and beyond.

Edit: One way to think of this as the integration of the network and relational data models, ala IDMS/R, but with more compact linked lists. And I believe Predrag Dizdarevic when he tells me IDMS/R did wind up working pretty well, in a rare instance of a DBMS technology success post acquisition by CA.

Please subscribe to our feed!

Categories: Other

Supporting evidence for the DBMS disruption story

Wed, 2008-04-09 23:31

As previously announced, I did a webcast this afternoon, discussing database diversity. The title of the talk was taken directly from a post – What leading DBMS vendors don’t want you to realize — that argued mid-range DBMS are suitable for a broad variety of tasks. The overriding theme was a Clayton Christensen-style “disruption” narrative.

The sponsor was EnterpriseDB, which is fitting. While not the biggest DBMS industry disrupter in terms of revenue or visible impact (MySQL and Netezza say “Hi”), the Postgres family in general and EnterpriseDB in particular epitomize the disruption threat like nobody else, because of how broadly they substitute for market-leading database managers.

As I promised on the call, below is a post with links to further research backing up the points made. They’re numbered to match some of the presentation slides, which you can find at this link.

3. Much of the discussion of database diversity comes from a series of posts I coordinated with Mike Stonebraker.

4. At various times, starting on Slide 4, I made reference to datatype extensibility, a key feature of Oracle and DB2 – and a key advantage of Postgres over MySQL.

10. Capping off the database diversity discussion, Slide 10 mirrors this 11-point version of a data management software taxonomy.

13-14. I’ve posted many times about data warehousing DBMS and related technologies, including this overview of major analytic DBMS products, another recent overview of data warehouse specialty technologies, and an attempt to distinguish between data warehouse appliance myths and realities. Of particular interest for further research may be our sections on data warehouse appliances and columnar DBMS.

15. I do most of my posting about text search over on Text Technologies, specifically in the search category. Vendors I specifically mentioned as blending search with other kinds of data retrieval were Mark Logic and Attivio.

16. There’s a section here on native XML database management.

17. We also have a section on managing RDF and other graphical data models.

18. Ditto complex event/stream processing.

19. The only embeddable DBMS I’ve written much about recently is solidDB. And frankly, even in that case I’ve focused more on mid-tier caching uses, the now-canceled MySQL relationship, or general technology than I did specifically on embedded uses.

22-24. Back in February, 2007 I made what is probably still my clearest post explaining why I think market-leading DBMS vendors are in the process of getting disrupted.

Please subscribe to our feed!

Categories: Other

My own data management software taxonomy

Wed, 2008-04-09 23:14

On a recent webcast, I presented an 11-node data management software taxonomy, updating a post commenting on Mike Stonebraker’s. It goes:

1. High-end OLTP/general-purpose DBMS
2. Mid-range OLTP/general-purpose DBMS
3. Row-based analytic RDBMS
4. Column- or array-based analytic RDBMS
5. Text search engines
6. XML and OO DBMS (but these may merge with search)
7. RDF and other graphical DBMS (but these may merge with relational)
8. Event/stream processing engines (aka CEP)
9. Embedded DBMS for devices
10. Sub-DBMS file managers (e.g. MapReduce/Hadoop)
11. Science DBMS

Obviously, this is a work in progress. In particular, while there’s clearly more than one kind of analytic DBMS, partitioning them into categories is not easy.

Please subscribe to our feed!

Categories: Other

Kickfire is de-cloaking

Tue, 2008-04-08 19:19

Kickfire, the renamed C2, is doing one of those buzz-building rollouts in which they make sure the first word comes from people on their payroll golly-gee-whizzing. You can see those at Xarpb and Diamond Notes, as well as a forthcoming article in MySQL magazine. Farhan Mashraqi also appears to be involved. Kickfire is also sponsoring the MySQL user conference next week.

I plan to write more after I get some substance, but a few things seem clear:

1. Kickfire’s product is an appliance that functions as a MySQL storage engine.
2. There’s a custom chip involved.
3. Kickfire plans to throw around the “stream processing” buzzphrase a lot.

Now, “stream processing” means a lot of different things to different people. E.g., Netezza uses the phrase just because their FPGA throws away a lot of data before ever routing it to more conventional SQL processing. But pending a briefing, I’m guessing that Kickfire’s sense is similar to what underlies the case for using CEP in BI.

Edit: Here’s an update after an actual Kickfire briefing.

Please subscribe to our feed!

Categories: Other

Positioning the data warehouse appliances and specialty DBMS

Sat, 2008-04-05 20:10

There now are four hardware vendors that each offer or seem about to announce two different tiers of data warehouse appliances: Sun, HP, EMC, and Teradata. Specifically:

In addition, multiple hardware vendors have “reference architecture” technical arrangements with Oracle, to try to capture some of the benefits of appliances. And IBM is constantly in partnership discussions with data warehouse specialists, notwithstanding having multiple data warehouse offerings of its own.

Positioning of these various offerings is confused. Part of the reason is the large vendors’ postures “We’re big and trustworthy, and those little upstart vendors aren’t – until the moment we partner with one of them.” Part of the reason is the small vendors’ stances of “We can do all things for all people – and by the way, 9 of the 14 customers we’ve ever had are all doing pretty much the same thing.” And part of the reason is just an industry penchant for secrecy.

To a first approximation, I think there are two sensible ways to define the tiers. In each case, we’re talking about what kinds of databases the various products are suited for.

  • Criterion S (for “Size”). “Bigger than Oracle can handle” vs. “Small enough that Oracle can handle it” (but that depends on what the definition of “handle” is).

  • Criterion U (for “Usage”). “Full enterprise data warehouse” vs. “big honking data mart”.

But those are very different classification rules – many products that might be upper-tier by Criterion S are lower-tier by Criterion U, and vice-versa. For example:

  • Teradata’s current products are at the upper end by either criterion. Even so, a significant fraction of older Teradata installations are below 5 terabytes or even 1 terabyte in size.

  • More generally, Teradata emphasizes Criterion U. Hence any future low-end products will surely be positioned as lower-tier by that criterion. Beyond that, I wouldn’t be surprised if release is delayed, with the final version of those products being different than what previously leaked. E.g., they might well be designed to compete with newer vendors that are upper-tier by Criterion S.

  • Netezza has clearly made it into the upper tier by the Size criterion. Most of its installations are lower-tier by Criterion U, but it trumpets a few exceptions that it describes as “enterprise data warehouses” in success stories.

  • DATAllegro is upper tier by Criterion S — more so than any other vendor except Teradata, in that there are at least two credible stories of DATAllegro warehouses at or above the quarter-petabyte mark. Even so, DATAllegro is still mainly in the lower tier by Criterion U. I.e., the most natural use of DATAllegro technology is to build Very Big data marts.

  • Vertica is a purely lower-tier Criterion U player, given its focus on single fact table schemas. But it’s well on its way into the upper tier by Criterion S.

  • Dataupia straddles the boundary of the tiers by Criterion S. That is, it’s meant to offload existing Oracle, SQL Server, or DB2 databases, or in some OEM cases to be a cheaper alternative. That sounds lower-tier. On the other hand, it has one 120 terabyte reference, which puts it squarely in in the upper tier. By Criterion U it’s pretty lower-tier.

  • ParAccel seems lower-tier by either criterion. And I’m too burned out on ParAccel’s secrecy to probe hard for exceptions.

  • Oracle, MS SQL Server, et al. are – pretty much by definition – lower-tier by Criterion S, but upper-tier by Criterion U.

  • HP Neoview is obviously meant to get to the higher end by both criteria. But like most specialty products, right now it’s further along by the Size criterion than the Usage one. Even so, it seems no further along by Criterion S than partner HP’s partner Vertica is.

  • Greenplum has clearly gotten to the upper tier by the Size criterion. But like most of the competition, it still seems to be in the lower tier by Usage.

  • Infobright is in the lower tier by either criterion. (They don’t even have an MPP offering yet.)

  • Kognitio KX2 is in the lower tier by either criterion. However, Kognitio aspires to move up when measured by Usage.

  • The last time I looked, Sybase IQ was lower tier by either criterion.

Related links:

Please subscribe to our feed!

Categories: Other

EMC is partnering with ParAccel

Sat, 2008-04-05 19:51

A talk about a ParAccel/EMC partnership has been promised for a forthcoming EMC user conference. Otherwise, ParAccel is exposing no useful information on the matter.*

*So what else is new?

The talk is called Highly Scalable Analytic Appliance Powered by EMC and ParAccel, and the abstract says:

Large and medium size enterprises are struggling with the technical and business challenges associated with processing operational data in near real time while executing increasingly complex queries on multi-terabyte data warehouses. To address these challenges EMC and ParAccel have jointly engineered and developed a highly scalable and performant analytic appliance. This solution is built on EMC CLARiiON midrange CX-3 UltraScale networked storage and ParAccel’s analytic columnar data store. Customers can simply deploy the EMC/ParAccel analytic appliance by simply extending their existing EMC footprint on enterprise ready storage while leveraging EMC’s proven solutions.

Hat tip to Mark Madsen.

Categories: Other

Webcast on database diversity Wednesday April 9 2 pm Eastern

Tue, 2008-04-01 23:42

Once or twice a year, EnterpriseDB sponsors a webcast for me. The last two were super well-attended. And most people stayed to the end, which is generally an encouraging sign!

The emphasis this time is on alternatives to the market-leading DBMS. I’ll highlight the advantages of both data warehousing specialists and general-purpose mid-range DBMS (naturally focusing on the latter, given who the sponsor is). The provocative title is taken from a January, 2008 post — What leading DBMS vendors don’t want you to realize. If you read every word of this blog, there probably won’t be much new for you. :) But I’d love to have you listen in and perhaps ask a question anyway!

You can register on EnterpriseDB’s webcast page, which also has an archived webcast I did for them in October, 2007.

Categories: Other