Re: Relationals vs. Objects Databases I

From: <d39902_at_doxbox.pnl.gov>
Date: 1998/02/06
Message-ID: <6bfp7p$9kc1@bbs2.pnl.gov>

In article <34D9EBE2.FF63A32D_at_young.epc.lmms.lmco.com>, DTF User <user_at_young.epc.lmms.lmco.com> wrote:

[*snip*]

>general certain performance gain (something like 3-5 times). But if you need
>pattern
>search or interval comparisons then forget it. RDBMS will outperform ODBMS by a
>huge

Why will an RDBMS out perform an ODBMS on a pattern match or an interval comparison? Is there a fundamentally different mechanism for comparing strings or integers that RDBMSs employ? I don't think so...

>For example: you have just three object classes - customers, transactions and
>inventory.
>Number of instances are 20000, 50000 and 5000000 respectively which I would say
>relatively modest.

Yes, that's modest by the databases I have implemented in an OODB - though I must admit that I'm using an OODB precisely because the size and complexity of data that I'm dealing with would cripple an RDBMS...

>You need to select customers by geographic regions who ordered
>within certain period of time product of certain categories with total sale greater
>then
>certain amount. I would like to see how ODBMS will perform in this case AND, which
>is
>important but almost always ommitted by ODBMS propaganda, how much time and
>efforts will it take to create such a query. With RDBMS it is a single query - 6 -7
>lines in old
>plain SQL or few keystrokes and mouse clicks with many of readily available query
>and
>report tools.

Well, you don't give me much to go on here, but I'll give it a whirl. You'll just have to forgive me if I make schematic assumptions here. So here's a trivial schema that captures your description:

   cust (custID, name, street, city, state, zip, region)    inv(itemID, cost, category)
   trans(custID, transID, date)
   transDetail(transID, itemID, quantity)

Note that in reality you'd probably want item information separate from inventory and a separate table that maps items into potentially multiple categories. But we can at least work with this. NOTE: I don't consider this a very good schema - it certainly has some functional dependencies that are naughty.

OK, so the query you proposed above is given by something like the following where I assume you want customers in the Pacific Northwest who bought widgets from you in December of 1997 and the total value of those widgets was > $50:

select cust.* from cust where
cust.region='PNW' and

(select sum(inv.cost) from trans, transDetail, inv where

     trans.custID = cust.custID and 
     transDetail.transID = trans.transID and
     transDetail.itemID = inv.itemID and
     trans.date >= '12/1/1997' and trans.date < '1/1/1998' and
     inv.category = 'widget') > 50

So yes, that's about 8 lines (could be fewer if I'd used longer lines). Now, how about the computational complexity of this operation? We can probably assume that each table has a hash index on the "***ID" key to optimize the joins. We should also assume a btree index on the trans table for date. If we're going to do this frequently, we'll want an index by cust region as well, and maybe item by category - though that's not likely to be of much help here since the optimizer would likely choose to join cust with trans, then join with transDetail and finally with inv (the proof is left to the reader who can consult any textbook for the rationale).

OK, so let me propose a simple OO schema using one of the OODB I'm pretty familiar with. First, though, let me note that there are many ways to skin this cat in an OO schema but I'll try to give something reasonably equivalent to the above. Also, this isn't a "real" schema since I'm not going to confuse things with the syntax of the particular OODB I'm considering. OK, so here it is:

   cust (name, street, city, state, zip, region, transactions)
   inv(itemID, cost, category)
   trans(cust, date, items)

In cust, transactions is a list of trans. In trans, items is a dictionary (i.e., a collection of key/value pairs) where the inv is the key and the quantity is the value (e.g., {inv, quantity} "tuples").

So here's my implementation of the query above using the DML of the database I have in mind (note, it trivially assumes the existence of a set containing all the cust objects in our database):

allCustomers select: [:c | ((c region) = 'PNW') and:

      ((c transactions) detect: [:t | ((t date) >= '12/1/1997') and:
				      (t date) < '1/1/1998') and:
         (((t items) inject: 0 into: [:sum:i | (i key category) = 'widget')
		ifTrue: [sum + ((i key cost) * (i value))]]) > 50)])]

Now I know you might object that this isn't clear to you, but I don't expect it to be. It's a rough estimate (rough because I wrote it in about twice the time that I took to write the SQL above, and I haven't been actively working with GemStone for a few months now, yet I've been using SQL for decades) of a GemStone Smalltalk query. I considered giving an ODMG schema and OQL query, but since one might quibble about the existence of an OODB that implements both the DDL and DML of the standard, I chose to stick with a real, working OODB (GemStone in this case).

A couple of salient observations. First, I would argue that for the person who knows both Smalltalk and SQL it's not much more difficult to write one query over another. So much for the argument favoring the ease of writing SQL vs. OODB queries. Oh, and I should also point out that GemStone can execute this Smalltalk query interactively just like you would with an SQL interpreter, there's no need to compile a C++ program or anything like that. Other than that, I don't think the compactness of one solution over another is a very persuasive argument when talking about the expressiveness or power of a query language.

Second, as for the performance, the big win for the OODB is that there are no joins. The inner-select in the SQL above is a multi-way join that must be executed over and over again for every customer in the PNW. Granted, we can optimize that some with indexes, but the OODB solution I gave above completely eliminates the join processing altogether. So to those who try to slight OODBs with the criticism that they don't do "joins" I say whopeee! All things being equal, I think it's relatively easy to make the argument that the OODB query I gave is as efficient as the SQL query. Sure, we can debate about the level of optimization of particular products, but in the sense of evaluating the computational complexity of the two solutions, they're not that much different (the proof, again, is left as an exercise to the reader).

As for the availability of off-the-shelf report writers, there I must concede that RDBMS have it over OODBs hands down, no question about it.

David Hansen -- David M. Hansen | Pacific Northwest National Laboratory Senior Research Scientist | Information Sciences & Engineering dm_hansen_at_pnl.gov | PO Box 999 MS K7-22 (509)375-2362, fax (509)375-3641 | Richland, WA 99352-0999

Received on Fri Feb 06 1998 - 00:00:00 CST