Re: S.O.D.A. database Query API - call for comments

From: Philip Lijnzaad <lijnzaad_at_ebi.ac.uk>
Date: 14 May 2001 17:33:11 +0100
Message-ID: <u7r8xrsxfc.fsf_at_sol6.ebi.ac.uk>


On Mon, 14 May 2001 17:40:24 +0200,
"Carl" == Carl Rosenberger <carl_at_db4o.com> wrote:

>> why ? The theory is so general that it allows clever (and fast) 'hacks',
>> because it is based on operating on sets of things rigth from start. I
>> believe this is what makes it so suitable to large-scale data management.

Carl> The approaches definitely come from two totally different directions:
Carl> - SQL *is* the language of sets and mass updates
Carl> - Object orientation is the exact opposite. You define single object
Carl> behaviour.

yes, agreed.

Carl> The two extremes both need the functionality of the other side:

Carl> Let's hope we can all learn from eachother to create the best of both Carl> worlds.

yes, certainly.

>> Why ? Using relational calculus, you 'massage' the relevant (parts of)
>> entities into a set of data that is of interest to the business logic, in
 Carl> the
>> most optimal way. The massaged (parts of) entities need not correspond to
 Carl> any
>> proper object type known to the system, but that is irrelevant.

Carl> This is where I see one of the largest advantages of relational databases:
Carl> You can always set up a two-dimensional table of your result, which is
Carl> instantaniously readable by a human.

yes, but it also comes in very handy if you want to find/update/delete a set of objects based on some complicated predicate. I.e., I think it's fine if the OR contains complex hand-crafted queries, e.g. for report generation.

Carl> This is where I see one of the largest disadvantages of relational Carl> databases:

Carl> Reconstructing objects and the relation between them is not supported
Carl> by the database engine. Once you start using a mapping tool to help,
Carl> you build an *additional* in-memory-database of the objects that you
Carl> have "checked-out" to the user-application. Ideally the additional
Carl> object layer would also take care about locking issues. This is where
Carl> object databases theoretically have great system-immanent advantages
Carl> since they need no additional layer.

yes, definitely.

Carl> How many tables would you use to represent an inheritance hierarchy
Carl> of the
Carl> following?
Carl> - person
Carl> - employee
Carl> - manager

>>
>> Depends, but usually three. This can be 'horizontal' (only tables for
>> leaf-classes; query super-classes through a UNION of a restriction of
>> them)
>> or vertical (each class its own table, with sub-classes having just the
>> bits
>> that are 'new', + a foreign key that is also a primary key; query
>> sub-classes
>> through a join of super- and sub-table).
Carl> This is a fair and honest answer. "Depends" shows that you have a
Carl> problem on a decision, depending on application and performance
Carl> issues. The need to take a decision already is the sign of a mismatch.

yes, I suppose that's true, but I don't see it as much of a problem; real life has a habit of complicating things anyway, so being pragmatic is not so much an option as a necessity. If you have a terabyte database of fairly complicated objects, I'm sure that at some point you'll start adding silly little object class to keep the damn thing up and running and up to date.

Carl> Thinking OO, the above is simple and straigth forward: Carl> manager extends employee extends person.

Carl> Using your second suggestion, the necessary queries for manager names looks
Carl> like this:
Carl> - Relational:
Carl> select person.name from person, employee, manager
Carl>   where employee.person = person.id
Carl>   and manager.employee = employee.id
Carl> - Object-oriented:
Carl> database.get(manager);


Carl> Now would you be so kind to write a query that completely puts all
Carl> person objects back together again, with the following addition to the
Carl> datamodel: - customer extends person - sponsor extends person - VC Carl> extends sponsor - self_employed extends employee

Carl> I bet, I will be faster:
Carl> database.get(person)

No, exactly these kinds of queries is trivial; it's still

select * from person;

In contrast, if you use the first mapping (i.e., 'horizontal', it would be (assuming that a person's attribs are just id, name and phone):

  select id,name,phone from employee
  union
  select id,name,phone from manager

(and if there are persons that are not an Employee, there 'd have to be a separate person_only table).

Addition of customer, sponsor and VC would require an additional

  union
  select id,name,phone from customer
  union
  select id,name,phone from sponsor
  union
  select id,name,phone from vc

All the person id's are unique across the union (and in fact I dislike this way of doing things, because you can't set up referential integrety as flexibly as with the first schema.

Carl> Accordingly we would like to extend the query functionality for object Carl> databases.

This would be indeed very useful.

Carl> Maintainability is a matter of the implementation. Schema evolution
Carl> is
Carl> not a problem inherent to object datases in principle.

>>
>> I'm not so sure, because objects don't have the closure property. I.e.,
>> products, projection, set operations and restrictions of relations are
>> again relations. This symmetry is not available for objects.
Carl> To some extent, you may be right: A table may exist on it's own. You
Carl> can transfer it anywhere independantly.  If you transfer a second
Carl> table, the joins will be there again, simply because they only consist
Carl> of numbers.

yes, that is true and important, but that's not what I mean. If you have to add a customer_contact table because you now want to allow more than one contact per customer, you can just add the table, and create the views that make the thing behave as if the old 'one contact per customer' situation was still the case. The old application continues to keep working, and you can start adjusting the object layer and the applications.

Carl> Object databases would transfer objects independantly. A complete
Carl> restore of a large object network is only possible with access to
Carl> OIDs. Ideally these would be unique worldwide (GUIDs).

yes.

Carl> Now what is the usecase?
Carl> Replication!

Carl> This is where GUIDs also become worthwhile using on relational systems.

yes.

Carl> I don't understand your
Carl> round-trip-problem. This is dependant on the implementation.
>>
>> in theory yes (if you are clever enough to do query transformations), but
>> in
>> practice no. In practice, complicated queries end up being done by
>> explicit
>> pointer chasing, because that's the only thing available.

Carl> Query transformation and optimization is quite simple:
Carl> - think in sets
Carl> - identify reocurring sets
Carl> - evaluate one set at a time
Carl> - always use the smallest set first
Carl> - narrow larger sets by smaller sets

yes, this is true if you get the query in a nice closed form, but typically bits and pieces of the object navigation occur in wildly different parts of an application program (e.g., in different compilation units). That doesn't make it easier.

>> Apart from the inability to do convenient aggregations over the data, you
>> loose the ability to work with sets of things ('things' not necessarily
>> being objects). Compared to that, the round-trip problem is indeed minor.

Carl> "Sets of things" should be available on object databases by providing Carl> indices.

Carl> Yes, OR layers can improve the speed by caching objects, but this only
Carl> works for usecases, where a small and defined amounts of data are
Carl> accessed. For typical High-End-Usecases you can assume that caches are
Carl> always full.

I'm not sure; this will depend on the application. Usually, the access patterns of (parts of) objects that come from an RDBMS thru an OR layer, will be fairly fixed, and can be parameterized to have good cache hit characteristics.

>> A view is as fast as the query that
>> defines it. So stating that a view is slower than a table may be true, but
>> you're comparing apples and oranges.

Carl> I fully agree.
Carl> A view is *only* as fast as the query that defines it.

yes, but it can be faster than an equivalent table too! This would happen if the join goes thru a very small table that you happen to be basing your selection on. So again, apples and oranges.

Carl> Sorry for my bad wording. The negative performance hit occurs when
Carl> queries join views with other views or tables. Most (all ?) current
Carl> query optimizers build all views first (run all underlying queries),
Carl> before evaluating them against eachother.

This would be a very stupid way to implement queries involving views. The sensible way to execute queries involving (non-materialized) views is to substitute the view definition in the new query, then optimize and execute this query (i.e., the query that is executed will only see real tables, and will have access to all the indexes, statistics and clustering that apply to those tables). Oracle appears to do it like that, and I'd be surprised if other databases do it differently (I will abstain from passing comment MSSQL). The obvious exception would be materialized views, which don't seem to be so common. Cheers,

                                                                      Philip
-- 
If you have a procedure with 10 parameters, you probably missed some. (Kraulis)
-----------------------------------------------------------------------------
Philip Lijnzaad, lijnzaad_at_ebi.ac.uk \ European Bioinformatics Institute,rm A2-08
+44 (0)1223 49 4639                 / Wellcome Trust Genome Campus, Hinxton
+44 (0)1223 49 4468 (fax)           \ Cambridgeshire CB10 1SD,  GREAT BRITAIN
PGP fingerprint: E1 03 BF 80 94 61 B6 FC  50 3D 1F 64 40 75 FB 53
Received on Mon May 14 2001 - 18:33:11 CEST

Original text of this message