Representing Types in the Catalog (was Re: Distributed foreign keys)

From: Paul Vernon <paul.vernon_at_ukk.ibmm.comm>
Date: Tue, 8 Jul 2003 12:39:58 +0100
Message-ID: <beean7$1l0g$>

Stupid outlook express won't let me reply to really deep threads ;-), so I need to post this in a new one. (I get a '441 Line 3 too long' as apparently it refs all prev messages in a thread in a header and does not wrap it when it reaches the SMTP max line length.)

Bob Badour" <> wrote in message news:guoOa.415$
> "Paul Vernon" <paul.vernon_at_ukk.ibmm.comm> wrote in message
> news:becgrp$2h7m$
> > True, and as they are constants (until we consider schema evolution) they
> > might as well all be view constants in the database catalog.
> Okay, a light went on. The set of values of a type is a literal (constant)
> which one specifies with an expression. Because a function of a literal is
> itself a literal, the expression describing the literal could be as simple
> as a bunch of values or as complex as a deeply nested relational expression.
> Since the expression is always a cartesian product, wouldn't it suffice to
> just have a set of components of a possible representation? One could derive
> the expression easily enough just by taking the product of each component.

Not all tuples of the cartesion product will be valid possiable representatoins. In my integer interval example only those with a lower bound less than or equal (yes, >= was a typo before) to the upper bound would be valid (assuming a inclusive-exclusive range poss representation).

So it's both the components and the constraints on them that we need.

> > Does SINGLE_BIT == boolean ?
> > I suggest not. That would, for me, make a BIT type generator (probably
> > generating bit stings of both fixed and variable length) the single type
> from
> > which all others would be built.
> Are you suggesting that the boolean type is built as a BIT(1) ?

That a possible representation of Boolean would be values of type BIT(1). Yes. That another representaion would be the string values {"True", "False"}. Yes.

> > Agreed. I suggest existing DBMSes already have reasonably low in cost
> query
> > cost estimators, and that future advances will allow cost estimates to be
> > revised during the run of a query to make up for initial bad estimates.
> But are they sufficiently accurate? The revision during the run sounds
> worrisome to me. I would hate a sixteen hour query to abort after six
> hours--especially if someone needs the results first thing in the morning.

Well if after 6 hours the DBMS comes across some massively skewed data that it's previous statistics did not know about (it's happened to me when we messed up a hashing algorithm used to partition data on a parallel database), and so the estimated query time shoots up to say 6 days, then if the DBMS could find no better plan at that point for dealing with the skewed data, then you would be pleased if it gave you the option of stopping the query at that point (but, maybe keeping the any working results so far in memory so that a subsequent, modified query might reuse the 6 hour work done so far).

> > Well I do want type literals one way or another. If I had <NOT> then
> simply
> > <NOT>ing an empty relation will give be it's type literal.
> > Here is an argument for Integer enumeration, I just want it for all types.
> >
> If you have one (either range literals or <not>), you don't really need the
> other because you can derive it easily enough.

Agreed. As long as types are well defined (i.e. we know all the values that they contain) Type literals and <NOT> are interchangeable.

Possible representation literals, where the components of the each possible representation are exposed allow us to ensure that our types are well defined. In particular it makes sure that every type value has exactly one (canonical) representation for each possible rep. That there is an equivalence between pos rep values (none of this where one POINT type has cartesiaon and polar poss reps that cannot be equivalent using RATIONAL numbers). We also get to formalise the "THE_" operators (aka. GET & SET) of poss representations, because these just become the relation attributes of the pos rep literal. So to get at the poss rep components for a given type value, just JOIN to the pos rep literal and select the component you are interested in. To set a component, just select the row where the other components of the poss rep are the same as current and the one you want to change is the value you want to change it to.
Non-canonical representations (i.e. multiple poss rep values mapping to one type value) would, I guess, be held in a separate catalog literal that maps to the canonical form for the particular poss rep.

All in all, it looks to me like a type system that does have some useful properties, and solves some of the problems that I am (and I think that databases are) interested it.

It might even be useful for the alleged "fundamental purpose of a type system" that is "to prevent the occurrence of execution errors during the running of a program". Replace, "running a program" with "evaluating a [relational] expression", then I would indeed hope so.

> I suggest changing the focus from using expressions to define the set of
> values as a relational literal to using expressions to describe the possible
> representations etc.

Same thing (if the expressions are non "parameterized")

> From a description of the possible representations, one
> can derive or generate an expression to define the set of values. A
> parameterized type would use a parameterized expression for the possible
> representation etc.

Sounds reasonable. What we need is the definition of a Relational Expression as a data type. What are some plausible possible representations of Relational Expressions? Are parameterised REs a super-type of a REs? For REs one poss rep is simply all strings that meet the RE BNF, but to get a poss rep that starts to exposes the components we need to look a representing parse trees as relation values (or, actually, database values). On that point, I love the mad standard SQL that Vadim Tropashko found for parsing in this article:
I was even more impressed that translating it to DB2 was trivial, and that it runs quickly (at least for his small example)

> > Allow an extension of the algebra to make it Turing equivalent I (naively)
> > guess. How far to being Turing equivalent does adding <TCLOSE> make "A" ?
> I don't know. I strongly suspect full recursion support would get it there,
> though.

Agreed. I wonder if <TCLOSE> should be replaced with generalised recursion?

> > > > > Drop the names of relations? How does one refer to them without
> names?
> > > >
> > > > By their tuple type. Obviously that does mean enforcing POOD however.
> > >
> > > That seems like a complex way to refer to a relation compared to using a
> > > name.
> >
> > Wouldn't deny that. Relation names work OK as shorthands for tuple types,
> it's
> > just that not all tuple tpyes should need them and some might have more
> than
> > one...
> The names I am thinking of are names of relation variables or relation
> valued database attributes, if you prefer, and not shorthands for types. A
> relation of employee number and employee's name has the same type as a
> relation of employee number and employee's manager's name, but they are
> different things.

Only if you use the same attribute names. Remember that the type of a tuple is the same as the heading of the tuple. I.e. it consists of the set of both the attribute names AND attributes types. This, possibly counter intuitive definition, is stated a number of times in TTM, but still easy to miss.

In the case above the two tuple types might be

    {<Employee_Number,EMP_NUM>, <Name,NAME>} and

    {<Manager_Number,EMP_NUM>, <Name,NAME>}

Then the name "Employee" could be a shorthand for the first tuple type and "Manager" for the second.

Paul Vernon
Business Intelligence, IBM Global Services Received on Tue Jul 08 2003 - 13:39:58 CEST

Original text of this message