Re: Representing Types in the Catalog (was Re: Distributed foreign keys)

From: Bob Badour <bbadour_at_golden.net>
Date: Tue, 8 Jul 2003 13:20:01 -0400
Message-ID: <1_DOa.435$%R5.65300577_at_mantis.golden.net>


"Paul Vernon" <paul.vernon_at_ukk.ibmm.comm> wrote in message news:beean7$1l0g$3_at_gazette.almaden.ibm.com...
> Stupid outlook express won't let me reply to really deep threads ;-), so I
> need to post this in a new one. (I get a '441 Line 3 too long' as
apparently
> it refs all prev messages in a thread in a header and does not wrap it
when it
> reaches the SMTP max line length.)
>
> Bob Badour" <bbadour_at_golden.net> wrote in message
> news:guoOa.415$7M3.59788339_at_mantis.golden.net...
> > "Paul Vernon" <paul.vernon_at_ukk.ibmm.comm> wrote in message
> > news:becgrp$2h7m$1_at_gazette.almaden.ibm.com...
> [snip]
> > > True, and as they are constants (until we consider schema evolution)
they
> > > might as well all be view constants in the database catalog.
> >
> > Okay, a light went on. The set of values of a type is a literal
(constant)
> > which one specifies with an expression. Because a function of a literal
is
> > itself a literal, the expression describing the literal could be as
simple
> > as a bunch of values or as complex as a deeply nested relational
expression.
> >
> > Since the expression is always a cartesian product, wouldn't it suffice
to
> > just have a set of components of a possible representation? One could
derive
> > the expression easily enough just by taking the product of each
component.
>
> Not all tuples of the cartesion product will be valid possiable
> representatoins.

This just means we need to represent the constraints on the components of a possible representation as well as the components.

> In my integer interval example only those with a lower bound
> less than or equal (yes, >= was a typo before) to the upper bound would
be
> valid (assuming a inclusive-exclusive range poss representation).

A type declaration comprises possible representations and constraints. One can derive the relation literal as a cartesian product of the components of a possible representation restricted by the constraints.

> So it's both the components and the constraints on them that we need.

Yes, I agree. The question is how best to represent components and how best to represent constraints in the catalog.

> [snip]
> > > Does SINGLE_BIT == boolean ?
> > > I suggest not. That would, for me, make a BIT type generator (probably
> > > generating bit stings of both fixed and variable length) the single
type
> > from
> > > which all others would be built.
> >
> > Are you suggesting that the boolean type is built as a BIT(1) ?
>
> That a possible representation of Boolean would be values of type BIT(1).
Yes.
> That another representaion would be the string values {"True", "False"}.
Yes.
>
> [snip]
> > > Agreed. I suggest existing DBMSes already have reasonably low in cost
> > query
> > > cost estimators, and that future advances will allow cost estimates to
be
> > > revised during the run of a query to make up for initial bad
estimates.
> >
> > But are they sufficiently accurate? The revision during the run sounds
> > worrisome to me. I would hate a sixteen hour query to abort after six
> > hours--especially if someone needs the results first thing in the
morning.
>
> Well if after 6 hours the DBMS comes across some massively skewed data
that
> it's previous statistics did not know about (it's happened to me when we
> messed up a hashing algorithm used to partition data on a parallel
database),
> and so the estimated query time shoots up to say 6 days, then if the DBMS
> could find no better plan at that point for dealing with the skewed data,
then
> you would be pleased if it gave you the option of stopping the query at
that
> point

The dba always has the option of stopping the query by killing the process. If the cost estimation has potential inaccuracies that cause the estimate to suddenly jump from hours to days after six hours of processing, presumably the cost estimation has potential inaccuraces that could cause the estimate to suddenly drop back down to hours after another 15 minutes of processing.

If the safety system is going to automate the decision to abort a query, it has to make that decision up front. Users have to be able to leave a long-running query knowing that when they return the dbms will have produced a result or will still be working at it.

> (but, maybe keeping the any working results so far in memory so that a
> subsequent, modified query might reuse the 6 hour work done so far).

Let's say my boss needs a result for a 9am meeting, and I expect the dbms to take 16 hours to deliver the result given the volume of data and the complexity of the calculations. I wait until 2pm to start the query to avoid unecessary load on the system during the peak hours of 9am to 2pm.

Having the 8pm image of the work in progress won't help me one bit when I come in the next morning at 7am to retrieve the result and format it for presentation. With 10 hours processing remaining to be done, my boss won't have the needed results in 2 hours.

When I leave at 6pm the first night, I need to have confidence the dbms will keep at it until it finishes.

> [snip]
> > > Well I do want type literals one way or another. If I had <NOT> then
> > simply
> > > <NOT>ing an empty relation will give be it's type literal.
> > > Here is an argument for Integer enumeration, I just want it for all
types.
> > > http://www.dbazine.com/tropashko2.html
> >
> > If you have one (either range literals or <not>), you don't really need
the
> > other because you can derive it easily enough.
>
> Agreed. As long as types are well defined (i.e. we know all the values
that
> they contain) Type literals and <NOT> are interchangeable.
>
> Possible representation literals, where the components of the each
possible
> representation are exposed allow us to ensure that our types are well
defined.
> In particular it makes sure that every type value has exactly one
(canonical)
> representation for each possible rep. That there is an equivalence between
pos
> rep values (none of this where one POINT type has cartesiaon and polar
poss
> reps that cannot be equivalent using RATIONAL numbers).

Possible representation literals won't prevent that.

> We also get to formalise the "THE_" operators (aka. GET & SET) of poss
> representations, because these just become the relation attributes of the
pos
> rep literal.

The "THE_" operators are already formalized without setting anything. If a variable's declared type has a possible representation, P1, with A and B components, the "THE_" operator to the left of the assignment operator is just a shorthand for another assignment:

THE_A(var1) := "NewAValue";

is a shorthand for

var1 := P1( A := "NewAValue", B := THE_B( var1 ) );

> So to get at the poss rep components for a given type value, just
> JOIN to the pos rep literal and select the component you are interested
in.
> To set a component, just select the row where the other components of the
poss
> rep are the same as current and the one you want to change is the value
you
> want to change it to.

How does this assign or change anything?

> Non-canonical representations (i.e. multiple poss rep values mapping to
one
> type value) would, I guess, be held in a separate catalog literal that
maps to
> the canonical form for the particular poss rep.

Catalog literals seem like a dead end to me. The catalog needs to represent the type, and from this representation the dbms can derive a literal as necessary.

It does not seem feasible to parameterize catalog literals, but it is easy enough to parameterize other representations of a type. Using the parameterized representation, the dbms can instantiate a literal when provided arguments for the parameters.

> All in all, it looks to me like a type system that does have some useful
> properties, and solves some of the problems that I am (and I think that
> databases are) interested it.

I am less interested in some of the properties you want, but I agree a relation literal can represent the set of values of a possible representation easily enough. However, I do not think a relation literal is the appropriate way to represent a type in the catalog.

The catalog representation must suffice to derive the appropriate relation literal, of course.

> It might even be useful for the alleged "fundamental purpose of a type
system"
> that is "to prevent the occurrence of execution errors during the running
of a
> program". Replace, "running a program" with "evaluating a [relational]
> expression", then I would indeed hope so.

As would a representation of a type that allows one to derive the relation literal for a possible representation as needed.

> [snip]
> > I suggest changing the focus from using expressions to define the set of
> > values as a relational literal to using expressions to describe the
possible
> > representations etc.
>
> Same thing (if the expressions are non "parameterized")
>
> > From a description of the possible representations, one
> > can derive or generate an expression to define the set of values. A
> > parameterized type would use a parameterized expression for the possible
> > representation etc.
>
> Sounds reasonable. What we need is the definition of a Relational
Expression
> as a data type. What are some plausible possible representations of
Relational
> Expressions? Are parameterised REs a super-type of a REs?

Parameterised REs are types with operations that result in REs.

> For REs one poss rep is simply all strings that meet the RE BNF, but to
get a
> poss rep that starts to exposes the components we need to look a
representing
> parse trees as relation values (or, actually, database values). On that
> point, I love the mad standard SQL that Vadim Tropashko found for parsing
in
> this article:
> http://www.dbazine.com/tropashko1.html
> I was even more impressed that translating it to DB2 was trivial, and that
it
> runs quickly (at least for his small example)

Cool. I am going to have to follow Vadim's stuff more closely.

> [snip]
> > > Allow an extension of the algebra to make it Turing equivalent I
(naively)
> > > guess. How far to being Turing equivalent does adding <TCLOSE> make
"A" ?
> >
> > I don't know. I strongly suspect full recursion support would get it
there,
> > though.
>
> Agreed. I wonder if <TCLOSE> should be replaced with generalised
recursion?

TTM has recursion as a relational prescription.

> [snip]
> > > > > > Drop the names of relations? How does one refer to them without
> > names?
> > > > >
> > > > > By their tuple type. Obviously that does mean enforcing POOD
however.
> > > >
> > > > That seems like a complex way to refer to a relation compared to
using a
> > > > name.
> > >
> > > Wouldn't deny that. Relation names work OK as shorthands for tuple
types,
> > it's
> > > just that not all tuple tpyes should need them and some might have
more
> > than
> > > one...
> >
> > The names I am thinking of are names of relation variables or relation
> > valued database attributes, if you prefer, and not shorthands for types.
A
> > relation of employee number and employee's name has the same type as a
> > relation of employee number and employee's manager's name, but they are
> > different things.
>
> Only if you use the same attribute names. Remember that the type of a
tuple is
> the same as the heading of the tuple. I.e. it consists of the set of both
the
> attribute names AND attributes types.

Good point.

> This, possibly counter intuitive
> definition, is stated a number of times in TTM, but still easy to miss.
>
> In the case above the two tuple types might be
> {<Employee_Number,EMP_NUM>, <Name,NAME>}
> and
> {<Manager_Number,EMP_NUM>, <Name,NAME>}
>
> Then the name "Employee" could be a shorthand for the first tuple type and
> "Manager" for the second.

I was actually thinking of this situation:

    {<Employee_Number,EMP_NUM>, <Name,NAME>} and

    {<Employee_Number,EMP_NUM>, <Manager_Name,NAME>}

And to be honest, I am not sure what I would call them. <g> Received on Tue Jul 08 2003 - 19:20:01 CEST

Original text of this message