Re: foundations of relational theory? - some references for the truly starving

From: Mike Preece <michael_at_preece.net>
Date: 24 Oct 2003 06:56:30 -0700
Message-ID: <1b0b566c.0310240556.3bf271e6_at_posting.google.com>


Costin Cozianu <c_cozianu_at_hotmail.com> wrote in message news:<bn7rd0$tul6k$1_at_ID-152540.news.uni-berlin.de>...

This is theory right? Theorising's a bit like imagining. So I'll imagine you're not a diificult person to discuss things with...

[snip]

> > If no "frame fault" in involved then performance is the same whether
> > the item being accessed is 15 bytes long or 1500. Performance is also
> > equal regardless of the number of items on the file or total file
> > size. It will take no longer to read one 1500 byte item from a 10 gig
> > file containing 30m items than it would to read a 15 byte item from a
> > 10k file containing 100 items.
> >
>
> BS. Or do you mean you don't write to log files ?

No - we don't usually. It's not usually wanted or needed. We can if you want. It's very easily done - centrally, with a trigger fired when the file is updated if you'd like.

>
> You also have to count that increasing record size decreases the number
> of rows stored in a single page (frame)

Yes, it decreases the remaining allocated space when you write new data to a file.

>
> > It might be worth mentioning at this stage that the amount of data you
> > can store in a Pick item will require far less physical disk space
> > than on any other (uncompressed) database.
>
> Another big bullshit of the month.

So now I hope you'll attempt to show me clearly and concisely how you'd store the data in my example below in fewer bytes an another equally useable database. Pick one. You guys live under the impression that you are technologically advanced.

>
> > There are two reasons for
> > this:
> > 1) because every field is of variable length - every datum is just as
> > long as it needs to be and no more and empty fields occupy no space at
> > all (other than single character system delimiters) -
>
> Sure. Every DBMS and their grandmo's have this feature, actually it's
> probably implemented better because numbers are in binary, there's a
> certain room for further updates, etc, etc.
>
> and
> > 2) because there is often no need for pointers to related data (the
> > related data is within the item itself).
> >
> >
>
> Oh, yes. Very big deal. A 4 bytes is a tremendous disaster.

Four bytes? Where did that come from? You're going to tell me that the total cost in disk space for a separate table to hold related values with pointers between parent and child tables and then to go to another level with pointers between the child table and it's child table, tables headers and all the rest of I don't know what for an infinite amount of data is, ummm..., lets me count them... 1 2 3 4... you did say FOUR BYTES? Well spear me!

> In any case,
> clusters and arrays and nested tables are there.
>

Oh. So that's alright then is it?

> >>So if the master table is accessed/updated more frequently then
> >>accesses for the information in the two of them combined
> >
> >
> > What "master" table? Which two? In Pick/MV there need only be one
> > item. One access/update. One head movement. READ INVOICE FROM
> > INVOICES,'12345' need not require any other reads to get the "line
> > item details" - they can be in the INVOICE item as multivalues (and
> > possibly subvalues for finer detail within each "line item").
> >
> >
>
> One access ? Does it write byte by byte, or does it write frames to the
> disks ?
>

Afaik, and I'm prepared to defer to those that know more about the internals, it reads a frame into memory, updates it, marks it as updated and it is at some later stage flushed to disk.  

> How about transaction logs ?
>

How about them? You want them - you can have them.

> And spear me the head movements, per update. This calculus is as valid
> as asking people how many clock cycles a Pentium 4 Xeon will spend for
> an ADD AX, BX. Guess what, more often then not is less than one.
>

Ah - now there's something interesting! I often see people use "then" when I think they mean "than". Is this an american thing or what?  

> >>there's not a
> >>good case to cluster the parent/children records (or whatever your
> >>nomenclature is) together.
> >>

There is often no good reason for both to exist in the first place. This is comp.databases.theory. Databases exist to record and retrieve data. Relational theory is the theory that some data is related to other data. I think some extremely brilliant mathemetician came up with that startling concept. If it didn't make sense to store related data together then every datum would be held seperately. There'd be more pointers than data. Just because the surface of the screen you're looking at can be plotted using 2 coordinates - like columns and rows - doesn't mean databases must be similarly limited. The Pick database allows you to store related data together. The reason for storing related data in a seperate file on a Pick database is not for some ridiculous reason like an inability to be able to plot it in 2 dimensional column and row terms.

> >>
> >>>In Pick the nested rows don't need pointers - they are in-situ. The
> >
> > item,
> >
> >>>or record, contains all of the related data - hence we get it all
> >
> > in a
> >
> >>>single read. No need to "dereference" anything. Not such a "plenty
> >
> > of
> >
> >>>complications".
> >>>
> >>
> >>The "complications are with regards to integrity constraints,
> >>normalization theory and query optimization.
> >>

What do you really mean by 'integrity'? Don't you really mean that the pointers between the *often unnecessary* tables must be maintained or the whole crock of s**t falls apart and makes big brown smelly stains all over your two dimensional world? Well here's a startling concept: What if you didn't need those pointers between the unnecessary tables because they don't actually exist?

And 'normalisation'? You have to fit your data into a 2 dimensional structure although you get the sneaking suspicion sometimes that it's actually a lot more difficult to do so than it should be somehow? That's because you're insisting on making life difficult for yourself. Well go on if that's what rocks your boat. Knock yourself out!

> >
> >
> > Data integrity is simplified, normalisation is greatly simplified and
> > there's no need to optimise anything.
>
> No because you pre-program moving your cursors just like in FoxPro, right ?
>

You guys! What the...? <deep sigh!>  

[big snip]

Oh look! Here's that example. Remember? You wanted to see an example...

> >
> > Example:
> >
> > We have a sales ledger requirement.
> >
> > We need to be able to record, against each invoice:
> >
> > * The Products ordered; (The product codes are composite keys
> > comprising a generic Product Number concatenated with a Size code and
> > a Colour code - only we don't know whether they are numeric or
> > alphabetic or alphanumeric, or how many characters long they are. All
> > we know for sure is that they must exist on the Product file, that the
> > Product File is keyed on Product Number, and that there are lists of
> > allowable Sizes and Colours for each Product item)
> > * The Quantity of each Product ordered; (Must be a whole number of
> > units greater than 0)
> > * The Date on which each Product ordered is required to be despatched
> > by; (we know they want to be able to despatch products in partial
> > shipments as and when they're available when the truck leaves the
> > depot on its delivery rounds)
> > * The Quantity of each partial shipment;
> > * The Date of each partial shipment.
> >
> > There are other requirements - but they are not known/defined at this
> > stage. We were lucky to get this much detail. The CEO for the company
> > we're writing this application for is a very hard man to pin down. He
> > wants it now but he doesn't know what it is exactly. We had a five
> > minute chat with him and, based on that, we come up with:-
> >
> > Pick file layout description:
> > File: INVOICES
> > ID: Invoice Number
> > 001: Product codes (multivalued)
> > 002: Quantity ordered (multivalued)
> > 003: Requested Ship-by Dates (multivalued)
> > 004: Quantity despatched - for each partial shipment (subvalued)
> > 005: Despatched Date for each partial shipment (subvalued)
> >
>
> This is already non-sense.

Why? What don't you understand? About the example I mean? It's kinda like what they call a "real world" requirement.

We want to store and retrieve data relating to invoices. There are 5 things we know about them. There can be an infinite number of Product codes, each with a single numeric quantity ordered and a single ship-by date. For each related (product, quantity & ship-by date) there can be an infinite number of partial shipments comprising a (single numeric quantity despatched and a despatch date).

There might also be no shipment for a product on order btw.

>
> You mean you can;t have a record type
> type InvoiceLine = record ProductCode: int , QuantityOrdered: float,
> QuantityDispatched: float, RequestDeadline: date
>

I don't know. In your equivalent database design you're free to do whatever you like - so long as you give the customer what they need.

> And have a single multivalued column with those records ? So how does
> the DBMS *enforce* the needed integrity constraints ?
>

Up to you.

In the Pick database as described, there is one file. No pointers to related data. All of the related data is in a single item.  

> > We create the file and write a program to allow them to start using
> > it. They spend the morning entering data they had recorded on various
> > bits of paper.
> >
>
> Oh sure. With no integrity constraints. Remember good old days, where we
> had a "Data Quality" department with data operators to double check the
> non-sense.

And I'm guessing there was a lot of that was there?

>
> > Later that day we look at some of the data. Here's what it looks
> > like:-
> >
> > Actual Pick file contents:
> > INVOICES
> > ID: 12345
> > 001: ABC*123*RED]ABC*456*RED]DEF*123*BLUE]DEF*789*WHITE
> > 002: 100]400]200]300
> > 003: 13085]13090]13090]13087
> > 004: 50\50]230]200]100\120
> > 005: 13079\13080]13080]13079]13078\13080
> >
>
> That looks great.

Thanks. But then - you're probably teasing me aren't you?

Each line (001-005) is what we call an 'attribute', See the layout described above.

The 'ABC' and 'DEF' in attribute 1 are Product Numbers. Funny looking numbers aren't they. Hang on - I'll ask why they're called numbers...

...The woman in the office says they used to be numbers but they changed some of them to letters a while ago. They still refer to them as numbers though. Some of the products are numbers, some are letters and some have hyphens and things in them.

The '123', '456' & '789' are Sizes.

The '*' is just an asterisk. Could be anything really - so long as it's not something that's ever going to be in any of the values it's delimiting.

You've probably got it now.

Oh yes! The ']' & '\' characters? The ']' represents a char(253) which is a Pick system delimiter called a 'value mark'. They seperate each of the multiple values within each attribute. They '\' is a char(252) and we call that a 'subvalue mark' - to seperate the multiple sub-values within each 'multivalue'. Don't worry too much about them. They don't mean anything outside of Pick.  

>
> > The boss now wants a printout of this particular invoice.
> >
>
> Hey, I was doing that when I was a kid.
>

Yeah?! Really?  

> > We create some dictionary items:
> >
> > ED DICT INVOICES Product Size Colour OrderQty ReqShipDate ShipQty
> > ActShipDate
> >
> > Product
> > 001 S
> > 002 1
> > 008 G*1
> > 009 L
> > 010 7
> >
> > Size
> > 001 S
> > 002 1
> > 008 G1*1
> > 009 L
> > 010 4
> >
> > Colour
> > 001 S
> > 002 1
> > 008 G2*1
> > 009 L
> > 010 6
> >
> > OrderQty
> > 001 S
> > 002 2
> > 003 Quantity]Ordered
> > 009 R
> > 010 8
> >
> > ReqShipDate
> > 001 S
> > 002 3
> > 003 Ship By
> > 007 D
> > 009 R
> > 010 11
> >
> > ShipQty
> > 001 S
> > 002 4
> > 003 Quantity]Despatched
> > 009 R
> > 010 10
> >
> > ActShipDate
> > 001 S
> > 002 5
> > 003 Despatched On
> > 007 D
> > 009 R
> > 010 13
> >
>
> IS this binary file dump, or is this code, or what;s this mumbo jumbo ?

They're Pick dictionary items. We use them in Pick to produce... hey! look at this... Just like when you were a kid...

>
> > ...and produce a listing by entering:
> >
> > :list invoices '12345' product size colour total orderqty reqshipdate
> > total shipqty actshipdate
> >
> > Page 1 invoices 18:12:14 23
> > Oct 2003
> >
> > invoices.. product size colour Quantity Ship By.... Quantity..
> > Despatched On
> > Ordered Despatched
> >
> > 12345 ABC 123 RED 100 28 Oct 2003 50 22
> > Oct 2003
> > 50 23
> > Oct 2003
> > ABC 456 RED 400 02 Nov 2003 230 23
> > Oct 2003
> > DEF 123 BLUE 200 02 Nov 2003 200 22
> > Oct 2003
> > DEF 789 WHITE 300 30 Oct 2003 100 21
> > Oct 2003
> > 20 23
> > Oct 2003
> > *** 1000 650
> >
> > [405] 1 items listed out of 1 items.
> >
>
> Now, that's quite a feat.

Shucks. Not really. Just something I knocked up - probably while you were flaming some poor chap on the newsgroup.

>
> > Question: How many disk reads do you think it took to get all of the
> > information shown above relating to the invoice?
> >
>
> Approximately the same as in any modern DBMS.
>

It'll be interesting to to some benchmarks when you've got your system ready. The CEO is an impatient man though! How long do you think it'll take?  

Better get to it then huh?

Cheers mate,
Mike. Received on Fri Oct 24 2003 - 15:56:30 CEST

Original text of this message