Re: Representing data on disk in an MV database - was Re: foundations of relational theory? - some references for the truly starving

From: Mark Brown <mbrown_at_drexelmgt.com>
Date: Fri, 24 Oct 2003 07:35:39 GMT
Message-ID: <fD4mb.60595$th6.19041_at_twister.socal.rr.com>


Comments below

"Mike Preece" <michael_at_preece.net> wrote in message news:1b0b566c.0310231443.4ce7517f_at_posting.google.com...
> Jonathan Leffler <jleffler_at_earthlink.net> wrote in message
news:<bUTlb.2937$wc3.186_at_newsread3.news.pas.earthlink.net>...
> > Mike Preece wrote:
> > > [...]
> > If I'm correct, and if an MV database is storing integer values in the
> > range 0..999 (or maybe 0..9999),
>
> There used to be an integer limit. I'm honestly not sure if there is
> anymore. I'll look it up if you'd like.

Since AP came out, Pick/RD has been using what they called "string math". They try to convert the number to 64bit binary. If it overflows, then they do some magic that does whatever math is required using the strings instead of numbers. Remember, we're not talking scientific notation here. This is business and as such, 4 decimals is usually enough and usually < 2^63.
>
> > then the space used is no worse than
> > in a system which uses a binary data representation, but as soon as
> > you're dealing with numbers in the millions, instead of using a fixed
> > length (4 bytes for a 32-bit integer), this system seems to be using a
> > lot more space. (Yes, I quietly ignored negative numbers - I'm
> > curious about how they're represented on disk too.)
> >
>
> OK. I'm prepared to concede this point - to a degree. Packed decimal
> isn't allowed in Pick - out of the box, you'd have to 'roll-your-own'
> functions to allow it. This would potentially mess with system
> delimiters (x'7F' would be a problem for example). The number 12345
> takes 5 bytes, as does 123.45 (numbers are generally stored in
> internal format with input and output conversions applied). Minus
> 123.45 takes 6 bytes. Obviously packed decimals take up less space. I
> suppose then, if your data comprises mostly packed decimals in a
> database that caters for variable length data (I wonder how fields
> would be delimited though), you would be able to get more data into
> less physical disk space. As a matter of academic interest I'd like to
> know which database can do this and how it does it.
>
Without actually seeing the source code, but after many long conversations with MS engineers (I'm originally from Seattle) I get the impression they store data in blocks a little like we do with frames. Their data is pretty much all the same (fixed length), so all you need is a header to describe the data type and then a long string of bytes, X bytes per entry. So if you have a column of numbers (long integer), they'll all be 4 bytes long. Since rows are assembled from relative position pointers,you can find anything pretty fast by simple pointing at the header, inc'ing by header length, inc by rel rec # * column width. Viola.

In SQL, a column of "text" or "string" data types will have 256 bytes entries; the max data length. MS Access defaults to 50 bytes (these may be different now). A "memo" data type is different, as are VARCHAR data types; but they have the added overhead of having to have a length field attached so the system will know where they start and stop. Those will each be 4 byte unsigned integers (thus the 4GB limit on memo strings).

I actually did this all one Saturday afternoon soon after I learned SQL. I thought they had some wonderful ideas of on-the-fly translates (joins) and I wanted to see if it could be done in Pick.

I built a large file with an item called TABLES. It was a multivalued list of table names. Then an item T$tablename to define the column names with the stuff like column width, data iconv/oconv, etc; then column records C$table.column

Each column record was a long dynamic array with each attribute a value. "Select Name, Addr, City from Custs Where CustID = 100" is just reading the records C$CUSTS.NAME, .ADDR, .CITY and .CUSTID from the file; find the position in the CUSTID of 100; use that vector to build an item with name<fnd> _at_AM addr<fnd> @AM city<fnd>

Update, Insert, Delete all work pretty much the same way. New data always goes on the end; keep indexes on various "key" fields, but the data is always in the same sequence. Sorting by unindexed columns required a temp file for speed and I never quite got move next/previous through my record set working properly.

The really tricky part of course is the command-line parser that understands stuff like

"Insert Name, Addr, City, State into NewCust Select CustomerName as Name, CustAddr as Addr, Town as City, UCase(St) as State from Custs"

But Hey! It was only one afternoon while my wife was out shopping.

Mark Brown
. Received on Fri Oct 24 2003 - 09:35:39 CEST

Original text of this message