Re: Representing data on disk in an MV database - was Re: foundations of relational theory? - some references for the truly starving

From: Jonathan Leffler <jleffler_at_earthlink.net>
Date: Fri, 24 Oct 2003 04:19:14 GMT
Message-ID: <6L1mb.4003$wc3.803_at_newsread3.news.pas.earthlink.net>

Mike Preece wrote:

> Jonathan Leffler <jleffler_at_earthlink.net> wrote:

>>Mike Preece wrote:
>>>[...]
>>> It might be worth mentioning at this stage that the amount of data you
>>>can store in a Pick item will require far less physical disk space
>>>than on any other (uncompressed) database. There are two reasons for
>>>this:
>>>1) because every field is of variable length - every datum is just as
>>>long as it needs to be and no more and empty fields occupy no space at
>>>all (other than single character system delimiters) - and
>>>2) [...]
>>
>>Does that mean that all field values are stored as printable character 
>>strings?

>
> Yes they are.
>

>>I've seen similar comments before, and always been puzzled 
>>by them.  It seems to be the only way to understand the use of a fixed 
>>byte code as the value separator (sorry - I've forgotten the correct 
>>MV terminology).

>
> Apology not necessary. Thanks anyway - it's like a breath of fresh
> air.
> Char(255) = End of item mark
> Char(254) = Attribute mark
> Char(253) = Value mark
> Char(252) = SubValue mark
>

>>If I'm correct, and if an MV database is storing integer values in the 
>>range 0..999 (or maybe 0..9999), then the space used is no worse than 
>>in a system which uses a binary data representation, but as soon as 
>>you're dealing with numbers in the millions, instead of using a fixed 
>>length (4 bytes for a 32-bit integer), this system seems to be using a 
>>lot more space.  (Yes, I quietly ignored negative numbers - I'm 
>>curious about how they're represented on disk too.)

>
> OK. I'm prepared to concede this point - to a degree. Packed decimal
> isn't allowed in Pick - out of the box, you'd have to 'roll-your-own'
> functions to allow it. This would potentially mess with system
> delimiters (x'7F' would be a problem for example). The number 12345
> takes 5 bytes, as does 123.45 (numbers are generally stored in
> internal format with input and output conversions applied). Minus
> 123.45 takes 6 bytes. Obviously packed decimals take up less space. I
> suppose then, if your data comprises mostly packed decimals in a
> database that caters for variable length data (I wonder how fields
> would be delimited though), you would be able to get more data into
> less physical disk space. As a matter of academic interest I'd like to
> know which database can do this and how it does it.

Well, maybe I can shed some light on this - at least as it applies to the Informix databases (SE, OnLine, IDS, XPS that is - not RedBrick, nor the U2 systems, nor yet Cloudscape or the other half-dozen or so DBMS that Informix acquired at various times), and slightly less definitively to DB2 in its various forms. I'm fairly sure that the basic ideas I'll outline apply to Oracle, MS SQL Server, Sybase -- and even some of the non-relational systems such as IMS.

All columns in an SQL table are typed. The majority of those types have a fixed size representation on disk (and in memory). Consequently, there's no need to delimit the data on disk. For example, the first four bytes of a particular record type on disk contain a 32-bit signed integer in some canonical format (either big-endian or little-endian). That's why I prevaricated about 0..999 vs 0..9999; with the end marker byte, 3 digit numbers in MV databases occupy the same space as a 4-byte integer in the average SQL system. So, in practice, the range -99..999 would occupy four bytes on disk (counting the marker byte), or -999..9999 not counting the marker byte.

Different SQL DBMS have different schemes for storing variable length character data. Some will organize the storage so that the data for the fixed length columns is stored at the beginning of the record and the variable length data at the end - others interleave the fixed and variable length data. And for some types of variable length data (large objects, binary or text - aka blobs), the data for the variable length object is stored completely separately, with just the (fixed length) information about where to find the actual data stored in the main record. No doubt there are other schemes too. There are different performance implications for the different organizations, though it is second order (minor detail) performance. Generally, the smaller variable length types (eg VARCHAR(255) or smaller), will use a layout such as a 1-byte length followed by that many bytes of data.

Now, when you have a variable size list of integers (eg SET{INTEGER}), it is reasonable to guess that there is a variable length record prefixed by a count of the number of integers and then that many fixed-length integer values.

-- 
Jonathan Leffler                   #include <disclaimer.h>
Email: jleffler_at_earthlink.net, jleffler_at_us.ibm.com
Guardian of DBD::Informix v2003.04 -- http://dbi.perl.org/

Received on Fri Oct 24 2003 - 06:19:14 CEST

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message