Re: Help Data-Types

From: Mikito Harakiri <mikharakiri_at_yahoo.com>
Date: 19 Aug 2002 16:10:00 -0700
Message-ID: <bdf69bdf.0208191510.12084745_at_posting.google.com>


GoranG <no_at_spam.net> wrote in message news:<pfn1mu8u6e6tph4gpcal282bsc6kcbpjfh_at_4ax.com>...
> On 18 Aug 2002 09:57:58 -0700, mikharakiri_at_yahoo.com (Mikito Harakiri)
> wrote:
>
> >"Dave" <dave_at_dmcomm.com> wrote in message news:<_Ug79.27$Tl4.10801289_at_news.incc.net>...
> >> Just a quick question on a datatypes since I'm a young novice.
> >> What the heck would an IP address go under? Text? Integer (long, short?) ??
> >>
> >> Thanks for the help.
> >> D
> >
> >Four integer columns?
>
> My first idea as well. Seems that nothing is lost by normalizing and a
> lot is gained...

I'm not sure what you mean by normalization here. Higher normal forms (>=2) aren't concerned with domain splitting at all, while 1st normal form is about disallowing nested relations. AFAIK Database theory tells us very little if nothing at all about functions applied upon domains. None of the relational algebra operators allow any functions. On practical side, however, if we split ip address into four parts, then we want concatenating them (with dots, or not) in some view as a convenience for the end-user/application-developer.  

> So, I am puzzled by amount of the answers that suggest contrary...
>
> Is it so because the IP address is almost all ways used as a whole (in
> usual applications) and therefore should be thought of as whole in
> semantic sense? Unless you are modeling a router ;)

Again, since a simple view can aggregate a whole from the parts, there is no logical whole vs. part distinction at all. Clearly, the only distinction is a programming convenience of using the whole or parts. Probably, this is the only reason stopping us from the design like this

table person (

   ...

   birthday_century integer,
   birthday_year_in_tens integer,
   birthday_year_in_ones integer,

   ...
)  

> Similar to date datatype - if there were no date datatype we could
> model it as single real number or as three integers (both design
> require non trivial constrains).
> Btw, what is relational purist answer/approach on the date data type?

From scientist's perspective the time is just a scalar domain isomorfic to real numbers. Unfortunately, people made every possible attempt to complicate the issue -- I mean Grigorian Calendar. The base system is just ridiculous: x60, x60, x24, x28-31, x12. It's not even uniform (0 year is absent, seconds are dropped at the century boundaries, etc). Storing data in a skewed base system is just plain wrong. Alternatively, we store data in the scientific base system (milliseconds after millenium) and transform that data to the dumb end user. (The end user is dumb, because he doesn't realize that the cost of changing calendar is lower than paying hordes of programers that have to program around the goofy calendar;-).  

> Standard answer that I read to such questions is: It depends on the
> semantic model which depends on the intention of usage.

That "advice" contains zero information for me.  

> However, in most cases those data types might be used in both meanings
> (whole and parts) within the same system - not to mention that the
> usage might be unknown.
>
> So is this why modeling is called art? since there are no definitive
> answers to these questions, and for every rule there are two
> exceptions?

Again, since the whole vs. part dilemma is a matter of complexity, and complexity definition is tricky (to say at least), then, you are right, the programming, in general, and modeling, in particular, is art.  

> >Then, enforsing constraints and querieng is the
> >simplest of all. Alternatively, it could be just a single integer:
> >
> >a3*256^3+a2*256^2+a1*256^1+a0
>
> My second idea as well. This seems more natural since it fits the
> actual data type (it'll even support IPv6). Data representation in
> xxx.xxx.xxx.xxx is considered formatting function (part of this
> formatting is data extraction). However textual representation seem
> cheaper (speed vs space costs), faster (in most cases, since it is
> already formatted) and better supported (bitwise operations are less
> supported compared to string functions - all IMHO)

I can hardly see any speed vs. cost tradeoff here. First, the storage difference between varchar and integer is negligeable. With more options progressively avaliable in the database (column data compression ,for example) the difference ceases to zero. Second, one don't have to use bitwise operations: modulo and division are supported by virtully all database products. The CPU difference between biwise and ariphmetic operations is negligeable from optimizer perspective.  

> Final digressions - does the following statement hold water?:
> For all proposed data types there exist semantics in which the
> proposed data type is the better then all other.

I don't think so. If you agree that different representations differ by complexity measure only, then it is well known that for finite data structures there is fundamentally no absolute complexity measure.

On more practical note, there also is one more issue related to part vs. whole -- query optimization. It could be easily seen that if we split a column into many, then the optimizer might possibly have more complete statistics about the data distribution. Maybe, the data are skewed in such a wierd way that it can't be vizible from the whole -- single column distribution. Therefore, splitting the data seems to be better performance approach. On the other hand, nothing prevents database engine from collecting statistics on pseudocolumns (and we already do so for functional indexes). Received on Tue Aug 20 2002 - 01:10:00 CEST

Original text of this message