Re: Units of Measurement in the Database Model
Date: 14 Apr 2002 10:15:55 -0700
Message-ID: <57da7b56.0204140915.11087419_at_posting.google.com>
mikharakiri_at_yahoo.com (Mikito Harakiri) wrote in message news:<bdf69bdf.0204131003.3527c104_at_posting.google.com>...
> paul_geoffrey_brown_at_yahoo.com (Paul G. Brown) wrote in message news:<57da7b56.0204122058.6773d7da_at_posting.google.com>...
[ snip ]
> > I oversimplified.
>
> To the point where the expence of creating and maintaining
> user-defined datatype is not quite obvious to the reader? Let
> summarise what a user needs to do in order to be able to store
> "pounds" and "kilogramms" together:
>
> 1. Define a user defined type (with methods and operators).
> 2. Define index(es) on the new type.
> 3. Define statistics and histograms.
> 4. Extend aggregate operators to embrace the new type.
>
> (Didn't I forgot anything?-) And we compare this to a view with a
> trivial conversion factor?
What items on this list (except statistics and histograms) could be left out by application developers? They will inevitably do measurements as an ADT, write a compare to do sorts, and provide basic math operations.
And this is actually a problem . Let's focus on measurements of length (or distance). If the application wanted to manage both FEET and CENTIMETERS, then the code to handle the conversion is going to have to be written, at least once, for each front end-language that uses the database. Once in Java, C#, Perl (for CGI plug-ins) and so on. Even if all you wanted to do was to write a conversion function then doing it once in the DBMS minimizes the amount of code, and maximizes the system's overall integrity.
The basic motivation for the domain-centric approach is that applications are often required to store the original units: that is, information about the original units is not an artifact. And I can think of numerous examples where this is necessary. For legal reasons a firm might be required to store information in the original units: foreign exchange (US$, EU, etc), commodities contracts, scientific measurements (hold the units argument for a second and think about how to handle error-bars). In GIS systems you always need to record more complex 'unit-like' information, such as the projection in which some distance or area calculations were performed.
If you take your position to it's extremes, why do programming languages have abstract types at all? Why not just write transformation logic to turn it all into tagged bit arrays? I grant you that this is (more or less) precisely what does happen under the covers, but the evidence of the growing reach of the SQL type system and the emphasis on typing in programming languages generally suggests that the trend goes the other way.
> A typical shotcut that DBMS would suggest to the user is defining just
> a single method that would map a value of user defined type into a
> number, so that most of the bullet 1, and all of 2,3, and 4 becomes
> unnecessary. But this mapping is esentially a conversion into standard
> units!
Some extensible DBMS products (notably DB2) take exactly this approach. You define a DISTINCT TYPE OF DOUBLE USING COMPARISONS, write your conversion function, and you're done. But this approach can't extend to certain types of information: Rational and Complex numbers, for example, or the more exotic unit forms, and it doesn't retain information about the original unit.
Doing a new type really isn't all that hard. In Postgres, you can get away with compare, in_func and _out_func. If you want math, then plus(), minus(), and so on are necessary. I wrote an entire library of weights-n-measures for INFORMIX in 'C'. All up it's about 1000 LOC, and it took me about ten days. (Granted, I was familiar with the tools and interfaces.)
>
> > Instead of Measurement, use a label like 'Length'. The only valid units
> > for a length would be the ones that could be meaningfully converted from
> > one to another. Thus comparing Parsecs with Kilogram meters per second
> > squared would not be possible.
>
> Note, that some of the users, for example, physicists, might want to
> compare mass and length. Do you suggest different "LengthInMeters"
> classes for different categories of users?
I suggest designing your data model to accurately and rigorously reflect your user's problem domain. I'm unclear on why they might want to, but if the physicist/users could define algorithmically what they mean by comparing mass and length (or how to map values from one domain into another) then I could support it quite elegantly. IMHO, LengthInMeters and LengthInYards are synonyms for the same domain, just as 'Employees' and 'Workers' might be different names for the same entity type in ER modelling. I'm not suggesting that you go nuts, but it does seem to me that a design putting units in a normalized table is guilt of 'multiplying entities' (Thanks Rev. Occam).
[ SNIP ]
> I don't see why measurements are different from other domains. In an
> enterprise system that spans the US and France with headquaters
> located in the US, do we store the names of the employees located in
> France in French?
So you concur! French and American employee names can be mixed and compared in quite straightforwrd ways: they can be concatenated, an interleaved list of names sorted, and so on. What's more complex is what happens when you introduce Kanji into the mix. But they key thing is not to compare employee names with Tokyo stock symbols or measurements of temperature on the surface of the sun over time.
And look, there are all kinds of problems with the FK/VIEW approach. For one thing, this is not a view you will be able to UPDATE freely (what happens if you try to update the unit?) And how does a DBA or report writer working with the data specify variables in queries? And then there is the small challenge of re-implementing the conversion/arithmetic routines in every front end (or, god forbid, getting user consensus on the 'best' canonical unit to adopt: I can just feel the JAD session now, with bigots on both sides going round and round the mulberry bush on the merits of pound feet per second versus Newtons), and what about replication or federated/distributed systems among databases electing to adopt different 'standards'?
In a nuthshell, the UDT is a) code you are going to have to write anyway, but by centralizing it you tighten the database's integrity, reliability and security, b) it solves the set of problems with a minimum of 'sorry, you can't do that' (retain original measures, UPDATE, write reports), and c) it affords the very highest amount of flexibility. It's single drawback is that it is an unfamiliar path up the database development mountain.
Hope this clarifies some things. The discussion has been very good.
KR
Pb
Received on Sun Apr 14 2002 - 19:15:55 CEST
