Re: Examples of SQL anomalies?

From: Brian Selzer <brian_at_selzer-software.com>
Date: Wed, 9 Jul 2008 10:04:51 -0400
Message-ID: <8A3dk.14663$mh5.14494_at_nlpi067.nbdc.sbc.com>


"Cimode" <cimode_at_hotmail.com> wrote in message news:eb1fcce7-aaaf-44e3-b7a5-eb1afc438dc1_at_k13g2000hse.googlegroups.com...
>
>> >> From what I've seen, poor design usually translates to inadequate
>> >> design.
>> >> If you don't spend enough time on design, you end up reinventing the
>> >> wheel
>> >> halfway through the implementation, and it ends up costing upwards of
>> >> double
>> >> the budget.
>> > How about the idea that poor(inadequate if you feel better with this
>> > term) design = using NULLS
>>
>> NULLs are only one way to deal with missing information. From your
>> comments
>> it appears that you are equating a design that uses NULLs with a design
>> that
>> permits missing information.
> I do not quite understand how you come to that conclusion or what may
> lead you to it. My point here is that using NULLS to handle missing
> information creates more problems than it solves.
>

I disagree. It is the misuse of NULLs that creates problems. Some would say that any use is a misuse, but they're just being stupid or contrary. As I explained below, it is possible to implement an equivalent database that does not use NULLs, but it tends to be a much more cumbersome affair--requiring decomposition into many tables, the introduction of flag attributes, and the enforcement of many more interrelational constraints. It seems to me that there is much more potential for problems handling missing information without NULLs: all it takes is neglecting to define one of the interrelational constraints or neglecting to add a flag attribute when needed. In addition, queries against such a database tend to require more joins and can be more complicated due to the addition of flag attributes.

Here's an example that requires the addition of a flag attribute:

Original schema that uses NULL:
R1 {Part#, Location} for parts that do not have a lot R2 {Part#, Location, Lot#} where Lot# can be NULL for parts that have a lot but it may not have been supplied.

Alternate schema that doesn't use NULL:
R3 {Part#, Location, HasLot}
R4 {Part#, Location, Lot#}
R4[Part#, Location] IN (R3 WHERE HasLot) [Part#, Location]

Splitting R2 into two tables would cause there to be two relations that have the same heading as R1. These are then combined into R3 with the addition of the flag HasLot.

>> While a design that uses NULLs necessarily
>> permits missing information, there are designs that do not use NULLs that
>> also permit missing information. In fact, for every design that employs
>> NULLs, there is an equivalent design that does not.
> OK.
>
>> >> >> >> Why would they have to spend hours? Hopefully any system would
>> >> >> >> present
>> >> >> >> enough complete information to obviate most guesses, but for the
>> >> >> >> rest,
>> >> >> >> given
>> >> >> >> a choice between being given a potentially wrong answer and
>> >> >> >> being
>> >> >> >> told
>> >> >> >> that
>> >> >> >> the information is incomplete, I (and most people I know) would
>> >> >> >> prefer
>> >> >> >> the
>> >> >> >> latter, even if it required a little extra effort on my part.
>> >> >> > But then, following this logic and the 8 item weigth example, the
>> >> >> > system would then require the information that the sum is
>> >> >> > relevant
>> >> >> > only up to 8 items to be available to people for them to make
>> >> >> > *educated* guesses. Don't you think that makes things more
>> >> >> > complicated? Most people I know would tell you, you did a poor
>> >> >> > design
>> >> >> > job and that they don't need this irrelevant information.
>>
>> >> >> The information is already available. If SUM returned an
>> >> >> indeterminant
>> >> >> result, then the user should be able to issue additional queries
>> >> >> qualified
>> >> >> with something like, WHERE weight IS NULL or WHERE weight is NOT
>> >> >> NULL,
>> >> >> in
>> >> >> order to find out why the result was indeterminant. It certainly
>> >> >> isn't
>> >> >> rocket science.
>> > What information is available ? The information fact that it is not
>> > correct(suspect) ? How is it already available ?
>>
>> If the information were not already available, then how could a query
>> that
>> performs a SUM be issued?
> I suppose that the information you are refering to is the fact that
> marker *suspect*. But don't you think that a binary (suspect/non
> suspect) information would tend to be highly subjective on the
> designer's end ?
>

The information I am referring to is in the table that is the target of the SUM. Each row either has a value to be summed or doesn't have a value.

>> > What interest in building systems that bring the information that the
>> > system is not reliable and impose on people a two step that would
>> > force them to re-interpretate and force them to permanently using NOT
>> > NULL/IS NOT NULL.
>>
>> Who said that the system is not reliable? Returning an indeterminate
>> result
>> to a query that cannot be exactly answered exemplifies robustness, not a
>> lack of reliability.
> OK can you define the difference between system's *lack of robustness*
> and system's *lack of reliability* ?
> I do not see quite the difference.
>

I can certainly define the difference between *robustness* and *lack of reliability*. *robust* and *reliable* are basicly synonymous. The only difference is that *robust* carries with it a sense of strength and the ability to resist attempts to tear it down, whereas *reliable* instead carries with it a sense of repeatability or faithfulness.

>> It is certainly better than telling the user that no
>> answer can be had, since there may be instances where the same query
>> /can/
>> be exactly answered.
>>
>> > Rocket science is understanding how is that simpler than simply taking
>> > the NULLS out of the equation and not have to deal with all this.
>>
>> If the design permits missing information, then you will have to deal
>> with
>> all this. Again, NULLs are just one way to deal with missing
>> information.
>> If there should be a value, but it hasn't been supplied, then there needs
>> to
>> be an indicator that there should be a value. This applies regardless of
>> whether NULLs are used or not. So to handle missing information without
>> NULLs, you need an indication (possibly implicit) that there should be a
>> value, which may need to be in a separate table, and another table for
>> those
>> values that should have been supplied and have. That way the absence of
>> a
>> tuple containing a value can be correctly interpreted, because it could
>> be
>> that there shouldn't be a value, and that's why there isn't a tuple.
> OK so if I understand right. So basically you are talking about
> relation decomposition in case NULLS are not to be used to handle
> missing information ?

Isn't that the only alternative?

>> >> > But what user are you refering the end user or the designer? Should
>> >> > an application end user have to know about IS NULL / IS NOT NULL?
>>
>> >> Both, and neither. The user could be the designer, or an application
>> >> developer, or the user could be an end-user writing a Crystal Report
>> >> or
>> >> integrating a spreadsheet. I would think that in order to integrate a
>> >> spreadsheet, the end-user would have to know about IS NULL / IS NOT
>> >> NULL
>> >> if
>> >> any of the columns in any of his queries can have NULLs.
>> > So, basically the end user is intended as designer on a per use and
>> > per application basis? Don't you believe this would leave the meaning
>> > of information open to designer's interpretation?
>>
>> Isn't it already?
> I tend to believe that interpretation of the meaning of data should be
> done by the end user who do not necessarily have SQL background. I
> would also tend to think that designer's role is merely to *implement*
> a construct/structure/formalization that would facilitate such
> interpretation for the end user . I do believe that using NULLS to
> handle missing information confuses both designer's and end user
> because they have to reconcile their respective subjectivenesses.

I disagree again (but mainly with the last sentence): either there should be a value and it has been supplied, or there should be a value but it hasn't been supplied. There is nothing subjective about that. Subjectivity arises when NULL is used to mean more or other than "there should be a value but it hasn't been supplied." I consider that a misuse of NULL. As the result of an aggregate (or any expression, for that matter) NULL indicates "at least one value hasn't been supplied," which really boils down to "there should be a value but it hasn't been supplied." SQL's implementation sucks.

I have no use for inapplicable nulls.

>
> Not to mention that on a physical standpoint, it degrades performance
> because of additional index scans performed when each optional column
> of a table should incluse or not IS NULL/IS NOT NULL.
>

I would argue that the extra joins required for an equivalent design that does not use nulls as well as the enforcement of many more interrelational constraints would degrade performance more. In addition, you would still have to deal with the flag attributes, which being boolean are not conducive to indexing.

> End user tend to end up frustrated by a lack of correctness due to
> designer's subjectiveness, a lack of accuracy due to a proprietary
> implementation of the markers, and an important performance
> degradation due to ISNULL index cumulative overhead.

Again, it is the misuse of NULL that is the cause of the frustration. Received on Wed Jul 09 2008 - 16:04:51 CEST

Original text of this message