Re: Nulls, integrity, the closed world assumption and events

From: dawn <dawnwolthuis_at_gmail.com>
Date: 18 Jan 2007 07:02:03 -0800
Message-ID: <1169132523.355992.33600_at_q2g2000cwa.googlegroups.com>


JOG wrote:
> dawn wrote:
> > JOG wrote:
> > [snip]
> > > As I said I prefer empty sets to nulls as a stop gap to the missing
> > > information issue. I find relational decomposition even more intuitive,
> > > but am still undecided if there is not a better methodology out there
> > > somewhere than 6NF. However, I am a firm believer that an elegant
> > > theory will lead to good practice.
> >
> > Any elegant theory? That has surely steered us wrong in the past. I
> > don't know enough about relativity, quantum theory, string theory and
> > the like, but I'm pretty sure there are examples for my point among
> > those.
>
> While I had to wait for the noise you've generated again to die down a
> bit dawn, I must say I do find it hard to keep conversations with
> yourself on track - you take the conversation off on such tangents to
> the main point, which then develop a polemic life of their own, that it
> is very hard to maintain focus. I really hope that this isn't
> intentional,

Nope, I suspect I just come from a different angle, not as a computer scientist or mathematician, but practitioner and then also budget officer over the years transitioning the data processing industry from file systems to DBMS's to RDBMS's (or approximations thereof, a qualifier one must add for this crowd). I have tried to adjust terminology and thought processes enough to be part of the discussion here since there are some mysteries I've been trying to clear up (why the industry headed to 1NF and 3VL and when it will rectify the existing situation, for example).

> because I am /sure/ you must know what I was inferring to
> the truism that good theory leads to good practice.

Actually, I thought you were suggesting that if we come with what we consider to be good theory, such as relational theory as it was in the 1970's, then we will have good practice. Relational theory seems like "good theory" (at least to most people and for the most part), but we botched up the practice in some ways by thinking along the lines you suggest, that good theory leads to good practice. I don't want to see the industry repeat this mistake by employing this mantra similarly again. So, I didn't see your statement as a truism, but as a way for us to repeat our mistakes by not testing (e.g. collecting emperical data) the practice that stems from "good theory."

> If a theory doesn't
> lead to good practice, it is a darn good indication that it wasn't a
> good theory in the first place. Either way, this is all pretty
> irrelevant and distracting stuff.

You made a statement with which I disagreed and which I thought was what led us to disrupt things in an unfortunate way in the first place.  I apologize if my response to a statement you thought obvious was distracting since I honestly think it is a central issue in how the industry took a bad turn.

> > I, too, think that modeling using relations and applying
> > set-based functions is good, but do not share your intuition for
> > splitting up predicates whenever there is a chance of using the form a
> > null that works with 2VL.
>
> You should do though! Why? Because even though an empty set is a value
> the propositions we state, and predicates we use to describe
> commonalities, do not contain sets, and would only extremely rarely
> ever mention sets. I think entrenched techie processes have clouded
> your intuition - you are thinking in terms of record systems, and what
> will fit in the 'holes' they engender, instead of considering that when
> we state facts, by definition, they have no holes! Never!

I'm hesitant to disagree because I suspect this is another truism. I will make a couple of comments, however.

  1. modeling with 2VL and nulls works very well. We have thrown it out before, as an industry, thinking that it did not align well with some theory. We should be careful to simply toss out what works well.
  2. When we collect and show data to a human, we are often collecting it with "blanks" and then showing it back to the user with blanks. For example, they might have a screen and related reports where the proposition Person 12345 is a male named John Doe whose hair color is _________. The user sees the blank and does not fill it in, either because John Doe is bald or because they do not know his hair color. Someone else then sees a report that might provide this very proposition back. It is ambiguous, but if it were important that it not be, then we could have collected additional data and insisted that the data entry person tell us why they are leaving it blank. Human beings are working with these propositions that include blanks repeatedly, performing logical functions based on them. It seems that we could model human language and behavior with our stored propositions and do likewise. In other words, when we model the propositions the way the user does, we are simulating human behavior, the behavior that humans would use if the data were stored on paper forms in file cabinets. It is the way a person would intuitively work with the data. If we cannot simulate this with a computer, then why not? Again, it seems to work very well and from what I have seen, users using systems with this approach to NULL have fewer problems with querying and working with it. The system then "think like" they do.

> And to think in terms of holes like this really isn't intuitive - for
> example, say you're writing some XML document. You are hardly going to
> create a tag with nothing in it, when you could have just left the tag
> out altogether are you?

If it is an XML document to back a report to the user, one might do so.  If we are repeating the tags for each column with each row of data, then (ugly as that is), so be it. For a null value, I would think you would have <hairColor /> (or <hairColor></hairColor>) for that row in your XML.

> The logic you are following is asking me to put
> in that tag with an empty set inside it.

Yes, or the shorthand version of that in XML.

> Framed this way it seems like
> madness (if still preferable to putting an sql-null in there).

Ah, but it isn't, and likely results in less madness than the SQL approach.

> > I was advocating migrating from the "legacy"
> > Pick approach to and SQL-DBMS approach when I was hit over the head
> > repeatedly with the fact that it was more intuitive for developers and
> > end-users, resulting in fewer data and process defects when using null
> > as empty set with 2VL than null as unknown with 3VL. I suspect there
> > would be few, if any, folks who have worked with each who prefer 3VL
> > and related relational decomposition.
>
> I do not see the coupling you make with 3VL and 6NF.

I don't recall where 6NF was in this mix, but my understanding is that many (not all) put the elimination of nulls into 1NF and all subsequent NF's are built on top of that. I did not intend to couple 3VL with 6NF per se and I don't even recall that 6NF was in this discussion.

> > > > > 2) If John Doe has no cars then no proposition that satisfies the
> > > > > Predicate Person(id, name, cars) will involve him (there is no mention
> > > > > of sets in the english version of the predicate).
> > > >
> > > > Once the predicate (loosely) below
> > > > person with id has this name and these cars
> > > >
> > > > is modeled, this model also permit us to see the propositions that
> > > > might not have naturally arisen, such as
> > > >
> > > > Person 123456 is John Doe who has no cars (of which we are aware)
> > >
> > > Well Decomposition allows this without a null by recording that very
> > > statement in a relation with a corresponding predicate. I am of one
> > > mind that commenting on missing information should not (yes,
> > > theoretically) be squashed into a predicate concerning known
> > > information. However, I also understand we are currently lacking in
> > > tools that make this process quick and easy to use.
> >
> > I have yet to understand how it is in any way simpler to developer or
> > use or less costly to maintain when a relation such as (id, name, ssn,
> > status, status_date, former_names, birth_date, address_ids,
> > email_addresses, phones, phone_types, cars, children_ids) turns into 8
> > 1NF tables.
>
> You've craftily (?) mixed up a couple of things here - MV and 6NF.

I wasn't intending to comment on 6NF (since I have to look it up every time it comes up), just decomposition of relations to eliminate NULLS

> 6NF - well I never said that it would be more efficient or less costly
> to maintain, so why bring that up? (again I do find it hard to keep the
> discussion on the rails, when these tangents stemming from debates with
> other people crop up).

I'm missing the point of bringing up 6NF here. I have read Date's comments on DK/NF and instantly dropped them from my brain. I don't know (or particularly care, but perhaps I should) how 6NF plays into this. Why are you bringing it up?

> Nevertheless, those are physical issue not a
> logical ones anyhow. Simpler to develop? Well that depends how you need
> to use the data, but again only an issue whatsoever if one is lacking
> the correct tools to translate from logical layer to your application
> code's particular unique conceptual model. I agree we do need better OR
> mapping tools -

I think we can pretty much skip them if we use a more consistent data model throughout our software.
(See
http://www.tincat-group.com/mewsings/2006/02/dont-suffer-impedance.html )

> or better still application code that handles data
> better than just with OO.

Whether writing the application with JavaScript, Visual Basic, DataBASIC, PHP, Java, C#, C, C++, Perl or any functional, procedural, or OO language, you still have that same mismatch with the RM. There is one side of this that is consistent in the mismatch and it isn't OO.

> Nevertheless these are all practical issues,
> not theoretical ones!

This is where I'm missing the meaing of "theory". When I abstracted it from practice, I was chided by others indicating that it isn't good database theory unless it it is related to good database practice -- that practical application is inherent in the theory, that we don't have theory in a vacuum, but in relation to something. So, if there are two practices and one is seen as better than another (recognizing we are missing emperical data, but let's say we had it) but our theory directs us to the other, then is it good theory? Shouldn't the practice at least be used to test the theory? To what extent can "database theory" function without caring about its implication for writing good, maintainable software? If relational theory had remained "on paper" then I would have no problems with what it did to our industry at all.

> MV - As a coder I can promise you that looming Query Bias brings me out
> in a cold sweat.

Is that a theory issue or a practical one ;-)

> Using the inevitable MV in OO for example - ugh,
> iterator loop upon iterator loop upon iterator loop to access the info
> I need buried three vectors down. Bring on relationship-oriented
> programming.
>
> [snip]
> Dawn wrote:
> > JOG wrote:
> > > with your view that bias was not a concern, whereas most here would say
> > > it was crucial for shared data.
> >
> > I would not say it is of no concern, but that there are tradeoffs and
> > drawing the line in the sand there eliminates some possibly better
> > solutions.
>
> This would be a provocative discussion to have in another thread. Imo
> agreeing that contentions boil down to whether one believes query bias
> is important or not in the logical model, progresses the debate
> substantially.

I agree that from your perspective it would. Without looking at the tradeoffs, I would think that if we just ask the question "is query bias something you want to avoid" the answer would be "yup." So, I think you have to take each data model as a whole in order to get enough comparison to ask "given what we gain, can we accept query bias in this form?"

> [snip]
> > > This is more helpful to me, so below is an example of what I mean -
> > > please excuse the perhaps unfamiliar syntax with which I describe the
> > > true statements being recorded:
> > >
> > > Non-MV person-table:
> > > Id(1) & Name(Tom) & Car(Nissan)
> > > Id(1) & Name(Tom) & Car(Porsche)
> > > Id(2) & Name(Bob) & Car(Aston Martin)
> > >
> > > MV person-table:
> > > Id(1) & Name(Tom) & Car(Nissan) & Car(Porsche)
> > > Id(2) & Name(Bob) & Car(Aston Martin)
> > >
> > > Now if I JOIN the Non-MV version with a relation describing cars,
> > > notably why they were designed say, I'd get:
> > >
> > > Non-MV Joined-table:
> > > Id(1) & Name(Tom) & Car(Nissan) & Designed(Japan)
> > > Id(1) & Name(Tom) & Car(Porsche) & Designed(Italy)
> > > Id(2) & Name(Bob) & Car(Aston Martin) & Designed(UK)
> > >
> > > And I could read the corresponding propositions as "The person with ID
> > > 1 is called Tom and has a Nissan car which are designed in Japan", for
> > > example. Now for the non-MV JOIN:
> > >
> > > MV person-table:
> > > Id(1) & Name(Tom) & Car(Nissan) & Car(Porsche) & Designed(Italy) &
> > > Designed(Japan)
> > > Id(2) & Name(Bob) & Car(Aston Martin) & Designed(UK)
> >
> > Actually, when performing a join on a nested attribute, you first
> > "unnest" or "ungroup" it, so in this case you would have the same as
> > your prior join.
>
> So you have to reinterpret MV data temporarily into 1NF just prior to a
> JOIN? (I say reinterpret because I see no simple mathematical
> transition)

No, I don't think it is reinterpreted, it is merely interpreted. The mathematics of it is the mathematics of the tutD ungroup, perhaps, as I think that corresponds to the IBM U2 UNNEST.

> Then this is surely an indication that the model requires
> 1NF for correct manipulation,

The data in the join result might not be in 1NF, it is only the list on which we are doing a join where we unnest the data, because that is the interpretation of what we want to do when we are joining on such. Recognize that if we are doing an actual JOIN in MV, rather than linking & navigating around, then we are switching from our di-graph model to a set/relation model and using SQL. Otherwise we do not speak of such joins in MV any more than we talk about JOINs of web pages.

> and remembering that manipulation is one
> of the three vital tenets of a data model, along with structure and
> integrity, this is pretty much the raison d'etre of the logical layer!

Again, when working with JOINs, we are working with the "relational model" of data and not the MV native model.

> I am becoming more and more inclined to think that it is better /tools/
> to translate between layers that could provide you with the
> functionality you desire.

Perhaps, however, each translation has its cost and developers, in the end, need to be able to trace them all for troubleshooting and debugging.

> > You get more benefit from this approach when there
> > are multiple multi-values. You can then just unnest the joined
> > attribute and take all others along for the ride as "property lists"
> > without doing the nasty cartesian cross-product thing with every
> > multivalue.
>
> Entities and ownership have nothing to do with propositions.

The propositions come from a conceptual model where entities and properties are relevant (or can be).

> I worry
> that you often seem to mix use of the terminology of entities and
> propositions up.

When I am talking propositions, it is in the logical modeling, where entities are in the conceptual layer. Perhaps the issue is that I do not see the mapping from conceptual to logical as one-way? That is just a guess as you seem to want to stick in the world of modeled propositions where they are only useful from my perspective, when actions can be taken, which means they must be interpreted, getting back to the conceptual layer.

> In my, granted, very limited experience this suggests
> an OO, process-centric as opposed to the data-centric outlook.

Hmmm.

> > This means that, for example, if you want to add one more
> > attribute to your view and it happens to be one that has multiple
> > values, such as phone numbers, you can just add that attribute into
> > your view and all of your reports still work. You do not end adding
> > more rows into your view. Do not underestimate how sweet this is.
>
> This is a point however that I think has any basis worth investigating
> (although it is concerned with schema evolution and query
> preservation). I am not saying that MV approach you suggest is any
> solution (after all it requires a similar change from value domains, to
> set domains for the relevent attribute), but I think the topic is
> worthy of research.

Good.

> > > I seem to have lost any correspondence between the Designed role and
> > > the Car role in the first proposition because of the multivalues,
> >
> > Yes, because you "forgot" to unnest the attribute on which you were
> > joining.
> >
> > > especially given the &'s are commutative, and so ordering is
> > > unimportant. The more I join the more this problem extrapolates.
> > >
> > > N.B. If you did want to represent the multivalues as:
> > > Id(1) & Name(Tom) & Car( { Nissan, Porsche } )
> > > I see no way of keeping the representation consistent by incorporating
> > > the Designed role within the value {Nissan, Porsche} as the result of a
> > > JOIN - unless you use a nested relation such as:
> > >
> > > Id(1) & Name(Tom) & Car( { Name(Nissan) , Name(Porsche) } )
> > >
> > > in which case the Join applies to the relation value in the Car
> > > attribute, giving:
> > > Id(1) & Name(Tom) & Car( { Name(Nissan) & Designed(Japan),
> > > Name(Porsche) & Designed(Italy) } )
> > >
> > > I certainly know which representation appeals to me - and thats the
> > > first, 1NF example. Clear, intuitive and generating readable
> > > propositions. regards, J.
> >
> > Does my response help clarify? --dawn
>
> Well yes, the second half does (as I said I found the first half quite
> bad for getting anywhere here).

Sorry 'bout that :-( I'm much more delightful in person -- still working on my writing skills.

> However, of course I think there is a
> mistake in not facilitating JOINs without having to do some form of
> dodgy reinterpretation of a table before hand.

I'll grant that set processing functions, in general, require switching to the set-based model (the RM), but this is rarely done (which I would think you would find to be another problem, with those nested iterations and all that otherwise take place behind the scenes).

> And of course
> propositions don't have ownership sets, or property lists, just ands,
> ors, nots and therefores.

Not until you try to interpret them.

> Y'know, I honestly /wanted/ Multi-Values to make theoretical sense,
> just as I once wanted binary graphs to make sense.

Did you want them to make more theoretical sense or more set-theoretical sense?

> I initially thought
> that atomic (non-MV) values were perhaps just an unnecessary vestige of
> Codd's 1969 decision to use mathematical relations before he turned a
> 'tuple' into a finite partial map with the addition of attribute names
> in 1970. But evaluating the effects of Query Bias and the effect on
> Join manipulation has meant I have /had/ to change my thinking on this,
> to concede that there is something more to 1NF than just fitting data
> into a relation, and to stop myself banging my head against a brick
> wall in the name of my own obstinacy. Its been a frustrating trip, but
> life is too short to get caught down dead ends. J.

So taking different journeys, equally frustrating perhaps, we have landed at different destinations. In the end, I'm looking at results and practice. In spite of never caring much about money, it does give an idea of what I'm after if I frame the question I am answering as "If I have $x to spend over n years on the development and ongoing support of software applications, using what approach(es) will I get the best software solutions for my investment?"

I wanted to proof-read this before sending, but real work has cropped up, so I'm just sending this and hope it is reasonably helpful. cheers! --dawn Received on Thu Jan 18 2007 - 16:02:03 CET

Original text of this message