Re: Mixed, or conditional, datatypes

From: Bob Badour <bbadour_at_pei.sympatico.ca>
Date: Mon, 26 Mar 2007 12:58:42 GMT
Message-ID: <6sPNh.15442$PV3.158964_at_ursa-nb00s0.nbnet.nb.ca>


Chris Edwards wrote:

> On 03/26/07 07:42, JOG wrote:
>

>>On Mar 25, 2:23 pm, Bob Badour <bbad..._at_pei.sympatico.ca> wrote:
>>
>>>sinister wrote:
>>>
>>>>Example:  suppose I'm storing responses to a diagnostic interview.  One of
>>>>the questions asked of the subject is something like
>>>>7.  Have  you always lived in the US?

>
> [...]
>
>>>The truth of the matter is you really have two surveys; even though,
>>>someone presented the surveys to respondents as a union of all the
>>>questions. Lifelong residents answer one survey and part time residents
>>>answer another.
>>>
>>> From the above observations, the design seems obvious: Use two relations.
>>
>>Just to echo this, without a shadow of a doubt more than one relation
>>is required, because you are attemptiing to record more than one type
>>of proposition - and the RM is based on collectivizing sets of
>>propositions of the same type. A "person_details" relation and a
>>"years_in_US" relation hence seems intuitive, with the latter
>>containing an appropriate foreign key to the former, and a simple JOIN
>>being utilized in queries if and when appropriate.

>
> Of course! Why did nobody use examples like this in the courses I
> studied? I know that Date, for example, takes pains to emphasise that
> that is precisely what relations are (sets of uniform propositions), but
> it seems to be all too easily overlooked.
>
> So would the join operation in this case would be an outer join? Since
> this would result in a relation with a lot of nulls in the
> non-applicable responses, would it be so bad to store the nulls in the
> base relation?

Outer join is just a shorthand for a union. The simplistic idea that one can use a marker or value that will be suitable for all situations is naive.

If one uses the data with a statistics package like SAS or SPSS to perform regression analysis, for example, one might translate the combinations of answers into a code that indicates various ranges of tenure or one might translate the answers into several flags etc.

Having something like 'A'=lifelong, B=less than a year, C=a year to five years etc. or having a handful of Y/N flags involves no NULL markers at all. The query is as easy to write as a union as it would be to write as an outer join.

While my knowledge of statistics is sorely lacking, my experience suggests to me that knowledgeable people avoid continuous variables for regressions preferring discrete values instead.

> I suppose another approach would be to store the response as a single
> attribute of a domain that supports the values "lifelong resident" (and
> perhaps "never lived in the US") in addition to numbers for indicating
> the length of time since moving there. Of course, realistically, you'd
> probably just record the date the respondent started living in the US
> (which could be their date of birth, for a lifelong resident).
>
> BTW, is it a coincidence that the "variant record" construct in C is
> called "union"?

It is almost certainly not a coincidence if for no other reason than the popular meaning of the word "union". Received on Mon Mar 26 2007 - 14:58:42 CEST

Original text of this message