Dirty data (was: Newbie question)

From: mAsterdam <mAsterdam_at_vrijdag.org>
Date: Thu, 23 Jun 2005 19:20:41 +0200
Message-ID: <42baef6b$0$34656$e4fe514c_at_news.xs4all.nl>


David Cressey wrote:
> Paul wrote:
>

>>True. But then many database constraints can be duplicated on the client
>>to avoid repeated round-trips to the server. There was a long thread
>>here a year or so ago on the desirability of clients being able to read
>>the database constraints instead of being redundantly recoded. And then
>>risking getting out of sync with each other.

>
> There is value in avoiding round trips to the server.
>
> With regard to keeping the client and the database in synch with regard to
> data constraints (database constraints?),
> there is another alternative. Allow both the client and the database to
> inherit the constraints from a common source.
> That's why Kenneth Downs' work is so interesting to me.
>
>>True again, there is no guarantee someone won't type in another valid
>>key, but it does knock out the vast majority of mistakes. I guess the
>>most common error is to make a mistake in a single character (writing a
>>1 as a 7 for example). I think that checksums are specifically designed
>>to ensure that changing any one character will invalidate the checksum.
>>I could be wrong here but if not, they certainly would take care of most
>>cases.

>
> Yep.
>
>>And to check that the key is valid (i.e. in the database) and not just
>>well-formed, you do need a trip to the server. But hopefully not many
>>entries will manage to get to that stage.

>
> Maybe "not many erroneous entries". Hopefully, the vast majority of entries
> will be correct, right?

Earlier, in

http://groups.google.nl/group/comp.databases.theory/browse_thread/thread/5161ba3961681a05/1c87055f4fa064a2?q=mAsterdam&rnum=58&hl=nl#1c87055f4fa064a2

I wrote:

> What information is lost at capture-time?
>
> 1. reference to information not already available.
>
> 2. information that doesn't fit the existing structure.
>
> Number 1 needs additional reference information,
> 2 needs a change of model.
> Both may be to cumbersome.
>
> 3. information contradicting information already available.
>
> 4. loss due to mistakes, sometimes caused by (4a) interface inadequacies.
>
> 5. loss due to "minimal input to get the job done"
> (not caring about the shared data).

Somewhat later:

> ... I do *not* assume here
> that the ultimate goal of the actor providing the information
> is to do just that (providing the information). So an "import"
> or "load" for maintaing the database does not suffer from this
> loss. My assumption here is (I used it earlier, in the thread
> on "multiple specification of constraints") that the actor
> providing the information provides the information as a side
> effect of trying to achieve a goal which is - at least
> partly - *outside* the scope of the database.
Dawn M. Wolthuis added:

>> 6. Information that is "lost in translation"
>> The person (or service) updating the data understands it, but cues that
>> might help someone interpret the data are missing.  This might be a
>> combination of a couple of the others, but seemed worthy of another bullet.

>
> Agreed. It is premature yet to aim for independence of the factors.
> Maybe a clever reformulation could distill some basics and reduce
> the number of bullets. Later.
>
> BTW: The "person (or service)" - let's say "actor".
> Any objections?
>
>> 7. Loss due to inaccurate or misleading metadata
>> The information might be added correctly but when passed along in the form
>> of a report, the information describing the data could be misleading (even
>> to the extent that the untagged data is fine, but the described/tagged data
>> would appear wrong).

>
> Would "unforeseen context" describe what you mean here?
[snip]

> "Perfect information", "perfect knowledge of
> information needs" and "confluence of goals"
> seem to slip in as implicit assumptions way to often.
Received on Thu Jun 23 2005 - 19:20:41 CEST

Original text of this message