Dirty data (was: Newbie question)
Date: Thu, 23 Jun 2005 19:20:41 +0200
David Cressey wrote:
> Paul wrote:
>>True. But then many database constraints can be duplicated on the client >>to avoid repeated round-trips to the server. There was a long thread >>here a year or so ago on the desirability of clients being able to read >>the database constraints instead of being redundantly recoded. And then >>risking getting out of sync with each other.
> There is value in avoiding round trips to the server.
> With regard to keeping the client and the database in synch with regard to
> data constraints (database constraints?),
> there is another alternative. Allow both the client and the database to
> inherit the constraints from a common source.
> That's why Kenneth Downs' work is so interesting to me.
>>True again, there is no guarantee someone won't type in another valid >>key, but it does knock out the vast majority of mistakes. I guess the >>most common error is to make a mistake in a single character (writing a >>1 as a 7 for example). I think that checksums are specifically designed >>to ensure that changing any one character will invalidate the checksum. >>I could be wrong here but if not, they certainly would take care of most >>cases.
>>And to check that the key is valid (i.e. in the database) and not just >>well-formed, you do need a trip to the server. But hopefully not many >>entries will manage to get to that stage.
> Maybe "not many erroneous entries". Hopefully, the vast majority of entries
> will be correct, right?
> What information is lost at capture-time?
> 1. reference to information not already available.
> 2. information that doesn't fit the existing structure.
> Number 1 needs additional reference information,
> 2 needs a change of model.
> Both may be to cumbersome.
> 3. information contradicting information already available.
> 4. loss due to mistakes, sometimes caused by (4a) interface inadequacies.
> 5. loss due to "minimal input to get the job done"
> (not caring about the shared data).
> ... I do *not* assume here
> that the ultimate goal of the actor providing the information
> is to do just that (providing the information). So an "import"
> or "load" for maintaing the database does not suffer from this
> loss. My assumption here is (I used it earlier, in the thread
> on "multiple specification of constraints") that the actor
> providing the information provides the information as a side
> effect of trying to achieve a goal which is - at least
> partly - *outside* the scope of the database.
Dawn M. Wolthuis added:
>> 6. Information that is "lost in translation" >> The person (or service) updating the data understands it, but cues that >> might help someone interpret the data are missing. This might be a >> combination of a couple of the others, but seemed worthy of another bullet.
> Agreed. It is premature yet to aim for independence of the factors.
> Maybe a clever reformulation could distill some basics and reduce
> the number of bullets. Later.
> BTW: The "person (or service)" - let's say "actor".
> Any objections?
>> 7. Loss due to inaccurate or misleading metadata >> The information might be added correctly but when passed along in the form >> of a report, the information describing the data could be misleading (even >> to the extent that the untagged data is fine, but the described/tagged data >> would appear wrong).
> Would "unforeseen context" describe what you mean here?
> "Perfect information", "perfect knowledge of
> information needs" and "confluence of goals"
> seem to slip in as implicit assumptions way to often.
Received on Thu Jun 23 2005 - 19:20:41 CEST