Re: check sum uniqueness in database

From: Richard D. Latham <lathamr_at_us.ibm.com>
Date: 19 Feb 2003 23:44:33 -0600
Message-ID: <of57bj5q.fsf_at_us.ibm.com>


leebertarian_at_yahoo.com (leebert) writes:

> I'm a DBA working on a data warehouse project... so the usual
> disclaimer applies to "Please forgive my ignorance, I am not a
> mathematician...."
>
> We are investigating the potential for compacting & representing
> common demographic profiles via check sum values, so we don't have to
> 'look up' previously existing complex demographic profiles. IOW the
> uniqueness of a 'common' demographic profile is functionally inherent
> in the concatenated elements of that profile. As an oversimplified
> example, 20 people share a demographic profile of 2 children, 2 cars,
> 2 wives, 1 dog, 1 cat which would be represented by the string: 22211,
> and 2 million people share another demographic profile of 2 cars, 1
> wife, 2 children and 1 dog (string: 2121).
>

< snip >

>
> Worse, we are dealing with 'important' data with potentially millions
> of unique demographic profiles, so we can't risk 'false positive'
> matches (and DBAs are conservative by nature, data integrity and all
> that stuff...).
>
< snip ?

Well, if your tolerance for collisions in the reduced representation is in fact == 0, and not some very small number , you won't be able to use a hash :-)

>
> Of course, being skeptical, I'm willing to accept that there's no
> perfect cure or that these 'strategies' amount to naive bit
> twiddling... or that we are going from 'vanishingly scarce' to
> 'vanishingly scarce ... and beyond!' but the holy grail is just simply
> out of reach.
>

The crypto-weenies could help in computing the entropy of the range of values, which would give you the theoretical minimum # of bits to exactly map every possible demographic profile. This will be a tremendous amount of work, I suspect ...

If your tolerance for collisions is in fact > 1 in 2 to the 64th power or so, you'll want to investigate the usage of one of the off-the-shelf cryptographically strong hashes ... again, if this is a possibility, the denizens of sci.crypt can point you at implementations of such beasts in just about any language you'd care to name.

-- 
#include  <disclaimer.std>    /* I don't speak for IBM ...           */
                              /* Heck, I don't even speak for myself */
                              /* Don't believe me ? Ask my wife :-)  */
Richard D. Latham   lathamr_at_us.ibm.com
Received on Thu Feb 20 2003 - 06:44:33 CET

Original text of this message