check sum uniqueness in database

From: leebert <leebertarian_at_yahoo.com>
Date: 19 Feb 2003 19:01:26 -0800
Message-ID: <a386bae7.0302191901.53d69d26_at_posting.google.com>


I'm a DBA working on a data warehouse project... so the usual disclaimer applies to "Please forgive my ignorance, I am not a mathematician...."

We are investigating the potential for compacting & representing common demographic profiles via check sum values, so we don't have to
'look up' previously existing complex demographic profiles. IOW the
uniqueness of a 'common' demographic profile is functionally inherent in the concatenated elements of that profile. As an oversimplified example, 20 people share a demographic profile of 2 children, 2 cars, 2 wives, 1 dog, 1 cat which would be represented by the string: 22211, and 2 million people share another demographic profile of 2 cars, 1 wife, 2 children and 1 dog (string: 2121).

The real concatenated strings are much longer, however, 80-120 bytes wide, bringing with them overhead in terms of space & performance in the database (table, index, etc.). With check summing we sought a way to encode the original demographic profiles into a self-generating unique key that is small & avoids a code-table lookup (demographic profile 222175A78972349797324ACDEFCONFBAR976987ZAPBLAH encoded as profile 42324 -- woohoo!). Packing these long keys down into a nice neat lil' ol' checksum value would be just dandy (maybe ultimately code the algorithm in C, run it in-process in DB2.. a free lunch!).

Or so we thought... Rather quickly I found in my reading on checksum algorithms that just about any 32bit checksum algorithm cannot guarantee uniqueness. And upping the ante to 64bit doesn't give any guarentees either, it just staves off the inevitable 'false match.'

Worse, we are dealing with 'important' data with potentially millions of unique demographic profiles, so we can't risk 'false positive' matches (and DBAs are conservative by nature, data integrity and all that stuff...).

However, what we are *really* after isn't so much a complete replacement for the original value but rather sufficient compaction, speed & unique representation.

So here's some questions:

  1. Would concatenating or summing the checksum and reverse checksum of a value increase uniqueness? (same algorithm, just we can toggle the loop so it either chews from the end or chews from the beginning)
  2. Would concatenating or summing 2 checksums from 2 different checksum algorithms, each with their own biases towards 'false positives' but combined bringing the chances of false matches down to vanishing scarce?
  3. Would concatenating a part of the original value that tends towards greater uniqueness (cardinality) generate a more unique signature?
  4. Or some sufficiently efficient / thorough / contortionist combination of the above strategies?

Of course, being skeptical, I'm willing to accept that there's no perfect cure or that these 'strategies' amount to naive bit twiddling... or that we are going from 'vanishingly scarce' to
'vanishingly scarce ... and beyond!' but the holy grail is just simply
out of reach.

Thanks in advance,
/lee Received on Thu Feb 20 2003 - 04:01:26 CET

Original text of this message