check sum uniqueness in database
Date: 19 Feb 2003 19:01:26 -0800
Message-ID: <a386bae7.0302191901.53d69d26_at_posting.google.com>
I'm a DBA working on a data warehouse project... so the usual
disclaimer applies to "Please forgive my ignorance, I am not a
mathematician...."
We are investigating the potential for compacting & representing
common demographic profiles via check sum values, so we don't have to
'look up' previously existing complex demographic profiles. IOW the
uniqueness of a 'common' demographic profile is functionally inherent
in the concatenated elements of that profile. As an oversimplified
example, 20 people share a demographic profile of 2 children, 2 cars,
2 wives, 1 dog, 1 cat which would be represented by the string: 22211,
and 2 million people share another demographic profile of 2 cars, 1
The real concatenated strings are much longer, however, 80-120 bytes wide, bringing with them overhead in terms of space & performance in the database (table, index, etc.). With check summing we sought a way to encode the original demographic profiles into a self-generating unique key that is small & avoids a code-table lookup (demographic profile 222175A78972349797324ACDEFCONFBAR976987ZAPBLAH encoded as profile 42324 -- woohoo!). Packing these long keys down into a nice neat lil' ol' checksum value would be just dandy (maybe ultimately code the algorithm in C, run it in-process in DB2.. a free lunch!).
Or so we thought... Rather quickly I found in my reading on checksum algorithms that just about any 32bit checksum algorithm cannot guarantee uniqueness. And upping the ante to 64bit doesn't give any guarentees either, it just staves off the inevitable 'false match.'
However, what we are *really* after isn't so much a complete replacement for the original value but rather sufficient compaction, speed & unique representation.
- Would concatenating or summing the checksum and reverse checksum of a value increase uniqueness? (same algorithm, just we can toggle the loop so it either chews from the end or chews from the beginning)
- Would concatenating or summing 2 checksums from 2 different checksum algorithms, each with their own biases towards 'false positives' but combined bringing the chances of false matches down to vanishing scarce?
- Would concatenating a part of the original value that tends towards greater uniqueness (cardinality) generate a more unique signature?
- Or some sufficiently efficient / thorough / contortionist combination of the above strategies?
Of course, being skeptical, I'm willing to accept that there's no
perfect cure or that these 'strategies' amount to naive bit
twiddling... or that we are going from 'vanishingly scarce' to
'vanishingly scarce ... and beyond!' but the holy grail is just simply
out of reach.
Thanks in advance,
/lee
Received on Thu Feb 20 2003 - 04:01:26 CET