Re: check sum uniqueness in database

From: Jim Kennedy <kennedy-down_with_spammers_at_attbi.com>
Date: Thu, 20 Feb 2003 06:01:29 GMT
Message-ID: <Za_4a.188464$Ec4.197580_at_rwcrnsc52.ops.asp.att.net>


I think it would be faster to use bitmap indexes on the thing and let Oracle concat the bitmaps together if need be. It hink you are trying to write your own db or "be smarter than it" , let the db do what it does best - look up data.

You could post your question to asktom.oracle.com . Jim

--
Replace part of the email address: kennedy-down_with_spammers_at_attbi.com
with family.  Remove the negative part, keep the minus sign.  You can figure
it out.
"leebert" <leebertarian_at_yahoo.com> wrote in message
news:a386bae7.0302191901.53d69d26_at_posting.google.com...

> I'm a DBA working on a data warehouse project... so the usual
> disclaimer applies to "Please forgive my ignorance, I am not a
> mathematician...."
>
> We are investigating the potential for compacting & representing
> common demographic profiles via check sum values, so we don't have to
> 'look up' previously existing complex demographic profiles. IOW the
> uniqueness of a 'common' demographic profile is functionally inherent
> in the concatenated elements of that profile. As an oversimplified
> example, 20 people share a demographic profile of 2 children, 2 cars,
> 2 wives, 1 dog, 1 cat which would be represented by the string: 22211,
> and 2 million people share another demographic profile of 2 cars, 1
> wife, 2 children and 1 dog (string: 2121).
>
> The real concatenated strings are much longer, however, 80-120 bytes
> wide, bringing with them overhead in terms of space & performance in
> the database (table, index, etc.). With check summing we sought a way
> to encode the original demographic profiles into a self-generating
> unique key that is small & avoids a code-table lookup (demographic
> profile 222175A78972349797324ACDEFCONFBAR976987ZAPBLAH encoded as
> profile 42324 -- woohoo!). Packing these long keys down into a nice
> neat lil' ol' checksum value would be just dandy (maybe ultimately
> code the algorithm in C, run it in-process in DB2.. a free lunch!).
>
> Or so we thought... Rather quickly I found in my reading on checksum
> algorithms that just about any 32bit checksum algorithm cannot
> guarantee uniqueness. And upping the ante to 64bit doesn't give any
> guarentees either, it just staves off the inevitable 'false match.'
>
> Worse, we are dealing with 'important' data with potentially millions
> of unique demographic profiles, so we can't risk 'false positive'
> matches (and DBAs are conservative by nature, data integrity and all
> that stuff...).
>
> However, what we are *really* after isn't so much a complete
> replacement for the original value but rather sufficient compaction,
> speed & unique representation.
>
> So here's some questions:
>
> 1. Would concatenating or summing the checksum and reverse checksum of
> a value increase uniqueness? (same algorithm, just we can toggle the
> loop so it either chews from the end or chews from the beginning)
>
> 2. Would concatenating or summing 2 checksums from 2 different
> checksum algorithms, each with their own biases towards 'false
> positives' but combined bringing the chances of false matches down to
> vanishing scarce?
>
> 3. Would concatenating a part of the original value that tends towards
> greater uniqueness (cardinality) generate a more unique signature?
>
> 4. Or some sufficiently efficient / thorough / contortionist
> combination of the above strategies?
>
> Of course, being skeptical, I'm willing to accept that there's no
> perfect cure or that these 'strategies' amount to naive bit
> twiddling... or that we are going from 'vanishingly scarce' to
> 'vanishingly scarce ... and beyond!' but the holy grail is just simply
> out of reach.
>
> Thanks in advance,
> /lee
Received on Thu Feb 20 2003 - 07:01:29 CET

Original text of this message