Re: check sum uniqueness in database

From: Christian Bau <christian.bau_at_cbau.freeserve.co.uk>
Date: Thu, 20 Feb 2003 07:44:50 +0000
Message-ID: <christian.bau-B6562D.07445020022003_at_slb-newsm1.svr.pol.co.uk>


In article <a386bae7.0302191901.53d69d26_at_posting.google.com>,  leebertarian_at_yahoo.com (leebert) wrote:

> We are investigating the potential for compacting & representing
> common demographic profiles via check sum values <snip>
 

> So here's some questions:

You have two choices: Checksum or compression. If you use a checksum then the best you can get is algorithms that will produce pseudo-random checksums, so identical "profiles" will have the same checksum and different "profiles" may get the same checksum by chance.

The CRC-32 algorithm is quite good. It produces 32 bit checksums, so there are 4 billion different results. For 4 million records there is a one-in-thousand chance that for a given profile you will find a false match. If you look up each of your 4 million records once then you will have four thousand false matches.

CRC-32 can be modified to CRC-64 which would produce 16 billion billion different checksums. The chance for any false positive in a 4 million database is one in a million.

This only works because CRC is of known good quality. Use any homegrown method, and chances are that your checksums are systematically non-random and the chances for false positives are immensely greater. Best to hire a mathematician for consultancy.

For compression, a good old Hufmann code or arithmetic coding should do things quite nicely. Again, hire a mathematician. Received on Thu Feb 20 2003 - 08:44:50 CET

Original text of this message