Re: check sum uniqueness in database
Date: Thu, 20 Feb 2003 07:44:50 +0000
Message-ID: <christian.bau-B6562D.07445020022003_at_slb-newsm1.svr.pol.co.uk>
In article <a386bae7.0302191901.53d69d26_at_posting.google.com>, leebertarian_at_yahoo.com (leebert) wrote:
> We are investigating the potential for compacting & representing
> common demographic profiles via check sum values <snip>
You have two choices: Checksum or compression. If you use a checksum then the best you can get is algorithms that will produce pseudo-random checksums, so identical "profiles" will have the same checksum and different "profiles" may get the same checksum by chance.
The CRC-32 algorithm is quite good. It produces 32 bit checksums, so there are 4 billion different results. For 4 million records there is a one-in-thousand chance that for a given profile you will find a false match. If you look up each of your 4 million records once then you will have four thousand false matches.
CRC-32 can be modified to CRC-64 which would produce 16 billion billion different checksums. The chance for any false positive in a 4 million database is one in a million.
This only works because CRC is of known good quality. Use any homegrown method, and chances are that your checksums are systematically non-random and the chances for false positives are immensely greater. Best to hire a mathematician for consultancy.
For compression, a good old Hufmann code or arithmetic coding should do things quite nicely. Again, hire a mathematician. Received on Thu Feb 20 2003 - 08:44:50 CET