Re: cyclical redundancy checksum algorithm(s)?

From: Volker Hetzer <firstname.lastname_at_ieee.org>
Date: Thu, 28 Sep 2006 19:33:51 +0200
Message-ID: <efh125$v08$1_at_nntp.fujitsu-siemens.com>


Karen Hill schrieb:
> I just finished reading one of Ralph Kimball's books. In it he
> mentions something called a cyclical redundancy checksum (crc)
> function. A crc function is a hash function that generates a checksum.
>
> I am wondering a few things. A crc function would be extremely useful
> and time saving in determining if a row needs to be updated or not (are
> the values the same, if yes don't update, if not update). In fact
> Ralph Kimball states that this is a way to check for changes. You just
> have an extra column for the crc checksum. When you go to update data,
> generate a crc checksum and compare it to the one in the crc column.
> If they are same, your data has not changed.
>
> Yet what happens if there is a collision of the checksum for a row?
It's simple:
If a crc indicates a difference, you can rely on the answer. If a crc indicates equality, you have to do a real comparison.

Therefore a CRC can give you a large speedup, depending on the situation and CRC size. Small CRCs (like 2 or 3 bit) are very fast to compare but the false positive percentage is high, necessitating lots of full comparisons. crc's almost as long as the data sets themselves are almost as slow as exhaustive comparison but ere almost always correct.

Given a certain average data record size, hash size, number of data records and relative frequency of comparison versus data change (i.e. crc calculation) you can figure out how much time you save when using crc's.

But in most circumstances you don't have to bother because...

> Is there a crc function in postgresql? If not what algorithm would I
> need to use to create one in pl/pgsql?
...normally a simple 32bit crc is amply sufficient. Given 2^16 data records, about /one/ pair out of them will share one crc, and, given one data record, /one/ other out of 2^32 data records will have the same crc.
They are fast to compute and can be indexed as integer. That's what I did when I once had to implement a simple disk mirroring application using mysql for the comparisons.
This compared several million file names (daily) using the method indicated above and worked flawlessly and fast.

Lots of Greetings!
Volker

-- 
For email replies, please substitute the obvious.
Received on Thu Sep 28 2006 - 19:33:51 CEST

Original text of this message