Re: [Spamprobe-users] hash question
Brought to you by:
bburton
From: Brian B. <bb...@us...> - 2005-03-31 15:42:11
|
Thomas Sch=FCrger wrote: >>If I used a digest (128 bit MD5 hash) the record size would double to 2= 4=20 >>bytes but losses would be virtually eliminated. Computing an MD5 for=20 >>each term would probably be significantly slower than either 32 or 64=20 >>hash as well. >=20 > Using a 64 bit hash key sounds reasonable (in relation to the numer of > unhashed keys to store). More important would be the use of a good > hash function which minimizes collisions. Something like taking the > first 64 bits from the MD5, SHA-1 or RIPEMD160 hashs could be a good > choice. I'll have to write a program to experiment with hash collision rates on=20 my 2.24 million term database. I have some free hash code to use=20 evahash, FNV, and MD5. I'll be very interested to see how many=20 collisions I get vs. CPU time to compute the hash. I'm inclined to rework the hash format to use a 64 bit key just as a=20 precaution. Maybe I could offer a choice of either with a little code=20 tweaking. All the best, ++Brian |