Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project!

## geotools-devel

 [Geotools-devel] hashCode and integer overflow From: Ian Schneider - 2004-01-28 17:12:32 ```Hi, While digging into hashCode problems, which I thought I had a fix to. I found a strange situation. The code for hashCode in DefaultFeatureType is: int hash = typeName.hashCode(); int ns = namespace.hashCode(); hash = ns == 0 ? hash : ns * hash; for (int i = 0, ii = types.length; i < ii; i++) { hash *= types[i].hashCode(); } return hash; I was getting a zero on occasion, so of course I looked at DefaultAttributeType (the type array) and saw its hashCode calculated by a product of String.hashCode and Class.hashCode. Where's the culprit? I found the perfect overflow error in a data set. System.out.println(-536870912 * -366131984); // 0 System.out.println(536870912 * 366131984); // 0 System.out.println(-536870912 * -366131985); // 536870912 So, lesson here is do not assume that multiplying a bunch of non-zero integers together will always lead to a non-zero integer. Hows that for luck? Ian ```
 Re: [Geotools-devel] hashCode and integer overflow From: Martin Desruisseaux - 2004-01-28 19:03:48 ```Ian Schneider a =E9crit : > System.out.println(-536870912 * -366131984); // 0 > System.out.println(536870912 * 366131984); // 0 > System.out.println(-536870912 * -366131985); // 536870912=20 >=20 > So, lesson here is do not assume that multiplying a bunch of non-zero i= ntegers=20 > together will always lead to a non-zero integer. It seems to me that it is somewhat unusual to multiply hash code in=20 'hashCode()' computation. I don't know if there is a recommanded=20 practice (exept for the "A1 + 37*A2" below), but there is what I usually = do: * hashCode() should be fast, otherwise it defeat part of its purpose. We don't need to always take everything in account. For example if a class has two fields, 'A' and 'B', and if 'B' is almost always the same for a given value of 'A' (for example because it is a derived value of 'A'), I do not include 'B' in hash code computation. Including 'A' is enough in most cases. * If there is an array, I usually don't include every array elements in the hash code computation, for the same reason than above (there is often a strong correlation between elements in the same array). For example I may includes only 1/4 of elements in hashCode computation. * Do not multiply hash codes; add them instead and multiply one of them by some small prime number, e.g.: hashCode1 + 37*hashCode2; When I combine hash code from two totally different objects, I usually perform only a XOR operation. For example: int A0, A1, A2, A3; return ((A0*37 + A1)*37 + A2)*37 + A3; String A; Class B; return A.hashCode() ^ B.hashCode(); The reason for the *37 in the first example was to avoid that two hash code cancel each other. Cancelation may occurs when adding similar elements (e.g. bounding box coordinates like -10 -10 +10 +10). But it seems very unlikely to me that the hash code of a String cancel the hash code of a Class just by chance, and it seems to me that multiplying by 37 will not changes this probability... Regards, Martin. ```
 Re: [Geotools-devel] hashCode and integer overflow From: Ian Schneider - 2004-01-28 19:52:26 ```Martin, Thanks for replying, > It seems to me that it is somewhat unusual to multiply hash code in > 'hashCode()' computation. I don't know if there is a recommanded > practice (exept for the "A1 + 37*A2" below), but there is what I usually > do: Agreed, multiply is no good. > * hashCode() should be fast, otherwise it defeat part of its purpose. > We don't need to always take everything in account. For example if > a class has two fields, 'A' and 'B', and if 'B' is almost always > the same for a given value of 'A' (for example because it is a > derived value of 'A'), I do not include 'B' in hash code > computation. Including 'A' is enough in most cases. Agreed, again, multiply also incrues extra overhead (though small it can be compounded easily). In this case, the key optimization I've added is the caching of hashCode in DefaultFeatureType. > * Do not multiply hash codes; add them instead and multiply one of > them by some small prime number, e.g.: > > hashCode1 + 37*hashCode2; This is precisely what I first did in DefaultFeature, instead of hash *= 13 * schema.hasCode() I had hash += 13 * schema.hashCode() (See below for a discussion of why this doesn't work well in this particular case) > When I combine hash code from two totally different > objects, I usually perform only a XOR operation. For example: > > int A0, A1, A2, A3; > return ((A0*37 + A1)*37 + A2)*37 + A3; > > String A; > Class B; > return A.hashCode() ^ B.hashCode(); > > The reason for the *37 in the first example was to avoid that two > hash code cancel each other. Cancelation may occurs when adding > similar elements (e.g. bounding box coordinates like -10 -10 +10 > +10). But it seems very unlikely to me that the hash code of a > String cancel the hash code of a Class just by chance, and it seems > to me that multiplying by 37 will not changes this probability... Again, this is my current change for DefaultFeatureType. the hash is the xor of typeName, namespace, and each AttributeType. I'm currently examining the distribution of the hashing mechanism. It should be noted that HashMap itself will recompute the hashCode to defend against poor hashes (though defending against 0 is quite hard). Discussion of featureId.hashCode() problems: --------------------------------------------------- The other interesting observation is String.hashCode. It turns out that when Shapefile creates its featureId, it appends an increment count to a base name, that is : feature.1, feature.2, etc. Looking at the String.hashCode method, we see that this leads to the behavior that every 10 Features have a hashCode that differs by only 1. example (name : hashCode of string): east_hru.1 : -647155476 east_hru.2 : -647155475 east_hru.3 : -647155474 east_hru.4 : -647155473 east_hru.5 : -647155472 east_hru.6 : -647155471 east_hru.7 : -647155470 east_hru.8 : -647155469 east_hru.9 : -647155468 then, because of the extra digit shifted in, you see this: east_hru.10 : 1413016772 east_hru.11 : 1413016773 east_hru.12 : 1413016774 east_hru.13 : 1413016775 east_hru.14 : 1413016776 east_hru.15 : 1413016777 east_hru.16 : 1413016778 east_hru.17 : 1413016779 east_hru.18 : 1413016780 east_hru.19 : 1413016781 east_hru.20 : 1413016803 (notice the small jump at 20) When you take this into consideration with the above hashCode technique (adding the string hash with 13 * schema.hashCode) you get some pretty poorly distributed hashes... Since schema.hashCode is the same, multiplying this by 13 simply incurs superfluous overhead. So, for now, I think the best bet is (in DefaultFeatureType): return featureId.hashCode() * schema.hashCode() (We cross our fingers here and hope that cancellations are few and far between) Hope that all makes sense. I haven't committed any changes yet, as I want to test load distribution in HashMap. Barring any obvious breakdown, and without anyone elses objections, I'll commit my changes and close the issue. Regards, Ian ```
 Re: [Geotools-devel] hashCode and integer overflow From: Martin Desruisseaux - 2004-01-30 06:10:14 ```Hello all Ian Schneider a =E9crit : > So, for now, I think the best bet is (in DefaultFeatureType): >=20 > return featureId.hashCode() * schema.hashCode() >=20 > (We cross our fingers here and hope that cancellations are few and far=20 > between) In the case of 536870912 * 366131984 printing 0, using the hexadecimal form gives some light: 2000,0000 * 15D2,BB10 =3D 2BA,5762,0000,0000 If the 20000000 value come from some hash code computation in Geotools 2=20 code, maybe we should revisit this hash code computation as well? Martin. ```