From: Alexandre F. <ale...@gm...> - 2010-02-12 16:57:10
|
Attaching inline since the mailing list apparently torn the attachment off. Thanks Clif for telling me ! -Alex On 2/11/10, Alexandre Ferrieux <ale...@gm...> wrote: > Since it appears that timings are tricky, and that worst-case > (including malevolent) performance is at least as interesting as the > average, here is a small pure-Tcl function that computes the buckets' > contents for a given array. It yields a list of {size bucket} pairs, > where size==[llength bucket], sorted by decreasing size. You can > verify its honesty by comparing the sizes with what [array statistics] > says. > > Hope this tool will let somebody discover the > killing-HTTP-header-attack we are all dying to see. > > -Alex > > proc nbuck va { upvar $va a regexp {, ([0-9]+) buckets} [array statistics a] -> n return $n } proc next {e l} { set pos [lsearch $l $e] if {$pos<0} {error "not found: $e in [list $l]"} incr pos return [lindex $l $pos] } proc bucketeer var { upvar $var ar array set x [array get ar] array unset x * set nb [nbuck x] foreach {k v} [array get ar] { set x($k) $v set n [nbuck x] if {$n==$nb} { set k2 [next $k [array names x]] lappend buck($k2) $k set buck($k) $buck($k2) unset buck($k2) } else { set buck($k) [list $k] set nb $n } } set out {} foreach {k b} [array get buck] { lappend out [list [llength $b] $b] } return [lsort -integer -decreasing -index 0 $out] } |
From: Gustaf N. <ne...@wu...> - 2010-02-14 17:20:48
|
Am 11.02.10 01:08, schrieb Alexandre Ferrieux: > Since it appears that timings are tricky, Apparently. It turns out that in the both mini-benchmarks (the two "foo" functions) the hash function for string keys (HashStringKey()) is not called at all (!) in the timed loop (e.g. Tcl variables are compiled locals). Funny enough, just by changing the string hash function one can measure different performance values (maybe due to processor level caching or similar external effects). Two more observations: - At least on x86 Intel, gcc compiles the classical Tcl string hash function pretty well. The loop body if the hash function "result += (result<<3) + c;" results in just two machine instructions (one lea + one add). This is hard to beat by assembly coding (if someone is interested, i have an apparently slighter faster inline assembly version for x86 64bit). So, the classical Tcl hash function appears well suited for modern compilers, and gcc does very well here (i used gcc 4.2.1). FNV needs the slower integer multiply, that's why the FNV is slower (measured by Donals hashes.c test). - For small hash tables, only a few bits of the computed hash values are used. For the initial size, these are only the last two bits. By every hash table resize (RebuildTable) two more bits are added. It is not clear how the properties of hash functions propagate, when only a small fraction (e.g. 2, 4, 6 etc. bits) of the computed hash value are actually used. It looks to me as if the original Tcl hash function has quite good properties for the task it is used. In order to make hashing faster in Tcl, it is most likely more promising to look at the Tcl-hash machinery altogether. The granularity of the functions does not look perfectly suited for the needs of current Tcl. So the current definitions ================================== Tcl_HashKeyType { .... HashKeyProc CompareHashKeysProc AllocHashEntryProc FreeHashEntryProc } Tcl_HashEntry * Tcl_CreateHashEntry(...) { ... if (tablePtr->keyType == TCL_STRING_KEYS) { typePtr = &tclStringHashKeyType; } else if (tablePtr->keyType == TCL_ONE_WORD_KEYS) { typePtr = &tclOneWordHashKeyType; } else if (tablePtr->keyType == TCL_CUSTOM_TYPE_KEYS || tablePtr->keyType == TCL_CUSTOM_PTR_KEYS) { typePtr = tablePtr->typePtr; } else { typePtr = &tclArrayHashKeyType; } ... ================================== could become something like ================================== Tcl_HashKeyType { .... CreateHashEntryProc FreeHashEntryProc } #define Tcl_CreateHashEntry(tablePtr,..) (...)tablePtr->type->CreateHashEntryProc(...) ================================== to provide tailored create functions for every hashKeyType. Tcl does not call HashKeyProc, CompareHashKeysProc, or AllocHashEntryProc outside CreateHashEntry. The common "keyType == ..." sermon used in several functions of the hash API does not look perfect either and could be replaced by a macro like the one sketched.... best regards -gustaf neumann |
From: Donal K. F. <don...@ma...> - 2010-02-14 23:54:47
Attachments:
donal_k_fellows.vcf
|
On 14/02/2010 16:58, Gustaf Neumann wrote: > Am 11.02.10 01:08, schrieb Alexandre Ferrieux: >> Since it appears that timings are tricky, > Apparently. It turns out that in the both mini-benchmarks (the two > "foo" functions) the hash function for string keys (HashStringKey()) > is not called at all (!) in the timed loop (e.g. Tcl variables are > compiled locals). That's why you have to use array variables there. Though that also doesn't call HashStringKey; it calls the equivalent for Tcl_Obj keys. > Funny enough, just by changing the string hash function one can > measure different performance values (maybe due to processor level > caching or similar external effects). There's a lot of tricky bits to measuring the performance. That's why I switched to using a direct test (and had to tweak that a lot too in the end so as to stop the gcc optimizer for making me test something other than what I really wanted!) with the result, after much experimentation, that the FNV hash function is slower than Tcl's original hash function. Not that it's a bottleneck for any existing code. The other metrics we can use here relate to the differences of distribution of hash keys, which can have quite a substantial impact on the overall hash *table* performance. With general data, the performance seems to be about the same; it's difficult to construct an example with ordinary keys where the difference is really worth mentioning. With "malicious" data it's different though; http://wiki.tcl.tk/5164 describes how the old function (mis)behaves, and with FNV that part is unremarkable. (There should be sets of keys which are a problem for FNV too, but I've not constructed any. I have found[*] a reference to some code for generating collisions against FNV, <URL:http://www.team5150.com /~andrew/blog/2007/03/hash_algorithm_attacks.html>, but I'm not sure about the practicality of any attack based on the methods described there. I also don't know what the effect is of hashing over utf-8 instead of something 8-bit clean.) > - For small hash tables, only a few bits of the computed hash > values are used. For the initial size, these are only the last two > bits. By every hash table resize (RebuildTable) two more bits are > added. It is not clear how the properties of hash functions > propagate, when only a small fraction (e.g. 2, 4, 6 etc. bits) of > the computed hash value are actually used. Tends not to be an issue because if anyone tries to exploit, the table gets rebuilt and more bits of the hash are used. A way to defeat the whole problem would be to take the whole hash value mod some number that is coprime with the hash table size, but mod is a fairly expensive operation. It's only really a pressing issue when people are generating many keys with an identical full hash value (by default, Tcl does a comparison of the full hashcode before proceeding to a key comparison). As noted before, that's really depressingly easy with Tcl's old hash function. > It looks to me as if the original Tcl hash function has quite > good properties for the task it is used. So long as people are not using it in a setting where an attacker controls the keys, that's true. If an attacker can dictate what values are being hashed (true in a number of cases like HTTP header parsing) then you've got a significant performance problem. > In order to make hashing faster in Tcl, it is most likely more > promising to look at the Tcl-hash machinery altogether. That's a swamp I've preferred to stay out of, because there are many constraints due to the requirement to maintain high degrees of binary compatibility. Reworking the hash code in depth is probably an activity for 9.0 and not before. The other alternative is Bob Jenkins's lookup3, but that's big, scary, and needs to know the length of the data being hashed (which either means a strlen() every time for strings, or forces us to break our binary-compatibility guarantees). Donal. [* _This_ thread is one of the top 10 hits that I find when I google for "fnv hash collision generating", which is a bit self-referential. ] |
From: Gustaf N. <ne...@wu...> - 2010-02-16 15:25:55
|
Am 15.02.10 00:54, schrieb Donal K. Fellows: > ... > than what I really wanted!) with the result, after much experimentation, > that the FNV hash function is slower than Tcl's original hash function. > Not that it's a bottleneck for any existing code. Whatever bottleneck means: since the FNV is slower than the old hash function, and assuming the "collision attack" is not a problem for most programs, most programs will be slower (admitted, by a small degree). Concerning the "attack": the problem is apparently that "aj" and "ba" result in the same hash key by using the classical tcl string hash function. The same is true for any other combination of "aj" and "ba" as long they have the same length e.g. Hash(baba) == Hash(ajaj) == Hash(baaj) == Hash(ajba). Since the hash is the same, the index of the bucket (based on low bits) is the same, therefore the same bucket is used for all these keys. For these hash keys, lookup becomes a linear search. Since the hash-values are equal, the hash-value-comparison can not avoid the string comparison. With 2 two-character pairs, for a string length of 4 we have just 4 collisions (with longer strings, exponentially more as "collide" with string length 30 shows). The situation is the same for many more 2 character pairs. How bad is this in practice? This "attack" is not more than a bad performance in a pathological case, or? Concerning FNV: for string lengths less then two characters, FNV and the classic Tcl function return the same hash value (just a side note). As Lars pointed out, it is in general possible to find strings $x and $y such that FNV($x) = FNV($y). But for the "collision attack" hash-equality is not required. The hash function implies a linear search (in the buckets), if INDEX(Hash($x)) == INDEX(Hash($x)) no matter which hash function is used (where INDEX is the function to calculate from the hash value the bucket). Linear search is only a problem when the number of entries in the bucket is high. It is very easy to come up with values, where also with FNV the hash lookup becomes a linear search. Consider (with FNV compiled in): foreach key {a e i m q u y 1 5 tcl_platform tcl_library} { set x($key) 1 } puts [array statistics x] % 11 entries in table, 4 buckets % number of buckets with 0 entries: 3 % number of buckets with 10 or more entries: 1 % average search distance for entry: 6.0 or for example: foreach key { 0z 19 1i 1y 2l 33 3c 3s 46 4f 4v 55 5e 5u 68 6h 6x 7o 82 8b 8r 91 9a 9q a9 ai ay bl c3 cc cs d6 df dv e5 ee eu f8 fh fx go h2 hb hr i1 ia iq } { set y($key) 1 } puts [array statistics y] % 47 entries in table, 16 buckets % number of buckets with 0 entries: 15 % number of buckets with 10 or more entries: 1 % average search distance for entry: 24.0 In both cases, all entries are added into just one bucket. I see no reason, why this can't happen for every other reasonable table size. In the case of "collide" with 32768 hash table entries, just 14 bits of the hash keys are used to determine the bucket. So, neither FNV (nor any other hash function) can guarantee that a the buckets are filled evenly and the degenerated case cannot happen. >> It looks to me as if the original Tcl hash function has quite >> good properties for the task it is used. > So long as people are not using it in a setting where an attacker > controls the keys, that's true. If an attacker can dictate what values > are being hashed (true in a number of cases like HTTP header parsing) > then you've got a significant performance problem. well, in my understanding "attacker" means "performance attacker". I guess the argument in the HTTP case is, that one transmits 30.000 malicious http header entries to slow down the array access from 0.3 micro seconds to 518 microseconds (measured my my machine on the "collide" data. I am pretty sure that transmitting 30.000 header fields costs much more. >> In order to make hashing faster in Tcl, it is most likely more >> promising to look at the Tcl-hash machinery altogether. > > That's a swamp I've preferred to stay out of, because there are many > constraints due to the requirement to maintain high degrees of binary > compatibility. Reworking the hash code in depth is probably an activity > for 9.0 and not before. if a high degree of binary compatibility is the only concern: it should be possible to keep the existing machinery in 8.* in parallel to a more optimized version, but the main question is most like, is it worth the effort? For my current understanding, the main issue is the whole machinery, not only the hash function. The issues are (a) uneven distribution of entries in the bucket, (b) linear search in the buckets (c) some more or less constant overhead, which could be improved. One approach is to improve the bucket-growing policy (e.g. linear hashing, extendible hashing). This would address (a) and (b), but require some more work. Another, conceptually more simple approach is to replace the linear search in the buckets by e.g. avl trees (these tree keep themselves balanced). This would address (b), compatibility would not be an issue. Changing the hash function just addresses (a). > The other alternative is Bob Jenkins's lookup3, but that's big, scary, > and needs to know the length of the data being hashed (which either > means a strlen() every time for strings, or forces us to break our > binary-compatibility guarantees). Well, HashVarKey in TclVar has the key length - so there is no strlen() required there. This is exactly the place, where "collide" stresses the hash function. When i try different hash functions on "colide", FNV is not yielding the best results. i get better figures with lookup8 (tested on 64bit) and murmur (references below). Note as well, that the average search distance decreases from FNV 3.0 to 2.0 with lookup8 and murmur. Also the number of buckets without entries is much better in lookup3/8 and murmur. -gustaf neumann references: http://burtleburtle.net/bob/hash/index.html http://murmurhash.googlepages.com/ PS: i have the suspicion that the avg search length reported by "array statistics" is not correct, since it reports always values ending with ".0"; i have added values which i believe are more accurate in brackets. # FNV 57581 microseconds per iteration 32776 entries in table, 16384 buckets number of buckets with 0 entries: 8331 number of buckets with 1 entries: 632 number of buckets with 2 entries: 1188 number of buckets with 3 entries: 1570 number of buckets with 4 entries: 1602 number of buckets with 5 entries: 1330 number of buckets with 6 entries: 826 number of buckets with 7 entries: 470 number of buckets with 8 entries: 252 number of buckets with 9 entries: 123 number of buckets with 10 or more entries: 60 average search distance for entry: 3.0 (2.0350) ##### murmur 54371 microseconds per iteration 32776 entries in table, 16384 buckets number of buckets with 0 entries: 2218 number of buckets with 1 entries: 4401 number of buckets with 2 entries: 4457 number of buckets with 3 entries: 2995 number of buckets with 4 entries: 1484 number of buckets with 5 entries: 549 number of buckets with 6 entries: 196 number of buckets with 7 entries: 59 number of buckets with 8 entries: 19 number of buckets with 9 entries: 6 number of buckets with 10 or more entries: 0 average search distance for entry: 2.00 (1.1569) ##### lookup3 53402 microseconds per iteration 32776 entries in table, 16384 buckets number of buckets with 0 entries: 2237 number of buckets with 1 entries: 4367 number of buckets with 2 entries: 4534 number of buckets with 3 entries: 2855 number of buckets with 4 entries: 1521 number of buckets with 5 entries: 621 number of buckets with 6 entries: 176 number of buckets with 7 entries: 54 number of buckets with 8 entries: 18 number of buckets with 9 entries: 1 number of buckets with 10 or more entries: 0 average search distance for entry: 2.00 (1.1584) ##### lookup8 55521 microseconds per iteration 32776 entries in table, 16384 buckets number of buckets with 0 entries: 2209 number of buckets with 1 entries: 4413 number of buckets with 2 entries: 4432 number of buckets with 3 entries: 3052 number of buckets with 4 entries: 1413 number of buckets with 5 entries: 595 number of buckets with 6 entries: 200 number of buckets with 7 entries: 49 number of buckets with 8 entries: 17 number of buckets with 9 entries: 3 number of buckets with 10 or more entries: 1 average search distance for entry: 2.0 (1.1561) |
From: Donal K. F. <don...@ma...> - 2010-02-16 16:46:04
Attachments:
donal_k_fellows.vcf
|
On 16/02/2010 15:25, Gustaf Neumann wrote: > How bad is this in practice? This "attack" is not more than a bad > performance in a pathological case, or? It's exactly a performance issue. The issue is exactly that it is far too easy to trigger this worst-case behaviour. When the behaviour is triggered, Tcl is still functioning correctly though. It never thinks two keys are the same unless they've been compared using full equality. (It just happens to think they're different if the hashcodes are different; it's a one-way implication.) > Concerning FNV: for string lengths less then two characters, FNV and > the classic Tcl function return the same hash value (just a side > note). Only because I'd made a mistake. :-) The FNV algorithm uses a baseline value which means that the hashcodes are different. > As Lars pointed out, it is in general possible to find strings > $x and $y such that FNV($x) = FNV($y). But for the "collision attack" > hash-equality is not required. The hash function implies a linear > search (in the buckets), if > > INDEX(Hash($x)) == INDEX(Hash($x)) > > no matter which hash function is used (where INDEX is the function to > calculate from the hash value the bucket). Linear search is only a problem > when the number of entries in the bucket is high. Precisely. Even 8 values per bucket isn't a big problem. The problem comes when you go into the realm of thousands. That's when it matters. > It is very easy to come up with values, where also with FNV > the hash lookup becomes a linear search. Consider (with FNV > compiled in): [...] > In both cases, all entries are added into just one bucket. I see no > reason, why this can't happen for every other reasonable table size. If the hashcodes are the same, the problem happens. If the hashcodes are only partially the same (to some number of LSBs) then small hash tables have a problem, but things will go away over some size. > In the case of "collide" with 32768 hash table entries, just 14 bits > of the hash keys are used to determine the bucket. So, neither FNV > (nor any other hash function) can guarantee that a the buckets are > filled evenly and the degenerated case cannot happen. The point is that it should be "hard" to find a large number of collisions with any particular value. Preferably there should be minimal collisions at all, but that's tricky to achieve. But being able to bury things that you *do* want is worse. That's because typically in practice most keys are not created equal. > if a high degree of binary compatibility is the only concern: > it should be possible to keep the existing machinery in 8.* in parallel to > a more optimized version, but the main question is most like, is it > worth the effort? The only major chunk of effort round here is this email thread. :-) [The rest of your message I'll get around to a bit later tonight.] Donal. |
From: Donal K. F. <don...@ma...> - 2010-02-16 22:36:28
Attachments:
donal_k_fellows.vcf
|
On 16/02/2010 15:25, Gustaf Neumann wrote: > if a high degree of binary compatibility is the only concern: > it should be possible to keep the existing machinery in 8.* in > parallel to a more optimized version, but the main question is most > like, is it worth the effort? > > For my current understanding, the main issue is the whole machinery, > not only the hash function. The issues are > (a) uneven distribution of entries in the bucket, > (b) linear search in the buckets > (c) some more or less constant overhead, which could be improved. > > One approach is to improve the bucket-growing policy (e.g. linear > hashing, extendible hashing). This would address (a) and (b), but > require some more work. > Another, conceptually more simple approach is to replace the linear > search in the buckets by e.g. avl trees (these tree keep themselves > balanced). This would address (b), compatibility would not be an > issue. > Changing the hash function just addresses (a). Switching to a wholly new map system has the disadvantage of needing really heavy testing. In other words, it's a total replacement for the contents of tclHash.c (and tclLiteral.c too). Changing the hash function has much smaller "engineering impact". The main benefit of going to a balanced tree is that it provides a strong upper bound on the cost of mapping a key to a value for any particular size of map. Whether that's as low a cost as for Tcl's hash tables in the normal case, well, I don't know. For a balanced binary tree with 32k elements, I'd expect ~15 comparisons per lookup. With hash tables, we're talking 2 or 3 (by your figures). So a binary tree is too simplistic, and we'd be into much more complex territory. Sorting all that out would make tinkering with hash functions seem rather like small beer. It'd probably attract less comment on this list though (by the Bikeshed Principle). ;-) > Well, HashVarKey in TclVar has the key length - so there is no > strlen() required there. This is exactly the place, where "collide" > stresses the hash function. I was more concerned about HashStringKey in tclHash.c, but I probably ought to measure instead of worrying about it. :-) > When i try different hash functions on "colide", FNV is not yielding > the best results. i get better figures with lookup8 (tested on 64bit) > and murmur (references below). Note as well, that the average search > distance decreases from FNV 3.0 to 2.0 with lookup8 and murmur. Also > the number of buckets without entries is much better in lookup3/8 and > murmur. As noted before, lookup3 is probably the best hash. It's reported to be highly resistant to attack, and is also easy to cut down for different hash sizes. I only didn't go with it because it is also a big and scary piece of code, and it doesn't perform so well with C string keys because of the need to perform a strlen() first. Might not be a real concern though; I need to do some testing. (I do know from the various pages that a thorough googling dredges up that attacking lookup3 is reckoned to be very difficult with a sub-brute-force technique.) IIRC, some of the others make alignment assumptions, and I know that those assumptions aren't safe for Tcl. For example, I'm pretty sure that command and namespace resolution don't guarantee to use aligned strings, and not all systems cope as well as Intel with non-aligned accesses - I've had problems with this on SPARC in the past. AIUI, murmur is one of the hashes with this problem; it assumes that all non-aligned bits are at the end. A big advantage of FNV was that it was easy to see that it was safe that way since it stuck to character-level accesses; I knew it would not crash inside its implementation. Auditing the other hash functions for this sort of thing is more challenging. > PS: i have the suspicion that the avg search length reported by > "array statistics" is not correct, since it reports always values > ending with ".0"; i have added values which i believe are more > accurate in brackets. I've not had problems with (admittedly trivial) testing. You're measuring at a point when there are about two times as many values as buckets, so perfect loading will produce two values per bucket. Since we're not always using the same number of values per hash table, it's useful to look at the distribution for different numbers of values per bucket. (I suspect the problems with FNV relate to how Tcl handles the hashcode when mapping it to a bucket.) It's quite possible that you'll find similar patterns at other loading ratios. I just note that these things are not at all constant. Donal. |
From: Gustaf N. <ne...@wu...> - 2010-02-17 11:05:14
|
Am 16.02.10 17:45, schrieb Donal K. Fellows: >> How bad is this in practice? This "attack" is not more than a bad >> performance in a pathological case, or? > It's exactly a performance issue. The issue is exactly that it is far > too easy to trigger this worst-case behaviour. my point is: for an "attacker", it is still very easy to attack the runtime behavior of the existing bucket distribution algorithm with FNV (or some other hash function). Get http://alice.wu-wien.ac.at:8000/xowiki/download/file/keys an try with a seeded FNV (i used 0x811c9dc5 as seed). set f [open /tmp/keys r]; set keys [read $f]; close $f foreach key [split $keys] { set x($key) 1} > 55917 entries in table, 65536 buckets > number of buckets with 0 entries: 65528 > number of buckets with 1 entries: 4 > number of buckets with 10 or more entries: 4 > average search distance for entry: 6989.14 you see, most buckets are empty, the avg is horrible. Certainly, other hash-functions need a different set of keys for the "attack". >> Concerning FNV: for string lengths less then two characters, FNV and >> the classic Tcl function return the same hash value (just a side >> note). > Only because I'd made a mistake. :-) The FNV algorithm uses a baseline > value which means that the hashcodes are different. i should have spotted this as well! I fixed this on my implementation. The overall effects are very little, the distribution pattern of the "collide" test is not changed by using a seed value. > If the hashcodes are the same, the problem happens. Of course. The bucket distribution algorithm based on least significant bits can't provide an even distribution when hash values are identical. (Tom, a modulo function does not help either in this case). > If the hashcodes are > only partially the same (to some number of LSBs) then small hash tables > have a problem, but things will go away over some size. No, the problem won't go away, if "attacked" (just the likleyhood for accidentially happening might go down). The point is, even if a hash-algorithm would produce no duplicates (which is impossible), one can get an strongly unbalanced bucket distribution (as long the number of buckets is significant smaller than the number of hash values). This kind of attack is not hard at all. > Switching to a wholly new map system has the disadvantage of needing > really heavy testing. .... Changing the hash function > has much smaller "engineering impact". Of course. Especially, when going towards e.g. linear hashing (using more sophisticated bucket splitting). On the other hand, this area is well understood and is used in many systems today. > The main benefit of going to a balanced tree is that it provides a > strong upper bound on the cost of mapping a key to a value for any > particular size of map. well, it changes worst-case linear to worst-case logarithmic. > Whether that's as low a cost as for Tcl's hash > tables in the normal case, well, I don't know. For a balanced binary > tree with 32k elements, I'd expect ~15 comparisons per lookup. With hash > tables, we're talking 2 or 3 (by your figures). guess, you got me wrong. The mentioned option was not to replace hashing by trees in general (imho worst engineering option) but to replace the linear search in the bucket list by a balanced tree, such that a bucket list with 32k elements requires instead of 16k comparisons ~15 on average. This approach would require less engineering than going towards e.g. linear hashing. It would be a remedy (safety net) against all mentioned hash attacks. > As noted before, lookup3 is probably the best hash. i would consider Murmur 2,0 as a serious contender. It is simpler and apparently faster than lookp3 (also newer), with excellent properties (see statistics at http://murmurhash.googlepages.com/) > IIRC, some of the others make alignment assumptions > AIUI, murmur is one of the hashes with this problem For lookup3 and murmur exist version for aligned and unaligned data. Part of the length of the code of lookp3 is that it contains essentially 3 versions (reading 32-bit chunks, 16-bit chunks and 8-bit chunks). The 32-bit chunk version can most probably lead to unaligned access, when the keys are not on word boundaries. const uint32_t *k = (const uint32_t *)key; /* read 32-bit chunks */ The exactly same is true for murmur and lookup3, both support char and word chunks for input. > > PS: i have the suspicion that the avg search length reported by > > "array statistics" is not correct > > I've not had problems with (admittedly trivial) testing. You're > measuring at a point when there are about two times as many values as > buckets, so perfect loading will produce two values per bucket. i would think, that the average search length per entry should be calculated via sum(entries(bucket)^2)/numEntries for unsuccessful searches, for successful searches the half (every insert requires an unsuccessful search). -gustaf neumann |
From: Kristoffer L. <se...@sc...> - 2010-02-17 11:18:29
|
On 17 Feb 2010, at 13:04, Gustaf Neumann wrote: > guess, you got me wrong. The mentioned option was not to > replace hashing by trees in general (imho worst engineering option) > but to replace the linear search in the bucket list by a balanced > tree, such that a bucket list with 32k elements requires instead > of 16k comparisons ~15 on average. Sticking in my 2 cents here, but this is indeed a common knowledge solution to the hash problem and a very reasonable one if you can avoid heavy overhead for small hashes. I'm a bit of a datastructures purist and love balances trees. The aesthetic, structural and self-managing aspect of them have a huge appeal to me. However, I do understand their limitations, and a combination might indeed be a decent solution — if there even is a serious problem to begin with. -- Kristoffer Lawson, Co-Founder, Scred // http://www.scred.com/ |
From: Tom J. <tom...@gm...> - 2010-02-16 18:31:04
|
Gustaf, Try out my Tcl hash functions in my previous email. The problem is not the hash function, it is generally the method of transforming the hash value into a bucket index. The better performing hashes shown in your post are the result of a lot of randomness in the low order bits. This is to be expected of a hash which mixes all the bits in with the new bits. If the entire hash value was used, it could make a lot of sense to increase the sophistication of the hash function, but it would be much easier to just take a prime modulus of hash value just once. This would take into account all the bits. This would help all simple hash functions. For instance, here are the results with a patched tcl8.6 which shows how badly the bucket choose can sometimes be compared to using a modulus reduction: note collide is as follows: proc collide {level {accum ""}} { global a if {$level} { incr level -1 collide $level "aj6$accum" collide $level "bap$accum" } else { set a($accum) 1 } } Note to Donal: I know the collide proc doesn't collide, but it generates equal length keys useful for testing. % source hash2.tcl 595596 microseconds per iteration 32768 entries in table, 16384 buckets number of buckets with 0 entries: 12289 number of buckets with 1 entries: 11 number of buckets with 2 entries: 36 number of buckets with 3 entries: 124 number of buckets with 4 entries: 242 number of buckets with 5 entries: 385 number of buckets with 6 entries: 473 number of buckets with 7 entries: 586 number of buckets with 8 entries: 580 number of buckets with 9 entries: 504 number of buckets with 10 entries: 408 number of buckets with 11 entries: 288 number of buckets with 12 entries: 204 number of buckets with 13 entries: 107 number of buckets with 14 entries: 70 number of buckets with 15 entries: 34 number of buckets with 16 entries: 19 number of buckets with 17 entries: 13 number of buckets with 18 entries: 8 number of buckets with 19 entries: 2 number of buckets with 20 entries: 1 number of buckets with 10000 or more entries: 0 average search distance for entry: 5.0 (obvious the average distance is more than 5.0, more like 10) Using modulus 16567 to choose a bucket: Modulus 16567 (hash%modulus = bucket) Bucket Fill hashmod1 DJB hash from cdb buckets with 1 = 4612 total 4612 buckets with 2 = 4489 total 13590 buckets with 3 = 2925 total 22365 buckets with 4 = 1439 total 28121 buckets with 5 = 582 total 31031 buckets with 6 = 206 total 32267 buckets with 7 = 60 total 32687 buckets with 8 = 9 total 32759 buckets with 9 = 1 total 32768 buckets with 0 = 2244 Bucket Fill hashmod2 Current Tcl hash buckets with 1 = 4514 total 4514 buckets with 2 = 4432 total 13378 buckets with 3 = 2964 total 22270 buckets with 4 = 1469 total 28146 buckets with 5 = 570 total 30996 buckets with 6 = 214 total 32280 buckets with 7 = 52 total 32644 buckets with 8 = 12 total 32740 buckets with 9 = 2 total 32758 buckets with 10 = 1 total 32768 buckets with 0 = 2337 Bucket Fill hashmod3 FNV Hash buckets with 1 = 4558 total 4558 buckets with 2 = 4504 total 13566 buckets with 3 = 2955 total 22431 buckets with 4 = 1418 total 28103 buckets with 5 = 610 total 31153 buckets with 6 = 181 total 32239 buckets with 7 = 55 total 32624 buckets with 8 = 9 total 32696 buckets with 9 = 8 total 32768 buckets with 0 = 2269 The current Tcl hash does better (internally) than FNV: 520529 microseconds per iteration 327520529 microseconds per iteration 32768 entries in table, 16384 buckets number of buckets with 0 entries: 8448 number of buckets with 1 entries: 928 number of buckets with 2 entries: 1296 number of buckets with 3 entries: 1348 number of buckets with 4 entries: 1176 number of buckets with 5 entries: 1024 number of buckets with 6 entries: 910 number of buckets with 7 entries: 576 number of buckets with 8 entries: 380 number of buckets with 9 entries: 196 number of buckets with 10 or more entries: 102 average search distance for entry: 3.2 tom jackson |
From: Gustaf N. <ne...@wu...> - 2010-02-17 11:49:38
|
Am 16.02.10 19:30, schrieb Tom Jackson: > The problem is not > the hash function, it is generally the method of transforming the hash > value into a bucket index. not sure, what you refer to as a problem. if the hash function returns for many keys the same hash value, any bucket distribution mechanism (from hash value to index) won't work, no matter whether the least significant bits or a modulo is used. The anomaly of "collide" can be fixed via e.g. FNV, but there are other algorithms that seem to provide better results and which are faster. However, with all these functions it is still possible to exploit the linear search in the buckets with tailored sets of keys. The problem can be solved by replacing the linear search in the buckets with something with better properties (e.g. avl tree). But also here, the implementation would benefit from better hash functions to avoid string comparisons). Another, even more complex task would be to replace the bucket management at all by algorithms like linear hashing or extendible hashing, but that is an even larger project requiring much more evaluation and testing. Btw, that would be an interesting GSOC Topic... I was starting to look into the topic, since xotcl spends a lot of time in the hash lookups - so any improvements here would help xotcl (and other tcl-oos) as well. The situation with xotcl is in many aspects different to the "attack" scenarios: in an oo environment, many small hash tables are involved (the number of variables, methods per class is typically less than 20, non-compiled local variables of procs are in hash tables, etc.). Therefore i like there the fast "classical" hash function of Tcl. I would be also happy about a reduction of constant cost from the Tcl hash machinery. On the first part, i am relaxed, since at least in some scenarios, xotcl could provide its custom hash type (such as tcl vars do it). -gustaf neumann |
From: Tom J. <tom...@gm...> - 2010-02-17 18:41:53
|
On Wed, Feb 17, 2010 at 3:56 AM, Donal K. Fellows <don...@ma...> wrote: > > Mod is quite an expensive operation, at least on this machine. Applying > a mod operation (mod a large prime) to the result of the classic hash > before returning nearly takes the cost of hashing up to equivalence with > FNV. My idea was to just apply the mod once (to the final hash and in the bucket selection code), however, on further testing the current Tcl hash produces values which (unless a collision) produce a good bucket distribution. You can't apply the mod inside the hash function because it doesn't have access to the size of the hash table (number of buckets). Also, the modulus is not "large" it is just a prime close to the number of buckets. However, the current bucket code works very well with the current hash function, both fnv and the djb hash perform worse. >> The system only needs to be broken once. All that is required is a set >> of keys (maybe a few thousand) which hash to the same value. After >> that the entire algorithm is broken (assuming the goal is to poison >> the bucket code with collisions). Assuming a 32 bit key-space, it >> should be easy to find at least one set of 10000 strings which share >> the same hash value. >> >> Do I care to find such a set? NO! But this is a triviality for >> cryptographers. > > Remember, cryptographers are mathematicians. They have a definition of > "trivial" which doesn't match that used by software engineers. I can > remember a number of acquaintances state that providing examples once an > existence proof was done was trivial. Can't say that I really bought > their argument... :-) The keyspace is only 32 bits, this is trivial by any measure. I guess Gustav has already found several sets of keys. Of course, the real problem is that you only need the low order bits of the hash to match, so the keyspace for the buckets is much smaller than 32 bits. > In terms of hash attacks, they only have real fangs (given that we don't > conflate hashcode equivalence with equality) if it is possible to make > either: > > a) A large number of *short* keys with the same hashcode, or > b) A substantial number of keys with the same hashcode as some > particular key from a set not nominated by the attacker (typically > defined by the communication protocol) > > The aim of the first one is to swamp the hash table with totally > irrelevant things. The aim of the second is to retard required lookups. b) fits the typical definition of a collision. Of course collisions have nothing to do with cryptography, since the key is always compared anyway. Hopefully the key comparison stops on the first mismatch, so the actual number of bits compared is more related to the length of a common prefix between all keys in one bucket. But since (again) the key is always compared, other types of attacks would be just as effective in consuming resources: a few very long keys which hash to the same value, or get into the same bucket. Also the communication protocol itself will consume more resources transmitting the data, and the hashing and storage of large quantities of data is also an attack, in fact the attack would be more effective if you use a more expensive hash function. Like I said earlier: the utility of cryptography is found in the ability to transfer the expensive calculations to the attacker. If instead you provide a platform so the attacker can make you perform many expensive calculations, the attacker has already succeeded without lifting a finger. It is much better to solve this problem at the application level: limit the number of headers, or the length of an entire request. Otherwise the attacker can just send millions of headers instead of thousands, what is the difference how performance is degraded? DOS attacks are not sophisticated attacks. Maybe just send a very long url and see of the whole url gets written to disk in the log file. Eventually the disk fills up. > >> I know there is a huge bias against AOLserver, so I digress to Apache. >> How does Apache handle headers? >> >>> From http://thomas.eibner.dk/apache/table.html (Apache table API): >> >> "Tables in Apache works almost like hashes in Perl. The main >> difference between the two is that you can have two of the same keys >> in the Apache table API. The table lookups are done in a >> case-insensitive manner." >> >> So for everyone who has some problem with AOLserver, maybe note that >> headers in Apache are handled very similarly. More specifically they >> are not handled as associative arrays or dictionaries. > > Checking the source to Apache (strictly the APR) I see that they're > using hashing inside their implementation of tables. OK, they're using a > truly trivial hash, but I suppose it's better than what they had before > (a single linear list). > > Check for yourself at: > http://svn.apache.org/repos/asf/apr/apr/trunk/tables/apr_tables.c Note that Apache does not modify the key, it can store multiple copies of the same key, and can search for keys in a case-insensitive manor. Unfortunately their case-insensitive search is limited to ASCII chars. Headers are also stored in a linear array (but it looks like you can actually only store two keys with the same name). The only way to justify using an array instead of a linear list {{key value} {key value}} is if you don't think it is important to maintain the data in the form it was sent. This is a valid argument. But note that in order to add a key, or lappend to an existing key, you have to search the array first. A large array will consume more resources during creation than a list containing the same data. tom jackson |
From: Gustaf N. <ne...@wu...> - 2010-02-18 13:21:56
|
Am 17.02.10 19:41, schrieb Tom Jackson: > The keyspace is only 32 bits, this is trivial by any measure. I guess > Gustav has already found several sets of keys. Of course, the real > problem is that you only need the low order bits of the hash to match, > so the keyspace for the buckets is much smaller than 32 bits. > Exactly. It is sufficient to attack a much smaller keyspace to achieve the reported slowdowns. To make a lookup in a table with 55000 entries a linear search, it is sufficient to find hash-values with the identical 14 low bits (in the current mapping from hash-key to bucket). One can construct such key-sets for every hash function in a c program in less than 20 seconds (finds e.g. 70000 keys with the same hash suffix as "hello" based on fnv). As pointed out earlier, this attack is not possible when using a randomized seed and hash functions like murmur or lookp3, which mix the seed properly into the hash value, since on every start of tcl, it will return sufficiently different hash values for the same keys. Since tcl support custom hash tables, every c-level application can provide for their own hash tables a different hash function with and without seeds. In case, there is still someone listening: Below are some results from loading dict/words into an array (the script donal posted) with 4 hash algorithms. It shows, that Tcl's classic hash function does very well, murmur is the seconds and the perl hash function is the worst of these. -gustaf neumann CLASSIC 218022.7906 microseconds per iteration MURMUR 219907.73859999998 microseconds per iteration LOOKUP3 225928.0354 microseconds per iteration PERL_HASH 249029.0566 microseconds per iteration |
From: Lars H. <Lar...@re...> - 2010-02-15 23:52:49
|
Tom Jackson skrev: > Hey, > > As far as I can tell this entire operation and the logic used to > justify it is totally messed up. > > Unless I'm missing some additional code, one of the basic > justifications for this change is completely wrong. > > That justification is the wiki entry which shows FNV somehow creates a > hash key distribution that outsmarts the test program "collide". From > my reading of the collide program it was designed to find collisions. > Unfortunately the code seems to find X candidates which have the same > hash value and then create an array. > > Guess what! This trick works for any hash function. As far as I can > tell it works for the new FNV hash and one which I tried the djb hash. Yes, but it is extremely simple for the classic Tcl hash, because: 1. It has the property that if $x and $y are two equal length strings which hash to the same 32-bit value, then $a$x$b and $a$y$b will be a pair of strings which hash to the same value too, for any strings $a and $b. 2. There are very many pairs of strings as short as 2 characters which hash to the same 32-bit value. FNV escapes 1 by using ^ to add in the characters, and 2 by using a multiplier larger than 255. It's probably still possible to find strings $x and $y such that FNV($x) = FNV($y), but I suspect they'd need to be of length at least 5, and chances are pretty good that FNV($a$x) != FNV($a$y) (naively I'd say 255/256, but there could be strings $x and $y which are less sensitive to the prefix $a). With the FNV hash function, an HTTP header parser can to some extent randomise the results of an attacker by picking a $salt prefix and using $salt$header as hash entry for data that has to do with the the $header prefix. With the classic hash function, that has no effect whatsoever; an attack works just as well no matter what $salt is chosen. Lars Hellström |
From: Tom J. <tom...@gm...> - 2010-02-16 09:30:47
|
2010/2/15 Lars Hellström <Lar...@re...>: > Tom Jackson skrev: >> >> Hey, >> >> As far as I can tell this entire operation and the logic used to >> justify it is totally messed up. >> >> Unless I'm missing some additional code, one of the basic >> justifications for this change is completely wrong. >> >> That justification is the wiki entry which shows FNV somehow creates a >> hash key distribution that outsmarts the test program "collide". From >> my reading of the collide program it was designed to find collisions. >> Unfortunately the code seems to find X candidates which have the same >> hash value and then create an array. >> >> Guess what! This trick works for any hash function. As far as I can >> tell it works for the new FNV hash and one which I tried the djb hash. > > Yes, but it is extremely simple for the classic Tcl hash, because: > > 1. It has the property that if $x and $y are two equal length strings which > hash to the same 32-bit value, then $a$x$b and $a$y$b will be a pair of > strings which hash to the same value too, for any strings $a and $b. > > 2. There are very many pairs of strings as short as 2 characters which hash > to the same 32-bit value. > > FNV escapes 1 by using ^ to add in the characters, and 2 by using a > multiplier larger than 255. It's probably still possible to find strings $x > and $y such that FNV($x) = FNV($y), but I suspect they'd need to be of > length at least 5, and chances are pretty good that FNV($a$x) != FNV($a$y) > (naively I'd say 255/256, but there could be strings $x and $y which are > less sensitive to the prefix $a). > > With the FNV hash function, an HTTP header parser can to some extent > randomise the results of an attacker by picking a $salt prefix and using > $salt$header as hash entry for data that has to do with the the $header > prefix. With the classic hash function, that has no effect whatsoever; an > attack works just as well no matter what $salt is chosen. Here is a short testing script which compares the hash function in Tcl (internal) as well as three external hash functions. I use the collide proc to generate the array keys and display the stats for the internal bucket distribution. Then the keys are fed into the other procs and the resultant hash values are saved. The hash values are reduced modulo some prime number close to the number of buckets used by the internal hash array. Here's what I noticed: The original code (current Tcl hash code) is slow to compute [collide 15], almost three minutes. Just changing the internal hash function sped this up two orders of magnitude: just a few seconds. I guess this is the result of the collisions? The internal bucket assignment is far worse of a problem than the hash function itself. With the exception of hash collisions, which obviously go into the same bucket no matter what, the bucket assignment code causes even more collisions. The external hash functions perform similarly when buckets are determined by their modulus some prime number. The current Tcl hash function also does okay with powers of two, but FNV and the DJB hash do not. When using the prime method, the distributions are all much better than the internal bucket assignment. The FNV hash, when used as the internal hash function somehow resists most of the bucket assignment problems. The results are still worse than the external FNV using a prime modulus. Maybe there is a reason to go to all the trouble of finding a good hash function only to throw away most of the randomness in the higher order bits. A prime modulus would maintain this randomness as much as possible. # Note aj => ajr and ba => bas in collide: proc collide {level {accum ""}} { global a if {$level} { incr level -1 collide $level "ajr$accum" collide $level "bas$accum" } else { set a($accum) 1 } } puts [time {collide 10}] puts [array statistics a] set names [array names a] # Change modulus for different size arrays: # Examples: # 19 251 521 1021 2063 4889 10007 16567 namespace eval ::hash { variable modulus 1021 } proc ::hash::hashString { string } { variable modulus set hash 5381 foreach char [split $string ""] { scan $char %c char set hash [expr {(($hash << 5) + $hash) ^ $char}] set hash [string trimleft [string range $hash end-12 end] 0] } return [list $string [expr {$hash % $modulus}] $hash] } proc ::hash::hashString2 { string } { set hash 0 variable modulus foreach char [split $string ""] { scan $char %c char set hash [expr {($hash + ($hash << 3 ) + $char)}] set hash [string trimleft [string range $hash end-12 end] 0] } return [list $string [expr {$hash % $modulus}] $hash] } proc ::hash::hashString3 { string } { variable modulus set hashPrime 1000193 set hash [expr 0x811c9dc5] foreach char [split $string ""] { scan $char %c char set hash [expr {($hash * $hashPrime ) ^ $char}] set hash [string trimleft [string range $hash end-12 end] 0] } return [list $string [expr {$hash % $modulus}] $hash] } foreach word $names { set hash1 [::hash::hashString $word] set hash2 [::hash::hashString2 $word] set hash3 [::hash::hashString3 $word] incr hashmod1([lindex $hash1 1]) incr hashmod2([lindex $hash2 1]) incr hashmod3([lindex $hash3 1]) } set desc1 "DJB hash from cdb" set desc2 "Current Tcl hash" set desc3 "FNV Hash" foreach array {1 2 3} { foreach {mod count} [array get hashmod$array] { incr bucket${array}($count) } } if {0} { foreach array {1 2 3} { puts "hashmod$array [set desc$array]" foreach name [lsort -integer [array names hashmod$array]] { puts "$name => [set hashmod${array}($name)]" } } } puts "Modulus $::hash::modulus (hash%modulus = bucket)" foreach array {1 2 3} { puts "Bucket Fill hashmod$array [set desc$array]" set total 0 set usedBuckets 0 foreach name [lsort -integer [array names bucket$array]] { set buckets [expr {[set bucket${array}($name)] * $name}] incr total $buckets incr usedBuckets [set bucket${array}($name)] puts "buckets with $name = [set bucket${array}($name)] total $total" } puts "buckets with 0 = [expr $::hash::modulus - $usedBuckets]\n\n" } |
From: Donal K. F. <don...@ma...> - 2010-02-16 22:44:30
Attachments:
donal_k_fellows.vcf
|
On 16/02/2010 18:30, Tom Jackson wrote: > Try out my Tcl hash functions in my previous email. The problem is not > the hash function, it is generally the method of transforming the hash > value into a bucket index. The better performing hashes shown in your > post are the result of a lot of randomness in the low order bits. This > is to be expected of a hash which mixes all the bits in with the new > bits. > > If the entire hash value was used, it could make a lot of sense to > increase the sophistication of the hash function, but it would be much > easier to just take a prime modulus of hash value just once. This > would take into account all the bits. Actually, if we were using lookup3 then the entropy is distributed pretty evenly across the entire key. It's a very sophisticated hash function that's resistant to a lot of odd abuses. Donal. |
From: Donal K. F. <don...@ma...> - 2010-02-17 11:56:54
Attachments:
hashspeed.tcl
donal_k_fellows.vcf
|
On 17/02/2010 01:36, Tom Jackson wrote: > No shit! The cost is extreme. I've been probing the cost of using various hash functions (I've done the comparison of the classic algorithm against FNV and lookup3) when used in Tcl and the attached script contains preliminary results for the two keysets: all dictionary words numbers from 0 to 500k The tests basically check the cost of doing 9 hash lookups per key; array lookups are close to as pure a test of this as we can generate in Tcl. (The cost of the "machinery" is about a third of the total time.) The cost of FNV over these (presumably near normal) keys is 3–10% over the classic algorithm. The cost of lookup3 is 10-30% over the classic algorithm. Other key-sets might have other costs. Note that the only thing that is changing between the runs is the contents of the function TclHashObjKey in tclObj.c, and the change was done by altering #ifdefs only. Everything else is just Tcl HEAD (plus a compiled in lookup3.c; I'm not retyping *that* code!) > The problem here is that in order to get good distribution to all > bits, you have to use a near cryptographic quality hash. A > cryptographic hash can be easily truncated to generate near random > bucket indexes. This is the desired goal, but it is much more easily > achieved most of the time with a single modulus operation which takes > into account all bits. Which is easier: one modulus operation or a > near-cryptographic hash which can be safely truncated? Mod is quite an expensive operation, at least on this machine. Applying a mod operation (mod a large prime) to the result of the classic hash before returning nearly takes the cost of hashing up to equivalence with FNV. > The system only needs to be broken once. All that is required is a set > of keys (maybe a few thousand) which hash to the same value. After > that the entire algorithm is broken (assuming the goal is to poison > the bucket code with collisions). Assuming a 32 bit key-space, it > should be easy to find at least one set of 10000 strings which share > the same hash value. > > Do I care to find such a set? NO! But this is a triviality for > cryptographers. Remember, cryptographers are mathematicians. They have a definition of "trivial" which doesn't match that used by software engineers. I can remember a number of acquaintances state that providing examples once an existence proof was done was trivial. Can't say that I really bought their argument... :-) In terms of hash attacks, they only have real fangs (given that we don't conflate hashcode equivalence with equality) if it is possible to make either: a) A large number of *short* keys with the same hashcode, or b) A substantial number of keys with the same hashcode as some particular key from a set not nominated by the attacker (typically defined by the communication protocol) The aim of the first one is to swamp the hash table with totally irrelevant things. The aim of the second is to retard required lookups. > I know there is a huge bias against AOLserver, so I digress to Apache. > How does Apache handle headers? > >> From http://thomas.eibner.dk/apache/table.html (Apache table API): > > "Tables in Apache works almost like hashes in Perl. The main > difference between the two is that you can have two of the same keys > in the Apache table API. The table lookups are done in a > case-insensitive manner." > > So for everyone who has some problem with AOLserver, maybe note that > headers in Apache are handled very similarly. More specifically they > are not handled as associative arrays or dictionaries. Checking the source to Apache (strictly the APR) I see that they're using hashing inside their implementation of tables. OK, they're using a truly trivial hash, but I suppose it's better than what they had before (a single linear list). Check for yourself at: http://svn.apache.org/repos/asf/apr/apr/trunk/tables/apr_tables.c > All I can say is that the more I investigate the original Tcl hash, > the more my respect grows. It seems to perform better most of the > time. It's very very fast, but weak. With non-malicious inputs, its weakness doesn't really matter all that much and its speed is nice. With malicious inputs, which are easy to create, it's no better than a linear list (plus some memory costs). The alternative hash functions that are recommended in the literature (FNV, lookup3) are slower, but harder to push into the ill-performing case (in the case of lookup3, it is believed to require effort nearly equivalent to just scanning through all possible inputs, which isn't a threat; FNV is faster but weaker). Donal. |
From: Donal K. F. <don...@ma...> - 2010-02-18 09:23:49
Attachments:
donal_k_fellows.vcf
|
On 17/02/2010 18:41, Tom Jackson wrote: > My idea was to just apply the mod once (to the final hash and in the > bucket selection code), however, on further testing the current Tcl > hash produces values which (unless a collision) produce a good bucket > distribution. You can't apply the mod inside the hash function > because it doesn't have access to the size of the hash table (number > of buckets). You could (except for in the hash function in tclLiteral.c) as you've got access to the tablePtr. But the problem with that is that it means that when you resize the hash table you've got to recalculate the hashcode for all the entries. That pushes up the cost of resizing quite a lot. > The keyspace is only 32 bits, this is trivial by any measure. I guess > Gustav has already found several sets of keys. Of course, the real > problem is that you only need the low order bits of the hash to match, > so the keyspace for the buckets is much smaller than 32 bits. It's not so trivial if it takes around 2**31 attempts to find another hash; brute-force trouble is unavoidable without using a much larger hashcode, and that pushes the cost up. The problem with the hashing scheme that we've currently got is that there are just too many ways to generate collisions; concatenating colliding bits with constant non-colliding bits generates more colliding bits. But the cost to most Tcl code of fixing this is too high. Most Tcl code just doesn't have the collision problem. Because of *that*, I've reverted to the old hash function. It's fast, very very fast; trying to better it for speed is hard since it is doing exactly what modern microprocessors have been optimized to do. It's also actually good enough for 99.$lots% of all uses, as it produces reasonable results with short keys (much of the research on hashes seems to focus on longer keys). Other approaches are needed for some apps, sure, but they can be shuffled off to extensions without us feeling too bad. I hate reverting code. But in this case, the preponderance of the evidence is that it's not fair on most code to make it bear the performance cost of the malicious-key-choice edge cases. (It was larger than I expected.) > b) fits the typical definition of a collision. Of course collisions > have nothing to do with cryptography, since the key is always compared > anyway. Hopefully the key comparison stops on the first mismatch, so > the actual number of bits compared is more related to the length of a > common prefix between all keys in one bucket. One of the problems with Tcl's hash is that it's trivial to make other values that hash to the same thing as receiver-nominated keys. That is, even if you don't look at keys that you don't know about (other than to store them in case some part of the processing needs them; this is a reasonable mode of operation when building plugin-extensible apps) you still bear the cost. > Note that Apache does not modify the key, it can store multiple copies > of the same key, and can search for keys in a case-insensitive manor. Unrelated to what we do. We implement a map, they implement a slightly different structure (a multimap IIRC). Hashing techniques would work just as well with one as the other. Making Tcl's hash tables work that way wouldn't be too hard, but isn't really about changing that part of the code. > The only way to justify using an array instead of a linear list {{key > value} {key value}} is if you don't think it is important to maintain > the data in the form it was sent. This is a valid argument. But note > that in order to add a key, or lappend to an existing key, you have to > search the array first. A large array will consume more resources > during creation than a list containing the same data. I'm not criticising other parts of what they're doing. Just the process of going from key to bucket chain to add to. Theirs is particularly lame as it loads all keys that start with the same letter into the same bucket (so trivial to attack it's unfunny) but it's not as lame as when it was all just an alist. It was in older versions of Apache; adds might have been quick, but lookups would have sucked... Donal. |
From: Tom J. <tom...@gm...> - 2010-02-18 21:31:53
|
On Thu, Feb 18, 2010 at 1:23 AM, Donal K. Fellows <don...@ma...> wrote: > On 17/02/2010 18:41, Tom Jackson wrote: >> The only way to justify using an array instead of a linear list {{key >> value} {key value}} is if you don't think it is important to maintain >> the data in the form it was sent. This is a valid argument. But note >> that in order to add a key, or lappend to an existing key, you have to >> search the array first. A large array will consume more resources >> during creation than a list containing the same data. > > I'm not criticising other parts of what they're doing. Just the process > of going from key to bucket chain to add to. Theirs is particularly lame > as it loads all keys that start with the same letter into the same > bucket (so trivial to attack it's unfunny) but it's not as lame as when > it was all just an alist. It was in older versions of Apache; adds might > have been quick, but lookups would have sucked... I really wish I could figure out how anyone ever got the idea that arrays are better than lists if the data is only used a few times. Somehow you have to pay for the creation of the array. This makes sense if you need continuous access to the same data over a long period of time. Another important point about Tcl arrays is that the history suggests that they were designed as a method for mapping strings to opaque structures (or pointers, whatever). The string-data to binary-data mapping is one of the most important features of the Tcl language. So you can't use a list to map strings to an opaque structure., because lists are "values". Somehow this important fact is being turned upside down. We are focusing on the key and not the value. Now, back to performance: You have to pay for the creation and maintenance of the array. This is an up-front cost. If you have 10 minutes to visit Disneyland, your per-ride cost will be very high. Enough talk. Here is a script which uses the FNV algorithm to create an array, using the collide keys. After the script I have timings (create the array with the keys, create a name-value list with the keys). The result is that it takes so much time to create the array, I have plenty of time left over to do very slow linear searches...and I can do them case-insensitive. In addition I can recover the order of the input list. Why anyone thinks it is a good idea to spend time destroying information is beyond me. But basically that is what any list->array transform gives you. # original collide proc proc collide {level {accum ""} } { global a if {$level} { incr level -1 collide $level "aj$accum" collide $level "ba$accum" } else { lappend a($accum) 1 } } # new proc just generates keys proc collideNames {level {accum ""} } { global names if {$level} { incr level -1 collideNames $level "aj$accum" collideNames $level "ba$accum" } else { lappend names $accum } } # generate lots of keys collideNames 15 set hostChoice [list host Host hosT hOst hoSt HoSt HOst hoST] set valueIndex 0 set hostIndex 0 # create array using keys, note collide runs faster because it is a compiled proc # the loop here matches the following list creation loop. puts "create host array: [time { foreach name $names { lappend a($name) [incr valueIndex] if {!($valueIndex % 10007)} { incr hostIndex lappend a([lindex $hostChoice $hostIndex]) $valueIndex } } }]" puts [array statistics a] set valueIndex 0 set hostIndex 0 puts "create host list: [time { foreach name $names { lappend headerList [list $name [incr valueIndex]] if {!($valueIndex % 10007)} { incr hostIndex lappend headerList [list [lindex $hostChoice $hostIndex] $valueIndex] } } }]" puts "finding First Host: [time {set hosts [lsearch -inline -index 0 $headerList Host]}]" puts "finding First Host nocase: [time {set hosts [lsearch -inline -index 0 $headerList Host]}]" puts "finding All Hosts: [time {set hosts [lsearch -inline -all -index 0 $headerList Host]}]" puts "finding All Hosts nocase: [time {set hosts [lsearch -inline -all -nocase -index 0 $headerList Host]}]" puts $hosts ### End script The only thing interesting here is the extra time required to create the array. But lets not even get into the problems of searching for a "host" header with no case: % array names a -regexp (h|H)(o|O)(s|S)(t|T) hosT hOst Host % time {array names a -regexp (h|H)(o|O)(s|S)(t|T)} 120592 microseconds per iteration Then you have to retrieve the values. Here is what I get with a list representation: create host array: 1025203 microseconds per iteration (wow! over one second) 32771 entries in table, 16384 buckets number of buckets with 0 entries: 8342 number of buckets with 1 entries: 604 number of buckets with 2 entries: 1144 number of buckets with 3 entries: 1657 number of buckets with 4 entries: 1616 number of buckets with 5 entries: 1271 number of buckets with 6 entries: 848 number of buckets with 7 entries: 487 number of buckets with 8 entries: 237 number of buckets with 9 entries: 111 number of buckets with 10 entries: 47 number of buckets with 11 entries: 15 number of buckets with 12 entries: 4 number of buckets with 14 entries: 1 number of buckets with 10000 or more entries: 0 average search distance for entry: 3.0 create host list: 381651 microseconds per iteration (2.5 times faster) finding First Host: 1867 microseconds per iteration finding First Host nocase: 1697 microseconds per iteration finding All Hosts: 6618 microseconds per iteration finding All Hosts nocase: 40670 microseconds per iteration Host keys: {Host 10007} {hosT 20014} {hOst 30021} Somewhere the conventional wisdom about list vs. array access needs to be discredited. tom jackson |