You can subscribe to this list here.
2000 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(19) |
Jul
(96) |
Aug
(144) |
Sep
(222) |
Oct
(496) |
Nov
(171) |
Dec
(6) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2001 |
Jan
(4) |
Feb
(4) |
Mar
(9) |
Apr
(4) |
May
(12) |
Jun
(6) |
Jul
|
Aug
|
Sep
(1) |
Oct
(2) |
Nov
|
Dec
|
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(52) |
Aug
(47) |
Sep
(47) |
Oct
(95) |
Nov
(56) |
Dec
(34) |
2003 |
Jan
(99) |
Feb
(116) |
Mar
(125) |
Apr
(99) |
May
(123) |
Jun
(69) |
Jul
(110) |
Aug
(130) |
Sep
(289) |
Oct
(211) |
Nov
(98) |
Dec
(140) |
2004 |
Jan
(85) |
Feb
(87) |
Mar
(342) |
Apr
(125) |
May
(101) |
Jun
(60) |
Jul
(151) |
Aug
(118) |
Sep
(162) |
Oct
(117) |
Nov
(125) |
Dec
(95) |
2005 |
Jan
(141) |
Feb
(54) |
Mar
(79) |
Apr
(83) |
May
(74) |
Jun
(125) |
Jul
(63) |
Aug
(89) |
Sep
(130) |
Oct
(89) |
Nov
(34) |
Dec
(39) |
2006 |
Jan
(98) |
Feb
(62) |
Mar
(56) |
Apr
(94) |
May
(169) |
Jun
(41) |
Jul
(34) |
Aug
(35) |
Sep
(132) |
Oct
(722) |
Nov
(381) |
Dec
(36) |
2007 |
Jan
(34) |
Feb
(174) |
Mar
(15) |
Apr
(35) |
May
(74) |
Jun
(15) |
Jul
(8) |
Aug
(18) |
Sep
(39) |
Oct
(125) |
Nov
(89) |
Dec
(129) |
2008 |
Jan
(176) |
Feb
(91) |
Mar
(69) |
Apr
(178) |
May
(310) |
Jun
(434) |
Jul
(171) |
Aug
(73) |
Sep
(187) |
Oct
(132) |
Nov
(259) |
Dec
(292) |
2009 |
Jan
(27) |
Feb
(54) |
Mar
(35) |
Apr
(54) |
May
(93) |
Jun
(10) |
Jul
(36) |
Aug
(36) |
Sep
(93) |
Oct
(52) |
Nov
(45) |
Dec
(74) |
2010 |
Jan
(20) |
Feb
(120) |
Mar
(165) |
Apr
(101) |
May
(56) |
Jun
(12) |
Jul
(73) |
Aug
(306) |
Sep
(154) |
Oct
(82) |
Nov
(63) |
Dec
(42) |
2011 |
Jan
(176) |
Feb
(86) |
Mar
(199) |
Apr
(86) |
May
(237) |
Jun
(50) |
Jul
(26) |
Aug
(56) |
Sep
(42) |
Oct
(62) |
Nov
(62) |
Dec
(52) |
2012 |
Jan
(35) |
Feb
(33) |
Mar
(128) |
Apr
(152) |
May
(133) |
Jun
(21) |
Jul
(74) |
Aug
(423) |
Sep
(165) |
Oct
(129) |
Nov
(387) |
Dec
(276) |
2013 |
Jan
(105) |
Feb
(30) |
Mar
(130) |
Apr
(42) |
May
(60) |
Jun
(79) |
Jul
(101) |
Aug
(46) |
Sep
(81) |
Oct
(14) |
Nov
(43) |
Dec
(4) |
2014 |
Jan
(25) |
Feb
(32) |
Mar
(30) |
Apr
(80) |
May
(42) |
Jun
(23) |
Jul
(68) |
Aug
(127) |
Sep
(112) |
Oct
(72) |
Nov
(29) |
Dec
(69) |
2015 |
Jan
(35) |
Feb
(49) |
Mar
(95) |
Apr
(10) |
May
(70) |
Jun
(64) |
Jul
(93) |
Aug
(85) |
Sep
(43) |
Oct
(38) |
Nov
(124) |
Dec
(29) |
2016 |
Jan
(253) |
Feb
(181) |
Mar
(132) |
Apr
(419) |
May
(68) |
Jun
(90) |
Jul
(52) |
Aug
(142) |
Sep
(131) |
Oct
(80) |
Nov
(84) |
Dec
(192) |
2017 |
Jan
(329) |
Feb
(842) |
Mar
(248) |
Apr
(85) |
May
(247) |
Jun
(186) |
Jul
(37) |
Aug
(73) |
Sep
(98) |
Oct
(108) |
Nov
(143) |
Dec
(143) |
2018 |
Jan
(155) |
Feb
(139) |
Mar
(72) |
Apr
(112) |
May
(82) |
Jun
(119) |
Jul
(24) |
Aug
(33) |
Sep
(179) |
Oct
(295) |
Nov
(111) |
Dec
(34) |
2019 |
Jan
(20) |
Feb
(29) |
Mar
(49) |
Apr
(89) |
May
(185) |
Jun
(131) |
Jul
(9) |
Aug
(59) |
Sep
(30) |
Oct
(44) |
Nov
(118) |
Dec
(53) |
2020 |
Jan
(70) |
Feb
(108) |
Mar
(50) |
Apr
(9) |
May
(70) |
Jun
(24) |
Jul
(103) |
Aug
(82) |
Sep
(132) |
Oct
(119) |
Nov
(174) |
Dec
(169) |
2021 |
Jan
(75) |
Feb
(51) |
Mar
(76) |
Apr
(73) |
May
(53) |
Jun
(120) |
Jul
(114) |
Aug
(73) |
Sep
(70) |
Oct
(18) |
Nov
(26) |
Dec
|
2022 |
Jan
(26) |
Feb
(63) |
Mar
(64) |
Apr
(64) |
May
(48) |
Jun
(74) |
Jul
(129) |
Aug
(106) |
Sep
(238) |
Oct
(169) |
Nov
(149) |
Dec
(111) |
2023 |
Jan
(110) |
Feb
(47) |
Mar
(82) |
Apr
(106) |
May
(168) |
Jun
(101) |
Jul
(155) |
Aug
(35) |
Sep
(51) |
Oct
(55) |
Nov
(134) |
Dec
(202) |
2024 |
Jan
(103) |
Feb
(129) |
Mar
(154) |
Apr
(89) |
May
(60) |
Jun
(162) |
Jul
(201) |
Aug
(61) |
Sep
(167) |
Oct
(111) |
Nov
(133) |
Dec
(141) |
2025 |
Jan
(122) |
Feb
(88) |
Mar
(106) |
Apr
(113) |
May
(203) |
Jun
(185) |
Jul
(124) |
Aug
(202) |
Sep
(176) |
Oct
(50) |
Nov
|
Dec
|
From: Paul A. <pau...@gm...> - 2010-02-17 01:08:21
|
I am trying to eliminate compiler warnings and have run into a problem with the definition of Tcl_SetResult. The code in question is: Tcl_SetResult(interp, Tcl_ErrnoMsg(Tcl_GetErrno()), TCL_STATIC); Which gives the compiler (gcc) warning: warning: passing argument 2 of ‘Tcl_SetResult’ discards qualifiers from pointer target type Argument 2 of Tcl_SetResult takes a char * argument, but Tcl_ErrnoMsg is of type const char * The final argument TCL_STATIC goes along with the const char *, but does not suppress the warming. Beside just ignoring the warning, the only other solution I can think of is a separate Tcl_SetResultStatic( Tcl_Interp * interp, const char * result ) |
From: Donal K. F. <don...@ma...> - 2010-02-16 22:44:30
|
On 16/02/2010 18:30, Tom Jackson wrote: > Try out my Tcl hash functions in my previous email. The problem is not > the hash function, it is generally the method of transforming the hash > value into a bucket index. The better performing hashes shown in your > post are the result of a lot of randomness in the low order bits. This > is to be expected of a hash which mixes all the bits in with the new > bits. > > If the entire hash value was used, it could make a lot of sense to > increase the sophistication of the hash function, but it would be much > easier to just take a prime modulus of hash value just once. This > would take into account all the bits. Actually, if we were using lookup3 then the entropy is distributed pretty evenly across the entire key. It's a very sophisticated hash function that's resistant to a lot of odd abuses. Donal. |
From: Donal K. F. <don...@ma...> - 2010-02-16 22:36:28
|
On 16/02/2010 15:25, Gustaf Neumann wrote: > if a high degree of binary compatibility is the only concern: > it should be possible to keep the existing machinery in 8.* in > parallel to a more optimized version, but the main question is most > like, is it worth the effort? > > For my current understanding, the main issue is the whole machinery, > not only the hash function. The issues are > (a) uneven distribution of entries in the bucket, > (b) linear search in the buckets > (c) some more or less constant overhead, which could be improved. > > One approach is to improve the bucket-growing policy (e.g. linear > hashing, extendible hashing). This would address (a) and (b), but > require some more work. > Another, conceptually more simple approach is to replace the linear > search in the buckets by e.g. avl trees (these tree keep themselves > balanced). This would address (b), compatibility would not be an > issue. > Changing the hash function just addresses (a). Switching to a wholly new map system has the disadvantage of needing really heavy testing. In other words, it's a total replacement for the contents of tclHash.c (and tclLiteral.c too). Changing the hash function has much smaller "engineering impact". The main benefit of going to a balanced tree is that it provides a strong upper bound on the cost of mapping a key to a value for any particular size of map. Whether that's as low a cost as for Tcl's hash tables in the normal case, well, I don't know. For a balanced binary tree with 32k elements, I'd expect ~15 comparisons per lookup. With hash tables, we're talking 2 or 3 (by your figures). So a binary tree is too simplistic, and we'd be into much more complex territory. Sorting all that out would make tinkering with hash functions seem rather like small beer. It'd probably attract less comment on this list though (by the Bikeshed Principle). ;-) > Well, HashVarKey in TclVar has the key length - so there is no > strlen() required there. This is exactly the place, where "collide" > stresses the hash function. I was more concerned about HashStringKey in tclHash.c, but I probably ought to measure instead of worrying about it. :-) > When i try different hash functions on "colide", FNV is not yielding > the best results. i get better figures with lookup8 (tested on 64bit) > and murmur (references below). Note as well, that the average search > distance decreases from FNV 3.0 to 2.0 with lookup8 and murmur. Also > the number of buckets without entries is much better in lookup3/8 and > murmur. As noted before, lookup3 is probably the best hash. It's reported to be highly resistant to attack, and is also easy to cut down for different hash sizes. I only didn't go with it because it is also a big and scary piece of code, and it doesn't perform so well with C string keys because of the need to perform a strlen() first. Might not be a real concern though; I need to do some testing. (I do know from the various pages that a thorough googling dredges up that attacking lookup3 is reckoned to be very difficult with a sub-brute-force technique.) IIRC, some of the others make alignment assumptions, and I know that those assumptions aren't safe for Tcl. For example, I'm pretty sure that command and namespace resolution don't guarantee to use aligned strings, and not all systems cope as well as Intel with non-aligned accesses - I've had problems with this on SPARC in the past. AIUI, murmur is one of the hashes with this problem; it assumes that all non-aligned bits are at the end. A big advantage of FNV was that it was easy to see that it was safe that way since it stuck to character-level accesses; I knew it would not crash inside its implementation. Auditing the other hash functions for this sort of thing is more challenging. > PS: i have the suspicion that the avg search length reported by > "array statistics" is not correct, since it reports always values > ending with ".0"; i have added values which i believe are more > accurate in brackets. I've not had problems with (admittedly trivial) testing. You're measuring at a point when there are about two times as many values as buckets, so perfect loading will produce two values per bucket. Since we're not always using the same number of values per hash table, it's useful to look at the distribution for different numbers of values per bucket. (I suspect the problems with FNV relate to how Tcl handles the hashcode when mapping it to a bucket.) It's quite possible that you'll find similar patterns at other loading ratios. I just note that these things are not at all constant. Donal. |
From: Daniel A. S. <da...@us...> - 2010-02-16 22:29:42
|
On Feb 16, 2010, at 12:43 PM, Jan Nijtmans wrote: > 2010/2/16 Daniel A. Steffen <da...@us...>: >>> reverted earlier rename from tcl*Stubs to >>> tcl*ConstStubs, it's not necessary at all. >> >> On Feb 5, 2010, at 20:53, nij...@us... wrote: >> >>> Eliminate the need for an extra Stubs Pointer >>> for adressing a static stub table: Just change >>> the exported table from static to MODULE_SCOPE. >> >> for the record, these were actually quite deliberate changes and part of the overall "const stubs tables" and "tclStubsPtr out of libtcl.so" efforts which were discussed at some length on email & chat back in 2007. > > Don't worry, I didn't revert any of those deliberate original > changes. However, making all stub tables static, and then > use an extra pointer to make the main stub table accessible > was rather clumsy. Therefore I extended genStubs.tcl to find > out which one is intended to be exported. It's a simplification, > retaining the original idea. In that process, I changed the name > of the stub table symbols, later discovering that that was > not necessary at all. So I reverted part of my own later changes. well, no... the name of the MODULE_SCOPE symbols used by Tcl_CreateInterp() and TclTommath_Init() before your latest changes were tclConstStubsPtr and tclTomMathConstStubsPtr. Now they are tclStubs and tclTomMathStubs, which is back to what they were in 2008 before [Patch 1938497] went in, c.f. http://tcl.cvs.sourceforge.net/viewvc/tcl/tcl/generic/tclStubInit.c?r1=1.152&r2=1.153 http://tcl.cvs.sourceforge.net/viewvc/tcl/tcl/generic/tclBasic.c?r1=1.296&r2=1.297 again the intent was (in particular for tclStubs) to have an _different_ exported name from before [Patch 1938497] to allow detection of inadvertent external use of these symbols in cases where MODULE_SCOPE is not effective. |
From: Jan N. <nij...@us...> - 2010-02-16 20:43:17
|
2010/2/16 Daniel A. Steffen <da...@us...>: >> reverted earlier rename from tcl*Stubs to >> tcl*ConstStubs, it's not necessary at all. > > On Feb 5, 2010, at 20:53, nij...@us... wrote: > >> Eliminate the need for an extra Stubs Pointer >> for adressing a static stub table: Just change >> the exported table from static to MODULE_SCOPE. > > for the record, these were actually quite deliberate changes and part of the overall "const stubs tables" and "tclStubsPtr out of libtcl.so" efforts which were discussed at some length on email & chat back in 2007. Don't worry, I didn't revert any of those deliberate original changes. However, making all stub tables static, and then use an extra pointer to make the main stub table accessible was rather clumsy. Therefore I extended genStubs.tcl to find out which one is intended to be exported. It's a simplification, retaining the original idea. In that process, I changed the name of the stub table symbols, later discovering that that was not necessary at all. So I reverted part of my own later changes. Regards, Jan Nijtmans |
From: Daniel A. S. <da...@us...> - 2010-02-16 18:41:05
|
On Feb 15, 2010, at 22:56, nij...@us... wrote: > Update of /cvsroot/tcl/tcl/tools > In directory sfp-cvsdas-2.v30.ch3.sourceforge.com:/tmp/cvs-serv22621/tools > > Modified Files: > genStubs.tcl > Log Message: > reverted earlier rename from tcl*Stubs to > tcl*ConstStubs, it's not necessary at all. On Feb 5, 2010, at 20:53, nij...@us... wrote: > Eliminate the need for an extra Stubs Pointer > for adressing a static stub table: Just change > the exported table from static to MODULE_SCOPE. for the record, these were actually quite deliberate changes and part of the overall "const stubs tables" and "tclStubsPtr out of libtcl.so" efforts which were discussed at some length on email & chat back in 2007. the intent behind the name change was to uncover any accidental use of the private (implementation detail) tcl*Stubs symbols outside of tclStubInit.c and Tcl_CreateInterp() resp. TclTommath_Init() (note that MODULE_SCOPE is not guaranteed to be effective on unix platforms, e.g. it has no effect with older gcc or non-gcc compilers, so accidental use of these exported symbols is quite possible) I've been out of the loop for the last few months, so I'll assume that these reversals are being made for sound reasons and have been equally widely discussed, just wanted to note the background behind the original change... Now back to our regular programming on the exciting world of hash functions ;-) |
From: Tom J. <tom...@gm...> - 2010-02-16 18:31:04
|
Gustaf, Try out my Tcl hash functions in my previous email. The problem is not the hash function, it is generally the method of transforming the hash value into a bucket index. The better performing hashes shown in your post are the result of a lot of randomness in the low order bits. This is to be expected of a hash which mixes all the bits in with the new bits. If the entire hash value was used, it could make a lot of sense to increase the sophistication of the hash function, but it would be much easier to just take a prime modulus of hash value just once. This would take into account all the bits. This would help all simple hash functions. For instance, here are the results with a patched tcl8.6 which shows how badly the bucket choose can sometimes be compared to using a modulus reduction: note collide is as follows: proc collide {level {accum ""}} { global a if {$level} { incr level -1 collide $level "aj6$accum" collide $level "bap$accum" } else { set a($accum) 1 } } Note to Donal: I know the collide proc doesn't collide, but it generates equal length keys useful for testing. % source hash2.tcl 595596 microseconds per iteration 32768 entries in table, 16384 buckets number of buckets with 0 entries: 12289 number of buckets with 1 entries: 11 number of buckets with 2 entries: 36 number of buckets with 3 entries: 124 number of buckets with 4 entries: 242 number of buckets with 5 entries: 385 number of buckets with 6 entries: 473 number of buckets with 7 entries: 586 number of buckets with 8 entries: 580 number of buckets with 9 entries: 504 number of buckets with 10 entries: 408 number of buckets with 11 entries: 288 number of buckets with 12 entries: 204 number of buckets with 13 entries: 107 number of buckets with 14 entries: 70 number of buckets with 15 entries: 34 number of buckets with 16 entries: 19 number of buckets with 17 entries: 13 number of buckets with 18 entries: 8 number of buckets with 19 entries: 2 number of buckets with 20 entries: 1 number of buckets with 10000 or more entries: 0 average search distance for entry: 5.0 (obvious the average distance is more than 5.0, more like 10) Using modulus 16567 to choose a bucket: Modulus 16567 (hash%modulus = bucket) Bucket Fill hashmod1 DJB hash from cdb buckets with 1 = 4612 total 4612 buckets with 2 = 4489 total 13590 buckets with 3 = 2925 total 22365 buckets with 4 = 1439 total 28121 buckets with 5 = 582 total 31031 buckets with 6 = 206 total 32267 buckets with 7 = 60 total 32687 buckets with 8 = 9 total 32759 buckets with 9 = 1 total 32768 buckets with 0 = 2244 Bucket Fill hashmod2 Current Tcl hash buckets with 1 = 4514 total 4514 buckets with 2 = 4432 total 13378 buckets with 3 = 2964 total 22270 buckets with 4 = 1469 total 28146 buckets with 5 = 570 total 30996 buckets with 6 = 214 total 32280 buckets with 7 = 52 total 32644 buckets with 8 = 12 total 32740 buckets with 9 = 2 total 32758 buckets with 10 = 1 total 32768 buckets with 0 = 2337 Bucket Fill hashmod3 FNV Hash buckets with 1 = 4558 total 4558 buckets with 2 = 4504 total 13566 buckets with 3 = 2955 total 22431 buckets with 4 = 1418 total 28103 buckets with 5 = 610 total 31153 buckets with 6 = 181 total 32239 buckets with 7 = 55 total 32624 buckets with 8 = 9 total 32696 buckets with 9 = 8 total 32768 buckets with 0 = 2269 The current Tcl hash does better (internally) than FNV: 520529 microseconds per iteration 327520529 microseconds per iteration 32768 entries in table, 16384 buckets number of buckets with 0 entries: 8448 number of buckets with 1 entries: 928 number of buckets with 2 entries: 1296 number of buckets with 3 entries: 1348 number of buckets with 4 entries: 1176 number of buckets with 5 entries: 1024 number of buckets with 6 entries: 910 number of buckets with 7 entries: 576 number of buckets with 8 entries: 380 number of buckets with 9 entries: 196 number of buckets with 10 or more entries: 102 average search distance for entry: 3.2 tom jackson |
From: Donal K. F. <don...@ma...> - 2010-02-16 16:46:04
|
On 16/02/2010 15:25, Gustaf Neumann wrote: > How bad is this in practice? This "attack" is not more than a bad > performance in a pathological case, or? It's exactly a performance issue. The issue is exactly that it is far too easy to trigger this worst-case behaviour. When the behaviour is triggered, Tcl is still functioning correctly though. It never thinks two keys are the same unless they've been compared using full equality. (It just happens to think they're different if the hashcodes are different; it's a one-way implication.) > Concerning FNV: for string lengths less then two characters, FNV and > the classic Tcl function return the same hash value (just a side > note). Only because I'd made a mistake. :-) The FNV algorithm uses a baseline value which means that the hashcodes are different. > As Lars pointed out, it is in general possible to find strings > $x and $y such that FNV($x) = FNV($y). But for the "collision attack" > hash-equality is not required. The hash function implies a linear > search (in the buckets), if > > INDEX(Hash($x)) == INDEX(Hash($x)) > > no matter which hash function is used (where INDEX is the function to > calculate from the hash value the bucket). Linear search is only a problem > when the number of entries in the bucket is high. Precisely. Even 8 values per bucket isn't a big problem. The problem comes when you go into the realm of thousands. That's when it matters. > It is very easy to come up with values, where also with FNV > the hash lookup becomes a linear search. Consider (with FNV > compiled in): [...] > In both cases, all entries are added into just one bucket. I see no > reason, why this can't happen for every other reasonable table size. If the hashcodes are the same, the problem happens. If the hashcodes are only partially the same (to some number of LSBs) then small hash tables have a problem, but things will go away over some size. > In the case of "collide" with 32768 hash table entries, just 14 bits > of the hash keys are used to determine the bucket. So, neither FNV > (nor any other hash function) can guarantee that a the buckets are > filled evenly and the degenerated case cannot happen. The point is that it should be "hard" to find a large number of collisions with any particular value. Preferably there should be minimal collisions at all, but that's tricky to achieve. But being able to bury things that you *do* want is worse. That's because typically in practice most keys are not created equal. > if a high degree of binary compatibility is the only concern: > it should be possible to keep the existing machinery in 8.* in parallel to > a more optimized version, but the main question is most like, is it > worth the effort? The only major chunk of effort round here is this email thread. :-) [The rest of your message I'll get around to a bit later tonight.] Donal. |
From: Gustaf N. <ne...@wu...> - 2010-02-16 15:25:55
|
Am 15.02.10 00:54, schrieb Donal K. Fellows: > ... > than what I really wanted!) with the result, after much experimentation, > that the FNV hash function is slower than Tcl's original hash function. > Not that it's a bottleneck for any existing code. Whatever bottleneck means: since the FNV is slower than the old hash function, and assuming the "collision attack" is not a problem for most programs, most programs will be slower (admitted, by a small degree). Concerning the "attack": the problem is apparently that "aj" and "ba" result in the same hash key by using the classical tcl string hash function. The same is true for any other combination of "aj" and "ba" as long they have the same length e.g. Hash(baba) == Hash(ajaj) == Hash(baaj) == Hash(ajba). Since the hash is the same, the index of the bucket (based on low bits) is the same, therefore the same bucket is used for all these keys. For these hash keys, lookup becomes a linear search. Since the hash-values are equal, the hash-value-comparison can not avoid the string comparison. With 2 two-character pairs, for a string length of 4 we have just 4 collisions (with longer strings, exponentially more as "collide" with string length 30 shows). The situation is the same for many more 2 character pairs. How bad is this in practice? This "attack" is not more than a bad performance in a pathological case, or? Concerning FNV: for string lengths less then two characters, FNV and the classic Tcl function return the same hash value (just a side note). As Lars pointed out, it is in general possible to find strings $x and $y such that FNV($x) = FNV($y). But for the "collision attack" hash-equality is not required. The hash function implies a linear search (in the buckets), if INDEX(Hash($x)) == INDEX(Hash($x)) no matter which hash function is used (where INDEX is the function to calculate from the hash value the bucket). Linear search is only a problem when the number of entries in the bucket is high. It is very easy to come up with values, where also with FNV the hash lookup becomes a linear search. Consider (with FNV compiled in): foreach key {a e i m q u y 1 5 tcl_platform tcl_library} { set x($key) 1 } puts [array statistics x] % 11 entries in table, 4 buckets % number of buckets with 0 entries: 3 % number of buckets with 10 or more entries: 1 % average search distance for entry: 6.0 or for example: foreach key { 0z 19 1i 1y 2l 33 3c 3s 46 4f 4v 55 5e 5u 68 6h 6x 7o 82 8b 8r 91 9a 9q a9 ai ay bl c3 cc cs d6 df dv e5 ee eu f8 fh fx go h2 hb hr i1 ia iq } { set y($key) 1 } puts [array statistics y] % 47 entries in table, 16 buckets % number of buckets with 0 entries: 15 % number of buckets with 10 or more entries: 1 % average search distance for entry: 24.0 In both cases, all entries are added into just one bucket. I see no reason, why this can't happen for every other reasonable table size. In the case of "collide" with 32768 hash table entries, just 14 bits of the hash keys are used to determine the bucket. So, neither FNV (nor any other hash function) can guarantee that a the buckets are filled evenly and the degenerated case cannot happen. >> It looks to me as if the original Tcl hash function has quite >> good properties for the task it is used. > So long as people are not using it in a setting where an attacker > controls the keys, that's true. If an attacker can dictate what values > are being hashed (true in a number of cases like HTTP header parsing) > then you've got a significant performance problem. well, in my understanding "attacker" means "performance attacker". I guess the argument in the HTTP case is, that one transmits 30.000 malicious http header entries to slow down the array access from 0.3 micro seconds to 518 microseconds (measured my my machine on the "collide" data. I am pretty sure that transmitting 30.000 header fields costs much more. >> In order to make hashing faster in Tcl, it is most likely more >> promising to look at the Tcl-hash machinery altogether. > > That's a swamp I've preferred to stay out of, because there are many > constraints due to the requirement to maintain high degrees of binary > compatibility. Reworking the hash code in depth is probably an activity > for 9.0 and not before. if a high degree of binary compatibility is the only concern: it should be possible to keep the existing machinery in 8.* in parallel to a more optimized version, but the main question is most like, is it worth the effort? For my current understanding, the main issue is the whole machinery, not only the hash function. The issues are (a) uneven distribution of entries in the bucket, (b) linear search in the buckets (c) some more or less constant overhead, which could be improved. One approach is to improve the bucket-growing policy (e.g. linear hashing, extendible hashing). This would address (a) and (b), but require some more work. Another, conceptually more simple approach is to replace the linear search in the buckets by e.g. avl trees (these tree keep themselves balanced). This would address (b), compatibility would not be an issue. Changing the hash function just addresses (a). > The other alternative is Bob Jenkins's lookup3, but that's big, scary, > and needs to know the length of the data being hashed (which either > means a strlen() every time for strings, or forces us to break our > binary-compatibility guarantees). Well, HashVarKey in TclVar has the key length - so there is no strlen() required there. This is exactly the place, where "collide" stresses the hash function. When i try different hash functions on "colide", FNV is not yielding the best results. i get better figures with lookup8 (tested on 64bit) and murmur (references below). Note as well, that the average search distance decreases from FNV 3.0 to 2.0 with lookup8 and murmur. Also the number of buckets without entries is much better in lookup3/8 and murmur. -gustaf neumann references: http://burtleburtle.net/bob/hash/index.html http://murmurhash.googlepages.com/ PS: i have the suspicion that the avg search length reported by "array statistics" is not correct, since it reports always values ending with ".0"; i have added values which i believe are more accurate in brackets. # FNV 57581 microseconds per iteration 32776 entries in table, 16384 buckets number of buckets with 0 entries: 8331 number of buckets with 1 entries: 632 number of buckets with 2 entries: 1188 number of buckets with 3 entries: 1570 number of buckets with 4 entries: 1602 number of buckets with 5 entries: 1330 number of buckets with 6 entries: 826 number of buckets with 7 entries: 470 number of buckets with 8 entries: 252 number of buckets with 9 entries: 123 number of buckets with 10 or more entries: 60 average search distance for entry: 3.0 (2.0350) ##### murmur 54371 microseconds per iteration 32776 entries in table, 16384 buckets number of buckets with 0 entries: 2218 number of buckets with 1 entries: 4401 number of buckets with 2 entries: 4457 number of buckets with 3 entries: 2995 number of buckets with 4 entries: 1484 number of buckets with 5 entries: 549 number of buckets with 6 entries: 196 number of buckets with 7 entries: 59 number of buckets with 8 entries: 19 number of buckets with 9 entries: 6 number of buckets with 10 or more entries: 0 average search distance for entry: 2.00 (1.1569) ##### lookup3 53402 microseconds per iteration 32776 entries in table, 16384 buckets number of buckets with 0 entries: 2237 number of buckets with 1 entries: 4367 number of buckets with 2 entries: 4534 number of buckets with 3 entries: 2855 number of buckets with 4 entries: 1521 number of buckets with 5 entries: 621 number of buckets with 6 entries: 176 number of buckets with 7 entries: 54 number of buckets with 8 entries: 18 number of buckets with 9 entries: 1 number of buckets with 10 or more entries: 0 average search distance for entry: 2.00 (1.1584) ##### lookup8 55521 microseconds per iteration 32776 entries in table, 16384 buckets number of buckets with 0 entries: 2209 number of buckets with 1 entries: 4413 number of buckets with 2 entries: 4432 number of buckets with 3 entries: 3052 number of buckets with 4 entries: 1413 number of buckets with 5 entries: 595 number of buckets with 6 entries: 200 number of buckets with 7 entries: 49 number of buckets with 8 entries: 17 number of buckets with 9 entries: 3 number of buckets with 10 or more entries: 1 average search distance for entry: 2.0 (1.1561) |
From: Colin M. <co...@ch...> - 2010-02-16 13:44:42
|
Donal K. Fellows wrote: > On 15/02/2010 20:55, Tom Jackson wrote: > >> Regardless of any problems with the current hash, why does anyone care >> if some idiot chooses an array to store HTTP headers? Should the Tcl >> language protect everyone because they don't realize they need to >> understand that user data needs to be validated, or at least not >> trusted? > > WTF are you talking about? I take it you seem to think that validating > is done before the initial parse of the values into a data structure, > rather than afterwards? Interesting concept. I'm not aware of any > non-trivial web server that does that. > >> Of course I'm on record as being opposed to using an array to handle >> HTTP headers, only the hard headed would continue to use arrays to >> represent HTTP headers, or create any local variables automatically. >> Instead, what you do is to create a representation of the user data >> and then query it using known keys. The HTTP spec permits a client to send headers with essentially arbitrary names. Anything which matches X-* is a valid HTTP header, according to the 1.0 and 1.1 specs. I presume it's possible to craft a series of such headers which would exploit an easily exploitable hash. Validation of individual headers, then, isn't really a viable approach. One could work around the possibility of a spammy malicious client sending too much of this stuff, pragmatically one would limit the size of headers. I doubt whether the issue is that HTTP servers using hashed structures can be made to degenerate. I think it's more that any use of hashed structures can be thus subverted. HTTP servers are probably only an obvious example where a dict or array is exploitable. BTW, I haven't read the record Tom is on, decrying the use of arrays to store HTTP headers. I think Donal's right, though - they're the obvious data type, and no other data type seems to me to have the desired properties. Colin. |
From: Colin M. <co...@ch...> - 2010-02-16 13:16:24
|
John Ousterhout wrote: >> One very glaring way of looking at these flaws is that the old >> hash function only distinguishes between 2551 different two-byte >> sequences, despite there being 65536 of them; this is less than >> 4%. Conversely, for *any* pair of neighbouring bytes of a key, >> there are on average 24 other pairs of bytes that could be >> substituted without changing the value of the old hash function. >> Yet another way of looking at it (though probably hard to >> formalise): the old hash function only keeps about 3.3 bits >> of "randomness" for every 8 bits of key fed in to it! >> > > This is true. I knew this at the time, and it was an intentional > (and actually correct!) choice. It turns out that typical strings > don't actually have 8 bits of information per byte: for example, > most strings that get hashed either consist of purely alphabetic > stuff or sequences of digits. Even in the case of alphabetic > strings, some characters are much more likely than others, so > the space of strings doesn't even grow by a factor of 26 with > each additional character that's added. For example, how many > 2-character words and numbers are there in the English language? > I'll bet it's a lot less than 2551. > If there is less entropy in a set of english words than in the set of extended-ascii characters necessary to encode those words, then one would expect less entropy in the output than in the output of a random bunch of binary. That's not a virtue, it's the law. If, however, there is exploitable entropy in the domain being hashed, it's no virtue of the hash function that it discards it. I would also like to point out that the set of identifiers (being that they are not english words, but often contractions and concatenations and mungings of them) is not the same as the set of english words ... is it? Someone should do the old SHRLDU ETAOIN comparison over them. Colin. |
From: Donal K. F. <don...@ma...> - 2010-02-16 11:57:05
|
On 15/02/2010 20:55, Tom Jackson wrote: > More interesting to me is why the similar function it tclLiteral.c was > not changed: > > line914 result += (result<<3) + bytes[i]; Oh dear, that was an oversight. It will be fixed very soon. :-) > So basically all the pro-change arguments are based upon un-shown > work. My tests show no difference when the collide proc is used as > given, but if you change the collide proc, the current hash seems to > work very well. What's the point of that? With the collision demonstrator (which is published on the Wiki) you have a simple way of generating problematic keys. By "problematic" I mean "large numbers of keys with the same hashcode". To argue that changing the collide procedure gets rid of the problem is missing the whole damn point! We're not testing the collision generator. We're testing the hash function. What I would be interested in is a demonstration of an equivalent collision generator for FNV without requiring rapid growth of key sizes (as those would force the attacker to do more work). > One interesting change is that the run time goes way down. Maybe there > is something special about the original proc. This special feature > does not reduce the security of the hash, for that you need a generic > method to find a collision with a particular string. You're mistaken. This is not a strict security issue because we never assume that two different keys have a different hashcode. What it is is a potentially-crippling performance attack, since it lets anyone who can control what the keys are to make as many keys with the same hashcode as they want (hence they always end up in the same hash bucket even when we resize, leading to O(n**2) performance instead of O(1)). The collide procedure demonstrates the core of how to exploit this problem; all that's then needed is code that puts the keys in an array or dictionary (which is the natural way to write a handler for an http request in tclhttpd or Wub) and you've got a real problem. Yes, only a Denial-of-Service attack, but those matter. > My recommendation: demonstrate a real vulnerability to the original > hash before introducing a fix. No vulnerability has been demonstrated. Do you accept the argument above? > Regardless of any problems with the current hash, why does anyone care > if some idiot chooses an array to store HTTP headers? Should the Tcl > language protect everyone because they don't realize they need to > understand that user data needs to be validated, or at least not > trusted? WTF are you talking about? I take it you seem to think that validating is done before the initial parse of the values into a data structure, rather than afterwards? Interesting concept. I'm not aware of any non-trivial web server that does that. > Of course I'm on record as being opposed to using an array to handle > HTTP headers, only the hard headed would continue to use arrays to > represent HTTP headers, or create any local variables automatically. > Instead, what you do is to create a representation of the user data > and then query it using known keys. Oh, we're supposed to use leprechaun juice to hold the representation? > But of course the problem is that the core developers want everything: > unconscious security. All we're looking for is a way to generate hashcodes for arbitrary "short" values (short is probably "no more than 80 characters" really) that are unlikely to be the same. We can't (and never could) guarantee that there will be no clashes, but we do want it to be difficult to generate arbitrarily many clashes because that enables DoS attacks. > Many years ago I learned a simple lesson: everyone that came before me > knew what they were doing. If I intend to change it, I must first > understand the choices they made. I don't see any understanding here. I certainly see no understanding in your message. You're mixing up different hashing schemes and different types of hashes and different reasons for hashing and, well, the result is you're way off base. Donal. |
From: Arjen M. <arj...@de...> - 2010-02-16 10:14:44
|
Hello Mwandha, you better ask this type of questions about the _use_ of Tcl on the comp.lang.tcl newsgroup. This list is meant for discussing the implementation of Tcl itself. Regards, Arjen On 2010-02-16 10:19, mwandha wrote: > Hullo to you all, > > i am a newbie seeking assistance...i am looking for someone with knowledge > of the Time Synchronization protocols. > > At the moment i am facing a problem with coming up with the TCL script for > Master-slave format of the protocol.. i have created the nodes set aside the > master but now the functionality of the master is the biggest problem...i am > not sure where to proceed next...all out of ideas at the moment...you can > find attached a script of the code i am using > http://old.nabble.com/file/p27605489/masters3.tcl masters3.tcl > > i also get an error from it that does say "routes not yet > computed"......Please advice accordingly...all help is dully welcome.. > > > thanks and regards > > ----- > Mwandha Alfred Obadiah | Network / System Technician > Phone: +256 78-2302850 | Email:anm...@gm... > Twitter: http://www.twitter.com/Obabauer > Address: P O Box 8198 | Kampala | Uganda |
From: Tom J. <tom...@gm...> - 2010-02-16 09:30:47
|
2010/2/15 Lars Hellström <Lar...@re...>: > Tom Jackson skrev: >> >> Hey, >> >> As far as I can tell this entire operation and the logic used to >> justify it is totally messed up. >> >> Unless I'm missing some additional code, one of the basic >> justifications for this change is completely wrong. >> >> That justification is the wiki entry which shows FNV somehow creates a >> hash key distribution that outsmarts the test program "collide". From >> my reading of the collide program it was designed to find collisions. >> Unfortunately the code seems to find X candidates which have the same >> hash value and then create an array. >> >> Guess what! This trick works for any hash function. As far as I can >> tell it works for the new FNV hash and one which I tried the djb hash. > > Yes, but it is extremely simple for the classic Tcl hash, because: > > 1. It has the property that if $x and $y are two equal length strings which > hash to the same 32-bit value, then $a$x$b and $a$y$b will be a pair of > strings which hash to the same value too, for any strings $a and $b. > > 2. There are very many pairs of strings as short as 2 characters which hash > to the same 32-bit value. > > FNV escapes 1 by using ^ to add in the characters, and 2 by using a > multiplier larger than 255. It's probably still possible to find strings $x > and $y such that FNV($x) = FNV($y), but I suspect they'd need to be of > length at least 5, and chances are pretty good that FNV($a$x) != FNV($a$y) > (naively I'd say 255/256, but there could be strings $x and $y which are > less sensitive to the prefix $a). > > With the FNV hash function, an HTTP header parser can to some extent > randomise the results of an attacker by picking a $salt prefix and using > $salt$header as hash entry for data that has to do with the the $header > prefix. With the classic hash function, that has no effect whatsoever; an > attack works just as well no matter what $salt is chosen. Here is a short testing script which compares the hash function in Tcl (internal) as well as three external hash functions. I use the collide proc to generate the array keys and display the stats for the internal bucket distribution. Then the keys are fed into the other procs and the resultant hash values are saved. The hash values are reduced modulo some prime number close to the number of buckets used by the internal hash array. Here's what I noticed: The original code (current Tcl hash code) is slow to compute [collide 15], almost three minutes. Just changing the internal hash function sped this up two orders of magnitude: just a few seconds. I guess this is the result of the collisions? The internal bucket assignment is far worse of a problem than the hash function itself. With the exception of hash collisions, which obviously go into the same bucket no matter what, the bucket assignment code causes even more collisions. The external hash functions perform similarly when buckets are determined by their modulus some prime number. The current Tcl hash function also does okay with powers of two, but FNV and the DJB hash do not. When using the prime method, the distributions are all much better than the internal bucket assignment. The FNV hash, when used as the internal hash function somehow resists most of the bucket assignment problems. The results are still worse than the external FNV using a prime modulus. Maybe there is a reason to go to all the trouble of finding a good hash function only to throw away most of the randomness in the higher order bits. A prime modulus would maintain this randomness as much as possible. # Note aj => ajr and ba => bas in collide: proc collide {level {accum ""}} { global a if {$level} { incr level -1 collide $level "ajr$accum" collide $level "bas$accum" } else { set a($accum) 1 } } puts [time {collide 10}] puts [array statistics a] set names [array names a] # Change modulus for different size arrays: # Examples: # 19 251 521 1021 2063 4889 10007 16567 namespace eval ::hash { variable modulus 1021 } proc ::hash::hashString { string } { variable modulus set hash 5381 foreach char [split $string ""] { scan $char %c char set hash [expr {(($hash << 5) + $hash) ^ $char}] set hash [string trimleft [string range $hash end-12 end] 0] } return [list $string [expr {$hash % $modulus}] $hash] } proc ::hash::hashString2 { string } { set hash 0 variable modulus foreach char [split $string ""] { scan $char %c char set hash [expr {($hash + ($hash << 3 ) + $char)}] set hash [string trimleft [string range $hash end-12 end] 0] } return [list $string [expr {$hash % $modulus}] $hash] } proc ::hash::hashString3 { string } { variable modulus set hashPrime 1000193 set hash [expr 0x811c9dc5] foreach char [split $string ""] { scan $char %c char set hash [expr {($hash * $hashPrime ) ^ $char}] set hash [string trimleft [string range $hash end-12 end] 0] } return [list $string [expr {$hash % $modulus}] $hash] } foreach word $names { set hash1 [::hash::hashString $word] set hash2 [::hash::hashString2 $word] set hash3 [::hash::hashString3 $word] incr hashmod1([lindex $hash1 1]) incr hashmod2([lindex $hash2 1]) incr hashmod3([lindex $hash3 1]) } set desc1 "DJB hash from cdb" set desc2 "Current Tcl hash" set desc3 "FNV Hash" foreach array {1 2 3} { foreach {mod count} [array get hashmod$array] { incr bucket${array}($count) } } if {0} { foreach array {1 2 3} { puts "hashmod$array [set desc$array]" foreach name [lsort -integer [array names hashmod$array]] { puts "$name => [set hashmod${array}($name)]" } } } puts "Modulus $::hash::modulus (hash%modulus = bucket)" foreach array {1 2 3} { puts "Bucket Fill hashmod$array [set desc$array]" set total 0 set usedBuckets 0 foreach name [lsort -integer [array names bucket$array]] { set buckets [expr {[set bucket${array}($name)] * $name}] incr total $buckets incr usedBuckets [set bucket${array}($name)] puts "buckets with $name = [set bucket${array}($name)] total $total" } puts "buckets with 0 = [expr $::hash::modulus - $usedBuckets]\n\n" } |
From: mwandha <anm...@gm...> - 2010-02-16 09:19:38
|
Hullo to you all, i am a newbie seeking assistance...i am looking for someone with knowledge of the Time Synchronization protocols. At the moment i am facing a problem with coming up with the TCL script for Master-slave format of the protocol.. i have created the nodes set aside the master but now the functionality of the master is the biggest problem...i am not sure where to proceed next...all out of ideas at the moment...you can find attached a script of the code i am using http://old.nabble.com/file/p27605489/masters3.tcl masters3.tcl i also get an error from it that does say "routes not yet computed"......Please advice accordingly...all help is dully welcome.. thanks and regards ----- Mwandha Alfred Obadiah | Network / System Technician Phone: +256 78-2302850 | Email:anm...@gm... Twitter: http://www.twitter.com/Obabauer Address: P O Box 8198 | Kampala | Uganda -- View this message in context: http://old.nabble.com/Master-Slave-Time-Synchronization-Protocols-tp27605489p27605489.html Sent from the tcl-core mailing list archive at Nabble.com. |
From: Lars H. <Lar...@re...> - 2010-02-15 23:52:49
|
Tom Jackson skrev: > Hey, > > As far as I can tell this entire operation and the logic used to > justify it is totally messed up. > > Unless I'm missing some additional code, one of the basic > justifications for this change is completely wrong. > > That justification is the wiki entry which shows FNV somehow creates a > hash key distribution that outsmarts the test program "collide". From > my reading of the collide program it was designed to find collisions. > Unfortunately the code seems to find X candidates which have the same > hash value and then create an array. > > Guess what! This trick works for any hash function. As far as I can > tell it works for the new FNV hash and one which I tried the djb hash. Yes, but it is extremely simple for the classic Tcl hash, because: 1. It has the property that if $x and $y are two equal length strings which hash to the same 32-bit value, then $a$x$b and $a$y$b will be a pair of strings which hash to the same value too, for any strings $a and $b. 2. There are very many pairs of strings as short as 2 characters which hash to the same 32-bit value. FNV escapes 1 by using ^ to add in the characters, and 2 by using a multiplier larger than 255. It's probably still possible to find strings $x and $y such that FNV($x) = FNV($y), but I suspect they'd need to be of length at least 5, and chances are pretty good that FNV($a$x) != FNV($a$y) (naively I'd say 255/256, but there could be strings $x and $y which are less sensitive to the prefix $a). With the FNV hash function, an HTTP header parser can to some extent randomise the results of an attacker by picking a $salt prefix and using $salt$header as hash entry for data that has to do with the the $header prefix. With the classic hash function, that has no effect whatsoever; an attack works just as well no matter what $salt is chosen. Lars Hellström |
From: Donal K. F. <don...@ma...> - 2010-02-15 22:02:16
|
On 14/02/2010 22:56, Lars Hellström wrote: >> For truly small tables, a Lisp-style alist would actually be cheaper. >> Not that they scale... > > That is exactly what is in the buckets, is it not? Close, but not quite. Right now, we also keep the full hashcode so that we can do a cheap pre-check on the equality before going to the full equality test. Apart from that (and some housekeeping info), a linked list of key-value pairs is exactly what we've got. The net effect is that on any lookup (read or update) the hash function is only invoked once, and it's not invoked at all on rebuild. > Raises the question > of whether it might be a good idea to have hash tables start out with > exactly one bucket and skip computing hash keys until the first time > the table is rebuilt. (This adds a small constant time to all accesses, > and drops one larger constant and one key-size-proportional term for > accesses in small tables.) But it complicates the code further and would require a different default hash table size. Right now, we start with 4 cells (preallocated in the hash table structure) so that we minimize the number of mallocs for small tables. > It's probably not possible as a change > under-the-hood of Tcl8's hash tables (e.g. hash entries cache hash > keys, if I read the code correctly), but might be of interest to those > who consider alternatives for e.g. dicts or Tcl9. Further afield (9?) I want to do arrays with things other than hash tables backing them up. Vectors (as in BLT) and databases (as in metakit) are particularly interesting cases. (The main problem with this is working out what to do about traces, but that's a story for another time.) Once that mechanism is done, I'm sure we could try using balanced trees or something like that for array storage. However, I'm not at all convinced that the gain will be worth it for most applications. Let the code prove itself on its own merit when that happens! FWIW, the one thing that I wish had been more clearly enunciated in my undergraduate data structures course (long time ago now!) was just how effective hash tables really are. I suspect the lecturer just didn't believe in them, and so focused on tree balancing and things like that. But hashes are *so* good in practice and, as long as you have a good enough hash function, are easy to implement correctly too. > One very glaring way of looking at these flaws is that the old hash > function only distinguishes between 2551 different two-byte sequences, > despite there being 65536 of them; this is less than 4%. True, but because we apply hashes to our (pseudo-)UTF-8 representation, not all pairs are possible. (We use different hash algorithms for standard pointers and structures, and always have.) I'm not sure what the effect of this is, and I'm *definitely* not sure what effect this has on the attack-prone-ness of the FNV hash. > Conversely, > for *any* pair of neighbouring bytes of a key, there are on average 24 > other pairs of bytes that could be substituted without changing the > value of the old hash function. Yet another way of looking at it > (though probably hard to formalise): the old hash function only keeps > about 3.3 bits of "randomness" for every 8 bits of key fed in to it! I suspect that's not too bad for "nice" (i.e., normal) hash keys. It's probably not far off the entropy level of both numbers and letters in English words. > That the old function multiplies by such a low number as 9 is why it's > not so good already at distinguishing between two-byte keys; the 2551 > above arises as (255*9+255) - (0*9+0) + 1 -- maximal contribution from > two bytes minus minimal contribution from two bytes plus 1. Multiplying > by 1 instead (i.e., dropping that result<<3 term) would probably make > it even faster, but also an even poorer hash. ;-) Everything's made more complex by the fact that we chop high bits off the hash value when building the table indexes. That's known (apparently) to be non-optimal for FNV. OTOH, we resize our hash tables too, and then use more bits of the hash so the bits aren't really ignored. There's not been that much study of the types of hash table that Tcl uses and what the effects of the resizing algorithm are on overall hash efficiency. (That may be because it's a sophisticated thing to be doing - a lot of hash table implementations don't bother - and complex to analyse too.) Too many things wrong that can't be fixed without breaking compatibility with lots of existing code. Donal. |
From: Tom J. <tom...@gm...> - 2010-02-15 20:55:39
|
Hey, As far as I can tell this entire operation and the logic used to justify it is totally messed up. Unless I'm missing some additional code, one of the basic justifications for this change is completely wrong. That justification is the wiki entry which shows FNV somehow creates a hash key distribution that outsmarts the test program "collide". From my reading of the collide program it was designed to find collisions. Unfortunately the code seems to find X candidates which have the same hash value and then create an array. Guess what! This trick works for any hash function. As far as I can tell it works for the new FNV hash and one which I tried the djb hash. Then I made changes to the "collide" procedure. Apparently the result is very sensitive to the data pre-pended to the string to be hashed. Regardless of these results, the best distribution seems to be achieved with the djb hash: result = 5381; for (c=*string++ ; c ; c=*string++) { result = ( ( result<<5 ) + result ) ^ c; } More interesting to me is why the similar function it tclLiteral.c was not changed: line914 result += (result<<3) + bytes[i]; But many words have been thrown around about cryptography, safety, whatever. This really makes no sense. The structure of the hash, which weighs toward low memory usage, is designed to open the door to many collisions. The whole design creates the potential for many collisions. You could compare this to a secure hash which maps into an enormous keyspace. The measure of security is based upon the difficulty of finding a collision with a given string. So basically all the pro-change arguments are based upon un-shown work. My tests show no difference when the collide proc is used as given, but if you change the collide proc, the current hash seems to work very well. For instance, change the collide proc to this: proc collide {level {accum ""}} { global a if {$level} { incr level -1 collide $level "jb$accum" collide $level "ba$accum" } else { set a($accum) 1 } } One interesting change is that the run time goes way down. Maybe there is something special about the original proc. This special feature does not reduce the security of the hash, for that you need a generic method to find a collision with a particular string. My recommendation: demonstrate a real vulnerability to the original hash before introducing a fix. No vulnerability has been demonstrated. Basically a cryptographic hash maps an infinite keyspace into a fixed, but very large hashspace. If the hashspace is scaled to the number of used keys, you can imagine that the cryptographic function is lost, assuming it ever existed. Cryptography is the modern equivalent of a leprechaun tying ribbons around every tree in order to disguise the tree hiding the treasure. The scaled hash functions tend to cut down most of the trees, making the treasure much easier to find. Regardless of any problems with the current hash, why does anyone care if some idiot chooses an array to store HTTP headers? Should the Tcl language protect everyone because they don't realize they need to understand that user data needs to be validated, or at least not trusted? Of course I'm on record as being opposed to using an array to handle HTTP headers, only the hard headed would continue to use arrays to represent HTTP headers, or create any local variables automatically. Instead, what you do is to create a representation of the user data and then query it using known keys. But of course the problem is that the core developers want everything: unconscious security. Many years ago I learned a simple lesson: everyone that came before me knew what they were doing. If I intend to change it, I must first understand the choices they made. I don't see any understanding here. tom jackson On Mon, Feb 15, 2010 at 7:14 AM, Donal K. Fellows <don...@ma...> wrote: > On 15/02/2010 04:56, John Ousterhout wrote: >> >> It's possible that the proposed new algorithm is better than the >> original Tcl one; it looks like it might use a more complex >> approach that tends to spread the information from each character >> more quickly across all of the bits of the hash value, whereas >> the multiply-by-9 approach only gradually does this. > > The original Tcl algorithm is not bad for most keys. It both keeps bits > low down and spreads them up, and it uses operations that don't lose too > much information (i.e., multiplication and addition). The FNV algorithm > is remarkably similar in structure (so advantageous in terms of ease of > transition) but uses a different multiplier that spreads the information > more thoroughly and uses xor rather than addition (though we all know > those are actually closely related operations). > > The main issue with the old hash algorithm was that it was too easy to > attack it and turn hash table accesses from O(1) to O(n**2). A > literature search (not very easy; this is not a heavily studied area) > indicates that the FNV hash is quite a bit more resistant; attacks would > require using much longer keys and would be more likely to hit some > other limit. > > The other advantage of the FNV algorithm is that it is well-known, > reasonably well-studied, and does a reasonably good job with producing > keys for the space of "short words"; I'm not really the guy saying that > it is quite good, a number of others are (URLs elsewhere in this > thread). I don't know of anyone other than Tcl using the old hash > algorithm (though related variants have been studied and found lacking). > > I'm tempted by Bob Jenkins's lookup3 hash, but it's big, complex, and > really wants to know the length of the string first so implementing it > efficiently for string keys is awkward[*]. We could use it for Tcl_Obj > keys, but I'm currently keener on using the same algorithm for both > types of user-visible keys. (I'm also a bit hesitant because of the > complexity of lookup3; it's non-trivial to code.) > > [Good advice elided] >> >> So, if you decide to change, make sure you measure very carefully >> on representative datasets, and make sure you've studied the >> properties of hashing functions enough to understand exactly >> how yours is working and why it is better. > > On non-malicious datasets, Tcl's original algorithm works pretty well. > It's the case when someone's deliberately trying to provoke bad > behaviour that concerned me. FNV seems to do better (and has reasonable > distribution behaviour on the non-malicious datasets I've tried it with, > such as the contents of /usr/share/dict/words and the first million > cardinal numbers). > > As a side note, changing the hash has flushed out a small number of > places in Tcl where we had code that actually did depend on the order of > hashing. These places were bugs waiting to happen... > > Donal. > [* Maintaining binary and source compatibility was a cast-iron > requirement for this change, so storing the length of a string key > was not possible; there's nowhere in a Tcl_HashEntry for it to go > and offsets of fields are baked into user code because of macros. ] > > ------------------------------------------------------------------------------ > SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, > Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW > http://p.sf.net/sfu/solaris-dev2dev > _______________________________________________ > Tcl-Core mailing list > Tcl...@li... > https://lists.sourceforge.net/lists/listinfo/tcl-core > > |
From: Larry M. <lm...@bi...> - 2010-02-15 16:16:44
|
On Mon, Feb 15, 2010 at 03:14:07PM +0000, Donal K. Fellows wrote: > The main issue with the old hash algorithm was that it was too easy to > attack it and turn hash table accesses from O(1) to O(n**2). So there must be bug reports or feature requests from users complaining about this. Could we see those? > On non-malicious datasets, Tcl's original algorithm works pretty well. > It's the case when someone's deliberately trying to provoke bad > behaviour that concerned me. FNV seems to do better (and has reasonable > distribution behaviour on the non-malicious datasets I've tried it with, > such as the contents of /usr/share/dict/words and the first million > cardinal numbers). What's wrong with letting people who are trying to break the system suffer poor performance? You've got working code, well thought out, field tested for a couple of decades, no complaints that I have seen, and you want to change it. That's fine but there should be a compelling case for doing so. In the absence of one any sane group of engineers would not change something that works just because they think the new one might be better. Experience has shown all of us that every change has some unexpected costs. -- --- Larry McVoy lm at bitmover.com http://www.bitkeeper.com |
From: Donal K. F. <don...@ma...> - 2010-02-15 15:14:30
|
On 15/02/2010 04:56, John Ousterhout wrote: > It's possible that the proposed new algorithm is better than the > original Tcl one; it looks like it might use a more complex > approach that tends to spread the information from each character > more quickly across all of the bits of the hash value, whereas > the multiply-by-9 approach only gradually does this. The original Tcl algorithm is not bad for most keys. It both keeps bits low down and spreads them up, and it uses operations that don't lose too much information (i.e., multiplication and addition). The FNV algorithm is remarkably similar in structure (so advantageous in terms of ease of transition) but uses a different multiplier that spreads the information more thoroughly and uses xor rather than addition (though we all know those are actually closely related operations). The main issue with the old hash algorithm was that it was too easy to attack it and turn hash table accesses from O(1) to O(n**2). A literature search (not very easy; this is not a heavily studied area) indicates that the FNV hash is quite a bit more resistant; attacks would require using much longer keys and would be more likely to hit some other limit. The other advantage of the FNV algorithm is that it is well-known, reasonably well-studied, and does a reasonably good job with producing keys for the space of "short words"; I'm not really the guy saying that it is quite good, a number of others are (URLs elsewhere in this thread). I don't know of anyone other than Tcl using the old hash algorithm (though related variants have been studied and found lacking). I'm tempted by Bob Jenkins's lookup3 hash, but it's big, complex, and really wants to know the length of the string first so implementing it efficiently for string keys is awkward[*]. We could use it for Tcl_Obj keys, but I'm currently keener on using the same algorithm for both types of user-visible keys. (I'm also a bit hesitant because of the complexity of lookup3; it's non-trivial to code.) [Good advice elided] > So, if you decide to change, make sure you measure very carefully > on representative datasets, and make sure you've studied the > properties of hashing functions enough to understand exactly > how yours is working and why it is better. On non-malicious datasets, Tcl's original algorithm works pretty well. It's the case when someone's deliberately trying to provoke bad behaviour that concerned me. FNV seems to do better (and has reasonable distribution behaviour on the non-malicious datasets I've tried it with, such as the contents of /usr/share/dict/words and the first million cardinal numbers). As a side note, changing the hash has flushed out a small number of places in Tcl where we had code that actually did depend on the order of hashing. These places were bugs waiting to happen... Donal. [* Maintaining binary and source compatibility was a cast-iron requirement for this change, so storing the length of a string key was not possible; there's nowhere in a Tcl_HashEntry for it to go and offsets of fields are baked into user code because of macros. ] |
From: John O. <ou...@pa...> - 2010-02-15 04:57:32
|
> One very glaring way of looking at these flaws is that the old > hash function only distinguishes between 2551 different two-byte > sequences, despite there being 65536 of them; this is less than > 4%. Conversely, for *any* pair of neighbouring bytes of a key, > there are on average 24 other pairs of bytes that could be > substituted without changing the value of the old hash function. > Yet another way of looking at it (though probably hard to > formalise): the old hash function only keeps about 3.3 bits > of "randomness" for every 8 bits of key fed in to it! This is true. I knew this at the time, and it was an intentional (and actually correct!) choice. It turns out that typical strings don't actually have 8 bits of information per byte: for example, most strings that get hashed either consist of purely alphabetic stuff or sequences of digits. Even in the case of alphabetic strings, some characters are much more likely than others, so the space of strings doesn't even grow by a factor of 26 with each additional character that's added. For example, how many 2-character words and numbers are there in the English language? I'll bet it's a lot less than 2551. Suppose instead of the current calculation "result = result*9 + c" we used "result = result*256 + c". The problem with this is that result grows way too quickly, so after processing just a few characters the first character in the string has been shifted into the high-order bits of result. We only use the low-order bits as the hash bucket index, so this means that only the last few characters of the string influence that index. As a result, there isn't enough entropy/information in the low-order bits that are actually used, so more hash collisions result. Ideally we'd like the magnitude of result to grow at the same rate as the overall table size, so that the useful information in the hash value falls into the bits that we actually use. I picked the constant 9 after a lot of experimentation with different hash values. Constants that are either larger or smaller cause worse hash table performance. It's possible that the proposed new algorithm is better than the original Tcl one; it looks like it might use a more complex approach that tends to spread the information from each character more quickly across all of the bits of the hash value, whereas the multiply-by-9 approach only gradually does this. However, BE VERY CAREFUL ABOUT HASH FUNCTIONS. Picking a good hash function seems simple but is actually very complex, with lots of subtle gotchas. Many people think they know good functions, but most of them don't actually behave very well and few people do extensive tests on data that is representative of how the hash functions will actually be used. Hash functions that work well on some kinds of data may not work well on other kinds. I got started in this because I wrote an application using an "off-the-shelf" hash function and discovered that the application performed very poorly at large scale. I eventually discovered that it was only using a small portion of the hash bucket array because the hash function did not successfully spread the hash values across the entire array; there were so many bucket collisions that the entire application performed badly. I ended up with the current function after a lot of experiments. So, if you decide to change, make sure you measure very carefully on representative datasets, and make sure you've studied the properties of hashing functions enough to understand exactly how yours is working and why it is better. -John- |
From: Larry M. <lm...@bi...> - 2010-02-15 03:20:43
|
> That is exactly what is in the buckets, is it not? Raises the question > of whether it might be a good idea to have hash tables start out with > exactly one bucket and skip computing hash keys until the first time > the table is rebuilt. (This adds a small constant time to all accesses, > and drops one larger constant and one key-size-proportional term for > accesses in small tables.) It's probably not possible as a change > under-the-hood of Tcl8's hash tables (e.g. hash entries cache hash > keys, if I read the code correctly), but might be of interest to those > who consider alternatives for e.g. dicts or Tcl9. We have an hash implementation that I and another guy built at SGI based on Oz Yigit's sdbm code. It's more or less dbm(3) compat, has in memory and mmap based on disk versions, and scales down somewhat. I believe that you can get the index and data in 128 bytes (doesn't matter if you are 32 or 64 bits but the key/data are limited to 16 bits (64K) in size. You can compile it to use 32 bit key/value sizes but that makes the memory footprint bigger. I can make a shar file of this code if anyone wants to play with it to see if it would be useful for tcl. The actual hashing function used is pluggable, it doesn't care. It comes with a handful of them but you could plug the tcl one (new or old) in if you liked. -- --- Larry McVoy lm at bitmover.com http://www.bitkeeper.com |
From: Donal K. F. <don...@ma...> - 2010-02-14 23:54:47
|
On 14/02/2010 16:58, Gustaf Neumann wrote: > Am 11.02.10 01:08, schrieb Alexandre Ferrieux: >> Since it appears that timings are tricky, > Apparently. It turns out that in the both mini-benchmarks (the two > "foo" functions) the hash function for string keys (HashStringKey()) > is not called at all (!) in the timed loop (e.g. Tcl variables are > compiled locals). That's why you have to use array variables there. Though that also doesn't call HashStringKey; it calls the equivalent for Tcl_Obj keys. > Funny enough, just by changing the string hash function one can > measure different performance values (maybe due to processor level > caching or similar external effects). There's a lot of tricky bits to measuring the performance. That's why I switched to using a direct test (and had to tweak that a lot too in the end so as to stop the gcc optimizer for making me test something other than what I really wanted!) with the result, after much experimentation, that the FNV hash function is slower than Tcl's original hash function. Not that it's a bottleneck for any existing code. The other metrics we can use here relate to the differences of distribution of hash keys, which can have quite a substantial impact on the overall hash *table* performance. With general data, the performance seems to be about the same; it's difficult to construct an example with ordinary keys where the difference is really worth mentioning. With "malicious" data it's different though; http://wiki.tcl.tk/5164 describes how the old function (mis)behaves, and with FNV that part is unremarkable. (There should be sets of keys which are a problem for FNV too, but I've not constructed any. I have found[*] a reference to some code for generating collisions against FNV, <URL:http://www.team5150.com /~andrew/blog/2007/03/hash_algorithm_attacks.html>, but I'm not sure about the practicality of any attack based on the methods described there. I also don't know what the effect is of hashing over utf-8 instead of something 8-bit clean.) > - For small hash tables, only a few bits of the computed hash > values are used. For the initial size, these are only the last two > bits. By every hash table resize (RebuildTable) two more bits are > added. It is not clear how the properties of hash functions > propagate, when only a small fraction (e.g. 2, 4, 6 etc. bits) of > the computed hash value are actually used. Tends not to be an issue because if anyone tries to exploit, the table gets rebuilt and more bits of the hash are used. A way to defeat the whole problem would be to take the whole hash value mod some number that is coprime with the hash table size, but mod is a fairly expensive operation. It's only really a pressing issue when people are generating many keys with an identical full hash value (by default, Tcl does a comparison of the full hashcode before proceeding to a key comparison). As noted before, that's really depressingly easy with Tcl's old hash function. > It looks to me as if the original Tcl hash function has quite > good properties for the task it is used. So long as people are not using it in a setting where an attacker controls the keys, that's true. If an attacker can dictate what values are being hashed (true in a number of cases like HTTP header parsing) then you've got a significant performance problem. > In order to make hashing faster in Tcl, it is most likely more > promising to look at the Tcl-hash machinery altogether. That's a swamp I've preferred to stay out of, because there are many constraints due to the requirement to maintain high degrees of binary compatibility. Reworking the hash code in depth is probably an activity for 9.0 and not before. The other alternative is Bob Jenkins's lookup3, but that's big, scary, and needs to know the length of the data being hashed (which either means a strlen() every time for strings, or forces us to break our binary-compatibility guarantees). Donal. [* _This_ thread is one of the top 10 hits that I find when I google for "fnv hash collision generating", which is a bit self-referential. ] |
From: Lars H. <Lar...@re...> - 2010-02-14 22:56:15
|
Donal K. Fellows skrev: > On 09/02/2010 16:15, Gustaf Neumann wrote: >> The output of Donal showing the buckets indicates certainly >> that FNV leads to less collisions. However, since from my experience, >> we have in tcl many hash tables with only a few items. >> I would not be surprised if most programs run rather slower >> with FNV than with the old tcl hash function. > > It depends critically on the operation mix, what keys are being used, > etc., and it's hard to test in situ because it's typically used in > conjunction with other (much more expensive) operations like memory > allocation. Given that, keeping bucket chains short is probably the best > metric of all. Indeed. That's what can cause a difference in the asymptotic complexity of these algorithms. > For truly small tables, a Lisp-style alist would actually be cheaper. > Not that they scale... That is exactly what is in the buckets, is it not? Raises the question of whether it might be a good idea to have hash tables start out with exactly one bucket and skip computing hash keys until the first time the table is rebuilt. (This adds a small constant time to all accesses, and drops one larger constant and one key-size-proportional term for accesses in small tables.) It's probably not possible as a change under-the-hood of Tcl8's hash tables (e.g. hash entries cache hash keys, if I read the code correctly), but might be of interest to those who consider alternatives for e.g. dicts or Tcl9. >> maybe, some handcoded inline assembly could speed up the function. An interesting possibility in that area might be to use something like the BSWAP instruction (http://www1.arl.wustl.edu/~lockwood/class/cs306/books/artofasm/Chapter_6/CH06-1.html#HEADING1-291) instead of a multiplication, on every second iteration of the loop (I'm imagining a "Duff's device" style loop unrolling). Since this switches what is most and least significant in the hash word, it would have the effect of making the remaining multiplications mix bits "downwards" as well as "upwards". > Maybe, but then we'd have to maintain it on all our supported platforms. > :-( The only time we've ever used assembly in Tcl's implementation was > because there was *no* other way to do it (and was for dealing with > particularly odd parts of Windows on x86 only). Well, it's not technically necessary to use the same hash function on all platforms, is it? On the other hand, I agree it's probably not significant enough to bother with doing anyway, once the more obvious flaws of the current hash function has been delt with. One very glaring way of looking at these flaws is that the old hash function only distinguishes between 2551 different two-byte sequences, despite there being 65536 of them; this is less than 4%. Conversely, for *any* pair of neighbouring bytes of a key, there are on average 24 other pairs of bytes that could be substituted without changing the value of the old hash function. Yet another way of looking at it (though probably hard to formalise): the old hash function only keeps about 3.3 bits of "randomness" for every 8 bits of key fed in to it! Gustaf Neumann skrev: > - At least on x86 Intel, gcc compiles the classical Tcl string hash > function pretty well. The loop body if the hash function > "result += (result<<3) + c;" > results in just two machine instructions (one lea + one add). > This is hard to beat by assembly coding (if someone is interested, i > have an apparently slighter faster inline assembly version for x86 > 64bit). So, the classical Tcl hash function appears well suited for > modern compilers, and gcc does very well here (i used gcc 4.2.1). > FNV needs the slower integer multiply, that's why the FNV > is slower (measured by Donals hashes.c test). That the old function multiplies by such a low number as 9 is why it's not so good already at distinguishing between two-byte keys; the 2551 above arises as (255*9+255) - (0*9+0) + 1 -- maximal contribution from two bytes minus minimal contribution from two bytes plus 1. Multiplying by 1 instead (i.e., dropping that result<<3 term) would probably make it even faster, but also an even poorer hash. ;-) Lars Hellström |
From: Gustaf N. <ne...@wu...> - 2010-02-14 17:20:48
|
Am 11.02.10 01:08, schrieb Alexandre Ferrieux: > Since it appears that timings are tricky, Apparently. It turns out that in the both mini-benchmarks (the two "foo" functions) the hash function for string keys (HashStringKey()) is not called at all (!) in the timed loop (e.g. Tcl variables are compiled locals). Funny enough, just by changing the string hash function one can measure different performance values (maybe due to processor level caching or similar external effects). Two more observations: - At least on x86 Intel, gcc compiles the classical Tcl string hash function pretty well. The loop body if the hash function "result += (result<<3) + c;" results in just two machine instructions (one lea + one add). This is hard to beat by assembly coding (if someone is interested, i have an apparently slighter faster inline assembly version for x86 64bit). So, the classical Tcl hash function appears well suited for modern compilers, and gcc does very well here (i used gcc 4.2.1). FNV needs the slower integer multiply, that's why the FNV is slower (measured by Donals hashes.c test). - For small hash tables, only a few bits of the computed hash values are used. For the initial size, these are only the last two bits. By every hash table resize (RebuildTable) two more bits are added. It is not clear how the properties of hash functions propagate, when only a small fraction (e.g. 2, 4, 6 etc. bits) of the computed hash value are actually used. It looks to me as if the original Tcl hash function has quite good properties for the task it is used. In order to make hashing faster in Tcl, it is most likely more promising to look at the Tcl-hash machinery altogether. The granularity of the functions does not look perfectly suited for the needs of current Tcl. So the current definitions ================================== Tcl_HashKeyType { .... HashKeyProc CompareHashKeysProc AllocHashEntryProc FreeHashEntryProc } Tcl_HashEntry * Tcl_CreateHashEntry(...) { ... if (tablePtr->keyType == TCL_STRING_KEYS) { typePtr = &tclStringHashKeyType; } else if (tablePtr->keyType == TCL_ONE_WORD_KEYS) { typePtr = &tclOneWordHashKeyType; } else if (tablePtr->keyType == TCL_CUSTOM_TYPE_KEYS || tablePtr->keyType == TCL_CUSTOM_PTR_KEYS) { typePtr = tablePtr->typePtr; } else { typePtr = &tclArrayHashKeyType; } ... ================================== could become something like ================================== Tcl_HashKeyType { .... CreateHashEntryProc FreeHashEntryProc } #define Tcl_CreateHashEntry(tablePtr,..) (...)tablePtr->type->CreateHashEntryProc(...) ================================== to provide tailored create functions for every hashKeyType. Tcl does not call HashKeyProc, CompareHashKeysProc, or AllocHashEntryProc outside CreateHashEntry. The common "keyType == ..." sermon used in several functions of the hash API does not look perfect either and could be replaced by a macro like the one sketched.... best regards -gustaf neumann |