From: Tom J. <tom...@gm...> - 2010-02-18 21:31:53
|
On Thu, Feb 18, 2010 at 1:23 AM, Donal K. Fellows <don...@ma...> wrote: > On 17/02/2010 18:41, Tom Jackson wrote: >> The only way to justify using an array instead of a linear list {{key >> value} {key value}} is if you don't think it is important to maintain >> the data in the form it was sent. This is a valid argument. But note >> that in order to add a key, or lappend to an existing key, you have to >> search the array first. A large array will consume more resources >> during creation than a list containing the same data. > > I'm not criticising other parts of what they're doing. Just the process > of going from key to bucket chain to add to. Theirs is particularly lame > as it loads all keys that start with the same letter into the same > bucket (so trivial to attack it's unfunny) but it's not as lame as when > it was all just an alist. It was in older versions of Apache; adds might > have been quick, but lookups would have sucked... I really wish I could figure out how anyone ever got the idea that arrays are better than lists if the data is only used a few times. Somehow you have to pay for the creation of the array. This makes sense if you need continuous access to the same data over a long period of time. Another important point about Tcl arrays is that the history suggests that they were designed as a method for mapping strings to opaque structures (or pointers, whatever). The string-data to binary-data mapping is one of the most important features of the Tcl language. So you can't use a list to map strings to an opaque structure., because lists are "values". Somehow this important fact is being turned upside down. We are focusing on the key and not the value. Now, back to performance: You have to pay for the creation and maintenance of the array. This is an up-front cost. If you have 10 minutes to visit Disneyland, your per-ride cost will be very high. Enough talk. Here is a script which uses the FNV algorithm to create an array, using the collide keys. After the script I have timings (create the array with the keys, create a name-value list with the keys). The result is that it takes so much time to create the array, I have plenty of time left over to do very slow linear searches...and I can do them case-insensitive. In addition I can recover the order of the input list. Why anyone thinks it is a good idea to spend time destroying information is beyond me. But basically that is what any list->array transform gives you. # original collide proc proc collide {level {accum ""} } { global a if {$level} { incr level -1 collide $level "aj$accum" collide $level "ba$accum" } else { lappend a($accum) 1 } } # new proc just generates keys proc collideNames {level {accum ""} } { global names if {$level} { incr level -1 collideNames $level "aj$accum" collideNames $level "ba$accum" } else { lappend names $accum } } # generate lots of keys collideNames 15 set hostChoice [list host Host hosT hOst hoSt HoSt HOst hoST] set valueIndex 0 set hostIndex 0 # create array using keys, note collide runs faster because it is a compiled proc # the loop here matches the following list creation loop. puts "create host array: [time { foreach name $names { lappend a($name) [incr valueIndex] if {!($valueIndex % 10007)} { incr hostIndex lappend a([lindex $hostChoice $hostIndex]) $valueIndex } } }]" puts [array statistics a] set valueIndex 0 set hostIndex 0 puts "create host list: [time { foreach name $names { lappend headerList [list $name [incr valueIndex]] if {!($valueIndex % 10007)} { incr hostIndex lappend headerList [list [lindex $hostChoice $hostIndex] $valueIndex] } } }]" puts "finding First Host: [time {set hosts [lsearch -inline -index 0 $headerList Host]}]" puts "finding First Host nocase: [time {set hosts [lsearch -inline -index 0 $headerList Host]}]" puts "finding All Hosts: [time {set hosts [lsearch -inline -all -index 0 $headerList Host]}]" puts "finding All Hosts nocase: [time {set hosts [lsearch -inline -all -nocase -index 0 $headerList Host]}]" puts $hosts ### End script The only thing interesting here is the extra time required to create the array. But lets not even get into the problems of searching for a "host" header with no case: % array names a -regexp (h|H)(o|O)(s|S)(t|T) hosT hOst Host % time {array names a -regexp (h|H)(o|O)(s|S)(t|T)} 120592 microseconds per iteration Then you have to retrieve the values. Here is what I get with a list representation: create host array: 1025203 microseconds per iteration (wow! over one second) 32771 entries in table, 16384 buckets number of buckets with 0 entries: 8342 number of buckets with 1 entries: 604 number of buckets with 2 entries: 1144 number of buckets with 3 entries: 1657 number of buckets with 4 entries: 1616 number of buckets with 5 entries: 1271 number of buckets with 6 entries: 848 number of buckets with 7 entries: 487 number of buckets with 8 entries: 237 number of buckets with 9 entries: 111 number of buckets with 10 entries: 47 number of buckets with 11 entries: 15 number of buckets with 12 entries: 4 number of buckets with 14 entries: 1 number of buckets with 10000 or more entries: 0 average search distance for entry: 3.0 create host list: 381651 microseconds per iteration (2.5 times faster) finding First Host: 1867 microseconds per iteration finding First Host nocase: 1697 microseconds per iteration finding All Hosts: 6618 microseconds per iteration finding All Hosts nocase: 40670 microseconds per iteration Host keys: {Host 10007} {hosT 20014} {hOst 30021} Somewhere the conventional wisdom about list vs. array access needs to be discredited. tom jackson |