Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Thu, Feb 18, 2010 at 1:23 AM, Donal K. Fellows
<don...@ma...> wrote:
> On 17/02/2010 18:41, Tom Jackson wrote:
>> The only way to justify using an array instead of a linear list {{key
>> value} {key value}} is if you don't think it is important to maintain
>> the data in the form it was sent. This is a valid argument. But note
>> that in order to add a key, or lappend to an existing key, you have to
>> search the array first. A large array will consume more resources
>> during creation than a list containing the same data.
>
> I'm not criticising other parts of what they're doing. Just the process
> of going from key to bucket chain to add to. Theirs is particularly lame
> as it loads all keys that start with the same letter into the same
> bucket (so trivial to attack it's unfunny) but it's not as lame as when
> it was all just an alist. It was in older versions of Apache; adds might
> have been quick, but lookups would have sucked...

I really wish I could figure out how anyone ever got the idea that
arrays are better than lists if the data is only used a few times.

Somehow you have to pay for the creation of the array. This makes
sense if you need continuous access to the same data over a long
period of time.

Another important point about Tcl arrays is that the history suggests
that they were designed as a method for mapping strings to opaque
structures (or pointers, whatever). The string-data to binary-data
mapping is one of the most important features of the Tcl language. So
you can't use a list to map strings to an opaque structure., because
lists are "values".

Somehow this important fact is being turned upside down. We are
focusing on the key and not the value.

Now, back to performance: You have to pay for the creation and
maintenance of the array. This is an up-front cost. If you have 10
minutes to visit Disneyland, your per-ride cost will be very high.

Enough talk. Here is a script which uses the FNV algorithm to create
an array, using the collide keys. After the script I have timings
(create the array with the keys, create a name-value list with the
keys). The result is that it takes so much time to create the array, I
have plenty of time left over to do very slow linear searches...and I
can do them case-insensitive. In addition I can recover the order of
the input list.  Why anyone thinks it is a good idea to spend time
destroying information is beyond me. But basically that is what any
list->array transform gives you.

# original collide proc
proc collide {level {accum ""} } {
    global a
    if {$level} {
        incr level -1
        collide $level "aj$accum"
        collide $level "ba$accum"
    } else {
        lappend a($accum) 1
    }
}

# new proc just generates keys
proc collideNames {level {accum ""} } {
    global names
    if {$level} {
        incr level -1
        collideNames $level "aj$accum"
        collideNames $level "ba$accum"
    } else {
        lappend names $accum
    }
}

# generate lots of keys
collideNames 15

set hostChoice [list host Host hosT hOst hoSt HoSt HOst hoST]
set valueIndex 0
set hostIndex 0

# create array using keys, note collide runs faster because it is a
compiled proc
# the loop here matches the following list creation loop.
puts "create host array: [time {

foreach name $names {
    lappend a($name) [incr valueIndex]
    if {!($valueIndex % 10007)} {
	incr hostIndex
	lappend a([lindex $hostChoice $hostIndex]) $valueIndex
    }
}
}]"

puts [array statistics a]

set valueIndex 0
set hostIndex 0

puts "create host list: [time {

foreach name $names {
    lappend headerList [list $name [incr valueIndex]]
    if {!($valueIndex % 10007)} {
	incr hostIndex
	lappend headerList [list [lindex $hostChoice $hostIndex] $valueIndex]
    }
}
}]"

puts "finding First Host: [time {set hosts [lsearch -inline -index 0
$headerList Host]}]"
puts "finding First Host nocase: [time {set hosts [lsearch -inline
-index 0 $headerList Host]}]"

puts "finding All Hosts: [time {set hosts [lsearch -inline -all -index
0 $headerList Host]}]"

puts "finding All Hosts nocase: [time {set hosts [lsearch -inline -all
-nocase -index 0 $headerList Host]}]"
puts $hosts

### End script

The only thing interesting here is the extra time required to create
the array. But lets not even get into the problems of searching for a
"host" header with no case:

% array names a -regexp (h|H)(o|O)(s|S)(t|T)
hosT hOst Host

% time {array names a -regexp (h|H)(o|O)(s|S)(t|T)}
120592 microseconds per iteration

Then you have to retrieve the values.

Here is what I get with a list representation:

create host array: 1025203 microseconds per iteration (wow! over one second)
32771 entries in table, 16384 buckets
number of buckets with 0 entries: 8342
number of buckets with 1 entries: 604
number of buckets with 2 entries: 1144
number of buckets with 3 entries: 1657
number of buckets with 4 entries: 1616
number of buckets with 5 entries: 1271
number of buckets with 6 entries: 848
number of buckets with 7 entries: 487
number of buckets with 8 entries: 237
number of buckets with 9 entries: 111
number of buckets with 10 entries: 47
number of buckets with 11 entries: 15
number of buckets with 12 entries: 4
number of buckets with 14 entries: 1
number of buckets with 10000 or more entries: 0
average search distance for entry: 3.0

create host list: 381651 microseconds per iteration (2.5 times faster)

finding First Host: 1867 microseconds per iteration

finding First Host nocase: 1697 microseconds per iteration

finding All Hosts: 6618 microseconds per iteration

finding All Hosts nocase: 40670 microseconds per iteration

Host keys:
{Host 10007} {hosT 20014} {hOst 30021}

Somewhere the conventional wisdom about list vs. array access needs to
be discredited.

tom jackson

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

The Tool Command Language implementation

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm