This patch addresses two areas:
1) The current hash function ignores the high byte
of UTF-16 characters. This is not good.
2) There is a cheap way to switch from the current
method of linear probing (to resolve collisions)
to the slightly better double hashing.
Ad 1) hashing of wide characters:
Currently, the algorithm is like this:
h = h * 33 + (unsigned char)c;
where h is the hash value and c is a character
in the string argument (char or wchar_t).
It seems this hashing algorithm belongs to a family
of algorithms of the type:
h = h * <some number> + character
where <some number> is taken froma list of tried
and proven values like 31, 33, 37, 41, 65599 which
should be (but are not always) prime numbers.
I added a wchar_t version (#ifdef XML_UNICODE)
of this type: h = h * 65599 + (unsigned short)c;
Ad 2) double hashing:
Currently the hash table index is calculated from
the hash value by masking out the lower bits
according to table size. This works because the
table size is a power of 2. But this also means that
some bits of the hash value are unused. From
these we can therefore generate the second hash
value with little extra effort.
The patch also updates the inlined hash table
implementation used for detecting duplicate
prefixed attributes (in function storeAtts()).
See the attached file hashpatch1.diff
(to be applied against xmlparse.c rev. 1.109).
Log in to post a comment.