#61 Hash table improvement

closed-accepted
None
5
2003-09-14
2003-03-07
No

This patch addresses two areas:

1) The current hash function ignores the high byte
of UTF-16 characters. This is not good.
2) There is a cheap way to switch from the current
method of linear probing (to resolve collisions)
to the slightly better double hashing.

Ad 1) hashing of wide characters:
Currently, the algorithm is like this:
h = h * 33 + (unsigned char)c;
where h is the hash value and c is a character
in the string argument (char or wchar_t).

It seems this hashing algorithm belongs to a family
of algorithms of the type:
h = h * <some number> + character
where <some number> is taken froma list of tried
and proven values like 31, 33, 37, 41, 65599 which
should be (but are not always) prime numbers.

I added a wchar_t version (#ifdef XML_UNICODE)
of this type: h = h * 65599 + (unsigned short)c;

Ad 2) double hashing:
Currently the hash table index is calculated from
the hash value by masking out the lower bits
according to table size. This works because the
table size is a power of 2. But this also means that
some bits of the hash value are unused. From
these we can therefore generate the second hash
value with little extra effort.

The patch also updates the inlined hash table
implementation used for detecting duplicate
prefixed attributes (in function storeAtts()).

See the attached file hashpatch1.diff
(to be applied against xmlparse.c rev. 1.109).

Discussion

  • Karl Waclawek

    Karl Waclawek - 2003-03-07

    First patch version.

     
  • Karl Waclawek

    Karl Waclawek - 2003-03-24

    Logged In: YES
    user_id=290026

    Improvement to first patch version (hashpatch1.diff):

    For calculating second hash, don't overwrite the
    rightmost bit of the hash value's unused portion.
    This means: change ">> (power)" to ">> ((power) - 1)":

    #define PROBE_STEP(hash, mask, power) \ ((unsigned char)((((hash) & ~(mask)) >> ((power) - 1)) &
    ((mask) >> 2)) \ + | (unsigned char)(1))

     
  • Karl Waclawek

    Karl Waclawek - 2003-03-29

    Logged In: YES
    user_id=290026

    Further change for PROBE_STEP macro:
    To prevent overflow (the result should fit into one Byte)
    we replace "& (mask >> 2)" with "& (unsigned long)0xFF":

    #define BYTE_MASK ((unsigned long)0xFF)
    #define PROBE_STEP(hash, mask, power) \ ((unsigned char)((((hash) & ~(mask)) >> ((power) - 1)) \ & BYTE_MASK) \ | (unsigned char)(1))

    which limits the maxium step size to 0xFF.
    Large steps are not a good idea anyway, since that
    might lead to cache misses.

     
  • Karl Waclawek

    Karl Waclawek - 2003-04-01

    Logged In: YES
    user_id=290026

    The last change is not necessary, since the cast
    to unsigned char takes care of overflow. It actually
    introduces a bug, since for small sizes the step size
    can now be larger than the table size.

    We can change back and write it like this:

    #define SECOND_HASH(hash, mask, power) \ ((((hash) & ~(mask)) >> ((power) - 1)) & ((mask) >> 2))
    #define PROBE_STEP(hash, mask, power) \ ((unsigned char)((SECOND_HASH(hash, mask, power)) |
    1))

     
  • Karl Waclawek

    Karl Waclawek - 2003-08-27

    Improved patch

     
  • Karl Waclawek

    Karl Waclawek - 2003-08-27

    Logged In: YES
    user_id=290026

    Attached new patch file hashpatch2.diff.
    Includes changes suggested in comment from 2003-04-01.
    Also changed hash function to be closer to Python's string
    hash and improved comments on the inlined attributes hash
    table in storeAttributes().

     
  • Karl Waclawek

    Karl Waclawek - 2003-08-28

    Logged In: YES
    user_id=290026

    Added one more improvement (attached as hashpatch3.diff):

    Removed the need to initialize the hash table for the duplicate
    prefixed attributes check (see function storeAttributes()).

    The size of the hash table will be at least twice the largest
    number of prefixed attributes encountered in any one element.
    Once that element is processed, any subsequent call to
    storeAttributes() would need to clear the entire (large) hash
    table as long as there is at least one prefixed attribute.

    The solution was to add a version counter such that all
    hash table entries get a version flag, which is updated
    with every call to storeAttributes().

     
  • Karl Waclawek

    Karl Waclawek - 2003-08-28

    Remove hash table initialization.

     
  • Karl Waclawek

    Karl Waclawek - 2003-09-13

    Logged In: YES
    user_id=290026

    Added three e-mail messages I sent to Fred Drake regarding
    benchmarks for this patch.

    The file names are Benchmarking1,2,3.txt.

     
  • Karl Waclawek

    Karl Waclawek - 2003-09-13

    E-mail message re benchmarking the patch

     
  • Karl Waclawek

    Karl Waclawek - 2003-09-13

    E-mail message re benchmarking the patch

     
  • Karl Waclawek

    Karl Waclawek - 2003-09-13

    E-mail message re benchmarking the patch

     
  • Karl Waclawek

    Karl Waclawek - 2003-09-14

    Logged In: YES
    user_id=290026

    Patch for xmlparse.c committed on 2003/09/13.
    New revision: xmlparse.c 1.110.

     
  • Karl Waclawek

    Karl Waclawek - 2003-09-14
    • status: open --> closed-accepted
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks