judy-1.0.5 with a new version "JudyGet.c"

                          - by -
                        Doug Baskins
                    dougbaskins@yahoo.com
                         8Nov2012


I have had a goal of improving the performance of Judy by 2X or better
for some time now.  It has been a tough problem to tackle and will not
be ready for awhile.  Recently, people have been adding enhancements to
Judy-1.0.5 and have requested that they be included with Judy.  Faster
JudyNext and friends, JudySL with string length specifying and JudyHS
tree walking come to mind.  These have been on my to do list for a long
time, but have been lower priority than my 2X goal.  So I decided to
take another look at judy-1.0.5, which has not had code changes for over
7 years.  See if I could apply some simple modifications to help with
performance things that I have learn from my 2X project.  The single
file JudyGet.c is the single biggest and simplest improvement that could
I could find.  This is it.  This new JudyGet.c avoids the DCD check on
branches by delaying the check until the final node (Leaf node).  It
does this by counting the levels (States--;) walking down the tree and
if one is skipped then the DCD check is done only on the final node.  A
DCD check is needed when a level (or state) is skipped during a tree
walk.  (In the 2X Judy, I changed the name to CIb -- Common_Index_bits)
In practice there is almost never more than one level that is skipped so
the check is needed very seldom -- usually none in 32 bit, and once in
64 bit Judy.

Extensive performance testing with the origional Judy is included here.
Look at the "4.6_i3570_32Gb_Orig" V.S "4.6_i3570_32Gb" directorys and
the "bat2" file which built and ran the tests.  Also included is the
"4.6_i3570_8Gb" tests done with RAM that has a CAS of 9, instead of 10.
Both were run at 1600Mhz RAM speed.  The processor is an i5-3570K
clocked at 4.6Ghz on an ASUS Maximus V motherboard.  The 4.6Ghz is
achieved with virtually one click in the BIOS and runs without a hitch.
I was surprised at how much improvement a CAS of 9 made.  I need to do
more investigating of this.  I really only ran this test to check the
RAM (for my nephew), but included it here for interest.  I use "jbgraph"
as a front end script to gnuplot to compare the results.  (# jbgraph -c5
-NL -Lx xxxxxx.plot, the -c5 is the column 5 in xxxxxx.plot.  You can
add as many plot files and columns as desired.  -NL == No Log, -Lx ==
Log x-axis).

Modified versions of JudyPrivate.h and JudyMallocIF.c are included.
But are NOT necessary for improved performance.

JudyPrivate.h is modified to use the __builtin_popcount() when __GNUC__
is defined by the compiler.  This allows the testing with and without
using the instruction "popcnt" (bit counting).  My tests indicate that
this does not help significantly (in judy-1.0.5).  (Enabled with
-msse4.2 as a compile option).  The 2X version of Judy is another story.
It uses bit counting for foreshortened Leaf searches and bigger
dynamically sized bit-mapped objects.  The size of JudyLGet.o and
Judy1Test.o would expand several fold if it were not for the "popcnt"
instruction used many times.

JudyMallocIF.c is modified (only when -DRAMMETRICS is defined) to
measure the memory used with each of the objects used by Judy during
tree growth.  This helps to see what objects have the effect on
performance.  My conclusion is that new JudyGet.c is a great help to 64
bit Judy, but not the 32 bit version.

The 32 bit Judy needs more extensive modifications to help performance.
Judy1 - 32 bit needs a deletion of the Leaf1 (already done in 64 bit).
This would get rid of the huge "mountain" in the performance curve with
that object.  (I have no explanation of why a linear search of small
number of byte objects takes so long).  JudyL - 32 bit needs to get rid
of the cJL_JPIMMED_2_0*.  The reason why this helps is a big mystery to
me.  The code difference is the lack of a PPjp, which is only used for a
DCD check in the IMMED_2_0* code.  The code and execution overhead of the
PPjp is minimal, the frequency of execution is minimal, but the
performance difference is dramatic.  I think this is also part of the
mystery why compiler options can have such dramatic performance
differences.  I am not willing to do the testing on the Insert/Delete
code modifications needed to do this change -- at this time.

I have found that -O2 to be the overall best with both gcc and icc.  -O3
is NOT a good choice.  Also, I have been misled far to often by thinking
that -Os is a good choice.  The code differences (*.s) seem to be
branches to common exit code rather than leaving it in-line.  Does a
"jmp" instruction sometimes incur a mis-predict?  I wouldn't think so.

The development of a new and much faster Judy has be hampered greatly by
the "mystery".  I have seen greater than 2X changes in performance in
areas where the source code has NOT changed.  This occurs even when the
data structure (shape) of the tree is exactly the same.  I have been
trying to find the cause for several years now.  At first I though it
had something to do code alignment, then with the TLB misses changing
with conditions outside of the Benchmark.  Then I thought the DATA was
stealing a TLB entry from the TEXT code when it shouldn't.  But I had no
way to measure this.  However, experimenting with placement of JudyGet
code on 1 or 2 adjacent pages, I concluded this was not probable (I.E.
code needing 1 VS 2 TLB TEXT entrys to execute made no difference.)  In
frustration, I expermented with a nop instruction placed in the
Judy1Test.s file in a place that is never executed and found that could
make a 2X change in performance (expecially in a small *.o, walking a
large tree > 10^9 population).

Then I discovered that VERY small changes in the code that did linear
searches could have a dramatic change in performance.  This suggested
that branch prediction was the culprit.  But, I have not been able to
figure out how branch prediction works or not.  I would like to have
some way to "tell" the compiler when a branch (or not) is going to
happen with greater than 99% probabity.  I have noticed that the Intel
icc compiler attempts to statically predict (in the *.s file comments)
chances of a branch being taken and it is often very wrong in reality.
However, the icc compiler will compile code that is much faster than the
code by the gcc compiler.  The reason (in the *.s) is not obvious to me.
See:  JudyL-64Bit.png and Judy1-64Bit.png for a starter.

I have come to a very very general conclusion.  Making the generated
code smaller (# size *.o) by design (not by compiler option) seems to
generally to help performance.  However, there are exceptions that
simply "fly in the face of reality".  See the best and worst of the 4
different tests -- 32/64 bit Judy1 and JudyL.  There are 18 variations
of each.  Two different compilers (icc, gcc), with and without "popcnt"
and 3 compiler options -Os, -O2 and -O3.  Obviously, there is a
"mystery" yet to be solved.

If you want to experiment with the new JudyGet.c, please email me.  I
will send you a tarball.  If I get overwhelmed, I will put it on
sourceforge.net in version judy-1.0.6.  For now, I would like to keep
track where the new JudyGet.c goes -- just incase a bug is discovered.

Feedback will be appreciated -- especially on the "mystery".

Thanks for your interest,

doug
 
Doug Baskins <dougbaskins@yahoo.com>