[Geoip-c-discuss] Performance patches for the geoip-c library

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello - I love the MaxMind GeoIP databases!

I have made a number of performance changes to the C GPL library that
I am attaching to this mail for anybody that may want to use
them. I'll describe the changes below, but first let me get your
attention with the fabulous benchmarks ;). I did them on Linux. The mods
might need a little love for non-Linux environments.

The first set of benchmarks are taken using the standard benchmark
program in the 1.4.4 geoip distribution doing the country lookups. The
measurements are taken on a core2-duo - the benchmark runs single
threaded. Line A is the standard 1.4.4 distribution for comparison,
line b has my patch applied. All numbers are transactions per second
(more is better).

Country Lookup
   NO_FLAGS      CHECK_CACHE    MEMORY_CACHE     (MEMORY_CACHE | CHECK_CACHE)
A: 225,140       168,302        618,552          552,760
B: 226,918 +0%   233,735 +39%   2,112,676 +242%  5,412,988 +879%

The next set is the same benchmark, but is the city db lookups instead
of the country ones. Most of my testing was with the city database.

   NO_FLAGS      INDEX_CACHE   (INDEX_CACHE | CHECK_CACHE)    MEMORY_CACHE
A: 124,481       687,534        282,352                        2,526,315
B: 128,369 +3%   724,688  +5%   653,975  +132%                 6,168,270 +144%

I have another benchmark I conjured up - it is derived from the
standard one. Mine starts by choosing 3 million random IP addresses
(still represented as text dotted decimal) and measures how long it
takes to call the city api on each of them. The standard one just
looks up the same 4 addresses over and over again - my test has more
realistic cache behavior and fairly exercises the database.

I measure the test 3 ways: (A) With the 1.4.4 GeoIP distribution, (B)
against my patches using the standard city api, and finally (C) with an
API I added that lets you provide the memory to put the city record
into - this takes some pressure off the allocator.

Measurements in TPS (more is better). % improvemts are vs (A).

   NO_FLAGS    CHECK_CACHE   MEMORY_CACHE     (MEMORY_CACHE | CHECK_CACHE)
A: 208,281     148,201       2,830,474         500,002
B: 210,451 +1% 203,957 +38%  4,910,321 +73%    2,944,836  +489%
C: 211,968 +2% 206,008 +39%  5,732,488 +103%   3,208,800  +542%

The changes come from 3 different areas.

The most prominent change is the addition of a new in memory index
that is created only when MEMORY_CACHE is provided. This index costs
about 37 MB of extra memory - a classic time vs space tradeoff. If the
maintainers want to integrate these patches into the base
distribution, perhaps a new flag should be used instead of
piggybacking on MEMORY_CACHE - though 37 is not an especially large
number if you're willing to set MEMORY_CACHE.

The new index, instead of using the radix tree layout that is included
in the data file, creates a hash table of the first 17 bits of the
netmask.  masks shorter than 17 bits are simply hashed as their set of
/17 subnets. Masks longer than /17 (which is 99.7% of the database)
contain pointers to a "super index node" that works a lot like the
radix tree except each node represents 3 bits. This arrangement means
that you get the first 17 bits in one memory lookup (instead of 17),
for a pretty reasonable memory cost - especially when you consider the
low levels of the tree are pretty dense to start with. The
super-index-nodes cover up to three levels of tree using 8 32 bit ints
that are directly indexed from the 3 bits of interest - addresses
longer than 20 bits are chained to the next super index node using a
flag bit in the int.  Since one node covers 3 levels
of tree, the most accesses any one address should require is 6 (the
hash lookup, and 5 super-index-nodes). Almost 1/3 of addresses in the
DB can be looked up in 4 or less, and 88% are done in 5 or less. The
radix tree required on average 27 accesses per lookup.

going with an initial /16, /4, /4, etc.. approach instead of /17, /3,
/3.. improves the all cache case 5-10% TPS at the cost of almost
doubling the ram overhead.. so /17-3 seems to be the sweet spot.

This is all transparent to the API and does not require any changes to
the disk format. It is directly used by _GeoIP_seek_record() and thus
optimizes the lookup of any DB..

The next largest change is to the handling of the CHECK_CACHE
flag. The existing code was mtime and stat() based. If CHECK_CACHE was
set the stat() was repeated on each lookup and the resulting mtime
value was compared to a cached value. However, mtime has only a 1
second granularity so it doesn't make sense to do this thousands and
thousands of times each second. The easy optimization here is to
substitute gettimeofday() for stat() just to make sure we are on a new
second before calling stat(). This is kind of a silly optimization -
if you're serious about performance and want the CHECK_CACHE
functionality the right thing to do would be implement inotify()
elsewhere in the application and use that in an event driven style to
determine when to reload the database, instead of polling with system
calls. Nonetheless - this technique has a good impact on the library.

The final changes are to the City-DB extract_record code. There were
some double traversals of strings in there that could be optimized,
and a number of string allocations could be condensed into the initial
record allocation. Avoiding malloc/free at very high transaction rates
is pretty important. In that vein, I also added another city lookup
function that allows the caller to supply an address to put the data
into - which also saves an allocation.

Best Regards,
Patrick