[Roundup-devel] warm_cache call initial results and failures

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi all:

Well based on my original email I put together a warm_cache function.

The test bed is demo.py against postgres (on local loopback) under
cygwin. I set the rdbms cache to 10000. I added 999 users to the
default setup for a total of 1001 users.

Without using warm_cache, I get the following stats:

Time elapsed: 2.218750s

Cache hits: 17104, misses 1014. Loading items: 0.578125 secs.
   Filtering: 0.000000 secs.

cache_hits_status:    14  cache_misses_status:     8 
cache_hits_priority:   8  cache_misses_priority:   5 
cache_hits_user:   17082  cache_misses_user:    1001

Enabling the warm_cache from getnode() (right at the end of getnode()
before the final return node) with a block of 100 entries loaded for
each call, I get:

Time elapsed: 1.765625s

Cache hits: 18106, misses 12. Loading items: 0.031250 secs.
   Filtering: 0.000000 secs.

cache_hits_status:     21   cache_misses_status:    1
cache_hits_priority:   12   cache_misses_priority:  1
cache_hits_user:    18073   cache_misses_user:     10

With a batch of 500 entries at a time loaded:

Time elapsed: 1.671875s

Cache hits: 18114, misses 4. Loading items: 0.000000 secs.
   Filtering: 0.000000 secs.

cache_hits_status:     21  cache_misses_status:   1
cache_hits_priority:   12  cache_misses_priority: 1
cache_hits_user:    18081  cache_misses_user:     2

So it looks like about a 22% improvement.

(Note the rdbms cachesize is set to 10000. If I use a size of 100 and
don't warm the cache the time elapsed is: 4.828125s. Currently if the
cache isn't large enough to load all the items, warm_cache fails since
it just keeps blowing out the cache.)

However I have an interesting issue. If I call:

  self._materialize_multilinks(classname, nodeid, node, mls)

from within warm_cache the for loop:

    for values in self.cursor:

aborts. I assume what's happening is the old result set is wiped by
the database access used to materialize the multilinks. If I don't
call self._materialize_multilinks(classname, nodeid, node, mls), then
things work as expected. The numbers above were done without the
self._materialize_multilinks() calls.

The warm_cache function currently looks like:

    def warm_cache(self, classname, maxitems=0, fetch_multilinks=False):
        """Perform the fetch of (some or) all items of class.

           Can use maxitems to limit number of items.

           FIXME: if fetch_mutlilinks set to True is a performance
           failure, then it should be set to false by default and
           should be passed through from the caller. Note that this works
           if and only if the item in the cache without fetched multilinks
           can be safely used where fetched multilinks is needed.
        """

        # figure the columns we're fetching
        cl = self.classes[classname]
        cols, mls = self.determine_columns(list(cl.properties.iteritems()))
        scols = ','.join([col for col,dt in cols])

        scols = "id," + scols   # add the id we need it

        # Get the items. The where clause selects only non-retired
        # items. This means that retired items won't be pre-fetched but
        # the normal getnode() call will put them in the cache
        # if requested.

        sql = 'select %s from _%s where __retired__ = 0'%(scols, classname)

        # invoke the sql
        self.sql(sql)

        # count the number of items we have added to the cache
        items = 0

        # import pdb; pdb.set_trace()

        if __debug__:
            print "rdbms_common.py/Database::warm_cache Sql done iterating\n"

            print "rdbms_common.py/Database::warm_cache Returned items: %s\n"%(self.cursor.rowcount)

	local_cursor = self.cursor
        for values in local_cursor:

            if maxitems and items >= maxitems:
                if __debug__:
                    print "rdbms_common.py/Database::warm_cache maxitems loaded, breaking.\n"
                break

            # nodeid is the first thing returned
            nodeid = values[0]

            if __debug__:
                print "rdbms_common.py/Database::warm_cache Nodeid %s loading"%nodeid

            key = (classname, str(nodeid))
            if key in self.cache:
                if __debug__:
                    print "rdbms_common.py/Database::warm_cache Key in self.cache continuing\n"
                continue
            else:
                items = items + 1

            # make up the node
            node = {}
            props = cl.getprops(protected=1)
            for col in range(len(cols)):
                name = cols[col][0][1:]
                if name.endswith('_int__'):
                    # XXX eugh, this test suxxors
                    # ignore the special Interval-as-seconds column
                    continue
                value = values[col + 1] # offset by 1 because of prepended 'id'
                if value is not None:
                    value = self.to_hyperdb_value(props[name].__class__)(value)
                node[name] = value

                if fetch_multilinks and mls:
                    # if this runs it kills the for loop self.cursor
                    # maybe it stomps on the cursor with another request?
                    self._materialize_multilinks(classname, nodeid, node, mls)

            # save off in the cache
            key = (classname, str(nodeid))
            self._cache_save(key, node)

Set the rdbms cache in config.ini larger than the number of items you
are planning on caching. If it's too low, warm_cache simply churns the
LRU cache and we end up with a worse situation.

Add a call to warm_cache right at the end of getnode() in
roundup/backends/rdbms_common.py. Add the call:

         self.warm_cache(classname, 100)

before the final 'return node' to load in batches of 100.

Invoke with:

   CGI_SHOW_TIMING=1 python  ./demo.py

(Note reporting class level cache hits require other pretty obvious
patches, but you can see the results in the native code so I am not
including these patches.)

There are two issues with the code that I know of (and probably many
more I don't know about).

1) the  issue with _materialize_multilinks causing the loop to exit

2) how to load data when the cache size is less then the number of
   items. I add data in the order returned by the database. If I
   add the first say 100 items to a cache sized as 150 items, the next
   time it gets called, it will nuke the first 50 items from the cache.
   Then the third time it's called it will readd the first 50 items to
   the cache removing the second 50 items which it will then nicely
   readd. So it does a very effective job of trashing the cache.

I think to handle 2, I can look at the top of the LRU cache (i.e. the
most recent item to ba added) and skip all the items I get until I
reach that item. Then I start loading the items after that item. I
hope this will prevent me from caching data that I have already put in
the cache.

So let me know if these patches work for you and if anybody has some
bright ideas about how to handle either issues 1 or 2 let me know.

--
				-- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.

[Roundup-devel] warm_cache call initial results and failures

Simple-to-use/-install issue-tracking system: web, REST, email, CLI

[Roundup-devel] warm_cache call initial results and failures