|
From: John P. R. <ro...@cs...> - 2013-10-12 05:13:16
|
Hi all:
Well based on my original email I put together a warm_cache function.
The test bed is demo.py against postgres (on local loopback) under
cygwin. I set the rdbms cache to 10000. I added 999 users to the
default setup for a total of 1001 users.
Without using warm_cache, I get the following stats:
Time elapsed: 2.218750s
Cache hits: 17104, misses 1014. Loading items: 0.578125 secs.
Filtering: 0.000000 secs.
cache_hits_status: 14 cache_misses_status: 8
cache_hits_priority: 8 cache_misses_priority: 5
cache_hits_user: 17082 cache_misses_user: 1001
Enabling the warm_cache from getnode() (right at the end of getnode()
before the final return node) with a block of 100 entries loaded for
each call, I get:
Time elapsed: 1.765625s
Cache hits: 18106, misses 12. Loading items: 0.031250 secs.
Filtering: 0.000000 secs.
cache_hits_status: 21 cache_misses_status: 1
cache_hits_priority: 12 cache_misses_priority: 1
cache_hits_user: 18073 cache_misses_user: 10
With a batch of 500 entries at a time loaded:
Time elapsed: 1.671875s
Cache hits: 18114, misses 4. Loading items: 0.000000 secs.
Filtering: 0.000000 secs.
cache_hits_status: 21 cache_misses_status: 1
cache_hits_priority: 12 cache_misses_priority: 1
cache_hits_user: 18081 cache_misses_user: 2
So it looks like about a 22% improvement.
(Note the rdbms cachesize is set to 10000. If I use a size of 100 and
don't warm the cache the time elapsed is: 4.828125s. Currently if the
cache isn't large enough to load all the items, warm_cache fails since
it just keeps blowing out the cache.)
However I have an interesting issue. If I call:
self._materialize_multilinks(classname, nodeid, node, mls)
from within warm_cache the for loop:
for values in self.cursor:
aborts. I assume what's happening is the old result set is wiped by
the database access used to materialize the multilinks. If I don't
call self._materialize_multilinks(classname, nodeid, node, mls), then
things work as expected. The numbers above were done without the
self._materialize_multilinks() calls.
The warm_cache function currently looks like:
def warm_cache(self, classname, maxitems=0, fetch_multilinks=False):
"""Perform the fetch of (some or) all items of class.
Can use maxitems to limit number of items.
FIXME: if fetch_mutlilinks set to True is a performance
failure, then it should be set to false by default and
should be passed through from the caller. Note that this works
if and only if the item in the cache without fetched multilinks
can be safely used where fetched multilinks is needed.
"""
# figure the columns we're fetching
cl = self.classes[classname]
cols, mls = self.determine_columns(list(cl.properties.iteritems()))
scols = ','.join([col for col,dt in cols])
scols = "id," + scols # add the id we need it
# Get the items. The where clause selects only non-retired
# items. This means that retired items won't be pre-fetched but
# the normal getnode() call will put them in the cache
# if requested.
sql = 'select %s from _%s where __retired__ = 0'%(scols, classname)
# invoke the sql
self.sql(sql)
# count the number of items we have added to the cache
items = 0
# import pdb; pdb.set_trace()
if __debug__:
print "rdbms_common.py/Database::warm_cache Sql done iterating\n"
print "rdbms_common.py/Database::warm_cache Returned items: %s\n"%(self.cursor.rowcount)
local_cursor = self.cursor
for values in local_cursor:
if maxitems and items >= maxitems:
if __debug__:
print "rdbms_common.py/Database::warm_cache maxitems loaded, breaking.\n"
break
# nodeid is the first thing returned
nodeid = values[0]
if __debug__:
print "rdbms_common.py/Database::warm_cache Nodeid %s loading"%nodeid
key = (classname, str(nodeid))
if key in self.cache:
if __debug__:
print "rdbms_common.py/Database::warm_cache Key in self.cache continuing\n"
continue
else:
items = items + 1
# make up the node
node = {}
props = cl.getprops(protected=1)
for col in range(len(cols)):
name = cols[col][0][1:]
if name.endswith('_int__'):
# XXX eugh, this test suxxors
# ignore the special Interval-as-seconds column
continue
value = values[col + 1] # offset by 1 because of prepended 'id'
if value is not None:
value = self.to_hyperdb_value(props[name].__class__)(value)
node[name] = value
if fetch_multilinks and mls:
# if this runs it kills the for loop self.cursor
# maybe it stomps on the cursor with another request?
self._materialize_multilinks(classname, nodeid, node, mls)
# save off in the cache
key = (classname, str(nodeid))
self._cache_save(key, node)
Set the rdbms cache in config.ini larger than the number of items you
are planning on caching. If it's too low, warm_cache simply churns the
LRU cache and we end up with a worse situation.
Add a call to warm_cache right at the end of getnode() in
roundup/backends/rdbms_common.py. Add the call:
self.warm_cache(classname, 100)
before the final 'return node' to load in batches of 100.
Invoke with:
CGI_SHOW_TIMING=1 python ./demo.py
(Note reporting class level cache hits require other pretty obvious
patches, but you can see the results in the native code so I am not
including these patches.)
There are two issues with the code that I know of (and probably many
more I don't know about).
1) the issue with _materialize_multilinks causing the loop to exit
2) how to load data when the cache size is less then the number of
items. I add data in the order returned by the database. If I
add the first say 100 items to a cache sized as 150 items, the next
time it gets called, it will nuke the first 50 items from the cache.
Then the third time it's called it will readd the first 50 items to
the cache removing the second 50 items which it will then nicely
readd. So it does a very effective job of trashing the cache.
I think to handle 2, I can look at the top of the LRU cache (i.e. the
most recent item to ba added) and skip all the items I get until I
reach that item. Then I start loading the items after that item. I
hope this will prevent me from caching data that I have already put in
the cache.
So let me know if these patches work for you and if anybody has some
bright ideas about how to handle either issues 1 or 2 let me know.
--
-- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.
|