From: Eric B. <er...@go...> - 2004-09-28 07:31:25
|
Colin Paul Adams wrote: > I've finished the tracing facility, and fixed several bugs which > showed up along the way. > > The tracing analysis showed that gexslt was taking the same broad > execution path as Saxon, so I next turned my attention to profiling. > > Two of the most prominent routines were compiling a regular > expression, and performing matches against it. > Commenting out the part of the stylesheet that uses regular > expressions (only the tokenize() function uses them at the moment), > reduced the execution time from 133s to 46s, and memory usage was > negligible with the latter. > So I then tried replacing > the regular expression stuff with ST_SPLITTER (the particular use of > tokenize was to split into white-space separated strings) as a one-off > test. > The run time was then 132s, and the memory usage was back to around 500-600 > MB showing that the regular expression library itself is not at > fault. > > But at least it points me to a plausible optimisation - I could avoid > the repeated compilation thousands of times of the same regular > expression, by caching compiled regular expressions in as > DS_HASH_TABLE (indexed by a combination of the text of the regular > expression and the flags), and I could also cache the results of > searches against particular input strings. > > In this particular case that should work very well (although I still > have to investigate where the memory growth is coming from). But it > brings me to the subject of caching. > In the case in question, Only 1 regular expression would ever be > compiled, and only one input string will be matched against - so it > should be a sure-fire winner. (I'm going to go ahead and test it out > today, after I've checked in my current work). > > In general, there are unlikely to be many regular expressions used in > a stylesheet, even if the stylesheet was itself generated by a > transform. But the number of input strings could easily grow very > large, so caching every single one could prove to be very expensive on > memory usage. So I think some UT_CACHE (or should it be DS_CACHE?) > classes (such as UT_LRU_CACHE) might be a good idea. I'm in the middle of a three day meeting, but here are some thoughts about this problem. My experience showed that sometimes using caches gives worse performances than without them. Indeed this is the case when computing the objects each time is faster than the time it takes to manage the objects in the cache. But of course you never know unless you try. Here it looks like using ST_SPLITTER instead of regexps gives the same performances, so the time is probably not spent in the compilation of the regexps (since I don't think there is such similar pre-computation in the ST_SPLITTER implementation). So what is left when processing input strings is merely string traversals (and substrings building). Now, what does it take to manage a cache? You will need to store some computed results in some sort if hash table with keys based on strings. But because the keys are based on strings (probably the regexp strings and/or input strings), in order to get the hash code of these strings some of the characters of the strings need to be traversed. And in order to cope with keys with the same hash code we need to call `is_equal' on these strings which will invoke string traversal. If there are key collisions, then `is_equal' is called several times. And then there is the LRU part to be handled. All in all, it's not obvious that using a cache will give better prerformances because there are string traversals in both cases. Finally, here is a possible cache implementation: my_cache: DS_HASH_TABLE [OBJECT, KEY] ... create my_cache.make (cache_size) object: OBJECT key: KEY ... ...get key... my_cache.search (key) if my_cache.found then object := my_cache.found_item my_cache.remove_found_item my_cache.put_last (object, key) else object := ...some computation... if my_cache.is_full then my_cache.start my_cache.remove (my_cache.key_for_iteration) end my_cache.put_last (object, key) end ...use object... -- Eric Bezault mailto:er...@go... http://www.gobosoft.com |