From: Mark T. <ma...@di...> - 2012-05-09 22:53:49
|
Thanks for testing :) Demian Katz <dem...@vi...> writes: > Thanks, Mark. I've done the build and it seems to be working. A > couple of things, though: > > 1.) The compiled browse-handler.jar file is now much larger than it > used to be (up from 3.8MB to 5.2MB); any idea what could have caused > this growth? It doesn't seem like your minor changes to the code > should have had this much impact. It looks like I haven't recompiled > the browse logic since updating to Solr 3.5 (it was last built with > 3.4), so that's the most likely explanation... but it's a little > surprising. Hm, that is surprising. One thing I notice is your jar files seem to have duplicate file entries. I didn't even know you could do that, but there you have it :) I notice in my build.xml I'm not forcing a "clean" when you build, so I wonder if it might be appending new files to an pre-existing jar in that directory. I've just modified the build process to clean up more aggressively before the build, so I'm hoping if you try again you won't see the same behaviour. > 2.) Now that I have so many authorities, building the index takes > much, much longer than before -- several hours. I don't think I'm > going to be able to do this nightly anymore because the process is > just too expensive. Do you think there might be a way to speed this > up? I'm not sure exactly what is eating all the time, but I assume > it's lookups in the Solr authority index. Since my authority data is > going to update much less frequently than my bib data, is there some > way we could avoid repeating this work on every update? For example, > could the prior, existing SQLite database be used as a cache for > authority data to avoid duplicate analysis, only going to Solr when a > new/changed heading is encountered? If this would save time, perhaps > a switch could be added to the index script so that we do a full > update once a week and faster incrementals the rest of the time. > Anyway, just brainstorming -- since I don't know the architecture of > this very well, it's entirely possible that there's nothing we can do > to make it better... but it would be nice! I've just loaded the FAST data in locally, and it sure is slow. Yeesh. I started it running on my "topic" browse here and got bored after 30 minutes and stopped it. Unfortunately the change I made to support larger numbers of "instead of" headings actually made the performance significantly worse for large data sets like this. I've just pushed up a fix for this too: instead of always requesting $BIGNUMBER hits from Lucene, I now just start with a low number (20) and double it on demand for the small number of cases where that ends up being necessary. That gets my build time for my "topic" browse down to 2 minutes 50 seconds on my machine here, so hopefully you see a similar speedup. Thanks, Mark -- Mark Triggs <ma...@di...> |