From: Greg P. <gre...@gm...> - 2010-05-30 23:22:23
|
Demian and I were having some off list discussions regarding this and I started playing last night on my home laptop. Interestingly enough I was asked today at work if I'd like to take our new server with solid state storage and benchmark VuFind index build and use times against it. I'll switch my testing to there, specifically looking at: 1) Building a 'biblio2' core, grabbing date data out of the 'biblio' core at build time. 2) Optimizing, merging and dictionary building on 'biblio2'. 3) Having Solr swap 'biblio' and 'biblio2' to bring the new core online with no user downtime. I'm keen to keep the script design modular however, so a database based approach would work off the same logic. Having said that, I'm also considering using Solrmarc from source so I run some profiling against it, and this raises the option (for Solr) of querying directly through embedded Solr to improve performance (that's the plan anyway). Any thoughts on trying to internalise this logic away from a bean shell? Also my current data set is the USQ catalogue (maybe 400k records). If anyone has a larger data set they'd are able to make available I'm happy to try it out. Ta, Greg On 28 May 2010 17:17, Greg Pendlebury <gre...@gm...> wrote: > I like this idea. Perhaps I'd suggest a second solr core rather then the > database, but it's not much of a difference. > > USQ indexes into the 'biblio2' core and swaps them to rebuild the index. I > had this discussion recently with Eoghan actually, and soon after stumbled > on how to have solr perform the core swap for you with no downtime. > > It wouldn't even be that complicated if we could manage to have solr sync > the field we're talking about. > > Greg > > > On 28 May 2010 01:50, Demian Katz <dem...@vi...> wrote: > >> > This could be done by hashing out the hits for the search. The hash >> > could be stored in the vufind-user-db together with the saved search. >> >> One major obstacle I see to this approach is actually obtaining the list >> of IDs for hashing in the first place. If you only use the values returned >> in the first page of Solr results, you won't get useful results -- any time >> a relevance algorithm is tweaked, everything will change. If you need to >> get every value out of the Solr results, you're going to be dealing with an >> unbounded data set that could be large and time-consuming to process. I >> wonder if there are any Solr plug-ins specifically designed to handle this >> situation... might be worth probing the solr-user list. >> >> For the greatest flexibility, though, I think the cleanest solution to >> allow various types of feeds, harvests, etc., would be to add two fields in >> the Solr index -- one for "first time record was added to Solr" and another >> for "most recent time record was changed." Obviously, if these fields were >> easy to populate, we would have them already... but here's a crazy thought >> for a possible approach: >> >> 1.) Create a database table that duplicates the contents of the "first >> time added" field in Solr (simply linking ID to date). This is the only >> data that needs to survive Solr destruction in order to rebuild everything >> appropriately if a full rebuild is necessary. >> >> 2.) To populate "first time added" field, look up the current record's ID >> in the database mentioned in step 1. If not found, use the current >> date/time. Be sure to update the database as well as Solr itself. (This >> could be accomplished in a SolrMarc BeanShell script). >> >> 3.) To populate "most recent time record was changed," look at the MARC >> 005 field. If this is newer than the value found in step 2, use MARC 005. >> If it is older, use the value from step 2 (this assumes that we are >> interested in tracking the times of changes to OUR INDEX, not the data that >> is in the index). (Again, this could be done in BeanShell). >> >> "first time added" will never change, but "most recent time record was >> changed" will update according to change dates found in the record itself. >> Exact details would be different for non-MARC records, but as long as the >> necessary data was available in the records, the process would be nearly the >> same. >> >> I don't like the extra complexity added by this solution... but I think >> it might work in the absence of a better solution. >> >> - Demian >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> VuFind-General mailing list >> VuF...@li... >> https://lists.sourceforge.net/lists/listinfo/vufind-general >> > > |