At the University of Virginia every night there is a dump made from our SirsiDynix Unicorn ILS of all records that are added or modified, as well as a file containing ids  of records to be deleted.  We have a cron job that uses Solrmarc to process the adds, updates, and the deletes.  (If you pass a file named something .del  to Solrmarc it will delete the ids listed in that file (one per line)).  Then using other utility programs available as a part of SolrMarc, the adds, updates and deletes are merged in with the saved full record dump, so that the updated full dump accurately reflects what is actually in the Solr index, and so that a full reindex can be performed if needed without first having to export all of the records from our ILS (which takes as long or longer than the indexing process for us!)

For those that receive a MARC record file for records that are to be deleted, instead of receiving a list of ids that are to be deleted, one possibility would be to preprocess those records (using the printrecord utility) to create the list of ids that SolrMarc expects.  However another possibility is to add a carefully crafted indexing specification to the xxxx_index.properties file to take advantage of the ability in SolrMarc to delete a record if the results of a given indexing rule results in an empty set. 

If anyone is interested in more details I can provide them.

-Bob Haschart


M.Fake@lse.ac.uk wrote:
Thanks everyone. We're a Voyager site, but I'm embarrassed to admit that
until these emails we hadn't even noticed the /util directory... Plenty
of good advice here for us to set up a workable deletion routine.

Michael.

Michael Fake
Library Systems Manager
London School of Economics & Political Science
T: +44 (0) 20 7955 6447
E: m.fake@lse.ac.uk

-----Original Message-----
From: Mark Triggs [mailto:mtriggs@nla.gov.au] 
Sent: 09 November 2009 20:25
To: Barnett, Jeffrey
Cc: vufind-general@lists.sourceforge.net
Subject: Re: [VuFind-General] Mass deletions on VuFind

Hi Jeffrey,

Just for interest's sake, we do a little bit of index gymnastics
ourselves in two different ways:

  - Our catalogue is load-balanced across two servers, each with its own
    Apache, code base and Solr.  Every night we apply updates to the
    Solr on the master server then, once the update/delete/optimize is
    finished, we sync those indexes to the slave.  On that last step we
    get around the need to restart Solr by blowing away the slave's
    index files, rsyncing the new ones into place and then sending a
    '<commit/>' to the slave (which causes it to reopen the index).

  - We run an internal "staff" instance of our Catalogue which is
    updated from our Voyager system more frequently (about once every 5
    minutes).  Every night after the above update process we replace the
    staff instance's indexes with the production indexes in order to
    ensure that the two are in sync at least once per day.

    In this case I've found I can get around fully duplicating the index
    files on disk by just populating the staff Solr data/index/
    directory with symlinks back to the production indexes (e.g.: cd
    staff/data/index; ln -s /production/data/index/* .).  You can get
    away with this because Lucene segments are write-once, and it means
    that the process of synchronising the two indexes is almost instant.

Not sure if any of that is useful for anyone, but just thought I'd share
our experiences.

Cheers,

Mark


"Barnett, Jeffrey" <jeffrey.barnett@yale.edu> writes:

  
We also use the Voyager deletes file on a daily basis, but with an 8
million record base index, the re-optimization process takes an hour
all on its own.  Our solution has been to run two copies/versions of
the index and of vufind.sh.  We do the full update/delete/optimize
sequence in "test" mode on a separate jetty port. Then we swap the
config.ini file from the test port for the production service (keeping
the Apache /vufind/web interface live the whole time).  Then we
"vufind.sh stop" the was_production index, delete it, and make a copy
of the running, rebuilt test index and restart it (takes another 15
min for a 24GB index) and let it re-warm. We verify the new production
index via the admin interface, and finally swap the config.ini file
back to the "normal" jetty port.  The whole sequence runs as a series
of cron jobs starting at about 8:30 AM, and running through about
10:00.  If anything fails in either the initial update, or subsequent
copy step the "other" index remains online until the problem can be
corrected.  Right now we do this on one machine, but we are thinking
of doing it on two in the future as part of an enhanced disaster
recovery plan.  YMMV, but for anyone without the luxury of a workday
"down time" window, it think the two-index "always on" approach is
pretty scalable, and pretty easy to automate once all the pieces are
in place.