From: Antonio B. <abarrera@Princeton.EDU> - 2007-09-18 13:07:43
|
Matthew, I've run into similar issues. Did you import script and solr configuration changes help the issue? Thanks, Antonio Barrera Princeton University Library -----Original Message----- From: vuf...@li... [mailto:vuf...@li...] On Behalf Of Matthew Hooper Sent: Tuesday, September 18, 2007 1:14 AM To: vuf...@li... Subject: Re: [VuFind-Tech] import error Hi all, This is my first time posting so please be patient with me. I was just going to add a few comments and thoughts regarding importing records via the import-solr.php script which some people have had problems with. I've been trying to load ~670,000 bib records into the system following the instructions in the install files for vufind. What I found was that on several occasions the solr service would stop responding to the post requests adding new records to the index. As a result while in some cases the import would finish, in other cases it would stop at a certain point and refuse to keep going past a certain number of records - in my case around 70,000 seemed to be the point where things would start going wrong. In some cases it finished parsing the entire file but only loaded a small percent of records into the system despite the file being in UTF8 marc xml format. After a bit of searching I came across a few solr related links which seemed to help in tuning the system to be more likely to load all the records. Firstly a quick solr tutorial from the apache website:=20 http://lucene.apache.org/solr/tutorial.htm And a FAQ document for SOLR which had some hints about increasing timeouts so POSTing would be less likely to fail: http://wiki.apache.org/solr/FAQ This got me thinking about the actual import script itself which essentially is posting files to the solr update URL and then indexing that data. I rewrote a small portion of the import process using some php hints and curl to try and make the process safer. Here's a sample of the code in case people are curious: // post record to SOLR $ch =3D curl_init(); $solr_url =3D $configArray['SOLR']['url'] . '/update'; curl_setopt($ch, CURLOPT_URL, $solr_url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_FAILONERROR, 1); curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-type:text/xml', 'charset=3Dutf-8')); curl_setopt($ch, CURLOPT_POSTFIELDS, $record ); curl_exec($ch); // close cURL resource, and free up system resources curl_close($ch); unset($ch); The other thought was to use the java import .jars included in the official solr releases to do the importing in a threadsafe fashion. Whenever I've found the import process has died I've had to remove and recreate the data directory in the vufind folder after stopping the service and then clear out the data folder in the solr directory before starting the vufind service back up and trying again. I'm just wondering if it might be worthwhile to split the import process into 2 stages ie create the xml files in the data folder and then parse all those files using the xsl and post them off to the solr server to ensure that records could be re-indexed or added with minimal fuss.=20 It looks like the vufind web services use the .xml files in the vufind data folder when you click on the staff view ie to read the marc record so these files need to be preserved and I'm guessing the IDs in the 001 tag are also used for a lot of the keys relating to things such as comments and favourites etc, so reindexing isn't a good thing unless the IDs stay the same. This would indicate that perhaps the 001 tag needs to be the bib_id from your ILMS to make life a little easier. Vufind looks like it has a lot of potential, it's just getting the data into the system in the first place that seems to be a pain at the moment (as I'm typing the import process has stopped at ~87000 records so I'm going to try and increase the timeout). Cheers, Matt. =20 -- Matthew Hooper Systems Officer, Flinders University Library G.P.O. Box 2100 ADELAIDE, South Australia 5001 P +618 8201 2068 F +618 8201 2508 E Mat...@fl... =20 |