Thread: Re: [VuFind-Tech] import error

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi all,

This is my first time posting so please be patient with me. I was just going
to add a few comments and thoughts regarding importing records via the
import-solr.php script which some people have had problems with.
I've been trying to load ~670,000 bib records into the system following the
instructions in the install files for vufind. What I found was that on
several occasions the solr service would stop responding to the post
requests adding new records to the index. As a result while in some cases
the import would finish, in other cases it would stop at a certain point and
refuse to keep going past a certain number of records - in my case around
70,000 seemed to be the point where things would start going wrong. In some
cases it finished parsing the entire file but only loaded a small percent of
records into the system despite the file being in UTF8 marc xml format.

After a bit of searching I came across a few solr related links which seemed
to help in tuning the system to be more likely to load all the records.
Firstly a quick solr tutorial from the apache website: 

http://lucene.apache.org/solr/tutorial.htm

And a FAQ document for SOLR which had some hints about increasing timeouts
so POSTing would be less likely to fail:

http://wiki.apache.org/solr/FAQ

This got me thinking about the actual import script itself which essentially
is posting files to the solr update URL and then indexing that data.
I rewrote a small portion of the import process using some php hints and
curl to try and make the process safer.

Here's a sample of the code in case people are curious:

        // post record to SOLR
        $ch = curl_init();
        $solr_url = $configArray['SOLR']['url'] . '/update';
        curl_setopt($ch, CURLOPT_URL, $solr_url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_POST, 1);
        curl_setopt($ch, CURLOPT_FAILONERROR, 1);
        curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-type:text/xml',
'charset=utf-8'));
        curl_setopt($ch, CURLOPT_POSTFIELDS, $record );
        curl_exec($ch);
       // close cURL resource, and free up system resources
        curl_close($ch);
        unset($ch);

The other thought was to use the java import .jars included in the official
solr releases to do the importing in a threadsafe fashion.

Whenever I've found the import process has died I've had to remove and
recreate the data directory in the vufind folder after stopping the service
and then clear out the data folder in the solr directory before starting the
vufind service back up and trying again.
I'm just wondering if it might be worthwhile to split the import process
into 2 stages ie create the xml files in the data folder and then parse all
those files using the xsl and post them off to the solr server to ensure
that records could be re-indexed or added with minimal fuss. 

It looks like the vufind web services use the .xml files in the vufind data
folder when you click on the staff view ie to read the marc record so these
files need to be preserved and I'm guessing the IDs in the 001 tag are also
used for a lot of the keys relating to things such as comments and
favourites etc, so reindexing isn't a good thing unless the IDs stay the
same. This would indicate that perhaps the 001 tag needs to be the bib_id
from your ILMS to make life a little easier.

Vufind looks like it has a lot of potential, it's just getting the data into
the system in the first place that seems to be a pain at the moment (as I'm
typing the import process has stopped at ~87000 records so I'm going to try
and increase the timeout).

Cheers,

	Matt.     

--
Matthew Hooper
Systems Officer,
Flinders University Library
G.P.O. Box 2100
ADELAIDE, South Australia 5001
P +618 8201 2068
F +618 8201 2508
E Mat...@fl...

Thread: Re: [VuFind-Tech] import error

vufind-tech