Re: [VuFind-Tech] import error

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Andrew,

I've sort of shot myself in the foot with the use of curl over the standard
http client requests since now the requests are being sent off to the server
faster causing the service to keel over quicker.

On the plus side, from reading some of the solr documentation the post
requests don't seem to be limited to one record per request so I'm trying a
batch process of approx 20 records per post request with a 3 second sleep
after each post.

I should explain that the test server I'm working with is a single processor
PC (not 32 or whatever multiple Chris has) and only 512Mb RAM. The way I
understand it works (or doesn't in some cases) is that tomcat has a limited
input buffer ie only so many requests it can handle at one time. Sending one
record off at a time is essentially compounding the problem on servers with
limited resources where as if you could increase the tomcat input buffer,
complete the processing of requests faster, or send fewer requests they
don't tend to fill up the input buffer so fast ie not so many tomcat
processes running in parallel competing for resources.

At the moment with 20 records per request and with a 3 second sleep I'm
sitting on about 24 tcp connections to tomcat, whereas previously the number
of connections would just increase until tomcat died and then the imports
would go really quick.... :-b
I'm making use of the % arithmetic function in php ie concatentating the
$record variable until the request number % the batch size (% is the modulo
not divided by function) has a remainder of 0 and then post the request off
and sleep 3 seconds. This seems to reduce the number of concurrent tomcat
processes during an import - though it does take longer to run the import
process.

There's got to be an easier way to index a directory of xml formatted data
without having to make a tcp connection to post the results of a read off to
solr - ie some sort of bulk ingest process that runs on a directory and
won't drag the solr service down during the ingest.

Anyway if you've got the processing power, a suggestion might be to try
something like Wayne is doing with multi threaded java importing and perhaps
batch the records in chunks to further reduce the indexing time.

Cheers,

	Matt. 

--
Matthew Hooper
Systems Officer,
Flinders University Library
G.P.O. Box 2100
ADELAIDE, South Australia 5001
P +618 8201 2068
F +618 8201 2508
E Mat...@fl...

:-----Original Message-----
:From: Andrew Nagy [mailto:and...@vi...] 
:Sent: Tuesday, 18 September 2007 11:06 PM
:To: Matthew Hooper; vuf...@li...
:Subject: RE: [VuFind-Tech] import error
:
:
:> Vufind looks like it has a lot of potential, it's just getting the 
:> data into the system in the first place that seems to be a 
:pain at the 
:> moment (as I'm typing the import process has stopped at 
:~87000 records 
:> so I'm going to try and increase the timeout).
:
:This was exactly my hope in open sourcing the code.  Others 
:would be able to find better ways to do things making VuFind 
:better for everyone.
:
:Please feel free to make suggestions or submit patches.  Have 
:you found your CURL code to be faster or better than the 
:existing HTTP_Client code?
:
:Thanks!
:Andrew
: