From: Matthew H. <Mat...@fl...> - 2007-09-19 01:29:21
|
Hi Andrew, I've sort of shot myself in the foot with the use of curl over the standard http client requests since now the requests are being sent off to the server faster causing the service to keel over quicker. On the plus side, from reading some of the solr documentation the post requests don't seem to be limited to one record per request so I'm trying a batch process of approx 20 records per post request with a 3 second sleep after each post. I should explain that the test server I'm working with is a single processor PC (not 32 or whatever multiple Chris has) and only 512Mb RAM. The way I understand it works (or doesn't in some cases) is that tomcat has a limited input buffer ie only so many requests it can handle at one time. Sending one record off at a time is essentially compounding the problem on servers with limited resources where as if you could increase the tomcat input buffer, complete the processing of requests faster, or send fewer requests they don't tend to fill up the input buffer so fast ie not so many tomcat processes running in parallel competing for resources. At the moment with 20 records per request and with a 3 second sleep I'm sitting on about 24 tcp connections to tomcat, whereas previously the number of connections would just increase until tomcat died and then the imports would go really quick.... :-b I'm making use of the % arithmetic function in php ie concatentating the $record variable until the request number % the batch size (% is the modulo not divided by function) has a remainder of 0 and then post the request off and sleep 3 seconds. This seems to reduce the number of concurrent tomcat processes during an import - though it does take longer to run the import process. There's got to be an easier way to index a directory of xml formatted data without having to make a tcp connection to post the results of a read off to solr - ie some sort of bulk ingest process that runs on a directory and won't drag the solr service down during the ingest. Anyway if you've got the processing power, a suggestion might be to try something like Wayne is doing with multi threaded java importing and perhaps batch the records in chunks to further reduce the indexing time. Cheers, Matt. -- Matthew Hooper Systems Officer, Flinders University Library G.P.O. Box 2100 ADELAIDE, South Australia 5001 P +618 8201 2068 F +618 8201 2508 E Mat...@fl... :-----Original Message----- :From: Andrew Nagy [mailto:and...@vi...] :Sent: Tuesday, 18 September 2007 11:06 PM :To: Matthew Hooper; vuf...@li... :Subject: RE: [VuFind-Tech] import error : : :> Vufind looks like it has a lot of potential, it's just getting the :> data into the system in the first place that seems to be a :pain at the :> moment (as I'm typing the import process has stopped at :~87000 records :> so I'm going to try and increase the timeout). : :This was exactly my hope in open sourcing the code. Others :would be able to find better ways to do things making VuFind :better for everyone. : :Please feel free to make suggestions or submit patches. Have :you found your CURL code to be faster or better than the :existing HTTP_Client code? : :Thanks! :Andrew : |