From: Matthew H. <Mat...@fl...> - 2007-09-18 05:13:28
Attachments:
Matthew Hooper.vcf
|
Hi all, This is my first time posting so please be patient with me. I was just going to add a few comments and thoughts regarding importing records via the import-solr.php script which some people have had problems with. I've been trying to load ~670,000 bib records into the system following the instructions in the install files for vufind. What I found was that on several occasions the solr service would stop responding to the post requests adding new records to the index. As a result while in some cases the import would finish, in other cases it would stop at a certain point and refuse to keep going past a certain number of records - in my case around 70,000 seemed to be the point where things would start going wrong. In some cases it finished parsing the entire file but only loaded a small percent of records into the system despite the file being in UTF8 marc xml format. After a bit of searching I came across a few solr related links which seemed to help in tuning the system to be more likely to load all the records. Firstly a quick solr tutorial from the apache website: http://lucene.apache.org/solr/tutorial.htm And a FAQ document for SOLR which had some hints about increasing timeouts so POSTing would be less likely to fail: http://wiki.apache.org/solr/FAQ This got me thinking about the actual import script itself which essentially is posting files to the solr update URL and then indexing that data. I rewrote a small portion of the import process using some php hints and curl to try and make the process safer. Here's a sample of the code in case people are curious: // post record to SOLR $ch = curl_init(); $solr_url = $configArray['SOLR']['url'] . '/update'; curl_setopt($ch, CURLOPT_URL, $solr_url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_FAILONERROR, 1); curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-type:text/xml', 'charset=utf-8')); curl_setopt($ch, CURLOPT_POSTFIELDS, $record ); curl_exec($ch); // close cURL resource, and free up system resources curl_close($ch); unset($ch); The other thought was to use the java import .jars included in the official solr releases to do the importing in a threadsafe fashion. Whenever I've found the import process has died I've had to remove and recreate the data directory in the vufind folder after stopping the service and then clear out the data folder in the solr directory before starting the vufind service back up and trying again. I'm just wondering if it might be worthwhile to split the import process into 2 stages ie create the xml files in the data folder and then parse all those files using the xsl and post them off to the solr server to ensure that records could be re-indexed or added with minimal fuss. It looks like the vufind web services use the .xml files in the vufind data folder when you click on the staff view ie to read the marc record so these files need to be preserved and I'm guessing the IDs in the 001 tag are also used for a lot of the keys relating to things such as comments and favourites etc, so reindexing isn't a good thing unless the IDs stay the same. This would indicate that perhaps the 001 tag needs to be the bib_id from your ILMS to make life a little easier. Vufind looks like it has a lot of potential, it's just getting the data into the system in the first place that seems to be a pain at the moment (as I'm typing the import process has stopped at ~87000 records so I'm going to try and increase the timeout). Cheers, Matt. -- Matthew Hooper Systems Officer, Flinders University Library G.P.O. Box 2100 ADELAIDE, South Australia 5001 P +618 8201 2068 F +618 8201 2508 E Mat...@fl... |
From: Antonio B. <abarrera@Princeton.EDU> - 2007-09-18 13:07:43
|
Matthew, I've run into similar issues. Did you import script and solr configuration changes help the issue? Thanks, Antonio Barrera Princeton University Library -----Original Message----- From: vuf...@li... [mailto:vuf...@li...] On Behalf Of Matthew Hooper Sent: Tuesday, September 18, 2007 1:14 AM To: vuf...@li... Subject: Re: [VuFind-Tech] import error Hi all, This is my first time posting so please be patient with me. I was just going to add a few comments and thoughts regarding importing records via the import-solr.php script which some people have had problems with. I've been trying to load ~670,000 bib records into the system following the instructions in the install files for vufind. What I found was that on several occasions the solr service would stop responding to the post requests adding new records to the index. As a result while in some cases the import would finish, in other cases it would stop at a certain point and refuse to keep going past a certain number of records - in my case around 70,000 seemed to be the point where things would start going wrong. In some cases it finished parsing the entire file but only loaded a small percent of records into the system despite the file being in UTF8 marc xml format. After a bit of searching I came across a few solr related links which seemed to help in tuning the system to be more likely to load all the records. Firstly a quick solr tutorial from the apache website:=20 http://lucene.apache.org/solr/tutorial.htm And a FAQ document for SOLR which had some hints about increasing timeouts so POSTing would be less likely to fail: http://wiki.apache.org/solr/FAQ This got me thinking about the actual import script itself which essentially is posting files to the solr update URL and then indexing that data. I rewrote a small portion of the import process using some php hints and curl to try and make the process safer. Here's a sample of the code in case people are curious: // post record to SOLR $ch =3D curl_init(); $solr_url =3D $configArray['SOLR']['url'] . '/update'; curl_setopt($ch, CURLOPT_URL, $solr_url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_FAILONERROR, 1); curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-type:text/xml', 'charset=3Dutf-8')); curl_setopt($ch, CURLOPT_POSTFIELDS, $record ); curl_exec($ch); // close cURL resource, and free up system resources curl_close($ch); unset($ch); The other thought was to use the java import .jars included in the official solr releases to do the importing in a threadsafe fashion. Whenever I've found the import process has died I've had to remove and recreate the data directory in the vufind folder after stopping the service and then clear out the data folder in the solr directory before starting the vufind service back up and trying again. I'm just wondering if it might be worthwhile to split the import process into 2 stages ie create the xml files in the data folder and then parse all those files using the xsl and post them off to the solr server to ensure that records could be re-indexed or added with minimal fuss.=20 It looks like the vufind web services use the .xml files in the vufind data folder when you click on the staff view ie to read the marc record so these files need to be preserved and I'm guessing the IDs in the 001 tag are also used for a lot of the keys relating to things such as comments and favourites etc, so reindexing isn't a good thing unless the IDs stay the same. This would indicate that perhaps the 001 tag needs to be the bib_id from your ILMS to make life a little easier. Vufind looks like it has a lot of potential, it's just getting the data into the system in the first place that seems to be a pain at the moment (as I'm typing the import process has stopped at ~87000 records so I'm going to try and increase the timeout). Cheers, Matt. =20 -- Matthew Hooper Systems Officer, Flinders University Library G.P.O. Box 2100 ADELAIDE, South Australia 5001 P +618 8201 2068 F +618 8201 2508 E Mat...@fl... =20 |
From: Wayne G. <ws...@wm...> - 2007-09-18 13:18:05
|
Matt, I think you bring up a good point about thread safety. I had done a little work with a Java version of the import script. I haven't gotten that far, but my basic idea was to skip the yaz-marcdump step and write an import/indexing mechanism that goes directly from a flat Marc file into the index and write the proper XML to the output directory. Solr can do portions of its indexing in parallel, but you need to do some JVM tuning, which will also help. By default, the Jetty instance is pretty generic and you need to set some tuning to run this a bit better. Assuming you have at least 2Gb of RAM, try setting an environmental variable for JAVA_OPTIONS with "-server -Xmx1024 -Xms1024 -XX:+UseParallelGC -XX:+AggressiveOpts" and try it again. Wayne Matthew Hooper wrote: > Hi all, > > This is my first time posting so please be patient with me. I was just going > to add a few comments and thoughts regarding importing records via the > import-solr.php script which some people have had problems with. > I've been trying to load ~670,000 bib records into the system following the > instructions in the install files for vufind. What I found was that on > several occasions the solr service would stop responding to the post > requests adding new records to the index. As a result while in some cases > the import would finish, in other cases it would stop at a certain point and > refuse to keep going past a certain number of records - in my case around > 70,000 seemed to be the point where things would start going wrong. In some > cases it finished parsing the entire file but only loaded a small percent of > records into the system despite the file being in UTF8 marc xml format. > > After a bit of searching I came across a few solr related links which seemed > to help in tuning the system to be more likely to load all the records. > Firstly a quick solr tutorial from the apache website: > > http://lucene.apache.org/solr/tutorial.htm > > And a FAQ document for SOLR which had some hints about increasing timeouts > so POSTing would be less likely to fail: > > http://wiki.apache.org/solr/FAQ > > This got me thinking about the actual import script itself which essentially > is posting files to the solr update URL and then indexing that data. > I rewrote a small portion of the import process using some php hints and > curl to try and make the process safer. > > > Here's a sample of the code in case people are curious: > > // post record to SOLR > $ch = curl_init(); > $solr_url = $configArray['SOLR']['url'] . '/update'; > curl_setopt($ch, CURLOPT_URL, $solr_url); > curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); > curl_setopt($ch, CURLOPT_POST, 1); > curl_setopt($ch, CURLOPT_FAILONERROR, 1); > curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-type:text/xml', > 'charset=utf-8')); > curl_setopt($ch, CURLOPT_POSTFIELDS, $record ); > curl_exec($ch); > // close cURL resource, and free up system resources > curl_close($ch); > unset($ch); > > The other thought was to use the java import .jars included in the official > solr releases to do the importing in a threadsafe fashion. > > Whenever I've found the import process has died I've had to remove and > recreate the data directory in the vufind folder after stopping the service > and then clear out the data folder in the solr directory before starting the > vufind service back up and trying again. > I'm just wondering if it might be worthwhile to split the import process > into 2 stages ie create the xml files in the data folder and then parse all > those files using the xsl and post them off to the solr server to ensure > that records could be re-indexed or added with minimal fuss. > > It looks like the vufind web services use the .xml files in the vufind data > folder when you click on the staff view ie to read the marc record so these > files need to be preserved and I'm guessing the IDs in the 001 tag are also > used for a lot of the keys relating to things such as comments and > favourites etc, so reindexing isn't a good thing unless the IDs stay the > same. This would indicate that perhaps the 001 tag needs to be the bib_id > from your ILMS to make life a little easier. > > Vufind looks like it has a lot of potential, it's just getting the data into > the system in the first place that seems to be a pain at the moment (as I'm > typing the import process has stopped at ~87000 records so I'm going to try > and increase the timeout). > > Cheers, > > Matt. > > -- > Matthew Hooper > Systems Officer, > Flinders University Library > G.P.O. Box 2100 > ADELAIDE, South Australia 5001 > P +618 8201 2068 > F +618 8201 2508 > E Mat...@fl... > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > ------------------------------------------------------------------------ > > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech -- /** * Wayne Graham * Earl Gregg Swem Library * PO Box 8794 * Williamsburg, VA 23188 * 757.221.3112 * http://swem.wm.edu/blogs/waynegraham/ */ |
From: Andrew N. <and...@vi...> - 2007-09-18 13:36:15
|
> Vufind looks like it has a lot of potential, it's just getting the data > into the system in the first place that seems to be a pain at the > moment (as I'm typing the import process has stopped at ~87000 records > so I'm going to try and increase the timeout). This was exactly my hope in open sourcing the code. Others would be able t= o find better ways to do things making VuFind better for everyone. Please feel free to make suggestions or submit patches. Have you found you= r CURL code to be faster or better than the existing HTTP_Client code? Thanks! Andrew |
From: Matthew H. <Mat...@fl...> - 2007-09-19 01:29:21
Attachments:
Matthew Hooper.vcf
|
Hi Andrew, I've sort of shot myself in the foot with the use of curl over the standard http client requests since now the requests are being sent off to the server faster causing the service to keel over quicker. On the plus side, from reading some of the solr documentation the post requests don't seem to be limited to one record per request so I'm trying a batch process of approx 20 records per post request with a 3 second sleep after each post. I should explain that the test server I'm working with is a single processor PC (not 32 or whatever multiple Chris has) and only 512Mb RAM. The way I understand it works (or doesn't in some cases) is that tomcat has a limited input buffer ie only so many requests it can handle at one time. Sending one record off at a time is essentially compounding the problem on servers with limited resources where as if you could increase the tomcat input buffer, complete the processing of requests faster, or send fewer requests they don't tend to fill up the input buffer so fast ie not so many tomcat processes running in parallel competing for resources. At the moment with 20 records per request and with a 3 second sleep I'm sitting on about 24 tcp connections to tomcat, whereas previously the number of connections would just increase until tomcat died and then the imports would go really quick.... :-b I'm making use of the % arithmetic function in php ie concatentating the $record variable until the request number % the batch size (% is the modulo not divided by function) has a remainder of 0 and then post the request off and sleep 3 seconds. This seems to reduce the number of concurrent tomcat processes during an import - though it does take longer to run the import process. There's got to be an easier way to index a directory of xml formatted data without having to make a tcp connection to post the results of a read off to solr - ie some sort of bulk ingest process that runs on a directory and won't drag the solr service down during the ingest. Anyway if you've got the processing power, a suggestion might be to try something like Wayne is doing with multi threaded java importing and perhaps batch the records in chunks to further reduce the indexing time. Cheers, Matt. -- Matthew Hooper Systems Officer, Flinders University Library G.P.O. Box 2100 ADELAIDE, South Australia 5001 P +618 8201 2068 F +618 8201 2508 E Mat...@fl... :-----Original Message----- :From: Andrew Nagy [mailto:and...@vi...] :Sent: Tuesday, 18 September 2007 11:06 PM :To: Matthew Hooper; vuf...@li... :Subject: RE: [VuFind-Tech] import error : : :> Vufind looks like it has a lot of potential, it's just getting the :> data into the system in the first place that seems to be a :pain at the :> moment (as I'm typing the import process has stopped at :~87000 records :> so I'm going to try and increase the timeout). : :This was exactly my hope in open sourcing the code. Others :would be able to find better ways to do things making VuFind :better for everyone. : :Please feel free to make suggestions or submit patches. Have :you found your CURL code to be faster or better than the :existing HTTP_Client code? : :Thanks! :Andrew : |
From: Andrew N. <and...@vi...> - 2007-09-19 13:56:32
|
> There's got to be an easier way to index a directory of xml formatted > data without having to make a tcp connection to post the results of a > read off to solr - ie some sort of bulk ingest process that runs on a > directory and won't drag the solr service down during the ingest. > > Anyway if you've got the processing power, a suggestion might be to try > something like Wayne is doing with multi threaded java importing and > perhaps batch the records in chunks to further reduce the indexing > time. Yes, this is where I see the future of the import script heading. It shoul= d be a java application that can take advantage of the java classes. It is= known that doing imports to solr over TCP IP is many factors slower than u= sing the solr java classes. Our goals with open sourcing vufind is to offer something great to other in= stitutions that don't have the ability to develop a similar application. A= s well as to build a collective group of collaborators to help make it bett= er. If we all work together on this, we can help make this a much better a= pplication. Andrew |
From: Wayne G. <ws...@wm...> - 2007-09-19 15:44:56
|
Hi, Just to give folks a heads up on what I'm working on, I thought I'd outline it a bit and see if anyone had feedback. For a Java implementation, it occurs to me that there are several goals. - Decrease the number of steps to go from initial marc to files/index - Speed up the indexing process - Make sure the program isn't any more difficult to use than the current scripted solution In the first goal, pulling data directly out of marc file is reasonably trivial using marc4j. The only big problem is that you have to read the records sequentially with an iterator. This kind of sucks because you can't set an arbitrary number of splits in the records and process them in parallel by default. I think we'd need to do some testing to break up large files into chunks and see if its in fact faster to index in parallel than it is to go sequentially. For the speeding of the indexing, I think there can be a couple of things done. First, for folks who will be running their Solr instance on the same box they have Vufind installed, we can take advantage of a direct connection to Solr, so there's no TCP overhead, just direct IO. However, there are some folks that would need to do this remotely, so a method to post the content with an HttpURLConnection. This will be an order of magnitude slower than a direct connection, but may be necessary for some implementations. To the last point, what I was thinking is that the program would be called with something along the lines of java -jar import-solr.jar However, I would like to build in some flexibility for naming the marc file to import, where the Solr instance is, and where to store the data on the system. That's a more complex call, but something like java -Dvufind.marc.file=/usr/local/vufind/import/import.mrc -Dvufind.solr.home=/usr/local/vufind/solr/jetty/webapps/solr -Dvufind.solr.data=... -jar A simple bash script (along the lines of the vufind startup script) could be written to make this really painless. I did have a question start floating around in my mind as this thread developed...I have a background in Lucene, but not so much Solr, and I've done this to create a "cached" version of web pages in Lucene. In Solr, can you create a stored, but unindexed fields? Storing the marc XML as an essentially "hidden" in the index. There are probably a lot of very good reasons not to do this (it'll increase the size of your index by the length of the XML records, but could be minimized by storing the entire xml file as a single line), but I thought I'd float it out there. Wayne Andrew Nagy wrote: >> There's got to be an easier way to index a directory of xml formatted >> data without having to make a tcp connection to post the results of a >> read off to solr - ie some sort of bulk ingest process that runs on a >> directory and won't drag the solr service down during the ingest. >> >> Anyway if you've got the processing power, a suggestion might be to try >> something like Wayne is doing with multi threaded java importing and >> perhaps batch the records in chunks to further reduce the indexing >> time. > > Yes, this is where I see the future of the import script heading. It should be a java application that can take advantage of the java classes. It is known that doing imports to solr over TCP IP is many factors slower than using the solr java classes. > > Our goals with open sourcing vufind is to offer something great to other institutions that don't have the ability to develop a similar application. As well as to build a collective group of collaborators to help make it better. If we all work together on this, we can help make this a much better application. > > Andrew > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech -- /** * Wayne Graham * Earl Gregg Swem Library * PO Box 8794 * Williamsburg, VA 23188 * 757.221.3112 * http://swem.wm.edu/blogs/waynegraham/ */ |
From: Andrew N. <and...@vi...> - 2007-10-01 13:43:42
|
Well I have been thinking about this as well. Our designer who did the int= erface hates the fact that everything is in templates except for the record= view pages that are in xslt files. We have been talking about moving ever= ything over to solr. So we could add fields that are not used for searchin= g into solr as unindexed fields as you mention. I think this is probably t= he best route to go. I think this would be much better than duplicating th= e entire marc record into one field. For example, I am storing the 260c fi= eld but not the 260a 260b. We could add those fields as unindexed and be a= ble to use them for displaying record details. Chris, we could work on this together if you would be interested? Andrew > -----Original Message----- > From: vuf...@li... [mailto:vufind-tech- > bo...@li...] On Behalf Of Chris Delis > Sent: Friday, September 28, 2007 5:58 PM > To: vuf...@li... > Subject: [VuFind-Tech] Non-indexed MarcXML records in SOLR a bad idea? > (was Re: Java Importer (was import error)) > > On Wed, Sep 19, 2007 at 11:44:40AM -0400, Wayne Graham wrote: > > > > I did have a question start floating around in my mind as this thread > > developed...I have a background in Lucene, but not so much Solr, and > > I've done this to create a "cached" version of web pages in Lucene. > In > > Solr, can you create a stored, but unindexed fields? Storing the marc > > XML as an essentially "hidden" in the index. There are probably a lot > of > > very good reasons not to do this (it'll increase the size of your > index > > by the length of the XML records, but could be minimized by storing > the > > entire xml file as a single line), but I thought I'd float it out > there. > > > > Wayne > > > > > Does anyone know the answer to this question? The need for local XML > files pains me to no end. If it were possible to store the XML in > SOLR, say, in one non-indexed field, and did not affect performance > that much (especially in the faceted searching, which is probably the > most important concern of mine), I would love to implement it. Would > it be as simple as creating a new SOLR field like so: > > <field name=3D"marcrecord" type=3D"text" indexed=3D"false" stored=3D"true= " > termVectors=3D"true"/> > > ???? > > I was thinking about creating this new field in SOLR, storing the > MarcXML file into this field, then retrieving it from SOLR instead of > reading them in via local XML files. But, if someone already knows if > this is stupid, a performance nightmare, etc., I'd like to know > beforehand. Next week, I will probably try to learn more about SOLR, > but just wanted to see if anyone knowledgeable with this stuff might > be able to offer some insight. > > Thanks, > Chris > > ----------------------------------------------------------------------- > -- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech |
From: Chris D. <ce...@ui...> - 2007-10-01 14:04:26
|
On Mon, Oct 01, 2007 at 09:43:36AM -0400, Andrew Nagy wrote: > Well I have been thinking about this as well. Our designer who did the interface hates the fact that everything is in templates except for the record view pages that are in xslt files. We have been talking about moving everything over to solr. So we could add fields that are not used for searching into solr as unindexed fields as you mention. I think this is probably the best route to go. I think this would be much better than duplicating the entire marc record into one field. For example, I am storing the 260c field but not the 260a 260b. We could add those fields as unindexed and be able to use them for displaying record details. > > Chris, we could work on this together if you would be interested? Count me in! Yes, the reason I was thinking about putting the whole MarcXML file into one field (instead of each MARC field in their own) was because it was the simplest to implement w/r/t the current design (easier to keep up with the main line of development). But if we put all of the fields into solr, but only choose a subset of them to be indexed, it makes it easier for those who might want to index more (or less) of them in their own implementation. Sounds good to me! Chris > > Andrew > > > > -----Original Message----- > > From: vuf...@li... [mailto:vufind-tech- > > bo...@li...] On Behalf Of Chris Delis > > Sent: Friday, September 28, 2007 5:58 PM > > To: vuf...@li... > > Subject: [VuFind-Tech] Non-indexed MarcXML records in SOLR a bad idea? > > (was Re: Java Importer (was import error)) > > > > On Wed, Sep 19, 2007 at 11:44:40AM -0400, Wayne Graham wrote: > > > > > > I did have a question start floating around in my mind as this thread > > > developed...I have a background in Lucene, but not so much Solr, and > > > I've done this to create a "cached" version of web pages in Lucene. > > In > > > Solr, can you create a stored, but unindexed fields? Storing the marc > > > XML as an essentially "hidden" in the index. There are probably a lot > > of > > > very good reasons not to do this (it'll increase the size of your > > index > > > by the length of the XML records, but could be minimized by storing > > the > > > entire xml file as a single line), but I thought I'd float it out > > there. > > > > > > Wayne > > > > > > > > > Does anyone know the answer to this question? The need for local XML > > files pains me to no end. If it were possible to store the XML in > > SOLR, say, in one non-indexed field, and did not affect performance > > that much (especially in the faceted searching, which is probably the > > most important concern of mine), I would love to implement it. Would > > it be as simple as creating a new SOLR field like so: > > > > <field name="marcrecord" type="text" indexed="false" stored="true" > > termVectors="true"/> > > > > ???? > > > > I was thinking about creating this new field in SOLR, storing the > > MarcXML file into this field, then retrieving it from SOLR instead of > > reading them in via local XML files. But, if someone already knows if > > this is stupid, a performance nightmare, etc., I'd like to know > > beforehand. Next week, I will probably try to learn more about SOLR, > > but just wanted to see if anyone knowledgeable with this stuff might > > be able to offer some insight. > > > > Thanks, > > Chris > > > > ----------------------------------------------------------------------- > > -- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > _______________________________________________ > > Vufind-tech mailing list > > Vuf...@li... > > https://lists.sourceforge.net/lists/listinfo/vufind-tech |
From: Antonio B. <abarrera@Princeton.EDU> - 2007-09-19 15:57:52
|
I actually have the software on two machines. Both are essentially our staff desktop dells. One is a couple of years old and one is brand new, both have 2gig ram. Similar quality machines. Odd thing is the old machine imports records to its local Solr, at least 20 times faster than the new machine. So I tried importing records remotely from the new server to the old server, and the fast speeds still held up. I'd suggest anyone who has slow import performance (or even the outright stopped process) try it on another machine. So until I figure out why the new machine is having this problem, I'm going to use Solr on the older machine, and the "pub" interface will be on the new one. Antonio -----Original Message----- From: vuf...@li... [mailto:vuf...@li...] On Behalf Of Andrew Nagy Sent: Wednesday, September 19, 2007 9:57 AM To: Matthew Hooper; vuf...@li... Subject: Re: [VuFind-Tech] import error > There's got to be an easier way to index a directory of xml formatted > data without having to make a tcp connection to post the results of a > read off to solr - ie some sort of bulk ingest process that runs on a > directory and won't drag the solr service down during the ingest. > > Anyway if you've got the processing power, a suggestion might be to try > something like Wayne is doing with multi threaded java importing and > perhaps batch the records in chunks to further reduce the indexing > time. Yes, this is where I see the future of the import script heading. It should be a java application that can take advantage of the java classes. It is known that doing imports to solr over TCP IP is many factors slower than using the solr java classes. Our goals with open sourcing vufind is to offer something great to other institutions that don't have the ability to develop a similar application. As well as to build a collective group of collaborators to help make it better. If we all work together on this, we can help make this a much better application. Andrew ------------------------------------------------------------------------ - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Vufind-tech mailing list Vuf...@li... https://lists.sourceforge.net/lists/listinfo/vufind-tech |
From: Wayne G. <ws...@wm...> - 2007-09-19 16:22:35
|
Are there different versions of Java on the boxes? Antonio Barrera wrote: > I actually have the software on two machines. Both are essentially our > staff desktop dells. One is a couple of years old and one is brand new, > both have 2gig ram. Similar quality machines. Odd thing is the old > machine imports records to its local Solr, at least 20 times faster than > the new machine. So I tried importing records remotely from the new > server to the old server, and the fast speeds still held up. I'd > suggest anyone who has slow import performance (or even the outright > stopped process) try it on another machine. So until I figure out why > the new machine is having this problem, I'm going to use Solr on the > older machine, and the "pub" interface will be on the new one. > > Antonio > > -----Original Message----- > From: vuf...@li... > [mailto:vuf...@li...] On Behalf Of Andrew > Nagy > Sent: Wednesday, September 19, 2007 9:57 AM > To: Matthew Hooper; vuf...@li... > Subject: Re: [VuFind-Tech] import error > >> There's got to be an easier way to index a directory of xml formatted >> data without having to make a tcp connection to post the results of a >> read off to solr - ie some sort of bulk ingest process that runs on a >> directory and won't drag the solr service down during the ingest. >> >> Anyway if you've got the processing power, a suggestion might be to > try >> something like Wayne is doing with multi threaded java importing and >> perhaps batch the records in chunks to further reduce the indexing >> time. > > Yes, this is where I see the future of the import script heading. It > should be a java application that can take advantage of the java > classes. It is known that doing imports to solr over TCP IP is many > factors slower than using the solr java classes. > > Our goals with open sourcing vufind is to offer something great to other > institutions that don't have the ability to develop a similar > application. As well as to build a collective group of collaborators to > help make it better. If we all work together on this, we can help make > this a much better application. > > Andrew > > ------------------------------------------------------------------------ > - > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech -- /** * Wayne Graham * Earl Gregg Swem Library * PO Box 8794 * Williamsburg, VA 23188 * 757.221.3112 * http://swem.wm.edu/blogs/waynegraham/ */ |
From: Casson, R. D. <cas...@mu...> - 2007-09-19 17:09:37
|
i've only tangentially followed this thread, but just a couple questions, t= hings to maybe think about: 1) yaz-marcdump will split files, which might help with parallel indexing. 2) has it been confirmed that going the "embedded solr" route is faster at = indexing than the http interface? i barely followed that thread on the sol= r lists too...;) rob (who needs to get off a couple lists to follow the important ones close= r) |
From: Antonio B. <abarrera@Princeton.EDU> - 2007-09-19 17:29:23
|
Not really. Older machine 1.6.0, new machine 1.6.0_02. -----Original Message----- From: Wayne Graham [mailto:ws...@wm...]=20 Sent: Wednesday, September 19, 2007 12:22 PM To: Antonio Barrera Cc: vuf...@li... Subject: Re: [VuFind-Tech] import error Are there different versions of Java on the boxes? Antonio Barrera wrote: > I actually have the software on two machines. Both are essentially our > staff desktop dells. One is a couple of years old and one is brand new, > both have 2gig ram. Similar quality machines. Odd thing is the old > machine imports records to its local Solr, at least 20 times faster than > the new machine. So I tried importing records remotely from the new > server to the old server, and the fast speeds still held up. I'd > suggest anyone who has slow import performance (or even the outright > stopped process) try it on another machine. So until I figure out why > the new machine is having this problem, I'm going to use Solr on the > older machine, and the "pub" interface will be on the new one. >=20 > Antonio >=20 > -----Original Message----- > From: vuf...@li... > [mailto:vuf...@li...] On Behalf Of Andrew > Nagy > Sent: Wednesday, September 19, 2007 9:57 AM > To: Matthew Hooper; vuf...@li... > Subject: Re: [VuFind-Tech] import error >=20 >> There's got to be an easier way to index a directory of xml formatted >> data without having to make a tcp connection to post the results of a >> read off to solr - ie some sort of bulk ingest process that runs on a >> directory and won't drag the solr service down during the ingest. >> >> Anyway if you've got the processing power, a suggestion might be to > try >> something like Wayne is doing with multi threaded java importing and >> perhaps batch the records in chunks to further reduce the indexing >> time. >=20 > Yes, this is where I see the future of the import script heading. It > should be a java application that can take advantage of the java > classes. It is known that doing imports to solr over TCP IP is many > factors slower than using the solr java classes. >=20 > Our goals with open sourcing vufind is to offer something great to other > institutions that don't have the ability to develop a similar > application. As well as to build a collective group of collaborators to > help make it better. If we all work together on this, we can help make > this a much better application. >=20 > Andrew >=20 > ------------------------------------------------------------------------ > - > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech >=20 > ------------------------------------------------------------------------ - > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech --=20 /** * Wayne Graham * Earl Gregg Swem Library * PO Box 8794 * Williamsburg, VA 23188 * 757.221.3112 * http://swem.wm.edu/blogs/waynegraham/ */ |
From: Antonio B. <abarrera@Princeton.EDU> - 2007-09-21 19:09:41
|
Through some trial and error, this is how I've improved my importing to about 30k per 20 minutes. First, I distributed tasks. An older production quality server now hosts the Solr install (has its own full VuFind install but I only use the Solr portion of it). The new desktop that I intended to use as my test server holds the data files and web interface. I run importing from the new desktop to the server. However, I rewrote the import process into two steps. The first step basically does everything the original php import script does, except post to Solr. I added a step to duplicate the output xml files (named iteratively) to a separate temporary directory. Step 2 uses those xml files to post to Solr. Step 2 accepts two arguments, a starting number and ending number. Hence, I can run the step 2 import concurrently just by changing the starting and ending numbers. I've been doing that in batches of 10k. I haven't tried more than 3 concurrent 10k batches at a time though. Also by using single xml files when posting, I can see which file was the last processed. So in case there is some sort of failure, I know exactly where to pick up. Seems to work. But I do look forward to the java import. Antonio -----Original Message----- From: Antonio Barrera [mailto:abarrera@Princeton.EDU]=20 Sent: Wednesday, September 19, 2007 1:29 PM To: Wayne Graham; Antonio Barrera Cc: vuf...@li... Subject: RE: [VuFind-Tech] import error Not really. Older machine 1.6.0, new machine 1.6.0_02. -----Original Message----- From: Wayne Graham [mailto:ws...@wm...]=20 Sent: Wednesday, September 19, 2007 12:22 PM To: Antonio Barrera Cc: vuf...@li... Subject: Re: [VuFind-Tech] import error Are there different versions of Java on the boxes? Antonio Barrera wrote: > I actually have the software on two machines. Both are essentially our > staff desktop dells. One is a couple of years old and one is brand new, > both have 2gig ram. Similar quality machines. Odd thing is the old > machine imports records to its local Solr, at least 20 times faster than > the new machine. So I tried importing records remotely from the new > server to the old server, and the fast speeds still held up. I'd > suggest anyone who has slow import performance (or even the outright > stopped process) try it on another machine. So until I figure out why > the new machine is having this problem, I'm going to use Solr on the > older machine, and the "pub" interface will be on the new one. >=20 > Antonio >=20 > -----Original Message----- > From: vuf...@li... > [mailto:vuf...@li...] On Behalf Of Andrew > Nagy > Sent: Wednesday, September 19, 2007 9:57 AM > To: Matthew Hooper; vuf...@li... > Subject: Re: [VuFind-Tech] import error >=20 >> There's got to be an easier way to index a directory of xml formatted >> data without having to make a tcp connection to post the results of a >> read off to solr - ie some sort of bulk ingest process that runs on a >> directory and won't drag the solr service down during the ingest. >> >> Anyway if you've got the processing power, a suggestion might be to > try >> something like Wayne is doing with multi threaded java importing and >> perhaps batch the records in chunks to further reduce the indexing >> time. >=20 > Yes, this is where I see the future of the import script heading. It > should be a java application that can take advantage of the java > classes. It is known that doing imports to solr over TCP IP is many > factors slower than using the solr java classes. >=20 > Our goals with open sourcing vufind is to offer something great to other > institutions that don't have the ability to develop a similar > application. As well as to build a collective group of collaborators to > help make it better. If we all work together on this, we can help make > this a much better application. >=20 > Andrew >=20 > ------------------------------------------------------------------------ > - > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech >=20 > ------------------------------------------------------------------------ - > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech --=20 /** * Wayne Graham * Earl Gregg Swem Library * PO Box 8794 * Williamsburg, VA 23188 * 757.221.3112 * http://swem.wm.edu/blogs/waynegraham/ */ |
From: Wayne G. <ws...@wm...> - 2007-09-21 19:55:38
|
I just dropped Andrew a note, but I wanted to let everyone know what I found today... Because there are variations in the implementation of marc (recall Andrew's note about the unique id being in the 949a and not the 001 field), I refactored the code to change the way it handles records. The flow creates a direct connection to the Solr server (it doesn't need to be running) and then reads in the marc file. It iterates over the records, and for each record, it writes an marcxml file and converts the record to the format needed for solr using a stylesheet (based on the marcxml2solr stylesheet). Each record is then sent to solr using a custom requestHandler that maps to the XmlUpdateRequestHandler in Solr. Unfortunately, this is much (MUCH) slower as there are extra parsing and evaluation steps happening here that I was able to skip by mapping specific fields to the index fields in solr. Yesterday I could do 10,100 (to make sure there was at least 1 autocommit in there) in less than 2 minutes (around 1:45 on average). Today's method, which is more flexible, runs the same number of documents in about 21 minutes. For just a straw poll...other than the unique id, which other fields are different than the standard inlcuded the XSLT file? If it's just the unique id, I'm thinking that could be passed in as a variable since handling this in-memory is so much faster. Wayne Antonio Barrera wrote: > Through some trial and error, this is how I've improved my importing to > about 30k per 20 minutes. First, I distributed tasks. An older > production quality server now hosts the Solr install (has its own full > VuFind install but I only use the Solr portion of it). The new desktop > that I intended to use as my test server holds the data files and web > interface. I run importing from the new desktop to the server. > However, I rewrote the import process into two steps. The first step > basically does everything the original php import script does, except > post to Solr. I added a step to duplicate the output xml files (named > iteratively) to a separate temporary directory. Step 2 uses those xml > files to post to Solr. Step 2 accepts two arguments, a starting number > and ending number. Hence, I can run the step 2 import concurrently just > by changing the starting and ending numbers. I've been doing that in > batches of 10k. I haven't tried more than 3 concurrent 10k batches at a > time though. Also by using single xml files when posting, I can see > which file was the last processed. So in case there is some sort of > failure, I know exactly where to pick up. > > Seems to work. But I do look forward to the java import. > > Antonio > > -----Original Message----- > From: Antonio Barrera [mailto:abarrera@Princeton.EDU] > Sent: Wednesday, September 19, 2007 1:29 PM > To: Wayne Graham; Antonio Barrera > Cc: vuf...@li... > Subject: RE: [VuFind-Tech] import error > > Not really. Older machine 1.6.0, new machine 1.6.0_02. > > -----Original Message----- > From: Wayne Graham [mailto:ws...@wm...] > Sent: Wednesday, September 19, 2007 12:22 PM > To: Antonio Barrera > Cc: vuf...@li... > Subject: Re: [VuFind-Tech] import error > > Are there different versions of Java on the boxes? > > Antonio Barrera wrote: >> I actually have the software on two machines. Both are essentially > our >> staff desktop dells. One is a couple of years old and one is brand > new, >> both have 2gig ram. Similar quality machines. Odd thing is the old >> machine imports records to its local Solr, at least 20 times faster > than >> the new machine. So I tried importing records remotely from the new >> server to the old server, and the fast speeds still held up. I'd >> suggest anyone who has slow import performance (or even the outright >> stopped process) try it on another machine. So until I figure out why >> the new machine is having this problem, I'm going to use Solr on the >> older machine, and the "pub" interface will be on the new one. >> >> Antonio >> >> -----Original Message----- >> From: vuf...@li... >> [mailto:vuf...@li...] On Behalf Of Andrew >> Nagy >> Sent: Wednesday, September 19, 2007 9:57 AM >> To: Matthew Hooper; vuf...@li... >> Subject: Re: [VuFind-Tech] import error >> >>> There's got to be an easier way to index a directory of xml formatted >>> data without having to make a tcp connection to post the results of a >>> read off to solr - ie some sort of bulk ingest process that runs on a >>> directory and won't drag the solr service down during the ingest. >>> >>> Anyway if you've got the processing power, a suggestion might be to >> try >>> something like Wayne is doing with multi threaded java importing and >>> perhaps batch the records in chunks to further reduce the indexing >>> time. >> Yes, this is where I see the future of the import script heading. It >> should be a java application that can take advantage of the java >> classes. It is known that doing imports to solr over TCP IP is many >> factors slower than using the solr java classes. >> >> Our goals with open sourcing vufind is to offer something great to > other >> institutions that don't have the ability to develop a similar >> application. As well as to build a collective group of collaborators > to >> help make it better. If we all work together on this, we can help > make >> this a much better application. >> >> Andrew >> >> > ------------------------------------------------------------------------ >> - >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2005. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >> _______________________________________________ >> Vufind-tech mailing list >> Vuf...@li... >> https://lists.sourceforge.net/lists/listinfo/vufind-tech >> >> > ------------------------------------------------------------------------ > - >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2005. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >> _______________________________________________ >> Vufind-tech mailing list >> Vuf...@li... >> https://lists.sourceforge.net/lists/listinfo/vufind-tech > > -- /** * Wayne Graham * Earl Gregg Swem Library * PO Box 8794 * Williamsburg, VA 23188 * 757.221.3112 * http://swem.wm.edu/blogs/waynegraham/ */ |
From: Chris D. <ce...@ui...> - 2007-09-18 13:50:08
|
On Tue, Sep 18, 2007 at 02:43:30PM +0930, Matthew Hooper wrote: > > Hi all, > > This is my first time posting so please be patient with me. I was just going > to add a few comments and thoughts regarding importing records via the > import-solr.php script which some people have had problems with. Hi Matt, Although I haven't personally experienced the same failures that you are discussing, I am finding myself having to work-around some of its limitations. My problem with the current version of the import script is that it is not very efficient for my needs. I am still in the pilot phase of the project therefore I need to make changes to the SOLR schemas quite often (at least I will be doing this a lot in the near future when I add new facets, etc.). I have a huge collection (25 million bib records which come from a library consortium - 9 million of them are unique). I don't need to import all of them for my pilot, but I do plan on testing with 5 or 6 million (3 or 4 schools' worth of data) at a time, though. So far, I am experiencing 10-hours of processing time per 500,000 bib records. During this processing, I noticed the CPU was at 100% (and I have 32 available CPUs on this machine). This got me thinking that I was hitting a bottleneck at the script level (and not so much the Lucene database). For me, it was absolutely necessary to parallelize this process! In other words, I needed to run several of these import processes at the same time. When I ran 4 import scripts at once (I had to make slight modifications to the script, e.g., one of these changes was to remove the optimize at the end--this needs to be done when all of the scripts are done running), I was able to import 2,000,000 bib records in the same time of 10-hours. So, if you or anyone else on this list is kind enough to share a multi-threaded version of the import script (written in java or whatever), I would love to test it! :-) Thanks, Chris > I've been trying to load ~670,000 bib records into the system following the > instructions in the install files for vufind. What I found was that on > several occasions the solr service would stop responding to the post > requests adding new records to the index. As a result while in some cases > the import would finish, in other cases it would stop at a certain point and > refuse to keep going past a certain number of records - in my case around > 70,000 seemed to be the point where things would start going wrong. In some > cases it finished parsing the entire file but only loaded a small percent of > records into the system despite the file being in UTF8 marc xml format. > > After a bit of searching I came across a few solr related links which seemed > to help in tuning the system to be more likely to load all the records. > Firstly a quick solr tutorial from the apache website: > > http://lucene.apache.org/solr/tutorial.htm > > And a FAQ document for SOLR which had some hints about increasing timeouts > so POSTing would be less likely to fail: > > http://wiki.apache.org/solr/FAQ > > This got me thinking about the actual import script itself which essentially > is posting files to the solr update URL and then indexing that data. > I rewrote a small portion of the import process using some php hints and > curl to try and make the process safer. > > > Here's a sample of the code in case people are curious: > > // post record to SOLR > $ch = curl_init(); > $solr_url = $configArray['SOLR']['url'] . '/update'; > curl_setopt($ch, CURLOPT_URL, $solr_url); > curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); > curl_setopt($ch, CURLOPT_POST, 1); > curl_setopt($ch, CURLOPT_FAILONERROR, 1); > curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-type:text/xml', > 'charset=utf-8')); > curl_setopt($ch, CURLOPT_POSTFIELDS, $record ); > curl_exec($ch); > // close cURL resource, and free up system resources > curl_close($ch); > unset($ch); > > The other thought was to use the java import .jars included in the official > solr releases to do the importing in a threadsafe fashion. > > Whenever I've found the import process has died I've had to remove and > recreate the data directory in the vufind folder after stopping the service > and then clear out the data folder in the solr directory before starting the > vufind service back up and trying again. > I'm just wondering if it might be worthwhile to split the import process > into 2 stages ie create the xml files in the data folder and then parse all > those files using the xsl and post them off to the solr server to ensure > that records could be re-indexed or added with minimal fuss. > > It looks like the vufind web services use the .xml files in the vufind data > folder when you click on the staff view ie to read the marc record so these > files need to be preserved and I'm guessing the IDs in the 001 tag are also > used for a lot of the keys relating to things such as comments and > favourites etc, so reindexing isn't a good thing unless the IDs stay the > same. This would indicate that perhaps the 001 tag needs to be the bib_id > from your ILMS to make life a little easier. > > Vufind looks like it has a lot of potential, it's just getting the data into > the system in the first place that seems to be a pain at the moment (as I'm > typing the import process has stopped at ~87000 records so I'm going to try > and increase the timeout). > > Cheers, > > Matt. > > -- > Matthew Hooper > Systems Officer, > Flinders University Library > G.P.O. Box 2100 > ADELAIDE, South Australia 5001 > P +618 8201 2068 > F +618 8201 2508 > E Mat...@fl... > > BEGIN:VCARD > VERSION:2.1 > N:Hooper;Matthew;;Mr > FN:Matthew Hooper > ORG:Flinders University > TITLE:Systems Officer > TEL;WORK;VOICE:+61 (08) 8201-2068 > TEL;WORK;FAX:+61 (08) 8201-2508 > ADR;WORK:;;G.P.O. Box 2100;ADELAIDE;South Australia;5001;Australia > LABEL;WORK;ENCODING=QUOTED-PRINTABLE:G.P.O. Box 2100=0D=0AADELAIDE, South Australia 5001=0D=0AAustralia > EMAIL;PREF;INTERNET:Mat...@fl... > REV:20070327T063222Z > END:VCARD > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech |
From: Wayne G. <ws...@wm...> - 2007-09-18 16:12:39
|
OK...wow. A test box with 32 processors? I'm jealous. Wally had mentioned this to me, but I'll ask you also, were all your processors spiking or just one? I'll try to get my Java in a little better shape this week (e.g. working) and send it to you. Wayne Chris Delis wrote: > On Tue, Sep 18, 2007 at 02:43:30PM +0930, Matthew Hooper wrote: >> Hi all, >> >> This is my first time posting so please be patient with me. I was just going >> to add a few comments and thoughts regarding importing records via the >> import-solr.php script which some people have had problems with. > > > Hi Matt, > > Although I haven't personally experienced the same failures that you > are discussing, I am finding myself having to work-around some of its > limitations. My problem with the current version of the import script > is that it is not very efficient for my needs. I am still in the > pilot phase of the project therefore I need to make changes to the > SOLR schemas quite often (at least I will be doing this a lot in the > near future when I add new facets, etc.). > > I have a huge collection (25 million bib records which come from a > library consortium - 9 million of them are unique). I don't need to > import all of them for my pilot, but I do plan on testing with 5 or 6 > million (3 or 4 schools' worth of data) at a time, though. So far, I > am experiencing 10-hours of processing time per 500,000 bib records. > During this processing, I noticed the CPU was at 100% (and I have 32 > available CPUs on this machine). This got me thinking that I was > hitting a bottleneck at the script level (and not so much the Lucene > database). For me, it was absolutely necessary to parallelize this > process! In other words, I needed to run several of these import > processes at the same time. When I ran 4 import scripts at once (I > had to make slight modifications to the script, e.g., one of these > changes was to remove the optimize at the end--this needs to be done > when all of the scripts are done running), I was able to import > 2,000,000 bib records in the same time of 10-hours. > > So, if you or anyone else on this list is kind enough to share a > multi-threaded version of the import script (written in java or > whatever), I would love to test it! :-) > > Thanks, > Chris > > > > > > > > > > >> I've been trying to load ~670,000 bib records into the system following the >> instructions in the install files for vufind. What I found was that on >> several occasions the solr service would stop responding to the post >> requests adding new records to the index. As a result while in some cases >> the import would finish, in other cases it would stop at a certain point and >> refuse to keep going past a certain number of records - in my case around >> 70,000 seemed to be the point where things would start going wrong. In some >> cases it finished parsing the entire file but only loaded a small percent of >> records into the system despite the file being in UTF8 marc xml format. >> >> After a bit of searching I came across a few solr related links which seemed >> to help in tuning the system to be more likely to load all the records. >> Firstly a quick solr tutorial from the apache website: >> >> http://lucene.apache.org/solr/tutorial.htm >> >> And a FAQ document for SOLR which had some hints about increasing timeouts >> so POSTing would be less likely to fail: >> >> http://wiki.apache.org/solr/FAQ >> >> This got me thinking about the actual import script itself which essentially >> is posting files to the solr update URL and then indexing that data. >> I rewrote a small portion of the import process using some php hints and >> curl to try and make the process safer. >> >> >> Here's a sample of the code in case people are curious: >> >> // post record to SOLR >> $ch = curl_init(); >> $solr_url = $configArray['SOLR']['url'] . '/update'; >> curl_setopt($ch, CURLOPT_URL, $solr_url); >> curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); >> curl_setopt($ch, CURLOPT_POST, 1); >> curl_setopt($ch, CURLOPT_FAILONERROR, 1); >> curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-type:text/xml', >> 'charset=utf-8')); >> curl_setopt($ch, CURLOPT_POSTFIELDS, $record ); >> curl_exec($ch); >> // close cURL resource, and free up system resources >> curl_close($ch); >> unset($ch); >> >> The other thought was to use the java import .jars included in the official >> solr releases to do the importing in a threadsafe fashion. >> >> Whenever I've found the import process has died I've had to remove and >> recreate the data directory in the vufind folder after stopping the service >> and then clear out the data folder in the solr directory before starting the >> vufind service back up and trying again. >> I'm just wondering if it might be worthwhile to split the import process >> into 2 stages ie create the xml files in the data folder and then parse all >> those files using the xsl and post them off to the solr server to ensure >> that records could be re-indexed or added with minimal fuss. >> >> It looks like the vufind web services use the .xml files in the vufind data >> folder when you click on the staff view ie to read the marc record so these >> files need to be preserved and I'm guessing the IDs in the 001 tag are also >> used for a lot of the keys relating to things such as comments and >> favourites etc, so reindexing isn't a good thing unless the IDs stay the >> same. This would indicate that perhaps the 001 tag needs to be the bib_id >> from your ILMS to make life a little easier. >> >> Vufind looks like it has a lot of potential, it's just getting the data into >> the system in the first place that seems to be a pain at the moment (as I'm >> typing the import process has stopped at ~87000 records so I'm going to try >> and increase the timeout). >> >> Cheers, >> >> Matt. >> >> -- >> Matthew Hooper >> Systems Officer, >> Flinders University Library >> G.P.O. Box 2100 >> ADELAIDE, South Australia 5001 >> P +618 8201 2068 >> F +618 8201 2508 >> E Mat...@fl... >> > >> BEGIN:VCARD >> VERSION:2.1 >> N:Hooper;Matthew;;Mr >> FN:Matthew Hooper >> ORG:Flinders University >> TITLE:Systems Officer >> TEL;WORK;VOICE:+61 (08) 8201-2068 >> TEL;WORK;FAX:+61 (08) 8201-2508 >> ADR;WORK:;;G.P.O. Box 2100;ADELAIDE;South Australia;5001;Australia >> LABEL;WORK;ENCODING=QUOTED-PRINTABLE:G.P.O. Box 2100=0D=0AADELAIDE, South Australia 5001=0D=0AAustralia >> EMAIL;PREF;INTERNET:Mat...@fl... >> REV:20070327T063222Z >> END:VCARD > >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2005. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >> _______________________________________________ >> Vufind-tech mailing list >> Vuf...@li... >> https://lists.sourceforge.net/lists/listinfo/vufind-tech > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech -- /** * Wayne Graham * Earl Gregg Swem Library * PO Box 8794 * Williamsburg, VA 23188 * 757.221.3112 * http://swem.wm.edu/blogs/waynegraham/ */ |
From: Chris D. <ce...@ui...> - 2007-09-18 16:20:05
|
On Tue, Sep 18, 2007 at 12:12:30PM -0400, Wayne Graham wrote: > OK...wow. A test box with 32 processors? I'm jealous. 32 cores, actually. But from the OS' perspective, it looks like 32 processors. It's a Sun T1000. The production machine will most likely be a T2000. (I have mixed feelings about the hardware, BTW. It is not the best environment for open-source development, IMHO.) > > Wally had mentioned this to me, but I'll ask you also, were all your > processors spiking or just one? Just one, which is why I quickly decided to run simultaneous PHP import scripts. :-) > > I'll try to get my Java in a little better shape this week (e.g. > working) and send it to you. Great! I would love to help (and plan to, in the future) with VUFind-related (main branch) development but I am busy working on other things at the moment. Thanks, Chris > > Wayne > > Chris Delis wrote: > > On Tue, Sep 18, 2007 at 02:43:30PM +0930, Matthew Hooper wrote: > >> Hi all, > >> > >> This is my first time posting so please be patient with me. I was just going > >> to add a few comments and thoughts regarding importing records via the > >> import-solr.php script which some people have had problems with. > > > > > > Hi Matt, > > > > Although I haven't personally experienced the same failures that you > > are discussing, I am finding myself having to work-around some of its > > limitations. My problem with the current version of the import script > > is that it is not very efficient for my needs. I am still in the > > pilot phase of the project therefore I need to make changes to the > > SOLR schemas quite often (at least I will be doing this a lot in the > > near future when I add new facets, etc.). > > > > I have a huge collection (25 million bib records which come from a > > library consortium - 9 million of them are unique). I don't need to > > import all of them for my pilot, but I do plan on testing with 5 or 6 > > million (3 or 4 schools' worth of data) at a time, though. So far, I > > am experiencing 10-hours of processing time per 500,000 bib records. > > During this processing, I noticed the CPU was at 100% (and I have 32 > > available CPUs on this machine). This got me thinking that I was > > hitting a bottleneck at the script level (and not so much the Lucene > > database). For me, it was absolutely necessary to parallelize this > > process! In other words, I needed to run several of these import > > processes at the same time. When I ran 4 import scripts at once (I > > had to make slight modifications to the script, e.g., one of these > > changes was to remove the optimize at the end--this needs to be done > > when all of the scripts are done running), I was able to import > > 2,000,000 bib records in the same time of 10-hours. > > > > So, if you or anyone else on this list is kind enough to share a > > multi-threaded version of the import script (written in java or > > whatever), I would love to test it! :-) > > > > Thanks, > > Chris > > > > > > > > > > > > > > > > > > > > > >> I've been trying to load ~670,000 bib records into the system following the > >> instructions in the install files for vufind. What I found was that on > >> several occasions the solr service would stop responding to the post > >> requests adding new records to the index. As a result while in some cases > >> the import would finish, in other cases it would stop at a certain point and > >> refuse to keep going past a certain number of records - in my case around > >> 70,000 seemed to be the point where things would start going wrong. In some > >> cases it finished parsing the entire file but only loaded a small percent of > >> records into the system despite the file being in UTF8 marc xml format. > >> > >> After a bit of searching I came across a few solr related links which seemed > >> to help in tuning the system to be more likely to load all the records. > >> Firstly a quick solr tutorial from the apache website: > >> > >> http://lucene.apache.org/solr/tutorial.htm > >> > >> And a FAQ document for SOLR which had some hints about increasing timeouts > >> so POSTing would be less likely to fail: > >> > >> http://wiki.apache.org/solr/FAQ > >> > >> This got me thinking about the actual import script itself which essentially > >> is posting files to the solr update URL and then indexing that data. > >> I rewrote a small portion of the import process using some php hints and > >> curl to try and make the process safer. > >> > >> > >> Here's a sample of the code in case people are curious: > >> > >> // post record to SOLR > >> $ch = curl_init(); > >> $solr_url = $configArray['SOLR']['url'] . '/update'; > >> curl_setopt($ch, CURLOPT_URL, $solr_url); > >> curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); > >> curl_setopt($ch, CURLOPT_POST, 1); > >> curl_setopt($ch, CURLOPT_FAILONERROR, 1); > >> curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-type:text/xml', > >> 'charset=utf-8')); > >> curl_setopt($ch, CURLOPT_POSTFIELDS, $record ); > >> curl_exec($ch); > >> // close cURL resource, and free up system resources > >> curl_close($ch); > >> unset($ch); > >> > >> The other thought was to use the java import .jars included in the official > >> solr releases to do the importing in a threadsafe fashion. > >> > >> Whenever I've found the import process has died I've had to remove and > >> recreate the data directory in the vufind folder after stopping the service > >> and then clear out the data folder in the solr directory before starting the > >> vufind service back up and trying again. > >> I'm just wondering if it might be worthwhile to split the import process > >> into 2 stages ie create the xml files in the data folder and then parse all > >> those files using the xsl and post them off to the solr server to ensure > >> that records could be re-indexed or added with minimal fuss. > >> > >> It looks like the vufind web services use the .xml files in the vufind data > >> folder when you click on the staff view ie to read the marc record so these > >> files need to be preserved and I'm guessing the IDs in the 001 tag are also > >> used for a lot of the keys relating to things such as comments and > >> favourites etc, so reindexing isn't a good thing unless the IDs stay the > >> same. This would indicate that perhaps the 001 tag needs to be the bib_id > >> from your ILMS to make life a little easier. > >> > >> Vufind looks like it has a lot of potential, it's just getting the data into > >> the system in the first place that seems to be a pain at the moment (as I'm > >> typing the import process has stopped at ~87000 records so I'm going to try > >> and increase the timeout). > >> > >> Cheers, > >> > >> Matt. > >> > >> -- > >> Matthew Hooper > >> Systems Officer, > >> Flinders University Library > >> G.P.O. Box 2100 > >> ADELAIDE, South Australia 5001 > >> P +618 8201 2068 > >> F +618 8201 2508 > >> E Mat...@fl... > >> > > > >> BEGIN:VCARD > >> VERSION:2.1 > >> N:Hooper;Matthew;;Mr > >> FN:Matthew Hooper > >> ORG:Flinders University > >> TITLE:Systems Officer > >> TEL;WORK;VOICE:+61 (08) 8201-2068 > >> TEL;WORK;FAX:+61 (08) 8201-2508 > >> ADR;WORK:;;G.P.O. Box 2100;ADELAIDE;South Australia;5001;Australia > >> LABEL;WORK;ENCODING=QUOTED-PRINTABLE:G.P.O. Box 2100=0D=0AADELAIDE, South Australia 5001=0D=0AAustralia > >> EMAIL;PREF;INTERNET:Mat...@fl... > >> REV:20070327T063222Z > >> END:VCARD > > > >> ------------------------------------------------------------------------- > >> This SF.net email is sponsored by: Microsoft > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > >> _______________________________________________ > >> Vufind-tech mailing list > >> Vuf...@li... > >> https://lists.sourceforge.net/lists/listinfo/vufind-tech > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > _______________________________________________ > > Vufind-tech mailing list > > Vuf...@li... > > https://lists.sourceforge.net/lists/listinfo/vufind-tech > > > -- > /** > * Wayne Graham > * Earl Gregg Swem Library > * PO Box 8794 > * Williamsburg, VA 23188 > * 757.221.3112 > * http://swem.wm.edu/blogs/waynegraham/ > */ > |
From: Antonio B. <abarrera@Princeton.EDU> - 2007-09-18 16:29:24
|
Chris, Care to share the php code changes for the import script? Thanks, Antonio Barrera Princeton University Library -----Original Message----- From: vuf...@li... [mailto:vuf...@li...] On Behalf Of Chris Delis Sent: Tuesday, September 18, 2007 12:20 PM To: Wayne Graham Cc: vuf...@li... Subject: Re: [VuFind-Tech] import error On Tue, Sep 18, 2007 at 12:12:30PM -0400, Wayne Graham wrote: > OK...wow. A test box with 32 processors? I'm jealous. 32 cores, actually. But from the OS' perspective, it looks like 32 processors. It's a Sun T1000. The production machine will most likely be a T2000. (I have mixed feelings about the hardware, BTW. It is not the best environment for open-source development, IMHO.) >=20 > Wally had mentioned this to me, but I'll ask you also, were all your > processors spiking or just one? Just one, which is why I quickly decided to run simultaneous PHP import scripts. :-) >=20 > I'll try to get my Java in a little better shape this week (e.g. > working) and send it to you. Great! I would love to help (and plan to, in the future) with VUFind-related (main branch) development but I am busy working on other things at the moment. Thanks, Chris >=20 > Wayne >=20 > Chris Delis wrote: > > On Tue, Sep 18, 2007 at 02:43:30PM +0930, Matthew Hooper wrote: > >> Hi all, > >> > >> This is my first time posting so please be patient with me. I was just going > >> to add a few comments and thoughts regarding importing records via the > >> import-solr.php script which some people have had problems with. > >=20 > >=20 > > Hi Matt, > >=20 > > Although I haven't personally experienced the same failures that you > > are discussing, I am finding myself having to work-around some of its > > limitations. My problem with the current version of the import script > > is that it is not very efficient for my needs. I am still in the > > pilot phase of the project therefore I need to make changes to the > > SOLR schemas quite often (at least I will be doing this a lot in the > > near future when I add new facets, etc.). > >=20 > > I have a huge collection (25 million bib records which come from a > > library consortium - 9 million of them are unique). I don't need to > > import all of them for my pilot, but I do plan on testing with 5 or 6 > > million (3 or 4 schools' worth of data) at a time, though. So far, I > > am experiencing 10-hours of processing time per 500,000 bib records. > > During this processing, I noticed the CPU was at 100% (and I have 32 > > available CPUs on this machine). This got me thinking that I was > > hitting a bottleneck at the script level (and not so much the Lucene > > database). For me, it was absolutely necessary to parallelize this > > process! In other words, I needed to run several of these import > > processes at the same time. When I ran 4 import scripts at once (I > > had to make slight modifications to the script, e.g., one of these > > changes was to remove the optimize at the end--this needs to be done > > when all of the scripts are done running), I was able to import > > 2,000,000 bib records in the same time of 10-hours. > >=20 > > So, if you or anyone else on this list is kind enough to share a > > multi-threaded version of the import script (written in java or > > whatever), I would love to test it! :-) > >=20 > > Thanks, > > Chris > >=20 > >=20 > >=20 > >=20 > >=20 > >=20 > >=20 > >=20 > >=20 > >=20 > >> I've been trying to load ~670,000 bib records into the system following the > >> instructions in the install files for vufind. What I found was that on > >> several occasions the solr service would stop responding to the post > >> requests adding new records to the index. As a result while in some cases > >> the import would finish, in other cases it would stop at a certain point and > >> refuse to keep going past a certain number of records - in my case around > >> 70,000 seemed to be the point where things would start going wrong. In some > >> cases it finished parsing the entire file but only loaded a small percent of > >> records into the system despite the file being in UTF8 marc xml format. > >> > >> After a bit of searching I came across a few solr related links which seemed > >> to help in tuning the system to be more likely to load all the records. > >> Firstly a quick solr tutorial from the apache website:=20 > >> > >> http://lucene.apache.org/solr/tutorial.htm > >> > >> And a FAQ document for SOLR which had some hints about increasing timeouts > >> so POSTing would be less likely to fail: > >> > >> http://wiki.apache.org/solr/FAQ > >> > >> This got me thinking about the actual import script itself which essentially > >> is posting files to the solr update URL and then indexing that data. > >> I rewrote a small portion of the import process using some php hints and > >> curl to try and make the process safer. > >> > >> > >> Here's a sample of the code in case people are curious: > >> > >> // post record to SOLR > >> $ch =3D curl_init(); > >> $solr_url =3D $configArray['SOLR']['url'] . '/update'; > >> curl_setopt($ch, CURLOPT_URL, $solr_url); > >> curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); > >> curl_setopt($ch, CURLOPT_POST, 1); > >> curl_setopt($ch, CURLOPT_FAILONERROR, 1); > >> curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-type:text/xml', > >> 'charset=3Dutf-8')); > >> curl_setopt($ch, CURLOPT_POSTFIELDS, $record ); > >> curl_exec($ch); > >> // close cURL resource, and free up system resources > >> curl_close($ch); > >> unset($ch); > >> > >> The other thought was to use the java import .jars included in the official > >> solr releases to do the importing in a threadsafe fashion. > >> > >> Whenever I've found the import process has died I've had to remove and > >> recreate the data directory in the vufind folder after stopping the service > >> and then clear out the data folder in the solr directory before starting the > >> vufind service back up and trying again. > >> I'm just wondering if it might be worthwhile to split the import process > >> into 2 stages ie create the xml files in the data folder and then parse all > >> those files using the xsl and post them off to the solr server to ensure > >> that records could be re-indexed or added with minimal fuss.=20 > >> > >> It looks like the vufind web services use the .xml files in the vufind data > >> folder when you click on the staff view ie to read the marc record so these > >> files need to be preserved and I'm guessing the IDs in the 001 tag are also > >> used for a lot of the keys relating to things such as comments and > >> favourites etc, so reindexing isn't a good thing unless the IDs stay the > >> same. This would indicate that perhaps the 001 tag needs to be the bib_id > >> from your ILMS to make life a little easier. > >> > >> Vufind looks like it has a lot of potential, it's just getting the data into > >> the system in the first place that seems to be a pain at the moment (as I'm > >> typing the import process has stopped at ~87000 records so I'm going to try > >> and increase the timeout). > >> > >> Cheers, > >> > >> Matt. =20 > >> > >> -- > >> Matthew Hooper > >> Systems Officer, > >> Flinders University Library > >> G.P.O. Box 2100 > >> ADELAIDE, South Australia 5001 > >> P +618 8201 2068 > >> F +618 8201 2508 > >> E Mat...@fl... > >> =20 > >=20 > >> BEGIN:VCARD > >> VERSION:2.1 > >> N:Hooper;Matthew;;Mr > >> FN:Matthew Hooper > >> ORG:Flinders University > >> TITLE:Systems Officer > >> TEL;WORK;VOICE:+61 (08) 8201-2068 > >> TEL;WORK;FAX:+61 (08) 8201-2508 > >> ADR;WORK:;;G.P.O. Box 2100;ADELAIDE;South Australia;5001;Australia > >> LABEL;WORK;ENCODING=3DQUOTED-PRINTABLE:G.P.O. Box = 2100=3D0D=3D0AADELAIDE, South Australia 5001=3D0D=3D0AAustralia > >> EMAIL;PREF;INTERNET:Mat...@fl... > >> REV:20070327T063222Z > >> END:VCARD > >=20 > >> ------------------------------------------------------------------------ - > >> This SF.net email is sponsored by: Microsoft > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > >> _______________________________________________ > >> Vufind-tech mailing list > >> Vuf...@li... > >> https://lists.sourceforge.net/lists/listinfo/vufind-tech > >=20 > >=20 > > ------------------------------------------------------------------------ - > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > _______________________________________________ > > Vufind-tech mailing list > > Vuf...@li... > > https://lists.sourceforge.net/lists/listinfo/vufind-tech >=20 >=20 > --=20 > /** > * Wayne Graham > * Earl Gregg Swem Library > * PO Box 8794 > * Williamsburg, VA 23188 > * 757.221.3112 > * http://swem.wm.edu/blogs/waynegraham/ > */ >=20 ------------------------------------------------------------------------ - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Vufind-tech mailing list Vuf...@li... https://lists.sourceforge.net/lists/listinfo/vufind-tech |
From: Chris D. <ce...@ui...> - 2007-09-18 16:48:33
|
On Tue, Sep 18, 2007 at 12:29:23PM -0400, Antonio Barrera wrote: > Chris, > > Care to share the php code changes for the import script? I would, except it is almost identical as the original (I only commented out the optimize call at the end). For proof of concept, I simply set up several catalog.xml files (splitting up my 2,000,000 records into 4 sets of 500,000) and ran the import script from different places (making use of Unix soft links). I only did this once, but, next time I think it would be easier to edit the import script so it will accept the XML file as a parameter (instead of the hard-coded catalog.xml file :-) Chris > > Thanks, > Antonio Barrera > Princeton University Library > > -----Original Message----- > From: vuf...@li... > [mailto:vuf...@li...] On Behalf Of Chris > Delis > Sent: Tuesday, September 18, 2007 12:20 PM > To: Wayne Graham > Cc: vuf...@li... > Subject: Re: [VuFind-Tech] import error > > On Tue, Sep 18, 2007 at 12:12:30PM -0400, Wayne Graham wrote: > > OK...wow. A test box with 32 processors? I'm jealous. > > 32 cores, actually. But from the OS' perspective, it looks like 32 > processors. It's a Sun T1000. The production machine will most > likely be a T2000. (I have mixed feelings about the hardware, BTW. > It is not the best environment for open-source development, IMHO.) > > > > > Wally had mentioned this to me, but I'll ask you also, were all your > > processors spiking or just one? > > Just one, which is why I quickly decided to run simultaneous PHP > import scripts. :-) > > > > > I'll try to get my Java in a little better shape this week (e.g. > > working) and send it to you. > > Great! I would love to help (and plan to, in the future) with > VUFind-related (main branch) development but I am busy working on > other things at the moment. > > Thanks, > Chris > > > > > Wayne > > > > Chris Delis wrote: > > > On Tue, Sep 18, 2007 at 02:43:30PM +0930, Matthew Hooper wrote: > > >> Hi all, > > >> > > >> This is my first time posting so please be patient with me. I was > just going > > >> to add a few comments and thoughts regarding importing records via > the > > >> import-solr.php script which some people have had problems with. > > > > > > > > > Hi Matt, > > > > > > Although I haven't personally experienced the same failures that you > > > are discussing, I am finding myself having to work-around some of > its > > > limitations. My problem with the current version of the import > script > > > is that it is not very efficient for my needs. I am still in the > > > pilot phase of the project therefore I need to make changes to the > > > SOLR schemas quite often (at least I will be doing this a lot in the > > > near future when I add new facets, etc.). > > > > > > I have a huge collection (25 million bib records which come from a > > > library consortium - 9 million of them are unique). I don't need to > > > import all of them for my pilot, but I do plan on testing with 5 or > 6 > > > million (3 or 4 schools' worth of data) at a time, though. So far, > I > > > am experiencing 10-hours of processing time per 500,000 bib records. > > > During this processing, I noticed the CPU was at 100% (and I have 32 > > > available CPUs on this machine). This got me thinking that I was > > > hitting a bottleneck at the script level (and not so much the Lucene > > > database). For me, it was absolutely necessary to parallelize this > > > process! In other words, I needed to run several of these import > > > processes at the same time. When I ran 4 import scripts at once (I > > > had to make slight modifications to the script, e.g., one of these > > > changes was to remove the optimize at the end--this needs to be done > > > when all of the scripts are done running), I was able to import > > > 2,000,000 bib records in the same time of 10-hours. > > > > > > So, if you or anyone else on this list is kind enough to share a > > > multi-threaded version of the import script (written in java or > > > whatever), I would love to test it! :-) > > > > > > Thanks, > > > Chris > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> I've been trying to load ~670,000 bib records into the system > following the > > >> instructions in the install files for vufind. What I found was that > on > > >> several occasions the solr service would stop responding to the > post > > >> requests adding new records to the index. As a result while in some > cases > > >> the import would finish, in other cases it would stop at a certain > point and > > >> refuse to keep going past a certain number of records - in my case > around > > >> 70,000 seemed to be the point where things would start going wrong. > In some > > >> cases it finished parsing the entire file but only loaded a small > percent of > > >> records into the system despite the file being in UTF8 marc xml > format. > > >> > > >> After a bit of searching I came across a few solr related links > which seemed > > >> to help in tuning the system to be more likely to load all the > records. > > >> Firstly a quick solr tutorial from the apache website: > > >> > > >> http://lucene.apache.org/solr/tutorial.htm > > >> > > >> And a FAQ document for SOLR which had some hints about increasing > timeouts > > >> so POSTing would be less likely to fail: > > >> > > >> http://wiki.apache.org/solr/FAQ > > >> > > >> This got me thinking about the actual import script itself which > essentially > > >> is posting files to the solr update URL and then indexing that > data. > > >> I rewrote a small portion of the import process using some php > hints and > > >> curl to try and make the process safer. > > >> > > >> > > >> Here's a sample of the code in case people are curious: > > >> > > >> // post record to SOLR > > >> $ch = curl_init(); > > >> $solr_url = $configArray['SOLR']['url'] . '/update'; > > >> curl_setopt($ch, CURLOPT_URL, $solr_url); > > >> curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); > > >> curl_setopt($ch, CURLOPT_POST, 1); > > >> curl_setopt($ch, CURLOPT_FAILONERROR, 1); > > >> curl_setopt($ch, CURLOPT_HTTPHEADER, > array('Content-type:text/xml', > > >> 'charset=utf-8')); > > >> curl_setopt($ch, CURLOPT_POSTFIELDS, $record ); > > >> curl_exec($ch); > > >> // close cURL resource, and free up system resources > > >> curl_close($ch); > > >> unset($ch); > > >> > > >> The other thought was to use the java import .jars included in the > official > > >> solr releases to do the importing in a threadsafe fashion. > > >> > > >> Whenever I've found the import process has died I've had to remove > and > > >> recreate the data directory in the vufind folder after stopping the > service > > >> and then clear out the data folder in the solr directory before > starting the > > >> vufind service back up and trying again. > > >> I'm just wondering if it might be worthwhile to split the import > process > > >> into 2 stages ie create the xml files in the data folder and then > parse all > > >> those files using the xsl and post them off to the solr server to > ensure > > >> that records could be re-indexed or added with minimal fuss. > > >> > > >> It looks like the vufind web services use the .xml files in the > vufind data > > >> folder when you click on the staff view ie to read the marc record > so these > > >> files need to be preserved and I'm guessing the IDs in the 001 tag > are also > > >> used for a lot of the keys relating to things such as comments and > > >> favourites etc, so reindexing isn't a good thing unless the IDs > stay the > > >> same. This would indicate that perhaps the 001 tag needs to be the > bib_id > > >> from your ILMS to make life a little easier. > > >> > > >> Vufind looks like it has a lot of potential, it's just getting the > data into > > >> the system in the first place that seems to be a pain at the moment > (as I'm > > >> typing the import process has stopped at ~87000 records so I'm > going to try > > >> and increase the timeout). > > >> > > >> Cheers, > > >> > > >> Matt. > > >> > > >> -- > > >> Matthew Hooper > > >> Systems Officer, > > >> Flinders University Library > > >> G.P.O. Box 2100 > > >> ADELAIDE, South Australia 5001 > > >> P +618 8201 2068 > > >> F +618 8201 2508 > > >> E Mat...@fl... > > >> > > > > > >> BEGIN:VCARD > > >> VERSION:2.1 > > >> N:Hooper;Matthew;;Mr > > >> FN:Matthew Hooper > > >> ORG:Flinders University > > >> TITLE:Systems Officer > > >> TEL;WORK;VOICE:+61 (08) 8201-2068 > > >> TEL;WORK;FAX:+61 (08) 8201-2508 > > >> ADR;WORK:;;G.P.O. Box 2100;ADELAIDE;South Australia;5001;Australia > > >> LABEL;WORK;ENCODING=QUOTED-PRINTABLE:G.P.O. Box 2100=0D=0AADELAIDE, > South Australia 5001=0D=0AAustralia > > >> EMAIL;PREF;INTERNET:Mat...@fl... > > >> REV:20070327T063222Z > > >> END:VCARD > > > > > >> > ------------------------------------------------------------------------ > - > > >> This SF.net email is sponsored by: Microsoft > > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > >> _______________________________________________ > > >> Vufind-tech mailing list > > >> Vuf...@li... > > >> https://lists.sourceforge.net/lists/listinfo/vufind-tech > > > > > > > > > > ------------------------------------------------------------------------ > - > > > This SF.net email is sponsored by: Microsoft > > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > _______________________________________________ > > > Vufind-tech mailing list > > > Vuf...@li... > > > https://lists.sourceforge.net/lists/listinfo/vufind-tech > > > > > > -- > > /** > > * Wayne Graham > > * Earl Gregg Swem Library > > * PO Box 8794 > > * Williamsburg, VA 23188 > > * 757.221.3112 > > * http://swem.wm.edu/blogs/waynegraham/ > > */ > > > > ------------------------------------------------------------------------ > - > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech |
From: Wayne G. <ws...@wm...> - 2007-09-19 21:08:24
|
I've spent some time this afternoon playing with embedded solr and marc4j. So far I'm encouraged by the results. Right now, this is the way it works (and I'll leave a caveat that it's not 100% functional yet). You point the main program to the raw marc formatted file and the home directory for Solr. The program reads each record into memory from the marc file and first converts it in memory to marcxml format and then writes to the disk. With the record still in memory, it programmatically evaluates the in-memory marc record to create prepare the document for Solr. Right now it's skipping the step of transforming this with an XSLT processory, but I may try to add that later this week as the rule-set is already conveniently detailed, and I only really need to load the file once. As far as speeds, in a straight up, strong arm indexing, I did 100 records in about 5 seconds consistently to read, index, and optimize. It's been doing about 1000 in 30 seconds, and did 10000 in under 2 minutes. I don't have comparative data on this box for these same indexing with PHP, but I seriously doubt I can beat these numbers. I'll keep folks updated. Wayne Wayne Graham wrote: > Hi, > > Just to give folks a heads up on what I'm working on, I thought I'd > outline it a bit and see if anyone had feedback. > > For a Java implementation, it occurs to me that there are several goals. > > - Decrease the number of steps to go from initial marc to files/index > - Speed up the indexing process > - Make sure the program isn't any more difficult to use than the current > scripted solution > > In the first goal, pulling data directly out of marc file is reasonably > trivial using marc4j. The only big problem is that you have to read the > records sequentially with an iterator. This kind of sucks because you > can't set an arbitrary number of splits in the records and process them > in parallel by default. I think we'd need to do some testing to break up > large files into chunks and see if its in fact faster to index in > parallel than it is to go sequentially. > > For the speeding of the indexing, I think there can be a couple of > things done. First, for folks who will be running their Solr instance on > the same box they have Vufind installed, we can take advantage of a > direct connection to Solr, so there's no TCP overhead, just direct IO. > However, there are some folks that would need to do this remotely, so a > method to post the content with an HttpURLConnection. This will be an > order of magnitude slower than a direct connection, but may be necessary > for some implementations. > > To the last point, what I was thinking is that the program would be > called with something along the lines of > > java -jar import-solr.jar > > However, I would like to build in some flexibility for naming the marc > file to import, where the Solr instance is, and where to store the data > on the system. That's a more complex call, but something like > > java -Dvufind.marc.file=/usr/local/vufind/import/import.mrc > -Dvufind.solr.home=/usr/local/vufind/solr/jetty/webapps/solr > -Dvufind.solr.data=... -jar > > A simple bash script (along the lines of the vufind startup script) > could be written to make this really painless. > > I did have a question start floating around in my mind as this thread > developed...I have a background in Lucene, but not so much Solr, and > I've done this to create a "cached" version of web pages in Lucene. In > Solr, can you create a stored, but unindexed fields? Storing the marc > XML as an essentially "hidden" in the index. There are probably a lot of > very good reasons not to do this (it'll increase the size of your index > by the length of the XML records, but could be minimized by storing the > entire xml file as a single line), but I thought I'd float it out there. > > Wayne > > Andrew Nagy wrote: >>> There's got to be an easier way to index a directory of xml formatted >>> data without having to make a tcp connection to post the results of a >>> read off to solr - ie some sort of bulk ingest process that runs on a >>> directory and won't drag the solr service down during the ingest. >>> >>> Anyway if you've got the processing power, a suggestion might be to try >>> something like Wayne is doing with multi threaded java importing and >>> perhaps batch the records in chunks to further reduce the indexing >>> time. >> Yes, this is where I see the future of the import script heading. It should be a java application that can take advantage of the java classes. It is known that doing imports to solr over TCP IP is many factors slower than using the solr java classes. >> >> Our goals with open sourcing vufind is to offer something great to other institutions that don't have the ability to develop a similar application. As well as to build a collective group of collaborators to help make it better. If we all work together on this, we can help make this a much better application. >> >> Andrew >> >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2005. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >> _______________________________________________ >> Vufind-tech mailing list >> Vuf...@li... >> https://lists.sourceforge.net/lists/listinfo/vufind-tech > > -- /** * Wayne Graham * Earl Gregg Swem Library * PO Box 8794 * Williamsburg, VA 23188 * 757.221.3112 * http://swem.wm.edu/blogs/waynegraham/ */ |
From: Andrew N. <and...@vi...> - 2007-09-20 14:07:01
|
Wayne, this sounds awesome. Thanks for working on this! I know that Casey from Seattle Public had built a java importer for library= records into solr. You can see the source here: http://fac-back-opac.googlecode.com/svn/trunk/indexer/ One thing to consider, that I think needs to go into the importer, is more = flexibility with fields. For example, Matt Mackey has his unique Id not in= the 001 field but in the 949a field. This would be nice to be able to eas= ily adjust in the importer. As well if there are any local custom fields, = etc. I was thinking about creating a config file or xml file for the impor= ter that would act as a mapping that could easily be adjusted. Again, thanks, this is great! Andrew > -----Original Message----- > From: Wayne Graham [mailto:ws...@wm...] > Sent: Wednesday, September 19, 2007 5:08 PM > To: Andrew Nagy > Cc: vuf...@li... > Subject: Re: [VuFind-Tech] Java Importer (was import error) > > I've spent some time this afternoon playing with embedded solr and > marc4j. So far I'm encouraged by the results. Right now, this is the > way > it works (and I'll leave a caveat that it's not 100% functional yet). > > You point the main program to the raw marc formatted file and the home > directory for Solr. The program reads each record into memory from the > marc file and first converts it in memory to marcxml format and then > writes to the disk. With the record still in memory, it > programmatically evaluates the in-memory marc record to create prepare > the document for Solr. Right now it's skipping the step of transforming > this with an XSLT processory, but I may try to add that later this week > as the rule-set is already conveniently detailed, and I only really > need > to load the file once. > > As far as speeds, in a straight up, strong arm indexing, I did 100 > records in about 5 seconds consistently to read, index, and optimize. > It's been doing about 1000 in 30 seconds, and did 10000 in under 2 > minutes. I don't have comparative data on this box for these same > indexing with PHP, but I seriously doubt I can beat these numbers. > > I'll keep folks updated. > > Wayne > > > Wayne Graham wrote: > > Hi, > > > > Just to give folks a heads up on what I'm working on, I thought I'd > > outline it a bit and see if anyone had feedback. > > > > For a Java implementation, it occurs to me that there are several > goals. > > > > - Decrease the number of steps to go from initial marc to files/index > > - Speed up the indexing process > > - Make sure the program isn't any more difficult to use than the > current > > scripted solution > > > > In the first goal, pulling data directly out of marc file is > reasonably > > trivial using marc4j. The only big problem is that you have to read > the > > records sequentially with an iterator. This kind of sucks because you > > can't set an arbitrary number of splits in the records and process > them > > in parallel by default. I think we'd need to do some testing to break > up > > large files into chunks and see if its in fact faster to index in > > parallel than it is to go sequentially. > > > > For the speeding of the indexing, I think there can be a couple of > > things done. First, for folks who will be running their Solr instance > on > > the same box they have Vufind installed, we can take advantage of a > > direct connection to Solr, so there's no TCP overhead, just direct > IO. > > However, there are some folks that would need to do this remotely, so > a > > method to post the content with an HttpURLConnection. This will be an > > order of magnitude slower than a direct connection, but may be > necessary > > for some implementations. > > > > To the last point, what I was thinking is that the program would be > > called with something along the lines of > > > > java -jar import-solr.jar > > > > However, I would like to build in some flexibility for naming the > marc > > file to import, where the Solr instance is, and where to store the > data > > on the system. That's a more complex call, but something like > > > > java -Dvufind.marc.file=3D/usr/local/vufind/import/import.mrc > > -Dvufind.solr.home=3D/usr/local/vufind/solr/jetty/webapps/solr > > -Dvufind.solr.data=3D... -jar > > > > A simple bash script (along the lines of the vufind startup script) > > could be written to make this really painless. > > > > I did have a question start floating around in my mind as this thread > > developed...I have a background in Lucene, but not so much Solr, and > > I've done this to create a "cached" version of web pages in Lucene. > In > > Solr, can you create a stored, but unindexed fields? Storing the marc > > XML as an essentially "hidden" in the index. There are probably a lot > of > > very good reasons not to do this (it'll increase the size of your > index > > by the length of the XML records, but could be minimized by storing > the > > entire xml file as a single line), but I thought I'd float it out > there. > > > > Wayne > > > > Andrew Nagy wrote: > >>> There's got to be an easier way to index a directory of xml > formatted > >>> data without having to make a tcp connection to post the results of > a > >>> read off to solr - ie some sort of bulk ingest process that runs on > a > >>> directory and won't drag the solr service down during the ingest. > >>> > >>> Anyway if you've got the processing power, a suggestion might be to > try > >>> something like Wayne is doing with multi threaded java importing > and > >>> perhaps batch the records in chunks to further reduce the indexing > >>> time. > >> Yes, this is where I see the future of the import script heading. > It should be a java application that can take advantage of the java > classes. It is known that doing imports to solr over TCP IP is many > factors slower than using the solr java classes. > > >> > >> Our goals with open sourcing vufind is to offer something great to > other institutions that don't have the ability to develop a similar > application. As well as to build a collective group of collaborators > to help make it better. If we all work together on this, we can help > make this a much better application. > >> > >> Andrew > >> > >> -------------------------------------------------------------------- > ----- > >> This SF.net email is sponsored by: Microsoft > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > >> _______________________________________________ > >> Vufind-tech mailing list > >> Vuf...@li... > >> https://lists.sourceforge.net/lists/listinfo/vufind-tech > > > > > > > -- > /** > * Wayne Graham > * Earl Gregg Swem Library > * PO Box 8794 > * Williamsburg, VA 23188 > * 757.221.3112 > * http://swem.wm.edu/blogs/waynegraham/ > */ |
From: Wayne G. <ws...@wm...> - 2007-09-20 15:15:00
|
I think this is a big reason to use the XSLT in the transformation. The thing I'm going to work on (hopefully today) is pulling out the PHP processing in the xslt code. I've been waffling on whether to do this in pure XSLT or in a combination of Java/XSLT (basically porting what you've done). Ideally, I think a pure XSLT solution would be ideal, however functions weren't introduced into the specification until the 2.0 release. You can run it with exslt libraries, but I'm not sure how this would actually 1) be as efficient and 2) work. I'd love to have a single stylesheet that is agnostic to the processor that could process any of this. That way, you can pick your processing poison, and you would need to do minimal tweaking to make it work in the varied environments out there. I took a look at the trunk code and unfortunately it looks like the Python is doing all the work...pretty much a Python version of the PHP script. Wayne Andrew Nagy wrote: > Wayne, this sounds awesome. Thanks for working on this! > > I know that Casey from Seattle Public had built a java importer for library records into solr. You can see the source here: > http://fac-back-opac.googlecode.com/svn/trunk/indexer/ > > One thing to consider, that I think needs to go into the importer, is more flexibility with fields. For example, Matt Mackey has his unique Id not in the 001 field but in the 949a field. This would be nice to be able to easily adjust in the importer. As well if there are any local custom fields, etc. I was thinking about creating a config file or xml file for the importer that would act as a mapping that could easily be adjusted. > > Again, thanks, this is great! > > Andrew > > >> -----Original Message----- >> From: Wayne Graham [mailto:ws...@wm...] >> Sent: Wednesday, September 19, 2007 5:08 PM >> To: Andrew Nagy >> Cc: vuf...@li... >> Subject: Re: [VuFind-Tech] Java Importer (was import error) >> >> I've spent some time this afternoon playing with embedded solr and >> marc4j. So far I'm encouraged by the results. Right now, this is the >> way >> it works (and I'll leave a caveat that it's not 100% functional yet). >> >> You point the main program to the raw marc formatted file and the home >> directory for Solr. The program reads each record into memory from the >> marc file and first converts it in memory to marcxml format and then >> writes to the disk. With the record still in memory, it >> programmatically evaluates the in-memory marc record to create prepare >> the document for Solr. Right now it's skipping the step of transforming >> this with an XSLT processory, but I may try to add that later this week >> as the rule-set is already conveniently detailed, and I only really >> need >> to load the file once. >> >> As far as speeds, in a straight up, strong arm indexing, I did 100 >> records in about 5 seconds consistently to read, index, and optimize. >> It's been doing about 1000 in 30 seconds, and did 10000 in under 2 >> minutes. I don't have comparative data on this box for these same >> indexing with PHP, but I seriously doubt I can beat these numbers. >> >> I'll keep folks updated. >> >> Wayne >> >> >> Wayne Graham wrote: >>> Hi, >>> >>> Just to give folks a heads up on what I'm working on, I thought I'd >>> outline it a bit and see if anyone had feedback. >>> >>> For a Java implementation, it occurs to me that there are several >> goals. >>> - Decrease the number of steps to go from initial marc to files/index >>> - Speed up the indexing process >>> - Make sure the program isn't any more difficult to use than the >> current >>> scripted solution >>> >>> In the first goal, pulling data directly out of marc file is >> reasonably >>> trivial using marc4j. The only big problem is that you have to read >> the >>> records sequentially with an iterator. This kind of sucks because you >>> can't set an arbitrary number of splits in the records and process >> them >>> in parallel by default. I think we'd need to do some testing to break >> up >>> large files into chunks and see if its in fact faster to index in >>> parallel than it is to go sequentially. >>> >>> For the speeding of the indexing, I think there can be a couple of >>> things done. First, for folks who will be running their Solr instance >> on >>> the same box they have Vufind installed, we can take advantage of a >>> direct connection to Solr, so there's no TCP overhead, just direct >> IO. >>> However, there are some folks that would need to do this remotely, so >> a >>> method to post the content with an HttpURLConnection. This will be an >>> order of magnitude slower than a direct connection, but may be >> necessary >>> for some implementations. >>> >>> To the last point, what I was thinking is that the program would be >>> called with something along the lines of >>> >>> java -jar import-solr.jar >>> >>> However, I would like to build in some flexibility for naming the >> marc >>> file to import, where the Solr instance is, and where to store the >> data >>> on the system. That's a more complex call, but something like >>> >>> java -Dvufind.marc.file=/usr/local/vufind/import/import.mrc >>> -Dvufind.solr.home=/usr/local/vufind/solr/jetty/webapps/solr >>> -Dvufind.solr.data=... -jar >>> >>> A simple bash script (along the lines of the vufind startup script) >>> could be written to make this really painless. >>> >>> I did have a question start floating around in my mind as this thread >>> developed...I have a background in Lucene, but not so much Solr, and >>> I've done this to create a "cached" version of web pages in Lucene. >> In >>> Solr, can you create a stored, but unindexed fields? Storing the marc >>> XML as an essentially "hidden" in the index. There are probably a lot >> of >>> very good reasons not to do this (it'll increase the size of your >> index >>> by the length of the XML records, but could be minimized by storing >> the >>> entire xml file as a single line), but I thought I'd float it out >> there. >>> Wayne >>> >>> Andrew Nagy wrote: >>>>> There's got to be an easier way to index a directory of xml >> formatted >>>>> data without having to make a tcp connection to post the results of >> a >>>>> read off to solr - ie some sort of bulk ingest process that runs on >> a >>>>> directory and won't drag the solr service down during the ingest. >>>>> >>>>> Anyway if you've got the processing power, a suggestion might be to >> try >>>>> something like Wayne is doing with multi threaded java importing >> and >>>>> perhaps batch the records in chunks to further reduce the indexing >>>>> time. >>>> Yes, this is where I see the future of the import script heading. >> It should be a java application that can take advantage of the java >> classes. It is known that doing imports to solr over TCP IP is many >> factors slower than using the solr java classes. >> >>>> Our goals with open sourcing vufind is to offer something great to >> other institutions that don't have the ability to develop a similar >> application. As well as to build a collective group of collaborators >> to help make it better. If we all work together on this, we can help >> make this a much better application. >>>> Andrew >>>> >>>> -------------------------------------------------------------------- >> ----- >>>> This SF.net email is sponsored by: Microsoft >>>> Defy all challenges. Microsoft(R) Visual Studio 2005. >>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >>>> _______________________________________________ >>>> Vufind-tech mailing list >>>> Vuf...@li... >>>> https://lists.sourceforge.net/lists/listinfo/vufind-tech >>> >> >> -- >> /** >> * Wayne Graham >> * Earl Gregg Swem Library >> * PO Box 8794 >> * Williamsburg, VA 23188 >> * 757.221.3112 >> * http://swem.wm.edu/blogs/waynegraham/ >> */ > -- /** * Wayne Graham * Earl Gregg Swem Library * PO Box 8794 * Williamsburg, VA 23188 * 757.221.3112 * http://swem.wm.edu/blogs/waynegraham/ */ |
From: Chris D. <ce...@ui...> - 2007-09-28 21:57:42
|
On Wed, Sep 19, 2007 at 11:44:40AM -0400, Wayne Graham wrote: > > I did have a question start floating around in my mind as this thread > developed...I have a background in Lucene, but not so much Solr, and > I've done this to create a "cached" version of web pages in Lucene. In > Solr, can you create a stored, but unindexed fields? Storing the marc > XML as an essentially "hidden" in the index. There are probably a lot of > very good reasons not to do this (it'll increase the size of your index > by the length of the XML records, but could be minimized by storing the > entire xml file as a single line), but I thought I'd float it out there. > > Wayne > Does anyone know the answer to this question? The need for local XML files pains me to no end. If it were possible to store the XML in SOLR, say, in one non-indexed field, and did not affect performance that much (especially in the faceted searching, which is probably the most important concern of mine), I would love to implement it. Would it be as simple as creating a new SOLR field like so: <field name="marcrecord" type="text" indexed="false" stored="true" termVectors="true"/> ???? I was thinking about creating this new field in SOLR, storing the MarcXML file into this field, then retrieving it from SOLR instead of reading them in via local XML files. But, if someone already knows if this is stupid, a performance nightmare, etc., I'd like to know beforehand. Next week, I will probably try to learn more about SOLR, but just wanted to see if anyone knowledgeable with this stuff might be able to offer some insight. Thanks, Chris |
From: Wayne G. <ws...@wm...> - 2007-10-01 14:19:19
|
As far as the indexing goes, there's just a couple of lines to refactor in the Java code. One thing I've noticed the indexer rewrite is that the marc4j libraries uses <collection> as the root element, and the marcdump utility uses <record>. I've been working on the XSLT that generates the different views of the information and should have an updated version of the stylesheets soon that _should_ work with whichever version is being used. With the termVectors on for the new field, you can easily do a "more like this" search. However, I'm a little wary of this for a raw marc record since so much of the data is embedded in control data and subfields. I think with the termVectors on for the derived values in the index (e.g. book, subjects, etc.) you'll get better results without the increased load on the indexing and storage. I would hesitate to put the rest of views into the index rather than working with the XSLT. This probably has more to do with my proclivity to separate data and display than any real objection, but I think its rather elegant (and robust) to use XSLT to transform the information you get back from the index (be it a simple field, or a complex document like XML). Anyway, there's my 2 cents... Wayne Andrew Nagy wrote: > Well I have been thinking about this as well. Our designer who did the interface hates the fact that everything is in templates except for the record view pages that are in xslt files. We have been talking about moving everything over to solr. So we could add fields that are not used for searching into solr as unindexed fields as you mention. I think this is probably the best route to go. I think this would be much better than duplicating the entire marc record into one field. For example, I am storing the 260c field but not the 260a 260b. We could add those fields as unindexed and be able to use them for displaying record details. > > Chris, we could work on this together if you would be interested? > > Andrew > > >> -----Original Message----- >> From: vuf...@li... [mailto:vufind-tech- >> bo...@li...] On Behalf Of Chris Delis >> Sent: Friday, September 28, 2007 5:58 PM >> To: vuf...@li... >> Subject: [VuFind-Tech] Non-indexed MarcXML records in SOLR a bad idea? >> (was Re: Java Importer (was import error)) >> >> On Wed, Sep 19, 2007 at 11:44:40AM -0400, Wayne Graham wrote: >>> I did have a question start floating around in my mind as this thread >>> developed...I have a background in Lucene, but not so much Solr, and >>> I've done this to create a "cached" version of web pages in Lucene. >> In >>> Solr, can you create a stored, but unindexed fields? Storing the marc >>> XML as an essentially "hidden" in the index. There are probably a lot >> of >>> very good reasons not to do this (it'll increase the size of your >> index >>> by the length of the XML records, but could be minimized by storing >> the >>> entire xml file as a single line), but I thought I'd float it out >> there. >>> Wayne >>> >> >> Does anyone know the answer to this question? The need for local XML >> files pains me to no end. If it were possible to store the XML in >> SOLR, say, in one non-indexed field, and did not affect performance >> that much (especially in the faceted searching, which is probably the >> most important concern of mine), I would love to implement it. Would >> it be as simple as creating a new SOLR field like so: >> >> <field name="marcrecord" type="text" indexed="false" stored="true" >> termVectors="true"/> >> >> ???? >> >> I was thinking about creating this new field in SOLR, storing the >> MarcXML file into this field, then retrieving it from SOLR instead of >> reading them in via local XML files. But, if someone already knows if >> this is stupid, a performance nightmare, etc., I'd like to know >> beforehand. Next week, I will probably try to learn more about SOLR, >> but just wanted to see if anyone knowledgeable with this stuff might >> be able to offer some insight. >> >> Thanks, >> Chris >> >> ----------------------------------------------------------------------- >> -- >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2005. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >> _______________________________________________ >> Vufind-tech mailing list >> Vuf...@li... >> https://lists.sourceforge.net/lists/listinfo/vufind-tech > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech -- /** * Wayne Graham * Earl Gregg Swem Library * PO Box 8794 * Williamsburg, VA 23188 * 757.221.3112 * http://swem.wm.edu/blogs/waynegraham/ */ |