From: Shepard, T. - 1. - M. <tsh...@ll...> - 2013-11-08 21:32:12
|
Demian, I finally had some success in loading and viewing in vufind over 14,000 records from a tab-separated source file! I did this by generating a batch of simple xml files and creating my own xsl and consulting schema.xml as I mapped these source fields/columns to the solr tags. Here is my final question of the week: Am I understanding correctly that there is only one date field and one url field in the current schema? We have multiple urls and multiple dates that will need to be imported into vufind, as well as a comment field. Do I need to create these myself? Do I accomplish this in schema.xml. I thought, for example, that publishDate was defined as a multivalued field, but my attempts to add multiple dates did not work. Anyway, I should stress the positive; I did manage to import tabular data and get that data to display in vufind. Now for the really fun stuff! Thanks again, Thom Shepard From: Demian Katz [mailto:dem...@vi...] Sent: Thursday, November 07, 2013 2:57 PM To: Shepard, Thomas - 1150 - MITLL Cc: vuf...@li... Subject: RE: xml Import advice? Have you tried restarting Solr or running the util/optimize.php script after performing your update? Sometimes records don't show up right away in Solr. If that doesn't help, a useful next step may be to use the --test-only switch of the import script to make sure the transformation is working... e.g.: cd $VUFIND_HOME/import php import-xsl.php -test-only $VUFIND_HOME/harvest/XML/one-of-your-files.xml If there's a problem with the transformation process, this should show you what's going wrong and help you to debug. Let me know if you need further assistance! - Demian From: Shepard, Thomas - 1150 - MITLL [mailto:tsh...@ll...] Sent: Thursday, November 07, 2013 2:25 PM To: Demian Katz Cc: vuf...@li...<mailto:vuf...@li...> Subject: RE: xml Import advice? Demian, To follow up on our earlier exchanges, I've written a perl script to generate individual xml files from a tab-separated text file, one xml file for each record row, and placed them into the subfolder harvest/XML. I copied dspace.properties and dspace.xsl to mitll.properties and mitll.xsl and modified them appropriately (or so I thought). I then ran the following string command: ./batch-import-xsl.sh `basename $0` XML mitll.properties Most of the xml files "seem" to load into vufind. The screen echo indicated that most were successful and these xml files moved to the processed subfolder, leaving the unsuccessful ones in XML, as expected. (Most of those that failed were due to bad characters in the xml text, which I believe I can fix.) I restarted vufind and performed some searches for records that seemed to be imported. But none of these imported records can be found in vufind. I have tried this several times, delting the index before each new attempt. The biblio index does seem to get populated but I wonder if I need to map my data to a collection or institution value in order for my data to show up with a search. I also wonder if I should not have used the dspace xsl as my model; I did so mainly because of its use of the Dublin core schema. Will the xsl imported accept simpe well-formed xml or does it only work with an oai schema? In other words, can I have something as simple as... <?xml version="1.0" encoding="utf-8" standalone="no"?> <record> <id>1160983</id> <dc:date>3/29/2013</dc:date> <dc:title>Medecine et maladies infectieuses</dc:title> <url>http://libproxy.mit.edu/login?url=http://www.sciencedirect.com/science/journal/0399077X</url<http://libproxy.mit.edu/login?url=http://www.sciencedirect.com/science/journal/0399077X%3c/url>> <dc:description>Science direct is an online service blah blah</dc:description> </record> ...so long as my xsl converts this to the appropriate update schema? For example, I might use the following code to transform description: <!-- DESCRIPTION --> <xsl:if test="//dc:description"> <field name="description"> <xsl:value-of select="//dc:description" /> </field> </xsl:if> <add> <doc> <field name="id">1160983</field> <field name="title"> Medecine et maladies infectieuses</field> <field name="date">3/29/2013</field> <field name="description">Science direct is an online service blah blah</field> <field name="url">http://libproxy.mit.edu/login?url=http://www.sciencedirect.com/science/journal/0399077X</field<http://libproxy.mit.edu/login?url=http://www.sciencedirect.com/science/journal/0399077X%3c/field>> </doc> </add> Am I on the right track here? Can you tell how I am failing to find the records that I appear to have imported? Thanks again, Thom Shepard From: Demian Katz [mailto:dem...@vi...] Sent: Friday, November 01, 2013 3:57 PM To: Shepard, Thomas - 1150 - MITLL Subject: RE: xml Import advice? Yes, there is a shell script that calls the importer on a directory full of XML files. The most common use case here is that you use VuFind's OAI-PMH harvester to harvest files to a directory, then you run batch-import-xsl.sh to import everything in that directory. This causes the XML files to get moved to a "processed" subdirectory, with problem records getting left behind for analysis. You can find the batch scripts in the harvest subdirectory of VuFind. - Demian ________________________________ From: Shepard, Thomas - 1150 - MITLL [tsh...@ll...] Sent: Friday, November 01, 2013 3:40 PM To: Demian Katz Subject: RE: xml Import advice? Oh... Is there a mechanism or utility in vufind to process a batch of thousands of xml records from, say, a Drupal or Omeka instance? Surely libraries and other repositories have a need to do this. Can you connect me to any other users successfully loading non-marc records as batches into vufind? Thanks, Thom From: Demian Katz [mailto:dem...@vi...] Sent: Friday, November 01, 2013 3:18 PM To: Shepard, Thomas - 1150 - MITLL Subject: RE: xml Import advice? VuFind's XML importer is actually designed to read one record at a time -- it expects a separate XML file for each record. - Demian ________________________________ From: Shepard, Thomas - 1150 - MITLL [tsh...@ll...] Sent: Friday, November 01, 2013 3:13 PM To: Demian Katz Subject: RE: xml Import advice? Demian, Deciding to give up on this method (for now), I turned to modifying the xml I've been generating and using vufind's built-in transformers. One small step... I created a group of records using a Dublin Core schema. Selecting the Dspace properties, I managed to pull these records into vufind. The problem was that they all went in as a single record! Looking at the xsl, I see why. Here is a small sample of my xml: <?xml version="1.0" encoding="windows-1252" standalone="no"?> <oai_dc:dc xmlns:xsi="http://www.w3.org/2001/XMLSchema-Instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/"> <record> <id>31287005122714</id> <dc:identifier>(Sirsi) 31287005122714</dc:identifier> <dc:identifier>(OCoLC)660161850</dc:identifier> <dc:identifier>(Sirsi)31287005122714</dc:identifier> <dc:creator>Ford, Kenneth William, 1926-</dc:creator> <dc:title>101 quantum questions : what you need to know about the world you can't see / Kenneth W. Ford.</dc:title> <dc:date>Cambridge, Mass. : Harvard University Press, 2011.</dc:date> <dc:description>nbl110603</dc:description> <dc:subject>Quantum theory Miscellanea.</dc:subject> <dc:subject>Quantum theory Popular works.</dc:subject> <collection>1</collection> </record> <record> <id>31287005149998</id> <dc:identifier>(Sirsi) 31287005149998</dc:identifier> <dc:identifier>(OCoLC)641998678</dc:identifier> <dc:identifier>(Sirsi)31287005149998</dc:identifier> <dc:creator>Sakurai, J. J. (Jun John), 1933-1982.</dc:creator> <dc:title>Modern quantum mechanics / J.J. Sakurai, Jim Napolitano.</dc:title> <dc:date>Boston : Addison-Wesley, c2011.</dc:date> <dc:description>nbl101119</dc:description> <dc:subject>Quantum theory Textbooks.</dc:subject> <dc:creator>Napolitano, Jim.</dc:creator> <collection>1</collection> </record> ... </oai_dc:dc> The xsl parser does not seem to do anything with the <record></record> tags. I assume I need something on the order of: <xsl:template match="record"> </xsl:template> Or are the built-in xsl scripts looking for a different kind of record separator? Thanks, Thom From: Demian Katz [mailto:dem...@vi...] Sent: Friday, November 01, 2013 9:15 AM To: Shepard, Thomas - 1150 - MITLL Subject: RE: xml Import advice? It looks like the --noproxy setting requires a parameter, so it is consuming your -H switch. Try removing noproxy or else repeating the URL behind it. Looking at build.xml is a pretty reliable way of determining VuFind version. The differences between 2.1 and 2.1.1 are very small, so an upgrade should not be difficult. However, I don't think your CLI problems are related to your version; I'm pretty sure the problems I mentioned were fixed prior to 2.1. Two other things to try: 1.) Run the deletes.php script from the directory containing it... e.g. cd /usr/local/vufind2/util prior to php deletes.php -- sometimes running scripts from another directory seems to cause problems with the routing. 2.) Make sure all the appropriate environment variables are set correctly (VUFIND_HOME, VUFIND_LOCAL_DIR, VUFIND_MODULES). If you're still having trouble, let me know the exact command you are executing and I'll dig deeper. - Demian ________________________________ From: Shepard, Thomas - 1150 - MITLL [tsh...@ll...] Sent: Friday, November 01, 2013 9:10 AM To: Demian Katz Subject: RE: xml Import advice? I tried: curl "http://localhost:8181/solr/biblio/update/?commit=true" --noproxy -H "Content-Type: text/xml" --upload-file "/usr/local/vufind2/import/incremental_test.xml" also: curl "http://localhost:8181/solr/biblio/update/?commit=true" --noproxy --upload-file "/usr/local/vufind2/import/incremental_test.xml" and some other variations, but always get a proxy error message, as well as: curl: (3) <url> malformed I looked at build.xml and saw that we are running just vufind 2.1, not 2.1.1 (unless there is a better way to tell which version we are running). Hopefully we can update asap. Thanks, Thom From: Demian Katz [mailto:dem...@vi...] Sent: Friday, November 01, 2013 8:52 AM To: Shepard, Thomas - 1150 - MITLL Subject: RE: xml Import advice? I think the CURL problem is that the target URL needs to be the first parameter -- try moving --noproxy after the localhost line. If that's still not working, you could also try modifying your Perl script to do the posting for you directly... or you could export to Dublin Core, which would allow you to use one of the example XSLTs with minimal modification. - Demian ________________________________ From: Shepard, Thomas - 1150 - MITLL [tsh...@ll...] Sent: Friday, November 01, 2013 8:24 AM To: Demian Katz Subject: RE: xml Import advice? Demian, I have tried every conceivable variation of the curl command, all giving me back the same error. For example: curl --noproxy "http://localhost:8181/solr/biblio/update/?commit=true" -H "Content-Type: text/xml" --upload-file "/usr/local/vufind2/import/incremental_test.xml" curl: no URL specified! curl: try 'curl --help' or 'curl --manual' for more information curl --noproxy "http://localhost:8181/solr/biblio/update/?commit=true" -H "Content-Type: text/xml" -T "/usr/local/vufind2/import/incremental_test.xml" curl: no URL specified! curl: try 'curl --help' or 'curl --manual' for more information curl --noproxy "http://localhost:8181/solr/biblio/update/?commit=true" -T "/usr/local/vufind2/import/incremental_test.xml" curl: no URL specified! curl: try 'curl --help' or 'curl --manual' for more information curl --noproxy "http://localhost:8181/solr/biblio/update/?commit=true" -T "/usr/local/vufind2/import/incremental_test.xml"; curl: no URL specified! curl: try 'curl --help' or 'curl --manual' for more information curl --noproxy "http://localhost:8181/solr/biblio/update/?commit=true" -T /usr/local/vufind2/import/incremental_test.xml curl: no URL specified! curl: try 'curl --help' or 'curl --manual' for more information curl --noproxy "http://localhost:8181/solr/biblio/update/?commit=true" -T "/usr/local/vufind2/import/incremental_test.xml" curl: no URL specified! curl: try 'curl --help' or 'curl --manual' for more information curl --noproxy "http://localhost:8181/solr/update/?commit=true" -T "/usr/local/vufind2/import/incremental_test.xml" curl: no URL specified! curl: try 'curl --help' or 'curl --manual' for more information curl --noproxy "http://localhost:8181/solr/biblio/update/?commit=true" -H "Content-Type: text/xml" --upload-file "/usr/local/vufind2/import/incremental_test.xml" curl: no URL specified! curl: try 'curl --help' or 'curl --manual' for more information As for using the provided xslt scripts, I'd be happy to try those but I haven't found any simple xml examples to test them with. That is to say, I am generating my own xml from a database. I query and export as a delimited text file, then use perl to convert to the solr update schema. Again, the marc import mechanism works wonderfully well, so I am even considering using perl to convert a database query into a marc flat file, though that ought to be unnecessary (and a lot of work). Or maybe I should attempt to write my own xslt transformer, though again I would want to do something a lot simpler than, say OAI-PMH. Is there a sample xml that I could use as a simple model, for which an xslt transformer already exists? Thanks again, Thom From: Demian Katz [mailto:dem...@vi...] Sent: Friday, November 01, 2013 7:47 AM To: Shepard, Thomas - 1150 - MITLL Subject: RE: xml Import advice? I think the problem here is that -T and --upload-file are variant forms of the same option... so when you say: -T --upload-file it thinks that you are trying to upload a file named "--upload-file." Just drop either one of those two options, and the other one should work. Also note that VuFind includes its own XML importer that allows you to apply XSLTs to XML in order to import it. If you have your own process for generating the XML files, that's fine... but the provided tool is useful if you need to hook the imports directly to PHP code (particularly important if you use the optional "change tracking" feature, since the import process needs to be able to write data to VuFind's MySQL database to keep track of record history). - Demian ________________________________ From: Shepard, Thomas - 1150 - MITLL [tsh...@ll...] Sent: Thursday, October 31, 2013 4:22 PM To: Demian Katz Subject: xml Import advice? Demian, It is of utmost importance for us at Lincoln Lab to figure out ways to import non-marc records. I believed if I could produce xml with the update schema as described in the solr import documentation***, then we could do incremental imports without any problem. But I have been working days on this without success. Our marc imports are going in perfectly, but my supervisor wants to add records from other sources. I got some advice from other folks, but as it turned out, they were using earlier versions of vufind, and we are using 2.0. What I've been using so far is the following, along with some path variations: curl --noproxy "http://localhost:8181/solr/biblio/update/?commit=true" -H "Content-Type: text/xml" -T --upload-file "/usr/local/vufind2/import/incremental_test.xml" But all I get in return is: curl: Can't open '--upload-file'! curl: try 'curl --help' or 'curl --manual' for more information Any clues to getting this to work would be most appreciated. Thanks, Thom *** <add> <doc> <field name="employeeId">05991</field> <field name="office">Bridgewater</field> <field name="skills">Perl</field> <field name="skills">Java</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]] </add> Thom Shepard MIT Lincoln Lab 244 Wood St. Lexington, MA 01523 tsh...@ll...<mailto:tsh...@ll...> 781 981 0370 |