From: Demian K. <dem...@vi...> - 2013-02-22 13:22:27
|
Which XSLT are you using? The answer might lie in the XSLT code that is pulling data out of the <subject> tag. Also, if you tell me which XSLT is in use here, I can try to reproduce the problem over here. As another approach, have you tried deleting the offending line and then retyping it manually? Is it possible there is some non-printing character in there that's messing things up without being obviously visible? (This seems unlikely if xmllint is happy... but it's such a weird problem that anything seems worth a try). - Demian ________________________________ From: Tod Olson [to...@uc...] Sent: Friday, February 22, 2013 8:10 AM To: vuf...@li... Tech Mailinglist Subject: [VuFind-Tech] OAI import oddnesses vufind-tech, We're seeing something strange thing with our OAI harvest, an "unterminated entity reference" error with XML that passes xmllint. We have about 5 records out of thousands which raise an unterminated entity reference error message. These records have a "&" character entity in the <subject>. The XSLT importer complains that there is an unterminated entity reference right after the "&". Based on the error message, it's as if the data is XML un-encoded twice: Processing /data/pyrite/vufind2/harvest/../local/harvest/APF//1361223215_oai_lib_uchicago_edu_apf3_03094.xml ... PHP Warning: DOMDocument::createElement(): unterminated entity reference the Redeemer Church (Chicago, Ill.) in /data/pyrite/vufind2/module/VuFind/src/VuFind/XSLT/Import/VuFind.php on line 424 PHP Warning: DOMDocument::createElement(): unterminated entity reference the Redeemer Church (Chicago in /data/pyrite/vufind2/module/VuFind/src/VuFind/XSLT/Import/VuFind.php on line 424 Successfully imported /data/pyrite/vufind2/harvest/../local/harvest/APF//1361223215_oai_lib_uchicago_edu_apf3_03094.xml... Here'e the offending <subject>, the file is attached as apf3_03094.xml: <subject>St. Paul & the Redeemer Church (Chicago, Ill.)</subject> More strangely, this does not happen for all <subject>s containing "&", it seems to be particular to this subject. For example, the second attached file, apf2_09151.xml, has the following, and raises no error: <subject>Marshall Field & Company</subject> Back to that problem record, I do notice that the topics in the resulting Solr document are: <arr name="topic"> <str>St</str> <str>Paul</str> <str>St. Paul</str> <str>Ill.)</str> <str>Military education</str> <str>Typewriters</str> <str>Sailors</str> </arr> So something is happening with the problematic <subject> to split out "St" and "Paul" and "St. Paul" and "Ill.)" all as separate strings, where I would expect just one string to be produced. Has anyone run into something like this before? Any clues or advice would be appreciated. -Tod Tod Olson <to...@uc...> Systems Librarian University of Chicago Library |
From: Filipe MS B. (UA) <fs...@ua...> - 2013-02-22 13:48:44
|
Hi! Again writing on the run (got to conduct a Focus Group in about half-an-hour - it's what I do for a living now :)), had my laptop ready to go, but took it out (read this msg in the mobile), because couldn't go away without replying, as I won't be able to do it during the rest of the afternoon. Going to the point: I would check the field (XLST, source or interaction) before that offending one. I've had a quick look at both the source .xml and the previous field is <field name="format">Image</field> So no problem there; so I guess it's within some string resulting from the explodes you are doing. Let me explain (be back later with more concrete data), perhaps the problem lies in the fact that when the importer reaches <subject>St. Paul & the Redeemer Church (Chicago, Ill.)</subject> without the previous field being closed "</something>" it will though that error. But this I'm just saying in theory. Sorry, be back later to this if meanwhile a solution is not found (the cause of the problem), Filipe From: Demian Katz [mailto:dem...@vi...] Sent: sexta-feira, 22 de Fevereiro de 2013 13:22 To: Tod Olson; vuf...@li... Tech Mailinglist Subject: Re: [VuFind-Tech] OAI import oddnesses Which XSLT are you using? The answer might lie in the XSLT code that is pulling data out of the <subject> tag. Also, if you tell me which XSLT is in use here, I can try to reproduce the problem over here. As another approach, have you tried deleting the offending line and then retyping it manually? Is it possible there is some non-printing character in there that's messing things up without being obviously visible? (This seems unlikely if xmllint is happy... but it's such a weird problem that anything seems worth a try). - Demian ________________________________ From: Tod Olson [to...@uc...] Sent: Friday, February 22, 2013 8:10 AM To: vuf...@li...<mailto:vuf...@li...> Tech Mailinglist Subject: [VuFind-Tech] OAI import oddnesses vufind-tech, We're seeing something strange thing with our OAI harvest, an "unterminated entity reference" error with XML that passes xmllint. We have about 5 records out of thousands which raise an unterminated entity reference error message. These records have a "&" character entity in the <subject>. The XSLT importer complains that there is an unterminated entity reference right after the "&". Based on the error message, it's as if the data is XML un-encoded twice: Processing /data/pyrite/vufind2/harvest/../local/harvest/APF//1361223215_oai_lib_uchicago_edu_apf3_03094.xml ... PHP Warning: DOMDocument::createElement(): unterminated entity reference the Redeemer Church (Chicago, Ill.) in /data/pyrite/vufind2/module/VuFind/src/VuFind/XSLT/Import/VuFind.php on line 424 PHP Warning: DOMDocument::createElement(): unterminated entity reference the Redeemer Church (Chicago in /data/pyrite/vufind2/module/VuFind/src/VuFind/XSLT/Import/VuFind.php on line 424 Successfully imported /data/pyrite/vufind2/harvest/../local/harvest/APF//1361223215_oai_lib_uchicago_edu_apf3_03094.xml... Here'e the offending <subject>, the file is attached as apf3_03094.xml: <subject>St. Paul & the Redeemer Church (Chicago, Ill.)</subject> More strangely, this does not happen for all <subject>s containing "&", it seems to be particular to this subject. For example, the second attached file, apf2_09151.xml, has the following, and raises no error: <subject>Marshall Field & Company</subject> Back to that problem record, I do notice that the topics in the resulting Solr document are: <arr name="topic"> <str>St</str> <str>Paul</str> <str>St. Paul</str> <str>Ill.)</str> <str>Military education</str> <str>Typewriters</str> <str>Sailors</str> </arr> So something is happening with the problematic <subject> to split out "St" and "Paul" and "St. Paul" and "Ill.)" all as separate strings, where I would expect just one string to be produced. Has anyone run into something like this before? Any clues or advice would be appreciated. -Tod Tod Olson <to...@uc...<mailto:to...@uc...>> Systems Librarian University of Chicago Library |