Which XSLT are you using? The answer might lie in the XSLT code that is pulling data out of the <subject> tag. Also, if you tell me which XSLT is in use here, I can try to reproduce
the problem over here.
As another approach, have you tried deleting the offending line and then retyping it manually? Is it possible there is some non-printing character in there that's messing things up without being obviously visible? (This seems unlikely if xmllint is happy...
but it's such a weird problem that anything seems worth a try).
From: Tod Olson [email@example.com]
Sent: Friday, February 22, 2013 8:10 AM
To: firstname.lastname@example.org Tech Mailinglist
Subject: [VuFind-Tech] OAI import oddnesses
We're seeing something strange thing with our OAI harvest, an "unterminated entity reference" error with XML that passes xmllint.
We have about 5 records out of thousands which raise an unterminated entity reference error message. These records have a "&" character entity in the <subject>. The XSLT importer complains that there is an unterminated entity reference right after the "&".
Based on the error message, it's as if the data is XML un-encoded twice:
Processing /data/pyrite/vufind2/harvest/../local/harvest/APF//1361223215_oai_lib_uchicago_edu_apf3_03094.xml ...
PHP Warning: DOMDocument::createElement(): unterminated entity reference the Redeemer Church (Chicago, Ill.) in /data/pyrite/vufind2/module/VuFind/src/VuFind/XSLT/Import/VuFind.php on line 424
PHP Warning: DOMDocument::createElement(): unterminated entity reference the Redeemer Church (Chicago in /data/pyrite/vufind2/module/VuFind/src/VuFind/XSLT/Import/VuFind.php on line 424
Successfully imported /data/pyrite/vufind2/harvest/../local/harvest/APF//1361223215_oai_lib_uchicago_edu_apf3_03094.xml...
Here'e the offending <subject>, the file is attached as apf3_03094.xml:
<subject>St. Paul & the Redeemer Church (Chicago, Ill.)</subject>
More strangely, this does not happen for all <subject>s containing "&", it seems to be particular to this subject. For example, the second attached file, apf2_09151.xml, has the following, and raises no error:
<subject>Marshall Field & Company</subject>
Back to that problem record, I do notice that the topics in the resulting Solr document are:
So something is happening with the problematic <subject> to split out "St" and "Paul" and "St. Paul" and "Ill.)" all as separate strings, where I would expect just one string to be produced.
Has anyone run into something like this before? Any clues or advice would be appreciated.
Tod Olson <email@example.com>
University of Chicago Library