Which XSLT are you using?  The answer might lie in the XSLT code that is pulling data out of the <subject> tag.  Also, if you tell me which XSLT is in use here, I can try to reproduce the problem over here.

As another approach, have you tried deleting the offending line and then retyping it manually?  Is it possible there is some non-printing character in there that's messing things up without being obviously visible?  (This seems unlikely if xmllint is happy...  but it's such a weird problem that anything seems worth a try).

- Demian

From: Tod Olson [tod@uchicago.edu]
Sent: Friday, February 22, 2013 8:10 AM
To: vufind-tech@lists.sourceforge.net Tech Mailinglist
Subject: [VuFind-Tech] OAI import oddnesses

vufind-tech,

We're seeing something strange thing with our OAI harvest, an "unterminated entity reference" error with XML that passes xmllint.

We have about 5 records out of thousands which raise an unterminated entity reference error message. These records have a "&amp;" character entity in the <subject>. The XSLT importer complains that there is an unterminated entity reference right after the "&amp;". Based on the error message, it's as if the data is XML un-encoded twice:

Processing /data/pyrite/vufind2/harvest/../local/harvest/APF//1361223215_oai_lib_uchicago_edu_apf3_03094.xml ...
PHP Warning:  DOMDocument::createElement(): unterminated entity reference  the Redeemer Church (Chicago, Ill.) in /data/pyrite/vufind2/module/VuFind/src/VuFind/XSLT/Import/VuFind.php on line 424
PHP Warning:  DOMDocument::createElement(): unterminated entity reference  the Redeemer Church (Chicago in /data/pyrite/vufind2/module/VuFind/src/VuFind/XSLT/Import/VuFind.php on line 424
Successfully imported /data/pyrite/vufind2/harvest/../local/harvest/APF//1361223215_oai_lib_uchicago_edu_apf3_03094.xml...

Here'e the offending <subject>, the file is attached as apf3_03094.xml:

        <subject>St. Paul &amp; the Redeemer Church (Chicago, Ill.)</subject>

More strangely, this does not happen for all <subject>s containing "&amp;", it seems to be particular to this subject. For example, the second attached file, apf2_09151.xml, has the following, and raises no error:

        <subject>Marshall Field &amp; Company</subject>

Back to that problem record, I do notice that the topics in the resulting Solr document are:

<arr name="topic">
        <str>St</str>
        <str>Paul</str>
        <str>St. Paul</str>
        <str>Ill.)</str>
        <str>Military education</str>
        <str>Typewriters</str>
        <str>Sailors</str>
</arr>

So something is happening with the problematic <subject> to split out "St" and "Paul" and "St. Paul" and "Ill.)" all as separate strings, where I would expect just one string to be produced.

Has anyone run into something like this before? Any clues or advice would be appreciated.

-Tod


       





Tod Olson <tod@uchicago.edu>
Systems Librarian    
University of Chicago Library