PS:

 

·         without the previous field being closed, “</something>”

 

this could be caused by an unexpected “ in the middle of the field value, for instance, I would say, or something like that.

 

 

From: Filipe MS Bento (UA)
Sent: sexta-feira, 22 de Fevereiro de 2013 13:47
To: 'Demian Katz'; Tod Olson; vufind-tech@lists.sourceforge.net Tech Mailinglist
Subject: RE: OAI import oddnesses

 

Hi!

 

Again writing on the run (got to conduct a Focus Group in about half-an-hour – it’s what I do for a living now J), had my laptop ready to go, but took it out (read this msg in the mobile), because couldn’t go away without replying, as I won’t be able to do it during the rest of the afternoon.

 

Going to the point: I would check the field (XLST, source or interaction) before that offending one. I’ve had a quick look at both the source .xml and the previous field is

 

   <field name="format">Image</field>

 

So no problem there; so I guess it’s within some string resulting from the explodes you are doing.

 

Let me explain (be back later with more concrete data), perhaps the problem lies in the fact that when the importer reaches

 

                <subject>St. Paul &amp; the Redeemer Church (Chicago, Ill.)</subject>

 

without the previous field being closed “</something>” it will though that error. But this I’m just saying in theory.

 

Sorry, be back later to this if meanwhile a solution is not found (the cause of the problem),

 

Filipe

 

 

From: Demian Katz [mailto:demian.katz@villanova.edu]
Sent: sexta-feira, 22 de Fevereiro de 2013 13:22
To: Tod Olson; vufind-tech@lists.sourceforge.net Tech Mailinglist
Subject: Re: [VuFind-Tech] OAI import oddnesses

 

Which XSLT are you using?  The answer might lie in the XSLT code that is pulling data out of the <subject> tag.  Also, if you tell me which XSLT is in use here, I can try to reproduce the problem over here.

As another approach, have you tried deleting the offending line and then retyping it manually?  Is it possible there is some non-printing character in there that's messing things up without being obviously visible?  (This seems unlikely if xmllint is happy...  but it's such a weird problem that anything seems worth a try).

- Demian


From: Tod Olson [tod@uchicago.edu]
Sent: Friday, February 22, 2013 8:10 AM
To:
vufind-tech@lists.sourceforge.net Tech Mailinglist
Subject: [VuFind-Tech] OAI import oddnesses

vufind-tech,

We're seeing something strange thing with our OAI harvest, an "unterminated entity reference" error with XML that passes xmllint.

We have about 5 records out of thousands which raise an unterminated entity reference error message. These records have a "&amp;" character entity in the <subject>. The XSLT importer complains that there is an unterminated entity reference right after the "&amp;". Based on the error message, it's as if the data is XML un-encoded twice:

Processing /data/pyrite/vufind2/harvest/../local/harvest/APF//1361223215_oai_lib_uchicago_edu_apf3_03094.xml ...
PHP Warning:  DOMDocument::createElement(): unterminated entity reference  the Redeemer Church (Chicago, Ill.) in /data/pyrite/vufind2/module/VuFind/src/VuFind/XSLT/Import/VuFind.php on line 424
PHP Warning:  DOMDocument::createElement(): unterminated entity reference  the Redeemer Church (Chicago in /data/pyrite/vufind2/module/VuFind/src/VuFind/XSLT/Import/VuFind.php on line 424
Successfully imported /data/pyrite/vufind2/harvest/../local/harvest/APF//1361223215_oai_lib_uchicago_edu_apf3_03094.xml...

Here'e the offending <subject>, the file is attached as apf3_03094.xml:

        <subject>St. Paul &amp; the Redeemer Church (Chicago, Ill.)</subject>

More strangely, this does not happen for all <subject>s containing "&amp;", it seems to be particular to this subject. For example, the second attached file, apf2_09151.xml, has the following, and raises no error:

        <subject>Marshall Field &amp; Company</subject>

Back to that problem record, I do notice that the topics in the resulting Solr document are:

<arr name="topic">
        <str>St</str>
        <str>Paul</str>
        <str>St. Paul</str>
        <str>Ill.)</str>
        <str>Military education</str>
        <str>Typewriters</str>
        <str>Sailors</str>
</arr>

So something is happening with the problematic <subject> to split out "St" and "Paul" and "St. Paul" and "Ill.)" all as separate strings, where I would expect just one string to be produced.

Has anyone run into something like this before? Any clues or advice would be appreciated.

-Tod

       






Tod Olson <tod@uchicago.edu>
Systems Librarian    
University of Chicago Library