indeed, the problem you describe is the same one we are having. Fedora
stores internally unicode characters just fine, but when one calls the
system disseminators, they convert the unicode characters to real characters
which breaks any subsequent processing. I sent Chris a reference to an
object from our Fedora, which illustrates the behavior. I hope that he will
be able to tell us more about the underlying cause.
Medford, MA 02155
----- Original Message -----
From: "Nat Eiseman" <neiseman@...>
To: "Fedora-users" <fedora-users@...>
Sent: Thursday, April 29, 2004 7:28 PM
Subject: Re: [Fedora-users] XML character reference resolution issue
> Chris Wilper wrote:
> > Hi Rob,
> > I'm trying to track this down...
> > After you ingest an object with the ampersand, how
> > are you getting the dublin core datastream out?
> > (is it via "getItem" or "getDublinCore"?)
> > - Chris
> > -----Original Message-----
> > From: fedora-users-bounces@...
> > [mailto:fedora-users-bounces@... Behalf Of Robert Chavez
> > Sent: Thursday, April 29, 2004 3:48 PM
> > To: fedora-users@...
> > Subject: [Fedora-users] XML character reference resolution issues
> > Greetings,
> > We are having some minor XML related issues with Fedora and I am
> > wondering if anyone has had a similar experience, or any ideas of how to
> > deal with this issue.
> > Some of the user supplied descriptive metadata datastreams (see sample
> > below) for digital objects that we are ingesting contain characters,
> > such as ampersands, that are encoded as entity references (&) or
> > unicode hexadecimal character references. For example, in the sample
> > below, the <dc:title> element contains an &.
> > The objects validate and ingest with no problem. But, we have several
> > disseminators that are basically XSLTs that perform some simple
> > disseminations of parts of this metadata datastream. For example,
> > getDublinCore might perform an XSLT transformation to disseminate the
> > text in the <dc:title> element.
> Rob, Chris, I'd like to get in on this one.
> I create my fedora objects using Perl, by reading the contents of a
> linux directory which contains all of the files for an online journal
> article. These include a master file, article-number.xml, and ca.
> 50-150 (s)html, gif,jpg and virtually any other possible mime-type.
> Each of these files, including article-number.xml, becomes a datastream
> and I parse article-number.xml to get the Dublin Core elements, labels,
> etc. These are physics papers from around the world and are _full_ of
> unicode characters. Fedora happily ingests the xml object file and all
> of the data files, ampersands and all.
> Retrieving these data is more problematical. In some cases fedora
> attempts to parse the xml and in others delivers it to the browser
> (Mozilla) intact.
> From the Dissemination Index view in Mozilla, getDublinCore produces:
> XML Parsing Error: undefined entity
> Line Number 79, Column 29:<orgname>University of
> while viewDublinCore produces:
> [FedoraAccessServlet] An error has occured in accessing the Fedora
> Access Subsystem. The error was " fedora.server.errors.GeneralException
> ". Reason: ServiceMethodDispatcher returned error. The underlying error
> was a java.lang.reflect.InvocationTargetExceptionThe message was "null"
> Input Request was:
> Request Parameters
> PID = eptest-2:226
> bDefPID = fedora-system:3
> methodName = viewDublinCore
> asOfDateTime = null
> Other Parameters Found:
> getItem DC produces (note &s in dc:Creator field):
> This XML file does not appear to have any style information
> associated with it. The document tree is shown below.
> NAKAMURA ET AL.: KUROSHIO MEANDER IN THE EAST CHINA SEA
> Hirohiko H. Nakamura firstname.lastname@example.org‐u.ac.jp
> 4223 Oceanography: General: Descriptive and regional oceanography 4520
> Oceanography: Physical: Eddies and mesoscale processes 4528
> Oceanography: Physical: Fronts and jets 4576 Oceanography: Physical:
> Western boundary currents
> <dc:source>J. Geophys. Res. 108:C11</dc:source>
> <dc:rights>Available By Subscription</dc:rights>
> All of the other get... methods produce output like this last one.
> From the Default Disseminator - Item Index View,
> Dublin Core for the Document object displays the"no style information"
> form of output above (ampersands and all) while article-number.xml
> produces a parsing error like the first example shown.
> I have not noticed any problems with the Managment GUI.
> I believe that these errors would not occur if the DTD for the articles
> and its associated files was available to the process, but why does
> fedora try to parse the xml anyway? I don't see the point, but I have
> to resolve this problem one way or another before we can put our
> repository into production mode.
> Tomorrow I will create some fubar objects manually to see if I can learn
> anything else. BTW, I created an object using URL redirection mode,
> hoping that that would let the process access the DTD. The result was
> that the URL for the DTD, given in article-number.xml, was converted to
> the default document root for localhost on this machine (linux2.agu.org
> is a virtual host). Not sure if fedora or apache or the OS did this,
> but will try to determine that.
> Nathaniel J. Eiseman, Ph.D.
> Electronic Publishing Systems Development neiseman@...
> American Geophysical Union http://www.agu.org
> 2000 Florida Avenue, NW 202-777-7523
> Washington, DC 20009
> Fedora-users mailing list