> My CDATA source document string contains the character entity
> which represents the bullet character.
There's no such thing as a "character entity" in XML. It's an entity
reference, and there must be a matching entity definition in the DTD. The
reference expands to whatever the DTD defines it as.
> On my local windows environment the transformation (via
> saxon:parse()) works fine and the actual bullet character is
> the result in my output document (i.e. the character entity
> reference is resolved to the correct 'real' character). My
> output method is "xml" and my output encoding is "utf8".
The output method and encoding have nothing to do with this.
> However when I run the same transformation in a unix
> environment the same character entity reference is resolved
> to the " character (double quote).
Are you sure that this is what saxon:parse() is returning, or is it a
problem that occurs later, when the result document is serialized? Check
using string-to-codepoints() to see what the actual expansion by
> After looking at the Parse function implementation, I think
> the problem might be due to the fact that the InputSource
> object is created with a character stream (StringReader)
> object. And any encoding declarations within the source
> document and even the encoding attribute on the class itself
> are ignored. (Please correct me if I'm
Well, if the DTD containing the definition of the entity has been
mis-encoded, then this is a possibility. You haven't said whether the DTD is
internal or external. If it's internal, then yes, saxon:parse() accepts a
string as input, and someone has to decode bytes to create that string, and
if they decode the bytes using the wrong encoding (follow me?) then a
problem could arise at this point. So it rather depends where your string
> So I'm guessing the encoding type is somehow determined by
> the environment default? If this is not the case then I
> don't have a clue what else may be causing this anomaly.
First thing, I think, is to determine whether saxon:parse() returned the
incorrect character, or whether the problem occurred later during
serialization. Once you know that, the problem space is reduced by half.