Re: [Exist-open] entity problem

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Scott Gillespie wrote:

> It is interesting now, in my stylesheet this works correctly:

> <xsl:output method=3D"html" version=3D"4.0" encoding=3D"ISO-8859-1"
indent=3D"yes"/>

> but the following gives me =C2=B0
> <xsl:output method=3D"html" version=3D"4.0" encoding=3D"UTF-8" indent=3D=
"yes"/>

It depends by what you mean by "works correctly" and "gives me". What is =
it
you actually want?

The value of ISO-8859-1 on xsl:output's encoding attribute will cause the
XSLT processor's serializer module to write any instance of the abstract
character DEGREE SIGN into the output stream as the single byte ISO-8859-=
1
code point 0xb0. If you then view this in an application (or on a console=
)
correctly configured for ISO-8859-1, you will see a degree sign.

The value of "UTF-8" on that same attribute will, given the same input
document and output tree, cause the serializer to write any instance of t=
he
abstract character DEGREE SIGN into the output stream as the the two-byte
utf-8 sequence 0xc2 0xb0. Now if you view that output in an application t=
hat
is expecting to receive ISO-8859-1 then it will misinterpret that
single-codepoint 2-byte sequence as two separate single byte code points,
for the ISO-8859-1 mappings of LATIN CAPITAL LETTER A WITH CIRCUMFLEX and
DEGREE SIGN. So that's what you will see on screen unless you are using a
tuf-8 savvy viewer, but it confirms that the correct utf-8 byte sequence =
is
present. Switch the encoding of your viewing app to utf-8 and the single
degree sign glyph should replace the incorrect pair of glyphs.

If however, you really are switching your viewer app appropriately before
trying to view the two output files in their respective encodings, and in
correct utf-8 mode you still see =C2=B0 then you have hit the bugbear of =
double
transcoding somewhere. This happens (and all Perl programmers will issue =
a
sympathetic groan here) when some component in a character handling libra=
ry
thinks that a utf-8 character stream is actually in an ISO-8859-n 8 bit
encoding and so needs transcoding into utf-8, when in fact it doesn't. Th=
e
snag is that although it is a trivial matter to detect non-utf-8 values i=
n
an ISO-8859-1 stream, there is no reliable algorithm (though there are
various heuristics) for detecting that a stream really and truly is in ut=
f-8
already and so shouldn't be transcoded again.

So if told (wrongly) that a character stream is ISO-8859-1 encoded and ne=
eds
transcoding to utf-8, any transcoding routine will happily attempt to do
just that, with disastrous semantic consequences if there are any multi-b=
yte
utf-8 sequences present in the input.utf-8, because then each component b=
yte
in each utf-8 sequence is treated as a distinct character value and so ge=
ts
translated into the corresponding utf-8 sequence for that spurious value.=
 In
other words, what lies behind the two displayed glyphs =C2=B0 in a utf-8 =
aware
rendering of a double-transcoded document are the *four bytes* 0xc3 0x82
oxc2 0xb0, with the first pair being the utf-8 transcoding of the
misinterpreted lead byte of the original utf-8 sequence, and the second p=
air
being the utf-8 transcoding of the second byte in the sequence, likewise
wrongly taken to be a distinct character code point.  The only way of
settling which you have with 100% certainty is to load the document into =
a
hex editor and inspect the number and values of the bytes at that point.

Double transcoding comes about either by application programmer error,
sending the data through a transcoding pass twice, or by a system library
defect, whereby the system fails to track the encoding of the character
stream it is processing. This is the domain of the horrendous "utf-8 flag=
"
in Perl, but it has its counterparts in all languages which attempt to
handle utf-8 variable-length encoding sequences while staying backward
compatible with older one-byte-per-character encodings (or with the
different processing paradigms required for stateful multi-byte encoding
schemes)

Michael Beddow

Re: [Exist-open] entity problem

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] entity problem