From: Michael B. <mbe...@mb...> - 2006-03-01 23:33:17
|
Scott Gillespie wrote: > It is interesting now, in my stylesheet this works correctly: > <xsl:output method=3D"html" version=3D"4.0" encoding=3D"ISO-8859-1" indent=3D"yes"/> > but the following gives me =C2=B0 > <xsl:output method=3D"html" version=3D"4.0" encoding=3D"UTF-8" indent=3D= "yes"/> It depends by what you mean by "works correctly" and "gives me". What is = it you actually want? The value of ISO-8859-1 on xsl:output's encoding attribute will cause the XSLT processor's serializer module to write any instance of the abstract character DEGREE SIGN into the output stream as the single byte ISO-8859-= 1 code point 0xb0. If you then view this in an application (or on a console= ) correctly configured for ISO-8859-1, you will see a degree sign. The value of "UTF-8" on that same attribute will, given the same input document and output tree, cause the serializer to write any instance of t= he abstract character DEGREE SIGN into the output stream as the the two-byte utf-8 sequence 0xc2 0xb0. Now if you view that output in an application t= hat is expecting to receive ISO-8859-1 then it will misinterpret that single-codepoint 2-byte sequence as two separate single byte code points, for the ISO-8859-1 mappings of LATIN CAPITAL LETTER A WITH CIRCUMFLEX and DEGREE SIGN. So that's what you will see on screen unless you are using a tuf-8 savvy viewer, but it confirms that the correct utf-8 byte sequence = is present. Switch the encoding of your viewing app to utf-8 and the single degree sign glyph should replace the incorrect pair of glyphs. If however, you really are switching your viewer app appropriately before trying to view the two output files in their respective encodings, and in correct utf-8 mode you still see =C2=B0 then you have hit the bugbear of = double transcoding somewhere. This happens (and all Perl programmers will issue = a sympathetic groan here) when some component in a character handling libra= ry thinks that a utf-8 character stream is actually in an ISO-8859-n 8 bit encoding and so needs transcoding into utf-8, when in fact it doesn't. Th= e snag is that although it is a trivial matter to detect non-utf-8 values i= n an ISO-8859-1 stream, there is no reliable algorithm (though there are various heuristics) for detecting that a stream really and truly is in ut= f-8 already and so shouldn't be transcoded again. So if told (wrongly) that a character stream is ISO-8859-1 encoded and ne= eds transcoding to utf-8, any transcoding routine will happily attempt to do just that, with disastrous semantic consequences if there are any multi-b= yte utf-8 sequences present in the input.utf-8, because then each component b= yte in each utf-8 sequence is treated as a distinct character value and so ge= ts translated into the corresponding utf-8 sequence for that spurious value.= In other words, what lies behind the two displayed glyphs =C2=B0 in a utf-8 = aware rendering of a double-transcoded document are the *four bytes* 0xc3 0x82 oxc2 0xb0, with the first pair being the utf-8 transcoding of the misinterpreted lead byte of the original utf-8 sequence, and the second p= air being the utf-8 transcoding of the second byte in the sequence, likewise wrongly taken to be a distinct character code point. The only way of settling which you have with 100% certainty is to load the document into = a hex editor and inspect the number and values of the bytes at that point. Double transcoding comes about either by application programmer error, sending the data through a transcoding pass twice, or by a system library defect, whereby the system fails to track the encoding of the character stream it is processing. This is the domain of the horrendous "utf-8 flag= " in Perl, but it has its counterparts in all languages which attempt to handle utf-8 variable-length encoding sequences while staying backward compatible with older one-byte-per-character encodings (or with the different processing paradigms required for stateful multi-byte encoding schemes) Michael Beddow |