From: Scott G. <ws...@vi...> - 2006-03-02 18:51:49
|
Thanks for the lesson in encoding. That clears things up quite a bit for=20 me. I'm obviously a little deficient in knowledge in this area and=20 appreciate the detailed response. I think everything is working as=20 expected now. Scott ................................... Project Manager / Programmer Virginia Center for Digital History http://www.vcdh.virginia.edu P:434.924.4777 F:434.243.5566 Michael Beddow wrote: > Scott Gillespie wrote: >=20 >> It is interesting now, in my stylesheet this works correctly: >=20 >> <xsl:output method=3D"html" version=3D"4.0" encoding=3D"ISO-8859-1" > indent=3D"yes"/> >=20 >> but the following gives me =C2=B0 >> <xsl:output method=3D"html" version=3D"4.0" encoding=3D"UTF-8" indent=3D= "yes"/> >=20 > It depends by what you mean by "works correctly" and "gives me". What i= s it > you actually want? >=20 > The value of ISO-8859-1 on xsl:output's encoding attribute will cause t= he > XSLT processor's serializer module to write any instance of the abstrac= t > character DEGREE SIGN into the output stream as the single byte ISO-885= 9-1 > code point 0xb0. If you then view this in an application (or on a conso= le) > correctly configured for ISO-8859-1, you will see a degree sign. >=20 > The value of "UTF-8" on that same attribute will, given the same input > document and output tree, cause the serializer to write any instance of= the > abstract character DEGREE SIGN into the output stream as the the two-by= te > utf-8 sequence 0xc2 0xb0. Now if you view that output in an application= that > is expecting to receive ISO-8859-1 then it will misinterpret that > single-codepoint 2-byte sequence as two separate single byte code point= s, > for the ISO-8859-1 mappings of LATIN CAPITAL LETTER A WITH CIRCUMFLEX a= nd > DEGREE SIGN. So that's what you will see on screen unless you are using= a > tuf-8 savvy viewer, but it confirms that the correct utf-8 byte sequenc= e is > present. Switch the encoding of your viewing app to utf-8 and the singl= e > degree sign glyph should replace the incorrect pair of glyphs. >=20 > If however, you really are switching your viewer app appropriately befo= re > trying to view the two output files in their respective encodings, and = in > correct utf-8 mode you still see =C2=B0 then you have hit the bugbear o= f double > transcoding somewhere. This happens (and all Perl programmers will issu= e a > sympathetic groan here) when some component in a character handling lib= rary > thinks that a utf-8 character stream is actually in an ISO-8859-n 8 bit > encoding and so needs transcoding into utf-8, when in fact it doesn't. = The > snag is that although it is a trivial matter to detect non-utf-8 values= in > an ISO-8859-1 stream, there is no reliable algorithm (though there are > various heuristics) for detecting that a stream really and truly is in = utf-8 > already and so shouldn't be transcoded again. >=20 > So if told (wrongly) that a character stream is ISO-8859-1 encoded and = needs > transcoding to utf-8, any transcoding routine will happily attempt to d= o > just that, with disastrous semantic consequences if there are any multi= -byte > utf-8 sequences present in the input.utf-8, because then each component= byte > in each utf-8 sequence is treated as a distinct character value and so = gets > translated into the corresponding utf-8 sequence for that spurious valu= e. In > other words, what lies behind the two displayed glyphs =C2=B0 in a utf-= 8 aware > rendering of a double-transcoded document are the *four bytes* 0xc3 0x8= 2 > oxc2 0xb0, with the first pair being the utf-8 transcoding of the > misinterpreted lead byte of the original utf-8 sequence, and the second= pair > being the utf-8 transcoding of the second byte in the sequence, likewis= e > wrongly taken to be a distinct character code point. The only way of > settling which you have with 100% certainty is to load the document int= o a > hex editor and inspect the number and values of the bytes at that point= . >=20 > Double transcoding comes about either by application programmer error, > sending the data through a transcoding pass twice, or by a system libra= ry > defect, whereby the system fails to track the encoding of the character > stream it is processing. This is the domain of the horrendous "utf-8 fl= ag" > in Perl, but it has its counterparts in all languages which attempt to > handle utf-8 variable-length encoding sequences while staying backward > compatible with older one-byte-per-character encodings (or with the > different processing paradigms required for stateful multi-byte encodin= g > schemes) >=20 > Michael Beddow >=20 >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting lang= uage > that extends applications into web and mobile media. Attend the live we= bcast > and join the prime developer group breaking into this new coding territ= ory! > http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=110944&bid$1720&dat=121642 > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open |