From: Colin P. A. <co...@co...> - 2007-10-13 15:59:23
|
I have just worked out the problem with an interesting bug in the XSLT library. The problem occurs with the following transformation: <?xml version="1.0"?> <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml" version="2.0"> <xsl:output method="xhtml" indent="no" encoding="UTF-8" normalization-form="NFC"/> <xsl:template match="/"> <html> <body>Á</body> </html> </xsl:template> </xsl:transform> The code-point sequence x41,x301 consists of the characters A followed by combining acute accent. This is represent internally within thw XSLT engine as a UC_UTF8_STRING of 3 bytes, (count = 2). The request to normalize to normalization form NFC converts this sequence to a single composed character, an A with an acute accent. The process by which this is done finishes with a single INTEGER which is then converted to a STRING using {UC_UNICODE_ROUTINES}.code_to_string. But this routine returns a Latin-1 STRING (not a UC_UTF8_STRING), when the code is small enough. But the actually process of encoding the string for writing to a file is expecting UC_UTF8_STRINGs. Since in this case the requested output is UTF-8, class XM_XSLT_UTF8_ENCODER passes the STRING through unchanged to the XM_OUTPUT object for efficiency. There are several ways I can fix this bug. One is to change the XM_XSLT_UTF8_ENCODER (and other encoders) so that they test the dynamic type of STRING and convert as necessary. Another is to change the Unicode normalization routines to always create UC_UTF8_STRINGs either by writing a variant on {UC_UNICODE_ROUTINES}.code_to_string, or changing it so as to always produce UC_UTF8_STRINGs. But I can think of other possibilities. What if the input string is a STRING_32? I think that I would want to see STRING_32 output in that case, so neither UC_UTF8_STRING or plain Latin-1 STRING_8 would be very acceptable. So my instinct is to change the normalization routines so that the dynamic type of the input string is preserved. This will work for NFC and NFKC (composition), but it won't work for NFD or NFKD, as given non-ASCII Latin-1 input, they will produce codes beyond 255. Also, the routines as_nfc and as_nfd (as opposed to to_nfc and to_nfd) return (by design) the input object if it is already in the desired normalization form. So a solution that always produced UC_UTF8_STRING when given ASCII or Latin-1 STRING input is undesirable. All opinions welcome. -- Colin Adams Preston Lancashire |