[gobo-eiffel-develop] UTF-8 strings changing to Latin-1 strings

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I have just worked out the problem with an interesting bug in the XSLT
library.

The problem occurs with the following transformation:

<?xml version="1.0"?> 
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml" version="2.0">

<xsl:output method="xhtml" indent="no" encoding="UTF-8" normalization-form="NFC"/>

<xsl:template match="/">
  <html>
        <body>&#x41;&#x301;</body>
  </html>
</xsl:template>

</xsl:transform>

The code-point sequence x41,x301 consists of the characters A followed
by combining acute accent. This is represent internally within thw
XSLT engine as a UC_UTF8_STRING of 3 bytes, (count = 2).

The request to normalize to normalization form NFC converts this
sequence to a single composed character, an A with an acute accent.

The process by which this is done finishes with a single INTEGER which
is then converted to a STRING using
{UC_UNICODE_ROUTINES}.code_to_string.

But this routine returns a Latin-1 STRING (not a UC_UTF8_STRING), when
the code is small enough.

But the actually process of encoding the string for writing to a file
is expecting UC_UTF8_STRINGs. Since in this case the requested output
is UTF-8, class XM_XSLT_UTF8_ENCODER passes the STRING through
unchanged to the XM_OUTPUT object for efficiency.

There are several ways I can fix this bug.

One is to change the XM_XSLT_UTF8_ENCODER (and other encoders) so that
they test the dynamic type of STRING and convert as necessary.

Another is to change the Unicode normalization routines to always
create UC_UTF8_STRINGs either by writing a variant on
{UC_UNICODE_ROUTINES}.code_to_string, or changing it so as to always
produce UC_UTF8_STRINGs.

But I can think of other possibilities. What if the input string is a
STRING_32? I think that I would want to see STRING_32 output in that
case, so neither UC_UTF8_STRING or plain Latin-1 STRING_8 would be
very acceptable.

So my instinct is to change the normalization routines so
that the dynamic type of the input string is preserved. This will work
for NFC and NFKC (composition), but it won't work for NFD or NFKD, as
given non-ASCII Latin-1 input, they will produce codes beyond 255.

Also, the routines as_nfc and as_nfd (as opposed to to_nfc and to_nfd)
return (by design) the input object if it is already in the desired
normalization form. So a solution that always produced UC_UTF8_STRING
when given ASCII or Latin-1 STRING input is undesirable.

All opinions welcome.
-- 
Colin Adams
Preston Lancashire