CLISP - an ANSI Common Lisp / Bugs / #370 utf-8 conversion error

Jörg Höhle - 2006-11-10

Logged In: YES
user_id=377168

Confirmed.

Note that man utf-8 says:
The UCS code values 0xd800–0xdfff (UTF-16
surrogates) as well as 0xfffe and 0xffff
(UCS non-characters) should not appear in conforming
UTF-8 streams.
I.e. your loop does not produce something that can be
transformed to UTF-8.
iconv complains about exactly that.

Of course, that's no reason for an internal error in CLISP.

The offending code at file-position 1431 is
55323 #xD81B, #o154033, #b1101 100000 011011
generating the three byte sequence'(#xED #xA0 #x9B)) #(237
160 155)

BTW, you forgot code-char in (dolist (c l) (princ c o)))
You generated a pure ASCII file in your 2nd example.
Note also that it reaches code 65536, which may not be intedded?

Maybe CLISP should error out when trying to write/convert
(code-char (between #xd800 #xdfff)) to a stream with
:external-format UTF-8

(convert-string-to-bytes (string (code-char #x10400))
charset:utf-8)
*** - Internal error: statement in file "encoding.d", line
2805 has been reached!!
Please send the authors of the program a description
how you produced this error!
The following restarts are available:
ABORT :R1 ABORT

Note that I'm not familiar at all with unicode and UTF-8,
multilingual planes, reserved code points etc.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jörg Höhle - 2006-11-13

Logged In: YES
user_id=377168

The original bug (internal error) is gone with today's patch.

However, 2 UTF-8 issues are unclear to me. See
http://unicode.org/unicode/faq/utf_bom.html

Issue A:
"Each UTF is reversible, thus every UTF supports lossless
round tripping: mapping from any Unicode coded character
sequence S to a sequence of bytes and back will produce S
again. To ensure round tripping, a UTF mapping must also
map all code points that are not valid Unicode characters to
unique byte sequences. These invalid code points are the 66
noncharacters (including FFFE and FFFF), as well as unpaired
surrogates."

CLISP does not support round-trip. #\UD800 (code-char
#xd800) is
converted to a 3 byte sequence. Reading this back yields an
error.

What would be consistent is either
a) disallow #\uD800 (and continue to refuse to convert from
bytes), or
b) continue to allow these codes, and accept to convert from
bytes.
It seems the above text mandates b).

The same FAQ also tells, about unpaired UTF-16 surrogates:
"Unicode conformance requires that encoding form conversion
always results in valid data stream. Therefore a converter
must treat this as an error."
This can be read in favour of a). But is the context
(utf-16) appropriate here?

Issue B: Another non-conformance: substituting broken byte
sequences
"A sequence such as <110xxxxx2 0xxxxxxx2> is illegal, and
must never be generated. When faced with this illegal byte
sequence while transforming or interpreting, a UTF-8
conformant process must treat the first byte 110xxxxx2 as an
illegal termination error: for example, either signaling an
error, filtering the byte out, or representing the byte with
a marker such as FFFD (REPLACEMENT CHARACTER). In the latter
two cases, it will continue processing at the second byte
0xxxxxxx2."

(ext:convert-string-from-bytes #(#xc1 #x61 #x62)
(ext:make-encoding :charset charset:utf-8
:input-error-action #\5))
"5b"
is wrong, it should generate "5ab".

Test other wrong sequences using:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
reveals other similar errors in CLISP's UTF-8 handling.

In Ubuntu Dapper, Firefox-1.5 appears to have excellent
UTF-8 support, while Gnome-terminal (2.14.2) has several
substitutions wrong according to that testfile.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jörg Höhle - 2006-11-13

status: open --> open-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jörg Höhle - 2006-11-13

Logged In: YES
user_id=377168

Ad issue A:
RFC3629 http://www.faqs.org/rfcs/rfc3629.html
"The definition of UTF-8 prohibits encoding character
numbers between U+D800 and U+DFFF, which are reserved for
use with the UTF-16 encoding form (as surrogate pairs) and
do not directly represent characters."
implies a)

but again, what do I know about Unicode?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jörg Höhle - 2006-11-20

assigned_to: haible --> hoehle

status: open-fixed --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jörg Höhle - 2006-11-20

Logged In: YES
user_id=377168
Originator: NO

thank you for your bug report.
the bug has been fixed in the CVS tree.
you can either wait for the next release (recommended)
or check out the current CVS tree (see http://clisp.cons.org\)
and build CLISP from the sources (be advised that between
releases the CVS tree is very unstable and may not even build
on your platform).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jörg Höhle - 2006-11-20

Logged In: YES
user_id=377168
Originator: NO

Close this item, src/TODO is good enough.
The original bug is fixed.

Issue A (round-trip): moved to src/TODO
Note that once implemented, (code-char #xD800) will not yield a character, because this code point does not designate a character. The original loop above will error out.

Issue B (handling broken sequences): Markus Kuhn's text is a recommendation of his own, not a standard in any sort.
(ext:convert-string-from-bytes #(#xC5 65 66) (ext:make-encoding :charset charset:utf-8 :input-error-action #\Z)) yields "ZB". It could yield "ZAB" if somebody cared
and if all such changes don't break invariants about mbslen and mbstowcs -- PTC?

Same for
(ext:convert-string-from-bytes #(65 #xB5 66) (ext:make-encoding :charset charset:utf-8 :input-error-action #\Z))
which yields "AB" but could yield "AZB", like iconv does:
(ext:convert-string-from-bytes #(65 #xB5 66) (ext:make-encoding :charset "\\utf-8" :input-error-action #\Z))

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

utf-8 conversion error

Group

Searches

Help

#370 utf-8 conversion error

Discussion