[ clisp-Bugs-1575569 ] utf-8 conversion error

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #1575569, was opened at 2006-10-12 00:12
Message generated for change (Comment added) made by hoehle
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=101355&aid=1575569&group_id=1355

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: libiconv
Group: lisp error
>Status: Closed
Resolution: Fixed
Priority: 5
Private: No
Submitted By: Sam Steingold (sds)
>Assigned to: Jörg Höhle (hoehle)
Summary: utf-8 conversion error

Initial Comment:
(ext:convert-string-to-bytes
  (map 'string 'code-char
      (loop for i from 0 below 44100
        collect (round (* 65536/2 
          (1+ (sin (* 441 (/ i  2 pi 44100)))))))) 
  charset:utf-8)

*** - Internal error: statement in file "encoding.d",
line 2805 has been reached!!
      Please send the authors of the program a
description how you produced this error!

also:
(setq l (loop for i from 0 below 44100
          collect (round (* 65536/2 
          (1+ (sin (* 441 (/ i  2 pi 44100))))))))
(with-open-file (o "/tmp/foo" :direction :output
:external-format charset:utf-8) 
(dolist (c l) (princ c o)))
(with-open-file (o "/tmp/foo") (file-length o))
==> 204572
this is a good file:
iconv -f utf-8 -t UCS2 -o /tmp/foo.ucs2 /tmp/foo
works fine

(setq s (map 'string 'code-char l))
(with-open-file (o "/tmp/foo" :direction :output
:external-format charset:utf-8) 
(write-sequence s o) nil)
(with-open-file (o "/tmp/foo") (file-length o))
==> 126244 (different size!!!)
this is a bad file:
iconv -f utf-8 -t UCS2 -o /tmp/foo.ucs2 /tmp/foo
iconv: illegal input sequence at position 1431

----------------------------------------------------------------------

>Comment By: Jörg Höhle (hoehle)
Date: 2006-11-20 14:25

Message:
Logged In: YES 
user_id=377168
Originator: NO

Close this item, src/TODO is good enough.
The original bug is fixed.

Issue A (round-trip): moved to src/TODO
Note that once implemented, (code-char #xD800) will not yield a character,
because this code point does not designate a character. The original loop
above will error out.

Issue B (handling broken sequences): Markus Kuhn's text is a
recommendation of his own, not a standard in any sort.
(ext:convert-string-from-bytes #(#xC5 65 66) (ext:make-encoding :charset
charset:utf-8 :input-error-action #\Z)) yields "ZB". It could yield "ZAB"
if somebody cared 
and if all such changes don't break invariants about mbslen and mbstowcs
-- PTC?

Same for
(ext:convert-string-from-bytes #(65 #xB5 66) (ext:make-encoding :charset
charset:utf-8 :input-error-action #\Z))
which yields "AB" but could yield "AZB", like iconv does:
(ext:convert-string-from-bytes #(65 #xB5 66) (ext:make-encoding :charset
"\\utf-8" :input-error-action #\Z))

----------------------------------------------------------------------

Comment By: Jörg Höhle (hoehle)
Date: 2006-11-20 14:25

Message:
Logged In: YES 
user_id=377168
Originator: NO

thank you for your bug report.
the bug has been fixed in the CVS tree.
you can either wait for the next release (recommended)
or check out the current CVS tree (see http://clisp.cons.org)
and build CLISP from the sources (be advised that between
releases the CVS tree is very unstable and may not even build
on your platform).

----------------------------------------------------------------------

Comment By: Jörg Höhle (hoehle)
Date: 2006-11-13 16:29

Message:
Logged In: YES 
user_id=377168

Ad issue A:
RFC3629 http://www.faqs.org/rfcs/rfc3629.html
"The definition of UTF-8 prohibits encoding character
numbers between U+D800 and U+DFFF, which are reserved for
use with the UTF-16 encoding form (as surrogate pairs) and
do not directly represent characters."
implies a)

but again, what do I know about Unicode?

----------------------------------------------------------------------

Comment By: Jörg Höhle (hoehle)
Date: 2006-11-13 15:52

Message:
Logged In: YES 
user_id=377168

The original bug (internal error) is gone with today's patch.

However, 2 UTF-8 issues are unclear to me. See
http://unicode.org/unicode/faq/utf_bom.html

Issue A:
"Each UTF is reversible, thus every UTF supports lossless
round tripping: mapping from any Unicode coded character
sequence S to a sequence of bytes and back will produce S
again. To ensure round tripping, a UTF mapping  must also
map all code points that are not valid Unicode characters to
unique byte sequences. These invalid code points are the 66
noncharacters (including FFFE and FFFF), as well as unpaired
surrogates."

CLISP does not support round-trip.  #\UD800 (code-char
#xd800) is
converted to a 3 byte sequence. Reading this back yields an
error.

What would be consistent is either
a) disallow #\uD800 (and continue to refuse to convert from
bytes), or
b) continue to allow these codes, and accept to convert from
bytes.
It seems the above text mandates b).

The same FAQ also tells, about unpaired UTF-16 surrogates:
"Unicode conformance requires that encoding form conversion
always results in valid data stream. Therefore a converter
must treat this as an error."
This can be read in favour of a). But is the context
(utf-16) appropriate here?

Issue B: Another non-conformance: substituting broken byte
sequences
"A sequence such as <110xxxxx2 0xxxxxxx2> is illegal, and
must never be generated. When faced with this illegal byte
sequence while transforming or interpreting, a UTF-8
conformant process must treat the first byte 110xxxxx2 as an
illegal termination error: for example, either signaling an
error, filtering the byte out, or representing the byte with
a marker such as FFFD (REPLACEMENT CHARACTER). In the latter
two cases, it will continue processing at the second byte
0xxxxxxx2."

(ext:convert-string-from-bytes #(#xc1 #x61 #x62)
(ext:make-encoding :charset charset:utf-8
:input-error-action #\5))
"5b"
is wrong, it should generate "5ab".

Test other wrong sequences using:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
reveals other similar errors in CLISP's UTF-8 handling.

In Ubuntu Dapper, Firefox-1.5 appears to have excellent
UTF-8 support, while Gnome-terminal (2.14.2) has several
substitutions wrong according to that testfile.

----------------------------------------------------------------------

Comment By: Jörg Höhle (hoehle)
Date: 2006-11-10 14:35

Message:
Logged In: YES 
user_id=377168

Confirmed.

Note that man utf-8 says:
       The  UCS  code values 0xd800–0xdfff (UTF-16
surrogates) as well as 0xfffe and 0xffff
       (UCS non-characters) should not appear in conforming
UTF-8 streams.
I.e. your loop does not produce something that can be
transformed to UTF-8.
iconv complains about exactly that.

Of course, that's no reason for an internal error in CLISP.

The offending code at file-position 1431 is
55323 #xD81B, #o154033, #b1101 100000 011011
generating the three byte sequence'(#xED #xA0 #x9B)) #(237
160 155)

BTW, you forgot code-char in (dolist (c l) (princ c o)))
You generated a pure ASCII file in your 2nd example.
Note also that it reaches code 65536, which may not be intedded?

Maybe CLISP should error out when trying to write/convert
(code-char (between #xd800 #xdfff)) to a stream with
:external-format UTF-8

(convert-string-to-bytes (string (code-char #x10400))
charset:utf-8)
*** - Internal error: statement in file "encoding.d", line
2805 has been reached!!
      Please send the authors of the program a description
how you produced this error!
The following restarts are available:
ABORT          :R1      ABORT

Note that I'm not familiar at all with unicode and UTF-8,
multilingual planes, reserved code points etc.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=101355&aid=1575569&group_id=1355