From: SourceForge.net <no...@so...> - 2006-11-20 13:25:53
|
Bugs item #1575569, was opened at 2006-10-12 00:12 Message generated for change (Comment added) made by hoehle You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=101355&aid=1575569&group_id=1355 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: libiconv Group: lisp error >Status: Closed Resolution: Fixed Priority: 5 Private: No Submitted By: Sam Steingold (sds) >Assigned to: Jörg Höhle (hoehle) Summary: utf-8 conversion error Initial Comment: (ext:convert-string-to-bytes (map 'string 'code-char (loop for i from 0 below 44100 collect (round (* 65536/2 (1+ (sin (* 441 (/ i 2 pi 44100)))))))) charset:utf-8) *** - Internal error: statement in file "encoding.d", line 2805 has been reached!! Please send the authors of the program a description how you produced this error! also: (setq l (loop for i from 0 below 44100 collect (round (* 65536/2 (1+ (sin (* 441 (/ i 2 pi 44100)))))))) (with-open-file (o "/tmp/foo" :direction :output :external-format charset:utf-8) (dolist (c l) (princ c o))) (with-open-file (o "/tmp/foo") (file-length o)) ==> 204572 this is a good file: iconv -f utf-8 -t UCS2 -o /tmp/foo.ucs2 /tmp/foo works fine (setq s (map 'string 'code-char l)) (with-open-file (o "/tmp/foo" :direction :output :external-format charset:utf-8) (write-sequence s o) nil) (with-open-file (o "/tmp/foo") (file-length o)) ==> 126244 (different size!!!) this is a bad file: iconv -f utf-8 -t UCS2 -o /tmp/foo.ucs2 /tmp/foo iconv: illegal input sequence at position 1431 ---------------------------------------------------------------------- >Comment By: Jörg Höhle (hoehle) Date: 2006-11-20 14:25 Message: Logged In: YES user_id=377168 Originator: NO Close this item, src/TODO is good enough. The original bug is fixed. Issue A (round-trip): moved to src/TODO Note that once implemented, (code-char #xD800) will not yield a character, because this code point does not designate a character. The original loop above will error out. Issue B (handling broken sequences): Markus Kuhn's text is a recommendation of his own, not a standard in any sort. (ext:convert-string-from-bytes #(#xC5 65 66) (ext:make-encoding :charset charset:utf-8 :input-error-action #\Z)) yields "ZB". It could yield "ZAB" if somebody cared and if all such changes don't break invariants about mbslen and mbstowcs -- PTC? Same for (ext:convert-string-from-bytes #(65 #xB5 66) (ext:make-encoding :charset charset:utf-8 :input-error-action #\Z)) which yields "AB" but could yield "AZB", like iconv does: (ext:convert-string-from-bytes #(65 #xB5 66) (ext:make-encoding :charset "\\utf-8" :input-error-action #\Z)) ---------------------------------------------------------------------- Comment By: Jörg Höhle (hoehle) Date: 2006-11-20 14:25 Message: Logged In: YES user_id=377168 Originator: NO thank you for your bug report. the bug has been fixed in the CVS tree. you can either wait for the next release (recommended) or check out the current CVS tree (see http://clisp.cons.org) and build CLISP from the sources (be advised that between releases the CVS tree is very unstable and may not even build on your platform). ---------------------------------------------------------------------- Comment By: Jörg Höhle (hoehle) Date: 2006-11-13 16:29 Message: Logged In: YES user_id=377168 Ad issue A: RFC3629 http://www.faqs.org/rfcs/rfc3629.html "The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters." implies a) but again, what do I know about Unicode? ---------------------------------------------------------------------- Comment By: Jörg Höhle (hoehle) Date: 2006-11-13 15:52 Message: Logged In: YES user_id=377168 The original bug (internal error) is gone with today's patch. However, 2 UTF-8 issues are unclear to me. See http://unicode.org/unicode/faq/utf_bom.html Issue A: "Each UTF is reversible, thus every UTF supports lossless round tripping: mapping from any Unicode coded character sequence S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF mapping must also map all code points that are not valid Unicode characters to unique byte sequences. These invalid code points are the 66 noncharacters (including FFFE and FFFF), as well as unpaired surrogates." CLISP does not support round-trip. #\UD800 (code-char #xd800) is converted to a 3 byte sequence. Reading this back yields an error. What would be consistent is either a) disallow #\uD800 (and continue to refuse to convert from bytes), or b) continue to allow these codes, and accept to convert from bytes. It seems the above text mandates b). The same FAQ also tells, about unpaired UTF-16 surrogates: "Unicode conformance requires that encoding form conversion always results in valid data stream. Therefore a converter must treat this as an error." This can be read in favour of a). But is the context (utf-16) appropriate here? Issue B: Another non-conformance: substituting broken byte sequences "A sequence such as <110xxxxx2 0xxxxxxx2> is illegal, and must never be generated. When faced with this illegal byte sequence while transforming or interpreting, a UTF-8 conformant process must treat the first byte 110xxxxx2 as an illegal termination error: for example, either signaling an error, filtering the byte out, or representing the byte with a marker such as FFFD (REPLACEMENT CHARACTER). In the latter two cases, it will continue processing at the second byte 0xxxxxxx2." (ext:convert-string-from-bytes #(#xc1 #x61 #x62) (ext:make-encoding :charset charset:utf-8 :input-error-action #\5)) "5b" is wrong, it should generate "5ab". Test other wrong sequences using: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt reveals other similar errors in CLISP's UTF-8 handling. In Ubuntu Dapper, Firefox-1.5 appears to have excellent UTF-8 support, while Gnome-terminal (2.14.2) has several substitutions wrong according to that testfile. ---------------------------------------------------------------------- Comment By: Jörg Höhle (hoehle) Date: 2006-11-10 14:35 Message: Logged In: YES user_id=377168 Confirmed. Note that man utf-8 says: The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well as 0xfffe and 0xffff (UCS non-characters) should not appear in conforming UTF-8 streams. I.e. your loop does not produce something that can be transformed to UTF-8. iconv complains about exactly that. Of course, that's no reason for an internal error in CLISP. The offending code at file-position 1431 is 55323 #xD81B, #o154033, #b1101 100000 011011 generating the three byte sequence'(#xED #xA0 #x9B)) #(237 160 155) BTW, you forgot code-char in (dolist (c l) (princ c o))) You generated a pure ASCII file in your 2nd example. Note also that it reaches code 65536, which may not be intedded? Maybe CLISP should error out when trying to write/convert (code-char (between #xd800 #xdfff)) to a stream with :external-format UTF-8 (convert-string-to-bytes (string (code-char #x10400)) charset:utf-8) *** - Internal error: statement in file "encoding.d", line 2805 has been reached!! Please send the authors of the program a description how you produced this error! The following restarts are available: ABORT :R1 ABORT Note that I'm not familiar at all with unicode and UTF-8, multilingual planes, reserved code points etc. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=101355&aid=1575569&group_id=1355 |