From: SourceForge.net <no...@so...> - 2006-11-13 14:52:09
|
Bugs item #1575569, was opened at 2006-10-12 00:12 Message generated for change (Comment added) made by hoehle You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=101355&aid=1575569&group_id=1355 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: libiconv Group: lisp error Status: Open >Resolution: Fixed Priority: 5 Private: No Submitted By: Sam Steingold (sds) Assigned to: Bruno Haible (haible) Summary: utf-8 conversion error Initial Comment: (ext:convert-string-to-bytes (map 'string 'code-char (loop for i from 0 below 44100 collect (round (* 65536/2 (1+ (sin (* 441 (/ i 2 pi 44100)))))))) charset:utf-8) *** - Internal error: statement in file "encoding.d", line 2805 has been reached!! Please send the authors of the program a description how you produced this error! also: (setq l (loop for i from 0 below 44100 collect (round (* 65536/2 (1+ (sin (* 441 (/ i 2 pi 44100)))))))) (with-open-file (o "/tmp/foo" :direction :output :external-format charset:utf-8) (dolist (c l) (princ c o))) (with-open-file (o "/tmp/foo") (file-length o)) ==> 204572 this is a good file: iconv -f utf-8 -t UCS2 -o /tmp/foo.ucs2 /tmp/foo works fine (setq s (map 'string 'code-char l)) (with-open-file (o "/tmp/foo" :direction :output :external-format charset:utf-8) (write-sequence s o) nil) (with-open-file (o "/tmp/foo") (file-length o)) ==> 126244 (different size!!!) this is a bad file: iconv -f utf-8 -t UCS2 -o /tmp/foo.ucs2 /tmp/foo iconv: illegal input sequence at position 1431 ---------------------------------------------------------------------- >Comment By: Jörg Höhle (hoehle) Date: 2006-11-13 15:52 Message: Logged In: YES user_id=377168 The original bug (internal error) is gone with today's patch. However, 2 UTF-8 issues are unclear to me. See http://unicode.org/unicode/faq/utf_bom.html Issue A: "Each UTF is reversible, thus every UTF supports lossless round tripping: mapping from any Unicode coded character sequence S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF mapping must also map all code points that are not valid Unicode characters to unique byte sequences. These invalid code points are the 66 noncharacters (including FFFE and FFFF), as well as unpaired surrogates." CLISP does not support round-trip. #\UD800 (code-char #xd800) is converted to a 3 byte sequence. Reading this back yields an error. What would be consistent is either a) disallow #\uD800 (and continue to refuse to convert from bytes), or b) continue to allow these codes, and accept to convert from bytes. It seems the above text mandates b). The same FAQ also tells, about unpaired UTF-16 surrogates: "Unicode conformance requires that encoding form conversion always results in valid data stream. Therefore a converter must treat this as an error." This can be read in favour of a). But is the context (utf-16) appropriate here? Issue B: Another non-conformance: substituting broken byte sequences "A sequence such as <110xxxxx2 0xxxxxxx2> is illegal, and must never be generated. When faced with this illegal byte sequence while transforming or interpreting, a UTF-8 conformant process must treat the first byte 110xxxxx2 as an illegal termination error: for example, either signaling an error, filtering the byte out, or representing the byte with a marker such as FFFD (REPLACEMENT CHARACTER). In the latter two cases, it will continue processing at the second byte 0xxxxxxx2." (ext:convert-string-from-bytes #(#xc1 #x61 #x62) (ext:make-encoding :charset charset:utf-8 :input-error-action #\5)) "5b" is wrong, it should generate "5ab". Test other wrong sequences using: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt reveals other similar errors in CLISP's UTF-8 handling. In Ubuntu Dapper, Firefox-1.5 appears to have excellent UTF-8 support, while Gnome-terminal (2.14.2) has several substitutions wrong according to that testfile. ---------------------------------------------------------------------- Comment By: Jörg Höhle (hoehle) Date: 2006-11-10 14:35 Message: Logged In: YES user_id=377168 Confirmed. Note that man utf-8 says: The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well as 0xfffe and 0xffff (UCS non-characters) should not appear in conforming UTF-8 streams. I.e. your loop does not produce something that can be transformed to UTF-8. iconv complains about exactly that. Of course, that's no reason for an internal error in CLISP. The offending code at file-position 1431 is 55323 #xD81B, #o154033, #b1101 100000 011011 generating the three byte sequence'(#xED #xA0 #x9B)) #(237 160 155) BTW, you forgot code-char in (dolist (c l) (princ c o))) You generated a pure ASCII file in your 2nd example. Note also that it reaches code 65536, which may not be intedded? Maybe CLISP should error out when trying to write/convert (code-char (between #xd800 #xdfff)) to a stream with :external-format UTF-8 (convert-string-to-bytes (string (code-char #x10400)) charset:utf-8) *** - Internal error: statement in file "encoding.d", line 2805 has been reached!! Please send the authors of the program a description how you produced this error! The following restarts are available: ABORT :R1 ABORT Note that I'm not familiar at all with unicode and UTF-8, multilingual planes, reserved code points etc. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=101355&aid=1575569&group_id=1355 |