Menu

#370 utf-8 conversion error

lisp error
closed-fixed
libiconv (7)
5
2006-11-20
2006-10-11
No

(ext:convert-string-to-bytes
(map 'string 'code-char
(loop for i from 0 below 44100
collect (round (* 65536/2
(1+ (sin (* 441 (/ i 2 pi 44100))))))))
charset:utf-8)

*** - Internal error: statement in file "encoding.d",
line 2805 has been reached!!
Please send the authors of the program a
description how you produced this error!

also:
(setq l (loop for i from 0 below 44100
collect (round (* 65536/2
(1+ (sin (* 441 (/ i 2 pi 44100))))))))
(with-open-file (o "/tmp/foo" :direction :output
:external-format charset:utf-8)
(dolist (c l) (princ c o)))
(with-open-file (o "/tmp/foo") (file-length o))
==> 204572
this is a good file:
iconv -f utf-8 -t UCS2 -o /tmp/foo.ucs2 /tmp/foo
works fine

(setq s (map 'string 'code-char l))
(with-open-file (o "/tmp/foo" :direction :output
:external-format charset:utf-8)
(write-sequence s o) nil)
(with-open-file (o "/tmp/foo") (file-length o))
==> 126244 (different size!!!)
this is a bad file:
iconv -f utf-8 -t UCS2 -o /tmp/foo.ucs2 /tmp/foo
iconv: illegal input sequence at position 1431

Discussion

  • Jörg Höhle

    Jörg Höhle - 2006-11-10

    Logged In: YES
    user_id=377168

    Confirmed.

    Note that man utf-8 says:
    The UCS code values 0xd800–0xdfff (UTF-16
    surrogates) as well as 0xfffe and 0xffff
    (UCS non-characters) should not appear in conforming
    UTF-8 streams.
    I.e. your loop does not produce something that can be
    transformed to UTF-8.
    iconv complains about exactly that.

    Of course, that's no reason for an internal error in CLISP.

    The offending code at file-position 1431 is
    55323 #xD81B, #o154033, #b1101 100000 011011
    generating the three byte sequence'(#xED #xA0 #x9B)) #(237
    160 155)

    BTW, you forgot code-char in (dolist (c l) (princ c o)))
    You generated a pure ASCII file in your 2nd example.
    Note also that it reaches code 65536, which may not be intedded?

    Maybe CLISP should error out when trying to write/convert
    (code-char (between #xd800 #xdfff)) to a stream with
    :external-format UTF-8

    (convert-string-to-bytes (string (code-char #x10400))
    charset:utf-8)
    *** - Internal error: statement in file "encoding.d", line
    2805 has been reached!!
    Please send the authors of the program a description
    how you produced this error!
    The following restarts are available:
    ABORT :R1 ABORT

    Note that I'm not familiar at all with unicode and UTF-8,
    multilingual planes, reserved code points etc.

     
  • Jörg Höhle

    Jörg Höhle - 2006-11-13

    Logged In: YES
    user_id=377168

    The original bug (internal error) is gone with today's patch.

    However, 2 UTF-8 issues are unclear to me. See
    http://unicode.org/unicode/faq/utf_bom.html

    Issue A:
    "Each UTF is reversible, thus every UTF supports lossless
    round tripping: mapping from any Unicode coded character
    sequence S to a sequence of bytes and back will produce S
    again. To ensure round tripping, a UTF mapping must also
    map all code points that are not valid Unicode characters to
    unique byte sequences. These invalid code points are the 66
    noncharacters (including FFFE and FFFF), as well as unpaired
    surrogates."

    CLISP does not support round-trip. #\UD800 (code-char
    #xd800) is
    converted to a 3 byte sequence. Reading this back yields an
    error.

    What would be consistent is either
    a) disallow #\uD800 (and continue to refuse to convert from
    bytes), or
    b) continue to allow these codes, and accept to convert from
    bytes.
    It seems the above text mandates b).

    The same FAQ also tells, about unpaired UTF-16 surrogates:
    "Unicode conformance requires that encoding form conversion
    always results in valid data stream. Therefore a converter
    must treat this as an error."
    This can be read in favour of a). But is the context
    (utf-16) appropriate here?

    Issue B: Another non-conformance: substituting broken byte
    sequences
    "A sequence such as <110xxxxx2 0xxxxxxx2> is illegal, and
    must never be generated. When faced with this illegal byte
    sequence while transforming or interpreting, a UTF-8
    conformant process must treat the first byte 110xxxxx2 as an
    illegal termination error: for example, either signaling an
    error, filtering the byte out, or representing the byte with
    a marker such as FFFD (REPLACEMENT CHARACTER). In the latter
    two cases, it will continue processing at the second byte
    0xxxxxxx2."

    (ext:convert-string-from-bytes #(#xc1 #x61 #x62)
    (ext:make-encoding :charset charset:utf-8
    :input-error-action #\5))
    "5b"
    is wrong, it should generate "5ab".

    Test other wrong sequences using:
    http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
    reveals other similar errors in CLISP's UTF-8 handling.

    In Ubuntu Dapper, Firefox-1.5 appears to have excellent
    UTF-8 support, while Gnome-terminal (2.14.2) has several
    substitutions wrong according to that testfile.

     
  • Jörg Höhle

    Jörg Höhle - 2006-11-13
    • status: open --> open-fixed
     
  • Jörg Höhle

    Jörg Höhle - 2006-11-13

    Logged In: YES
    user_id=377168

    Ad issue A:
    RFC3629 http://www.faqs.org/rfcs/rfc3629.html
    "The definition of UTF-8 prohibits encoding character
    numbers between U+D800 and U+DFFF, which are reserved for
    use with the UTF-16 encoding form (as surrogate pairs) and
    do not directly represent characters."
    implies a)

    but again, what do I know about Unicode?

     
  • Jörg Höhle

    Jörg Höhle - 2006-11-20
    • assigned_to: haible --> hoehle
    • status: open-fixed --> closed-fixed
     
  • Jörg Höhle

    Jörg Höhle - 2006-11-20

    Logged In: YES
    user_id=377168
    Originator: NO

    thank you for your bug report.
    the bug has been fixed in the CVS tree.
    you can either wait for the next release (recommended)
    or check out the current CVS tree (see http://clisp.cons.org\)
    and build CLISP from the sources (be advised that between
    releases the CVS tree is very unstable and may not even build
    on your platform).

     
  • Jörg Höhle

    Jörg Höhle - 2006-11-20

    Logged In: YES
    user_id=377168
    Originator: NO

    Close this item, src/TODO is good enough.
    The original bug is fixed.

    Issue A (round-trip): moved to src/TODO
    Note that once implemented, (code-char #xD800) will not yield a character, because this code point does not designate a character. The original loop above will error out.

    Issue B (handling broken sequences): Markus Kuhn's text is a recommendation of his own, not a standard in any sort.
    (ext:convert-string-from-bytes #(#xC5 65 66) (ext:make-encoding :charset charset:utf-8 :input-error-action #\Z)) yields "ZB". It could yield "ZAB" if somebody cared
    and if all such changes don't break invariants about mbslen and mbstowcs -- PTC?

    Same for
    (ext:convert-string-from-bytes #(65 #xB5 66) (ext:make-encoding :charset charset:utf-8 :input-error-action #\Z))
    which yields "AB" but could yield "AZB", like iconv does:
    (ext:convert-string-from-bytes #(65 #xB5 66) (ext:make-encoding :charset "\\utf-8" :input-error-action #\Z))

     

Log in to post a comment.