#108 UTF-16 min/max_bytes_per_char must be 2

closed-invalid
libiconv (7)
5
2002-05-06
2002-04-30
No

Hi,

impnotes says: "UTF-16, the 16-bit UNICODE character
set. Every character is represented as two bytes."

However, TheEncoding->min/max_bytes_per_char is
currently 1, resp.8.

# Please export fehler_encoding(enc) from encoding.d
into lispbibl.d
# for the sake of module writers

LISPFUNN(enc_minmax,1)
{
var object encoding = popSTACK();
if (!encodingp(encoding)) fehler(error,GETTEXT("not
an encoding."));
value1 =
fixnum(TheEncoding(encoding)->min_bytes_per_char);
value2 =
fixnum(TheEncoding(encoding)->max_bytes_per_char);
mv_count = 2;
}
FFI[12]> (ext::encoding-minmax charset:utf-16)
1 ; 8

Other may be broken as well, e.g.
(CHARSET:CP1133 1 8) which is said to be "an extension
of the ASCII character set, suitable for Laotian." --
*if* I can assume that extension of ASCII means "not
multibyte", or (CHARSET:KOI8-RU 1 8) (while KOI8-U or
-R are reported as 1;1).

Regards,
Jörg Höhle.

Here are the values currently not reported as 1;1.
UNICODE-16/32 and UTF-8 are correct, most others I
don't know, (impnotes mentions some as multibyte, some
as ASCII extension):
((CHARSET:BIG5HKSCS 1 8) (CHARSET:CP936 1 8)
(CHARSET:JAVA 1 6)
(CHARSET:SHIFT-JIS 1 8) (CHARSET:CP932 1 8)
(CHARSET:ISO-2022-CN-EXT 1 8)
(CHARSET:UTF-7 1 8) (CHARSET:CP950 1 8)
(CHARSET:GEORGIAN-ACADEMY 1 8)
(CHARSET:GEORGIAN-PS 1 8) (CHARSET:HZ 1 8)
(CHARSET:EUC-CN 1 8)
(CHARSET:UTF-8 1 3) (CHARSET:UNICODE-16 2 2)
(CHARSET:VISCII 1 8)
(CHARSET:JOHAB 1 8) (CHARSET:KOI8-RU 1 8)
(CHARSET:UNICODE-32 4 4)
(CHARSET:GBK 1 8) (CHARSET:GB18030 1 8)
(CHARSET:ARMSCII-8 1 8)
(CHARSET:ISO-2022-JP-2 1 8) (CHARSET:TCVN 1 8)
(CHARSET:ISO-2022-JP-1 1 8)
(CHARSET:MULELAO-1 1 8) (CHARSET:EUC-KR 1 8)
(CHARSET:BIG5 1 8)
(CHARSET:EUC-JP 1 8) (CHARSET:EUC-TW 1 8)
(CHARSET:CP949 1 8)
(CHARSET:UCS-4 4 4) (CHARSET:CP1133 1 8)
(CHARSET:UCS-2 2 2)
(CHARSET:TIS-620 1 8) (CHARSET:UTF-16 1 8)
(CHARSET:ISO-2022-CN 1 8)
(CHARSET:UNICODE-32-BIG-ENDIAN 4 4)
(CHARSET:UNICODE-16-BIG-ENDIAN 2 2)
(CHARSET:ISO-2022-JP 1 8) (CHARSET:ISO-2022-KR 1 8)
(CHARSET:UNICODE-32-LITTLE-ENDIAN 4 4)
(CHARSET:UNICODE-16-LITTLE-ENDIAN 2 2))

Discussion

  • Sam Steingold

    Sam Steingold - 2002-04-30

    Logged In: YES
    user_id=5735

    are you sure you need these numbers to be correct to get the
    correct translations?

     
  • Bruno Haible

    Bruno Haible - 2002-05-06
    • status: open --> closed-invalid
     
  • Bruno Haible

    Bruno Haible - 2002-05-06

    Logged In: YES
    user_id=5923

    > impnotes says: "UTF-16, the 16-bit UNICODE character
    > set. Every character is represented as two bytes."
    >
    > However, TheEncoding->min/max_bytes_per_char is
    > currently 1, resp.8.

    This is because it's an encoding based on iconv, and it's
    isn't worth
    building into clisp extra knowledge about some specific
    encodings.

    > LISPFUNN(enc_minmax,1)
    > {
    > var object encoding = popSTACK();
    > if (!encodingp(encoding)) fehler(error,GETTEXT("not
    > an encoding."));
    > value1 =
    > fixnum(TheEncoding(encoding)->min_bytes_per_char);
    > value2 =
    > fixnum(TheEncoding(encoding)->max_bytes_per_char);
    > mv_count = 2;
    > }
    > FFI[12]> (ext::encoding-minmax charset:utf-16)
    > 1 ; 8

    If you want this function to be as accurate as possible,
    then you need
    to add a table containing extra knowledge about specific
    encodings.

    > Here are the values currently not reported as 1;1.
    > UNICODE-16/32 and UTF-8 are correct, most others I
    > don't know, (impnotes mentions some as multibyte, some
    > as ASCII extension):
    > (CHARSET:JAVA 1 6)

    This one is correct.

    > (CHARSET:UNICODE-16 2 2) (CHARSET:UCS-4 4 4)

    This one too.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks