Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#23 better way to check for UTF-8

open
nobody
None
5
2008-08-01
2008-07-31
Anonymous
No

The new 3.0 ZIP and (beta) UNZIP have support for storing UTF-8 in the "name" field.

The way they check whether the "operating system environment" is in UTF-8 mode is:

in ZIP:
loc = setlocale(LC_CTYPE, "en_US.UTF-8");
in UNZIP:
loc = setlocale(LC_CTYPE, "en_GB.UTF-8");

First, it is strange that it uses two different locales.

Second, this fails on platforms that use only UTF-8 and no legacy stuff because they have no non-UTF8 locale and so the UTF-8 locale is called "en_US" and NOT "en_US.UTF-8" (it's just a free-form name, so...).

This is the actual bug we encountered on the Nokia N800 device (Maemo platform, GNU environment): there are locales like "de_DE", "en_US" etc and they are all in UTF-8 codeset.

(Also, it is rude to override the locale the user has set for himself: if he's not using UTF-8 (but would have an UTF-8 locale installed as part of glibc, if he even knows that) but he's doesn't want to use UTF-8, why force him to?)

The easiest fix is to instead _check_ whether ZIP / UNZIP are run in an UTF-8 locale:

----------

#include <locale.h>
#include <langinfo.h>

setlocale(LC_CTYPE, ""); /* tell base library that we support locales. This will load the locale the user has selected. */

char* codeset = nl_langinfo(CODESET); /* get the codeset used. for example "UTF-8". */

if (strcmp(codeset, "UTF-8") == 0) {
using_utf8 = 1;
}

----------

Discussion

1 2 > >> (Page 1 of 2)
  • Logged In: YES
    user_id=765110
    Originator: NO

    diff -up orig/zip30/tailor.h zip30/tailor.h
    --- orig/zip30/tailor.h 2008-05-06 21:38:12.000000000 +0200
    +++ zip30/tailor.h 2008-07-31 14:38:23.771881100 +0200
    @@ -332,6 +332,10 @@ IZ_IMP char *mktemp();
    #ifdef UNICODE_SUPPORT
    # if defined( UNIX) || defined( VMS)
    # include <locale.h>
    +# include <langinfo.h>
    +# ifndef SETLOCALE
    +# define SETLOCALE(category, locale) setlocale(category, locale)
    +# endif /* ndef SETLOCALE */
    # endif /* defined( UNIX) || defined( VMS) */
    # include <wchar.h>
    # include <wctype.h>
    diff -up orig/zip30/zip.c zip30/zip.c
    --- orig/zip30/zip.c 2008-07-05 12:34:06.000000000 +0200
    +++ zip30/zip.c 2008-07-31 14:17:48.623351200 +0200
    @@ -2204,20 +2204,20 @@ char **argv; /* command line
    writing, and displaying (if the fonts are loaded) all
    characters in UTF-8. */
    {
    - char *loc;
    + char *encoding;

    /*
    loc = setlocale(LC_CTYPE, NULL);
    printf(" Initial language locale = '%s'\n", loc);
    */

    - loc = setlocale(LC_CTYPE, "en_US.UTF-8");
    + encoding = nl_langinfo(CODESET);

    /*
    printf("langinfo %s\n", nl_langinfo(CODESET));
    */

    - if (loc != NULL) {
    + if (strcmp(encoding, "UTF-8") == 0) {
    /* using UTF-8 character set so can set UTF-8 GPBF bit 11 */
    using_utf8 = 1;
    /*

     
  • Logged In: YES
    user_id=765110
    Originator: NO

    (there is a 'SETLOCALE(LC_CTYPE, "");' call a few lines above the context already, hence the patch doesn't do it a second time).

    Hope that helps.

     
  • Ed Gordon
    Ed Gordon
    2008-08-01

    Logged In: YES
    user_id=1172496
    Originator: NO

    > The way they check whether the "operating system environment" is in UTF-8
    mode is:

    > in ZIP:
    > loc = setlocale(LC_CTYPE, "en_US.UTF-8");
    > in UNZIP:
    > loc = setlocale(LC_CTYPE, "en_GB.UTF-8");

    > First, it is strange that it uses two different locales.

    That's because the Zip and UnZip maintainers are in different countries. Since the locale "UTF-8" does not exist, some specific locale was needed. At least that's what we thought.

    > Second, this fails on platforms that use only UTF-8 and no legacy stuff
    > because they have no non-UTF8 locale and so the UTF-8 locale is called
    > "en_US" and NOT "en_US.UTF-8" (it's just a free-form name, so...).

    Didn't know that. On the test platforms we have the above worked fine. That's why we need help from the community. (Thanks!)

    It's possible your way doesn't work on our test platforms, as these platforms need to be told that UTF-8 is what's wanted.

    > This is the actual bug we encountered on the Nokia N800 device (Maemo
    > platform, GNU environment): there are locales like "de_DE", "en_US" etc and
    > they are all in UTF-8 codeset.

    OK. If you can provide links to documentation on this we'd appreciate it.

    > (Also, it is rude to override the locale the user has set for himself: if
    > he's not using UTF-8 (but would have an UTF-8 locale installed as part of
    > glibc, if he even knows that) but he's doesn't want to use UTF-8, why force
    > him to?)

    A UTF-8 path can't be read or written unless the locale is set to UTF-8, or so we found.

    > The easiest fix is to instead _check_ whether ZIP / UNZIP are run in an
    > UTF-8 locale:

    This may not be the case when files are extracted. We felt that file paths should be restored correctly regardless of the current character set, if the system supports UTF-8. The archive could have paths from various character sets that need to be restored.

    > char* codeset = nl_langinfo(CODESET); /* get the codeset used. for example
    "UTF-8". */

    This can vary from system to system.

    > if (strcmp(codeset, "UTF-8") == 0) {
    > using_utf8 = 1;
    > }

    Not sure this will work. Note that the locale is only being changed within the Zip program, not the locale of the system. At least that is the intent. We need some details of the problem to work it.

     
  • Ed Gordon
    Ed Gordon
    2008-08-01

    • summary: better was to check for UTF-8 --> better way to check for UTF-8
     
  • Logged In: YES
    user_id=765110
    Originator: NO

    Hi,

    gordone wrote:
    > Danny wrote:
    >> in ZIP:
    >> loc = setlocale(LC_CTYPE, "en_US.UTF-8");
    >> in UNZIP:
    >> loc = setlocale(LC_CTYPE, "en_GB.UTF-8");
    >> First, it is strange that it uses two different locales.

    >That's because the Zip and UnZip maintainers are in different countries.
    >Since the locale "UTF-8" does not exist, some specific locale was needed.
    >At least that's what we thought.

    I see. Well, the entire point of locales is to localize the environment to a country, so makes sense it includes a country name :)

    Unicode is a special case the locale people didn't design in, so you need a country setting for LC_CTYPE which is admittedly weird.

    (Note that most locale settings don't have anything to do with the charset so you need the country setting anyway - for example to select currency, decimal point representation, date representation, sort order, digit grouping, ...)

    >> Second, this fails on platforms that use only UTF-8 and no legacy stuff because they have no non-UTF8 locale and so the UTF-8 locale is called "en_US" and NOT "en_US.UTF-8" (it's just a free-form name, so...).

    >Didn't know that. On the test platforms we have the above worked fine.
    >That's why we need help from the community. (Thanks!)

    You're welcome :)

    I see. I mostly used older maturer platforms before and they were all _migrating_ to UTF-8 and thus still supporting the older locales and thus the problem of using the original names for the new coding never surfaced.

    The platform the problem happens on is a new embedded platform and there it apparently made sense to get rid of the (big) legacy locales and only have UTF-8. -> bam, assumption broken :)

    >It's possible your way doesn't work on our test platforms, as these platforms need to be told that UTF-8 is what's wanted.

    That's what's supposed to happen, isn't it?
    Shouldn't a program:
    - as long as nobody sets a locale, use C (ASCII) encoding.
    - if somebody sets a locale, use the encoding specified in that locale.

    Ah, I think I see the misunderstanding:

    The _environment_ should tell the program that UTF-8 is what's wanted, not the program itself (the program doesn't set its own locale because how would it know what the system wanted to be used for input, display, file name storage, ...).

    So if there is no locale set on the test platform, it's not supposed to use any locale (as far as I understand).

    Newer systems that can handle the weird multibyte 8-bit byte sequences that are UTF-8 will have an UTF-8 locale set in the environment anyway.

    The only problem arises when one doesn't check the locale itself for the encoding, but parses the locale's name.
    Or tries to set a specific locale that wasn't specified by the user directly or indirectly (which seems weird to me, frankly).

    >> This is the actual bug we encountered on the Nokia N800 device (Maemo platform, GNU environment): there are locales like "de_DE", "en_US" etc and they are all in UTF-8 codeset.

    >OK. If you can provide links to documentation on this we'd appreciate it.

    I'm trying to find documentation about encoding on the Maemo platform, will get back to you.

    Using the shell, one gets:
    $ locale
    LANG=de_DE
    LC_CTYPE="de_DE"
    LC_NUMERIC="de_DE"
    LC_TIME="de_DE"
    LC_COLLATE="de_DE"
    LC_MONETARY="de_DE"
    LC_MESSAGES="de_DE"
    LC_PAPER="de_DE"
    LC_NAME="de_DE"
    LC_ADDRESS="de_DE"
    LC_TELEPHONE="de_DE"
    LC_MEASUREMENT="de_DE"
    LC_IDENTIFICATION="de_DE"
    LC_ALL=

    $ locale charmap
    UTF-8

    >> ... doesn't want to use UTF-8, why force him to?)

    >A UTF-8 path can't be read or written unless the locale is set to UTF-8, or so we found.

    Do you mean to the console? If the locale is not UTF-8, the console probably isn't UTF-8 capable either and so it will print gibberish, also after you set a UTF-8 locale :) Not that that happened to me, but just saying.

    I think you want to use iconv(3) to convert freely between charsets.

    >> The easiest fix is to instead _check_ whether ZIP / UNZIP are run in an UTF-8 locale:

    >This may not be the case when files are extracted. We felt that file paths should be restored correctly regardless of the current character set, if the system supports UTF-8. The archive could have paths from various character sets that need to be restored.

    Well, UNIX is weird in that way: The correct character set for the file system is apparently defined by the locale, with all the weird fun that entails :-)

    LANG dependent encoding
    <http://linux.derkeiler.com/Newsgroups/comp.os.linux.setup/2004-12/0849.html>

    GCC on file name encodings
    <http://gcc.gnu.org/ml/java-prs/2003-q1/msg00033.html>

    Global file name encodings suggested
    <http://mail.nl.linux.org/linux-utf8/2001-06/msg00244.html>.

    Funn...

    >> char* codeset = nl_langinfo(CODESET); /* get the codeset used. for example "UTF-8". */

    >This can vary from system to system.

    Hmm? Of course, isn't that the point?

    When you added UTF-8 support to ZIP, you wanted to make sure it also works on non-UTF-8 systems, right?
    With the UNIX libc authors, it was the same: as long as the locale doesn't use UTF-8, it's not supposed to use UTF-8 (because UTF-8 is unknown to the rest of the system and you could neither type it with the keyboard nor display it on screen anyway).

    >> if (strcmp(codeset, "UTF-8") == 0) {
    >> using_utf8 = 1;
    >> }

    >Not sure this will work.

    You mean whether this will work everywhere? It works here :)

    Similar code was already in the ZIP release, but it was commented out. I'm not sure why, does it causes trouble?

    >Note that the locale is only being changed within the Zip program, not the locale of the system. At least that is the intent.

    That's what it does.

    It fails on that platform though because a locale of that name does not exist. For all we know, the correct UTF-8 locale could be called "thingaboo" or "default" or whatever. I'm not sure what the objective is. If you want to convert text
    from UTF-8, use iconv(3).

    (btw, the problem in the ZIP/UNZIP code doesn't arise directly from the failed setlocale() call, it arises because "using_utf8 = 0").

    >We need some details of the problem to work it.

    I hope that helps :)

     
  • Logged In: YES
    user_id=765110
    Originator: NO

    $ locale -a
    C
    da_DK
    de_DE
    el_GR
    en_GB
    en_US
    es_ES
    es_MX
    fi_FI
    fr_CA
    fr_FR
    it_IT
    nl_NL
    no_NO
    POSIX
    pt_BR
    pt_PT
    ru_RU
    sv_SE

     
  • Ed Gordon
    Ed Gordon
    2008-08-02

    Logged In: YES
    user_id=1172496
    Originator: NO

    >>Since the locale "UTF-8" does not exist, some specific locale was needed.

    >>At least that's what we thought.

    >I see. Well, the entire point of locales is to localize the environment to
    >a country, so makes sense it includes a country name :)

    >Unicode is a special case the locale people didn't design in, so you
    >need a country setting for LC_CTYPE which is admittedly weird.

    Yep.

    >I see. I mostly used older maturer platforms before and they were all
    >_migrating_ to UTF-8 and thus still supporting the older locales and thus
    >the problem of using the original names for the new coding never surfaced.

    >The platform the problem happens on is a new embedded platform and there
    >it apparently made sense to get rid of the (big) legacy locales and only
    >have UTF-8. -> bam, assumption broken :)

    It seems an assumed standard was broken, that the new platform didn't stick
    with the way others were doing it. So that's something extra to design to.

    >>It's possible your way doesn't work on our test platforms, as these
    platforms need to be told that UTF-8 is what's wanted.

    >That's what's supposed to happen, isn't it?
    >Shouldn't a program:
    >- as long as nobody sets a locale, use C (ASCII) encoding.
    >- if somebody sets a locale, use the encoding specified in that locale.

    Yep. There's detailed documentation on how the different settings of a
    locale are used to process specific character set and other information.

    >Ah, I think I see the misunderstanding:

    >The _environment_ should tell the program that UTF-8 is what's wanted, not
    >the program itself (the program doesn't set its own locale because how
    >would it know what the system wanted to be used for input, display, file
    >name storage, ...).

    No, that's not quite how I interpret this. The locale set in the program sets
    how the program is asking to interact with the operating system. In the
    case where we have a UTF-8 string we want to write to the file system as
    UTF-8, we set the UTF-8 locale so that the standard file calls that write
    to the file system can interpret the UTF-8 strings. That's the main point
    of switching to UTF-8, so these paths on the file system can be read and
    written.

    Console I/O is a separate issue, and so far we've found that most console
    windows don't have the fonts loaded to display characters outside the
    current loaded character set. See how code pages are switched in and
    out to support different characters in the documentation. However, if
    a UTF-8 path is written by Zip using UTF-8 as the locale, then the file
    browser displays the path correctly, even if Japanese, which is what is
    intented.

    I'm not following what ill effects are resulting from how Zip does the
    processing, other than the effects of having the wrong locale for the
    target platforms.

    >So if there is no locale set on the test platform, it's not supposed to
    >use any locale (as far as I understand).

    There's always a locale. Indeed, the default locale for a C program is
    the standard C locale, which has limited character support. In
    general, the only way to know the OS calls will support UTF-8 is to set
    it to a UTF-8 locale.

    >Newer systems that can handle the weird multibyte 8-bit byte sequences
    >that are UTF-8 will have an UTF-8 locale set in the environment anyway.

    That's an assumption, at least from the view of a generic application like
    Zip that can be run on all sorts of platforms. Linux doesn't do that and
    UTF-8 must be set as the locale to get it.

    >The only problem arises when one doesn't check the locale itself for the
    >encoding, but parses the locale's name.
    >Or tries to set a specific locale that wasn't specified by the user
    >directly or indirectly (which seems weird to me, frankly).

    I think you're not following the key point that Zip (and UnZip) have UTF-8
    strings to read and write as paths on the file system. They don't really
    care what the user set the system locale to be, they just need to be
    able to read and write paths in UTF-8. On Linux, this happens just fine.

    >>> This is the actual bug we encountered on the Nokia N800 device (Maemo
    platform, GNU environment): there are locales like "de_DE", "en_US" etc and
    they are all in UTF-8 codeset.

    >>OK. If you can provide links to documentation on this we'd appreciate
    it.

    >I'm trying to find documentation about encoding on the Maemo platform,
    will get back to you.

    OK. Appreciate that. This needs to be fixed so that all cases are covered
    and tested. This problem introduces new cases that apparently didn't exist
    while we were designing this code.

    >Using the shell, one gets:
    >$ locale
    >LANG=de_DE
    >LC_CTYPE="de_DE"
    ...
    >LC_COLLATE="de_DE"
    ...
    >LC_ALL=

    >$ locale charmap
    >UTF-8

    Yep. The CTYPE locale setting is important here. The collate setting I believe
    impacts how the strings are sorted in some cases and can be significant. The
    ALL setting can be used to set everything, but Zip and probably UnZip (still in
    development) don't care about the other settings and the local settings are left
    as is.

    >>> ... doesn't want to use UTF-8, why force him to?)

    UTF-8 is needed to write the paths stored in the archive. If the user doesn't
    want that to happen, they can disable Unicode support, either by recompiling
    without it or by using the option that disables it.

    Without UTF-8 support, some paths on the file system may not be readable
    (as they're in another character set) or writable (as the current locale does
    not support some of the characters in the paths).

    >A UTF-8 path can't be read or written unless the locale is set to UTF-8,
    or so we found.

    >Do you mean to the console?

    No. The file system.

    >If the locale is not UTF-8, the console
    >probably isn't UTF-8 capable either and so it will print gibberish, also
    >after you set a UTF-8 locale :) Not that that happened to me, but just
    >saying.

    If your console can automatically handle Japanese paths (assuming it's
    not set for Japanese already), then great. Over here the fonts to show
    Japanese on the console are not there so I generally see spaces instead.

    >I think you want to use iconv(3) to convert freely between charsets.

    No, we looked at that and there's reasons we're not using it. First, the
    license is not compatible with the Info-ZIP license and we are resistant
    to force people to get the iconv library (not supported on all platforms)
    separately before they can compile our code. Second, to use iconv
    you need to know the official names of the character sets you are
    converting from and to, and that generally is not known for Zip paths.
    Third, iconv is not needed. UTF-8 is just fine for storing all character
    set characters and standard system calls are usually available for
    converting to and from the local character set and Unicode that are
    more available than iconv.

    >>> The easiest fix is to instead _check_ whether ZIP / UNZIP are run in an
    UTF-8 locale:

    >>This may not be the case when files are extracted. We felt that file
    >>paths should be restored correctly regardless of the current character set,
    >>if the system supports UTF-8. The archive could have paths from various
    >>character sets that need to be restored.

    >Well, UNIX is weird in that way: The correct character set for the file
    >system is apparently defined by the locale, with all the weird fun that
    >entails :-)

    >LANG dependent encoding
    ><http://linux.derkeiler.com/Newsgroups/comp.os.linux.setup/2004-12/0849.html>

    >GCC on file name encodings
    ><http://gcc.gnu.org/ml/java-prs/2003-q1/msg00033.html>

    >Global file name encodings suggested
    ><http://mail.nl.linux.org/linux-utf8/2001-06/msg00244.html>.

    >Funn...

    I'll try to check the above references later this weekend.

    Most modern Unix file systems store the paths on the drives as
    Unicode as far as I know. The OS then converts the paths to other
    character sets as needed. More importantly, if the UTF-8 locale is
    set, the OS handles converting the paths to whatever format is used
    for the file system.

    >>> char* codeset = nl_langinfo(CODESET); /* get the codeset used. for
    example "UTF-8". */

    >>This can vary from system to system.

    >Hmm? Of course, isn't that the point?

    Well, yeah. However, there's no standard way to determine that the current
    locale is UTF-8, which there should be.

    It doesn't matter though, as setting a UTF-8 locale tells the OS that paths to
    OS calls are UTF-8 and the OS should do the right thing. If setting a UTF-8
    locale fails, the OS does not support it and Zip acts accordingly.

    >When you added UTF-8 support to ZIP, you wanted to make sure it also works
    >on non-UTF-8 systems, right?

    We want Zip to work if UTF-8 is supported or not.

    >With the UNIX libc authors, it was the same: as long as the locale doesn't
    >use UTF-8, it's not supposed to use UTF-8 (because UTF-8 is unknown to the
    >rest of the system and you could neither type it with the keyboard nor
    >display it on screen anyway).

    The point is to read and write file system paths so that directories with
    files in a mix of character sets can be restored. Zip does that now on
    Linux and Windows. Displaying fonts in console windows and being
    able to type in Japanese for instance would be good, but is not
    essential to archiving and restoring directories.

    It may be that Zip and UnZip need to use different locales for file system
    calls and for console calls, switching between them as needed. We
    didn't get that far but should look at that in Zip 3.1.

    >>> if (strcmp(codeset, "UTF-8") == 0) {
    >>> using_utf8 = 1;
    >>> }

    >>Not sure this will work.

    >You mean whether this will work everywhere? It works here :)

    It might work as a check for the current locale setting.

    If all you got is UTF-8, then great. Indeed, in your case there's
    no need to switch locales. On the Linux system here, if the locale
    is not switched to UTF-8, paths in UTF-8 don't work.

    >Similar code was already in the ZIP release, but it was commented out. I'm
    >not sure why, does it causes trouble?

    Probably. I'll have to look at it.

    >>Note that the locale is only being changed within the Zip program, not
    >>the locale of the system. At least that is the intent.

    >That's what it does.

    So far so good then.

    >>It fails on that platform though because a locale of that name does not
    >>exist.

    So it sounds like the fix is to wire in the correct locale name for that
    platform. Then everything else should work.

    >For all we know, the correct UTF-8 locale could be called
    >"thingaboo" or "default" or whatever. I'm not sure what the objective is.
    >If you want to convert text
    >from UTF-8, use iconv(3).

    The objective is to activate the OS UTF-8 support for file system calls.

    We don't want to convert text between anything other than UTF-8 and
    the local character set, and iconv is not needed for that.

    >(btw, the problem in the ZIP/UNZIP code doesn't arise directly from the
    >failed setlocale() call, it arises because "using_utf8 = 0").

    Sure. So send in a patch that fixes that for your target platforms. Be
    sure to put it in some #ifdef block specific for these platforms. Any
    patch that makes standard Linux platforms fail will be rejected.

    >I hope that helps :)

    Some.

    Thanks.

     
  • Logged In: YES
    user_id=765110
    Originator: NO

    Hi,

    First, excuse me for my long winded post, yet again :)

    Summary:

    As a compromise, could we test the nl_langinfo CODESET first and if it's not UTF-8 already, set it to your hardcoded UTF-8 locale? That would work.

    Long winded stuff (read only if you have a lot of time :)):

    gordone wrote:
    >It seems an assumed standard was broken, that the new platform didn't stick with the way others were doing it. So that's something extra to design to.

    Yeah, it's true that that's the first platform I've seen that didn't use the convention "en_US.UTF-8" for UTF-8 locales. Strange.

    >Danny wrote:
    >>The _environment_ should tell the program that UTF-8 is what's wanted, not the program itself (the program doesn't set its own locale because how would it know what the system wanted to be used for input, display, file name storage, ...).

    >No, that's not quite how I interpret this. The locale set in the program sets how the program is asking to interact with the operating system.

    The locale system is only known in the base library; the kernel doesn't care about it (If it did, my life would be a lot easier ;)). It doesn't magically convert before passing things to the kernel either.

    No really, check for yourself:

    in <http://ftp.gnu.org/gnu/glibc/glibc-2.6.1.tar.bz2>:
    - glibc "open": in "sysdeps/unix/sysv/linux/open64.c" line 28 ff.

    Hmm, of course it depends what you mean by "Operating System".
    In any case, "The locale set in the program sets how the program is asking to interact with the base library as it entails to sorting, formatting and error messages"? :)

    > In the case where we have a UTF-8 string we want to write to the file system as UTF-8, we set the UTF-8 locale so that the standard file calls that write to the file system can interpret the UTF-8 strings. That's the main point of switching to UTF-8, so these paths on the file system can be read and written.

    Heh, if that worked, I would be so grateful. Unfortunately UNIX is weird and there's no encoding information whatsoever in the kernel's open() / creat() system calls (nor even in the directory entry itself for some file systems) and so it couldn't pass the encoding information even if it wanted to.

    Hence the locale was a convention set by the system administrator specifying how the byte sequence (that is the file's name) was supposed to be interpreted.

    >Console I/O is a separate issue

    Ah, I misunderstood then, sorry.

    > However, if a UTF-8 path is written by Zip using UTF-8 as the locale, then the file browser displays the path correctly, even if Japanese, which is what is intented.

    Fair enough, it's convention by the GNOME people that file names are in UTF-8 except when the environment variable "G_BROKEN_FILENAMES" is set. No joke ;)

    Most UNIX text utilities just store the file name just how they got it from the terminal - in the argument of the command line (and that depends on the terminal, not the locale).

    > However, if a UTF-8 path is written by Zip using UTF-8 as the

    Written to disk? The kernel's open() system call doesn't care about the locale :(

    >I'm not following what ill effects are resulting from how Zip does the processing, other than the effects of having the wrong locale for the target platforms.

    A simple test of:
    1) save a ZIP file containing a file whose name has umlauts in MS Windows.
    2) restore the contents on the device.
    yields the file name's umlauts broken. (and what it displays with "unzip -v" is wrong, since it's presumably printing the Windows ANSI codepage bytes to an UTF-8 terminal)

    I checked the beta UNZIP and it may be because it never checks the extra field for the UTF-8 name and prefer that. Or is it because of the failed setlocale() call? In any case, if it would just check the current locale before changing the locale, it would see that if it were to just leave it alone, it would have UTF-8 support.

    In any case, I read the ZIP source code to mean "if UTF-8 capable, just store the UTF-8 name into the archive's directory entry main name field". Which it now doesn't on the device.

    >>So if there is no locale set on the test platform, it's not supposed to use any locale (as far as I understand).
    >There's always a locale.

    Well, technically there is, but it's a backward compability measure for the programs written in a time when there was in fact no locale at all because locales were not invented yet. And that's why when I said "no locale", I meant "practically no locale". :)

    >Indeed, the default locale for a C program is the standard C locale, which has limited character support. In general, the only way to know the OS calls will support UTF-8 is to set it to a UTF-8 locale.

    As a compromise, could we test the nl_langinfo CODESET first and if it's not UTF-8 already, set it to your hardcoded UTF-8 locale? That would work.

    >>Newer systems that can handle the weird multibyte 8-bit byte sequences that are UTF-8 will have an UTF-8 locale set in the environment anyway.

    >That's an assumption,

    It's an assumption of the same kind like if the system supports locales at all, it will have a LANG or LC_* environment variable set at all, otherwise not. I'm not sure whether it's a safe assumption to make, but...

    >at least from the view of a generic application like Zip that can be run on all sorts of platforms. Linux doesn't do that and UTF-8 must be set as the locale to get it.

    I see your point that it must be robust and work even when the system is set up wrong.

    I'm sorry to be so nitpicky as to what happens where but locales are messy and so I feel being extra clear is better (if annoying :)).

    >I think you're not following the key point that Zip (and UnZip) have UTF-8 strings to read and write as paths on the file system. They don't really care what the user set the system locale to be, they just need to be able to read and write paths in UTF-8. On Linux, this happens just fine.

    Ah, it would be nice if it were so. The file system API (that is, the VFS interface of Linux) has no encoding field for file names and hence any locale you set or not set will just fall into the void. :)

    > Yep. The CTYPE locale setting is important here. The collate setting I believe impacts how the strings are sorted in some cases and can be significant.

    Yes.

    >UTF-8 is needed to write the paths stored in the archive.

    >Without UTF-8 support, some paths on the file system may not be readable (as they're in another character set) or writable (as the current locale does not support some of the characters in the paths).

    That's why paths in the kernel are opaque byte sequences with no explicit meaning field (or encoding or character set) or even implicit meaning as far as the kernel is concerned.

    >A UTF-8 path can't be read or written unless the locale is set to UTF-8, or so we found.

    The kernel doesn't care about the locale. It doesn't even know it.

    >>If the locale is not UTF-8, the console probably isn't UTF-8 capable either and so it will print gibberish, also after you set a UTF-8 locale :)

    >If your console can automatically handle Japanese paths (assuming it's not set for Japanese already), then great. Over here the fonts to show Japanese on the console are not there so I generally see spaces instead.

    I didn't mean fonts.

    I mean let's say as an extreme case you have a terminal that can do only 7 bit ASCII.
    You are logged into the system using that terminal and in a shell.

    Now you start a program that (sets the locale - who cares? - and) writes a UTF-8 sequence, lets say, \xc3\x96 using the write() system call on file descriptor 1. What do you want the console (which is far away and only gets \xc3\x96, nothing else) to do?

    Which is why I said that the locale settings (like the TERM environment variable too) are user settings (depending on the capabilities of the terminal the user is on right now) and the program just setting an UTF-8 locale doesn't help in that case, it still won't be able to print the name just the same.

    It's the same with file paths, if you open(\xc3\x96), the kernel will merrily create a file that is named \xc3\x96 but if you type "ls" and your locale is "C" (which it reasonably is, given my hypothetical 7-bit ASCII terminal), you'll not see the correct UTF-8 character (in fact ls could even just sanity check and err out).

    Then you suggested to set the locale to UTF-8 which makes the situation little better:

    "ls" will be able to sort the names correctly - taking into account that \xc3\x96 is supposed to come after "O" and before "P" - but then it tries to print it, write(1, \xc3\x96), and since the locale is UTF-8 capable "ls" doesn't err out and so tells the terminal "write \xc3\x96" and the terminal goes "huh??" and probably writes something like \x67\x22 ('g"') or beeps wildly ;)

    Note that in no case the name of the file (the byte sequence passed to the kernel) was any different.

    >we looked at that [iconv] and there's reasons we're not using it. [license, unmet dependencies]

    Fair enough.

    >you need to know the official names of the character sets you are converting from and to, and that generally is not known for Zip paths.

    That would be exactly what I would store in the directory entry: the name of the character set the name is in. But always using UTF-8 is a good compromise and good enough for all cases I can think of, with most systems using it anyway and all.

    >UTF-8 is just fine for storing all character set characters and standard system calls are usually available for converting to and from the local character set and Unicode that are more available than iconv.

    Really? Where?

    >Most modern Unix file systems store the paths on the drives as
    Unicode as far as I know. The OS then converts the paths to other character sets as needed.

    What is "the OS"?

    >More importantly, if the UTF-8 locale is set, the OS handles converting the paths to whatever format is used for the file system.

    No. Whatever byte array you pass to the open() system call, it stays just as it was until it hits the VFS in the kernel.

    >>> char* codeset = nl_langinfo(CODESET); /* get the codeset used. for example "UTF-8". */

    >Well, yeah. However, there's no standard way to determine that the current locale is UTF-8, which there should be.

    I'm not trying to be annoying, maybe I'm just losing it, but doesn't the above code do exactly that?

    With "It can vary from system to system" I understood "if the system is UTF-8 capable the result will be 'UTF-8' and otherwise it will be something else" - hence it varies from system to system. Did I get it wrong?

    I can check the SUSv2 specification on the matter if you want, but I'm pretty sure it's exactly what it does.

    Locale Naming Guidline for Linux <http://www.openi18n.org/docs/text/LocNameGuide-V10.txt>.

    UTF-8 and Unicode FAQ for Unix/Linux <http://www.cl.cam.ac.uk/~mgk25/unicode.html>:
    >>>if (((s = getenv("LC_ALL")) && *s) ||
    >>> ((s = getenv("LC_CTYPE")) && *s) ||
    >>> ((s = getenv("LANG")) && *s)) {
    >>> if (strstr(s, "UTF-8"))
    >>> utf8_mode = 1;
    >>> }
    >>>"This relies of course on all UTF-8 locales having the name of the encoding in their name, which is not always the case, therefore the nl_langinfo() query is clearly the better method."

    > It doesn't matter though, as setting a UTF-8 locale tells the OS that paths to OS calls are UTF-8 and the OS should do the right thing. If setting a UTF-8 locale fails, the OS does not support it and Zip acts accordingly.

    In my case, it does support UTF-8, as you saw in the terminal output. It just not found since it has a different name.

    <http://www.cl.cam.ac.uk/~mgk25/ucs/norm_charmap.c>
    >>> "Unfortunately the names used by the CODESET are not yet standardized".

    Note that it doesn't have any special case for matching multiple different spellings of "UTF-8".

    >>When you added UTF-8 support to ZIP, you wanted to make sure it also works on non-UTF-8 systems, right?

    >We want Zip to work if UTF-8 is supported or not.

    Which is why the "using_utf8" variable is there to stay.
    Good to know.

    >The point is to read and write file system paths so that directories with files in a mix of character sets can be restored. Zip does that now on Linux and Windows. Displaying fonts in console windows and being able to type in Japanese for instance would be good, but is not essential to archiving and restoring directories.

    >It may be that Zip and UnZip need to use different locales for file system calls and for console calls, switching between them as needed. We didn't get that far but should look at that in Zip 3.1.

    >>> if (strcmp(codeset, "UTF-8") == 0) {
    >>> using_utf8 = 1;
    >>> }

    >It might work as a check for the current locale setting.

    > On the Linux system here, if the locale
    is not switched to UTF-8, paths in UTF-8 don't work.

    O_O

    I want to see that. I want to believe that while Ulrich Drepper does a lot of things, he does not modify the base library in a way that makes path names break sometimes :)

    >>Similar code was already in the ZIP release, but it was commented out. I'm not sure why, does it causes trouble?

    >Probably. I'll have to look at it.

    Thanks.

    >So it sounds like the fix is to wire in the correct locale name for that platform. Then everything else should work.

    I would suggest doing the nl_langinfo and if that didn't report "UTF-8", use setlocale().

    >The objective is to activate the OS UTF-8 support for file system calls.

    There is no such OS UTF-8 support for file system calls.

    >We don't want to convert text between anything other than UTF-8 and the local character set, and iconv is not needed for that.

    >Sure. So send in a patch that fixes that for your target platforms. Be sure to put it in some #ifdef block specific for these platforms. Any patch that makes standard Linux platforms fail will be rejected.

    How about a patch that first uses nl_langinfo and then, if that's not UTF-8, setlocale(). If any succeeds, set "utf8_support = 1", that is:

    -------------------------------------
    setlocale(LC_CTYPE, ""); /* tell base library that we support locales. This will load the locale the user has selected. */

    char* codeset = nl_langinfo(CODESET); /* get the codeset currently used. for example "UTF-8". */

    if (strcmp(codeset, "UTF-8") == 0) {
    using_utf8 = 1;
    } else {
    if (setlocale(LC_CTYPE, "en_US.UTF-8") != NULL) {
    using_utf8 = 1;
    }
    }
    -------------------------------------

    Thanks for your time.

     
  • Ed Gordon
    Ed Gordon
    2008-09-28

    Sorry about the delay, but the list of things to do is getting long lately. I hope to reply to this shortly, and it looks like something can be done. A couple thoughts though. It seems we are missing slightly in interpretting what each is saying, so I need to step through my reply more carefully. Also, the functioning of Zip and the UnZip beta were worked out through research and much trail and error. I twinge when you suggest something should work in some way and I have experience to the contrary. Anyway, I should have time tomorrrow to work through this.

     
  • Happy new year :-)

    >It seems we are missing slightly in
    interpretting what each is saying, so I need to step through my reply more carefully.

    Yeah, encoding issues are complex and I probably misunderstood part of what you said.

    >Also, the functioning of Zip and the UnZip beta were worked out through research and much trail and error.

    > I twinge when you suggest something should work in some way and I have experience to the contrary.

    Oh no, I'm sorry, I didn't mean it like that.

    I appreciate your effort to have ZIP/UNZIP work on so many platforms. And I know that there are tradeoffs to be made when being cross-platform, and quirks on each platform.

    I'm quite OK with having an "#ifdef" for the Maemo platform (note that the setlocale call for Maemo would be 'setlocale(LC_CTYPE, "en_US")' - this will set an UTF-8 locale and indeed everything works then *(only) since the entire platform is UTF-8 anyway* -, but please note that this problem is indicative of a bigger problem that will come back to you on other platforms.

    You are right that getting things to work in practise is a lot harder than just saying "this is how it should be". I am also a practical person, hence I checked the actual code places in the libc, kernel etc where things happen and didn't go by some high-horse theoretical framework. You'll notice that nowhere in the entire chain from your program calling open() up to the kernel filesystem code doing the lookup() there is any charset conversion or even charset information. Hence I found it interesting that your tests on the Linux box failed depending on locale when there is clearly no locale information anywhere in the call stack. I'm not saying it didn't happen, but I'd really like to see the call stack from gdb in the failure case :-)

     
1 2 > >> (Page 1 of 2)