#23 better way to check for UTF-8

open
nobody
None
5
2008-08-01
2008-07-31
Anonymous
No

The new 3.0 ZIP and (beta) UNZIP have support for storing UTF-8 in the "name" field.

The way they check whether the "operating system environment" is in UTF-8 mode is:

in ZIP:
loc = setlocale(LC_CTYPE, "en_US.UTF-8");
in UNZIP:
loc = setlocale(LC_CTYPE, "en_GB.UTF-8");

First, it is strange that it uses two different locales.

Second, this fails on platforms that use only UTF-8 and no legacy stuff because they have no non-UTF8 locale and so the UTF-8 locale is called "en_US" and NOT "en_US.UTF-8" (it's just a free-form name, so...).

This is the actual bug we encountered on the Nokia N800 device (Maemo platform, GNU environment): there are locales like "de_DE", "en_US" etc and they are all in UTF-8 codeset.

(Also, it is rude to override the locale the user has set for himself: if he's not using UTF-8 (but would have an UTF-8 locale installed as part of glibc, if he even knows that) but he's doesn't want to use UTF-8, why force him to?)

The easiest fix is to instead _check_ whether ZIP / UNZIP are run in an UTF-8 locale:

----------

#include <locale.h>
#include <langinfo.h>

setlocale(LC_CTYPE, ""); /* tell base library that we support locales. This will load the locale the user has selected. */

char* codeset = nl_langinfo(CODESET); /* get the codeset used. for example "UTF-8". */

if (strcmp(codeset, "UTF-8") == 0) {
using_utf8 = 1;
}

----------

Discussion

  • Danny Milosavljevic

    Logged In: YES
    user_id=765110
    Originator: NO

    diff -up orig/zip30/tailor.h zip30/tailor.h
    --- orig/zip30/tailor.h 2008-05-06 21:38:12.000000000 +0200
    +++ zip30/tailor.h 2008-07-31 14:38:23.771881100 +0200
    @@ -332,6 +332,10 @@ IZ_IMP char *mktemp();
    #ifdef UNICODE_SUPPORT
    # if defined( UNIX) || defined( VMS)
    # include <locale.h>
    +# include <langinfo.h>
    +# ifndef SETLOCALE
    +# define SETLOCALE(category, locale) setlocale(category, locale)
    +# endif /* ndef SETLOCALE */
    # endif /* defined( UNIX) || defined( VMS) */
    # include <wchar.h>
    # include <wctype.h>
    diff -up orig/zip30/zip.c zip30/zip.c
    --- orig/zip30/zip.c 2008-07-05 12:34:06.000000000 +0200
    +++ zip30/zip.c 2008-07-31 14:17:48.623351200 +0200
    @@ -2204,20 +2204,20 @@ char **argv; /* command line
    writing, and displaying (if the fonts are loaded) all
    characters in UTF-8. */
    {
    - char *loc;
    + char *encoding;

    /*
    loc = setlocale(LC_CTYPE, NULL);
    printf(" Initial language locale = '%s'\n", loc);
    */

    - loc = setlocale(LC_CTYPE, "en_US.UTF-8");
    + encoding = nl_langinfo(CODESET);

    /*
    printf("langinfo %s\n", nl_langinfo(CODESET));
    */

    - if (loc != NULL) {
    + if (strcmp(encoding, "UTF-8") == 0) {
    /* using UTF-8 character set so can set UTF-8 GPBF bit 11 */
    using_utf8 = 1;
    /*

     
  • Danny Milosavljevic

    Logged In: YES
    user_id=765110
    Originator: NO

    (there is a 'SETLOCALE(LC_CTYPE, "");' call a few lines above the context already, hence the patch doesn't do it a second time).

    Hope that helps.

     
  • Ed Gordon

    Ed Gordon - 2008-08-01

    Logged In: YES
    user_id=1172496
    Originator: NO

    > The way they check whether the "operating system environment" is in UTF-8
    mode is:

    > in ZIP:
    > loc = setlocale(LC_CTYPE, "en_US.UTF-8");
    > in UNZIP:
    > loc = setlocale(LC_CTYPE, "en_GB.UTF-8");

    > First, it is strange that it uses two different locales.

    That's because the Zip and UnZip maintainers are in different countries. Since the locale "UTF-8" does not exist, some specific locale was needed. At least that's what we thought.

    > Second, this fails on platforms that use only UTF-8 and no legacy stuff
    > because they have no non-UTF8 locale and so the UTF-8 locale is called
    > "en_US" and NOT "en_US.UTF-8" (it's just a free-form name, so...).

    Didn't know that. On the test platforms we have the above worked fine. That's why we need help from the community. (Thanks!)

    It's possible your way doesn't work on our test platforms, as these platforms need to be told that UTF-8 is what's wanted.

    > This is the actual bug we encountered on the Nokia N800 device (Maemo
    > platform, GNU environment): there are locales like "de_DE", "en_US" etc and
    > they are all in UTF-8 codeset.

    OK. If you can provide links to documentation on this we'd appreciate it.

    > (Also, it is rude to override the locale the user has set for himself: if
    > he's not using UTF-8 (but would have an UTF-8 locale installed as part of
    > glibc, if he even knows that) but he's doesn't want to use UTF-8, why force
    > him to?)

    A UTF-8 path can't be read or written unless the locale is set to UTF-8, or so we found.

    > The easiest fix is to instead _check_ whether ZIP / UNZIP are run in an
    > UTF-8 locale:

    This may not be the case when files are extracted. We felt that file paths should be restored correctly regardless of the current character set, if the system supports UTF-8. The archive could have paths from various character sets that need to be restored.

    > char* codeset = nl_langinfo(CODESET); /* get the codeset used. for example
    "UTF-8". */

    This can vary from system to system.

    > if (strcmp(codeset, "UTF-8") == 0) {
    > using_utf8 = 1;
    > }

    Not sure this will work. Note that the locale is only being changed within the Zip program, not the locale of the system. At least that is the intent. We need some details of the problem to work it.

     
  • Ed Gordon

    Ed Gordon - 2008-08-01
    • summary: better was to check for UTF-8 --> better way to check for UTF-8
     
  • Danny Milosavljevic

    Logged In: YES
    user_id=765110
    Originator: NO

    Hi,

    gordone wrote:
    > Danny wrote:
    >> in ZIP:
    >> loc = setlocale(LC_CTYPE, "en_US.UTF-8");
    >> in UNZIP:
    >> loc = setlocale(LC_CTYPE, "en_GB.UTF-8");
    >> First, it is strange that it uses two different locales.

    >That's because the Zip and UnZip maintainers are in different countries.
    >Since the locale "UTF-8" does not exist, some specific locale was needed.
    >At least that's what we thought.

    I see. Well, the entire point of locales is to localize the environment to a country, so makes sense it includes a country name :)

    Unicode is a special case the locale people didn't design in, so you need a country setting for LC_CTYPE which is admittedly weird.

    (Note that most locale settings don't have anything to do with the charset so you need the country setting anyway - for example to select currency, decimal point representation, date representation, sort order, digit grouping, ...)

    >> Second, this fails on platforms that use only UTF-8 and no legacy stuff because they have no non-UTF8 locale and so the UTF-8 locale is called "en_US" and NOT "en_US.UTF-8" (it's just a free-form name, so...).

    >Didn't know that. On the test platforms we have the above worked fine.
    >That's why we need help from the community. (Thanks!)

    You're welcome :)

    I see. I mostly used older maturer platforms before and they were all _migrating_ to UTF-8 and thus still supporting the older locales and thus the problem of using the original names for the new coding never surfaced.

    The platform the problem happens on is a new embedded platform and there it apparently made sense to get rid of the (big) legacy locales and only have UTF-8. -> bam, assumption broken :)

    >It's possible your way doesn't work on our test platforms, as these platforms need to be told that UTF-8 is what's wanted.

    That's what's supposed to happen, isn't it?
    Shouldn't a program:
    - as long as nobody sets a locale, use C (ASCII) encoding.
    - if somebody sets a locale, use the encoding specified in that locale.

    Ah, I think I see the misunderstanding:

    The _environment_ should tell the program that UTF-8 is what's wanted, not the program itself (the program doesn't set its own locale because how would it know what the system wanted to be used for input, display, file name storage, ...).

    So if there is no locale set on the test platform, it's not supposed to use any locale (as far as I understand).

    Newer systems that can handle the weird multibyte 8-bit byte sequences that are UTF-8 will have an UTF-8 locale set in the environment anyway.

    The only problem arises when one doesn't check the locale itself for the encoding, but parses the locale's name.
    Or tries to set a specific locale that wasn't specified by the user directly or indirectly (which seems weird to me, frankly).

    >> This is the actual bug we encountered on the Nokia N800 device (Maemo platform, GNU environment): there are locales like "de_DE", "en_US" etc and they are all in UTF-8 codeset.

    >OK. If you can provide links to documentation on this we'd appreciate it.

    I'm trying to find documentation about encoding on the Maemo platform, will get back to you.

    Using the shell, one gets:
    $ locale
    LANG=de_DE
    LC_CTYPE="de_DE"
    LC_NUMERIC="de_DE"
    LC_TIME="de_DE"
    LC_COLLATE="de_DE"
    LC_MONETARY="de_DE"
    LC_MESSAGES="de_DE"
    LC_PAPER="de_DE"
    LC_NAME="de_DE"
    LC_ADDRESS="de_DE"
    LC_TELEPHONE="de_DE"
    LC_MEASUREMENT="de_DE"
    LC_IDENTIFICATION="de_DE"
    LC_ALL=

    $ locale charmap
    UTF-8

    >> ... doesn't want to use UTF-8, why force him to?)

    >A UTF-8 path can't be read or written unless the locale is set to UTF-8, or so we found.

    Do you mean to the console? If the locale is not UTF-8, the console probably isn't UTF-8 capable either and so it will print gibberish, also after you set a UTF-8 locale :) Not that that happened to me, but just saying.

    I think you want to use iconv(3) to convert freely between charsets.

    >> The easiest fix is to instead _check_ whether ZIP / UNZIP are run in an UTF-8 locale:

    >This may not be the case when files are extracted. We felt that file paths should be restored correctly regardless of the current character set, if the system supports UTF-8. The archive could have paths from various character sets that need to be restored.

    Well, UNIX is weird in that way: The correct character set for the file system is apparently defined by the locale, with all the weird fun that entails :-)

    LANG dependent encoding
    <http://linux.derkeiler.com/Newsgroups/comp.os.linux.setup/2004-12/0849.html>

    GCC on file name encodings
    <http://gcc.gnu.org/ml/java-prs/2003-q1/msg00033.html>

    Global file name encodings suggested
    <http://mail.nl.linux.org/linux-utf8/2001-06/msg00244.html>.

    Funn...

    >> char* codeset = nl_langinfo(CODESET); /* get the codeset used. for example "UTF-8". */

    >This can vary from system to system.

    Hmm? Of course, isn't that the point?

    When you added UTF-8 support to ZIP, you wanted to make sure it also works on non-UTF-8 systems, right?
    With the UNIX libc authors, it was the same: as long as the locale doesn't use UTF-8, it's not supposed to use UTF-8 (because UTF-8 is unknown to the rest of the system and you could neither type it with the keyboard nor display it on screen anyway).

    >> if (strcmp(codeset, "UTF-8") == 0) {
    >> using_utf8 = 1;
    >> }

    >Not sure this will work.

    You mean whether this will work everywhere? It works here :)

    Similar code was already in the ZIP release, but it was commented out. I'm not sure why, does it causes trouble?

    >Note that the locale is only being changed within the Zip program, not the locale of the system. At least that is the intent.

    That's what it does.

    It fails on that platform though because a locale of that name does not exist. For all we know, the correct UTF-8 locale could be called "thingaboo" or "default" or whatever. I'm not sure what the objective is. If you want to convert text
    from UTF-8, use iconv(3).

    (btw, the problem in the ZIP/UNZIP code doesn't arise directly from the failed setlocale() call, it arises because "using_utf8 = 0").

    >We need some details of the problem to work it.

    I hope that helps :)

     
  • Danny Milosavljevic

    Logged In: YES
    user_id=765110
    Originator: NO

    $ locale -a
    C
    da_DK
    de_DE
    el_GR
    en_GB
    en_US
    es_ES
    es_MX
    fi_FI
    fr_CA
    fr_FR
    it_IT
    nl_NL
    no_NO
    POSIX
    pt_BR
    pt_PT
    ru_RU
    sv_SE

     
  • Ed Gordon

    Ed Gordon - 2008-08-02

    Logged In: YES
    user_id=1172496
    Originator: NO

    >>Since the locale "UTF-8" does not exist, some specific locale was needed.

    >>At least that's what we thought.

    >I see. Well, the entire point of locales is to localize the environment to
    >a country, so makes sense it includes a country name :)

    >Unicode is a special case the locale people didn't design in, so you
    >need a country setting for LC_CTYPE which is admittedly weird.

    Yep.

    >I see. I mostly used older maturer platforms before and they were all
    >_migrating_ to UTF-8 and thus still supporting the older locales and thus
    >the problem of using the original names for the new coding never surfaced.

    >The platform the problem happens on is a new embedded platform and there
    >it apparently made sense to get rid of the (big) legacy locales and only
    >have UTF-8. -> bam, assumption broken :)

    It seems an assumed standard was broken, that the new platform didn't stick
    with the way others were doing it. So that's something extra to design to.

    >>It's possible your way doesn't work on our test platforms, as these
    platforms need to be told that UTF-8 is what's wanted.

    >That's what's supposed to happen, isn't it?
    >Shouldn't a program:
    >- as long as nobody sets a locale, use C (ASCII) encoding.
    >- if somebody sets a locale, use the encoding specified in that locale.

    Yep. There's detailed documentation on how the different settings of a
    locale are used to process specific character set and other information.

    >Ah, I think I see the misunderstanding:

    >The _environment_ should tell the program that UTF-8 is what's wanted, not
    >the program itself (the program doesn't set its own locale because how
    >would it know what the system wanted to be used for input, display, file
    >name storage, ...).

    No, that's not quite how I interpret this. The locale set in the program sets
    how the program is asking to interact with the operating system. In the
    case where we have a UTF-8 string we want to write to the file system as
    UTF-8, we set the UTF-8 locale so that the standard file calls that write
    to the file system can interpret the UTF-8 strings. That's the main point
    of switching to UTF-8, so these paths on the file system can be read and
    written.

    Console I/O is a separate issue, and so far we've found that most console
    windows don't have the fonts loaded to display characters outside the
    current loaded character set. See how code pages are switched in and
    out to support different characters in the documentation. However, if
    a UTF-8 path is written by Zip using UTF-8 as the locale, then the file
    browser displays the path correctly, even if Japanese, which is what is
    intented.

    I'm not following what ill effects are resulting from how Zip does the
    processing, other than the effects of having the wrong locale for the
    target platforms.

    >So if there is no locale set on the test platform, it's not supposed to
    >use any locale (as far as I understand).

    There's always a locale. Indeed, the default locale for a C program is
    the standard C locale, which has limited character support. In
    general, the only way to know the OS calls will support UTF-8 is to set
    it to a UTF-8 locale.

    >Newer systems that can handle the weird multibyte 8-bit byte sequences
    >that are UTF-8 will have an UTF-8 locale set in the environment anyway.

    That's an assumption, at least from the view of a generic application like
    Zip that can be run on all sorts of platforms. Linux doesn't do that and
    UTF-8 must be set as the locale to get it.

    >The only problem arises when one doesn't check the locale itself for the
    >encoding, but parses the locale's name.
    >Or tries to set a specific locale that wasn't specified by the user
    >directly or indirectly (which seems weird to me, frankly).

    I think you're not following the key point that Zip (and UnZip) have UTF-8
    strings to read and write as paths on the file system. They don't really
    care what the user set the system locale to be, they just need to be
    able to read and write paths in UTF-8. On Linux, this happens just fine.

    >>> This is the actual bug we encountered on the Nokia N800 device (Maemo
    platform, GNU environment): there are locales like "de_DE", "en_US" etc and
    they are all in UTF-8 codeset.

    >>OK. If you can provide links to documentation on this we'd appreciate
    it.

    >I'm trying to find documentation about encoding on the Maemo platform,
    will get back to you.

    OK. Appreciate that. This needs to be fixed so that all cases are covered
    and tested. This problem introduces new cases that apparently didn't exist
    while we were designing this code.

    >Using the shell, one gets:
    >$ locale
    >LANG=de_DE
    >LC_CTYPE="de_DE"
    ...
    >LC_COLLATE="de_DE"
    ...
    >LC_ALL=

    >$ locale charmap
    >UTF-8

    Yep. The CTYPE locale setting is important here. The collate setting I believe
    impacts how the strings are sorted in some cases and can be significant. The
    ALL setting can be used to set everything, but Zip and probably UnZip (still in
    development) don't care about the other settings and the local settings are left
    as is.

    >>> ... doesn't want to use UTF-8, why force him to?)

    UTF-8 is needed to write the paths stored in the archive. If the user doesn't
    want that to happen, they can disable Unicode support, either by recompiling
    without it or by using the option that disables it.

    Without UTF-8 support, some paths on the file system may not be readable
    (as they're in another character set) or writable (as the current locale does
    not support some of the characters in the paths).

    >A UTF-8 path can't be read or written unless the locale is set to UTF-8,
    or so we found.

    >Do you mean to the console?

    No. The file system.

    >If the locale is not UTF-8, the console
    >probably isn't UTF-8 capable either and so it will print gibberish, also
    >after you set a UTF-8 locale :) Not that that happened to me, but just
    >saying.

    If your console can automatically handle Japanese paths (assuming it's
    not set for Japanese already), then great. Over here the fonts to show
    Japanese on the console are not there so I generally see spaces instead.

    >I think you want to use iconv(3) to convert freely between charsets.

    No, we looked at that and there's reasons we're not using it. First, the
    license is not compatible with the Info-ZIP license and we are resistant
    to force people to get the iconv library (not supported on all platforms)
    separately before they can compile our code. Second, to use iconv
    you need to know the official names of the character sets you are
    converting from and to, and that generally is not known for Zip paths.
    Third, iconv is not needed. UTF-8 is just fine for storing all character
    set characters and standard system calls are usually available for
    converting to and from the local character set and Unicode that are
    more available than iconv.

    >>> The easiest fix is to instead _check_ whether ZIP / UNZIP are run in an
    UTF-8 locale:

    >>This may not be the case when files are extracted. We felt that file
    >>paths should be restored correctly regardless of the current character set,
    >>if the system supports UTF-8. The archive could have paths from various
    >>character sets that need to be restored.

    >Well, UNIX is weird in that way: The correct character set for the file
    >system is apparently defined by the locale, with all the weird fun that
    >entails :-)

    >LANG dependent encoding
    ><http://linux.derkeiler.com/Newsgroups/comp.os.linux.setup/2004-12/0849.html>

    >GCC on file name encodings
    ><http://gcc.gnu.org/ml/java-prs/2003-q1/msg00033.html>

    >Global file name encodings suggested
    ><http://mail.nl.linux.org/linux-utf8/2001-06/msg00244.html>.

    >Funn...

    I'll try to check the above references later this weekend.

    Most modern Unix file systems store the paths on the drives as
    Unicode as far as I know. The OS then converts the paths to other
    character sets as needed. More importantly, if the UTF-8 locale is
    set, the OS handles converting the paths to whatever format is used
    for the file system.

    >>> char* codeset = nl_langinfo(CODESET); /* get the codeset used. for
    example "UTF-8". */

    >>This can vary from system to system.

    >Hmm? Of course, isn't that the point?

    Well, yeah. However, there's no standard way to determine that the current
    locale is UTF-8, which there should be.

    It doesn't matter though, as setting a UTF-8 locale tells the OS that paths to
    OS calls are UTF-8 and the OS should do the right thing. If setting a UTF-8
    locale fails, the OS does not support it and Zip acts accordingly.

    >When you added UTF-8 support to ZIP, you wanted to make sure it also works
    >on non-UTF-8 systems, right?

    We want Zip to work if UTF-8 is supported or not.

    >With the UNIX libc authors, it was the same: as long as the locale doesn't
    >use UTF-8, it's not supposed to use UTF-8 (because UTF-8 is unknown to the
    >rest of the system and you could neither type it with the keyboard nor
    >display it on screen anyway).

    The point is to read and write file system paths so that directories with
    files in a mix of character sets can be restored. Zip does that now on
    Linux and Windows. Displaying fonts in console windows and being
    able to type in Japanese for instance would be good, but is not
    essential to archiving and restoring directories.

    It may be that Zip and UnZip need to use different locales for file system
    calls and for console calls, switching between them as needed. We
    didn't get that far but should look at that in Zip 3.1.

    >>> if (strcmp(codeset, "UTF-8") == 0) {
    >>> using_utf8 = 1;
    >>> }

    >>Not sure this will work.

    >You mean whether this will work everywhere? It works here :)

    It might work as a check for the current locale setting.

    If all you got is UTF-8, then great. Indeed, in your case there's
    no need to switch locales. On the Linux system here, if the locale
    is not switched to UTF-8, paths in UTF-8 don't work.

    >Similar code was already in the ZIP release, but it was commented out. I'm
    >not sure why, does it causes trouble?

    Probably. I'll have to look at it.

    >>Note that the locale is only being changed within the Zip program, not
    >>the locale of the system. At least that is the intent.

    >That's what it does.

    So far so good then.

    >>It fails on that platform though because a locale of that name does not
    >>exist.

    So it sounds like the fix is to wire in the correct locale name for that
    platform. Then everything else should work.

    >For all we know, the correct UTF-8 locale could be called
    >"thingaboo" or "default" or whatever. I'm not sure what the objective is.
    >If you want to convert text
    >from UTF-8, use iconv(3).

    The objective is to activate the OS UTF-8 support for file system calls.

    We don't want to convert text between anything other than UTF-8 and
    the local character set, and iconv is not needed for that.

    >(btw, the problem in the ZIP/UNZIP code doesn't arise directly from the
    >failed setlocale() call, it arises because "using_utf8 = 0").

    Sure. So send in a patch that fixes that for your target platforms. Be
    sure to put it in some #ifdef block specific for these platforms. Any
    patch that makes standard Linux platforms fail will be rejected.

    >I hope that helps :)

    Some.

    Thanks.

     
  • Danny Milosavljevic

    Logged In: YES
    user_id=765110
    Originator: NO

    Hi,

    First, excuse me for my long winded post, yet again :)

    Summary:

    As a compromise, could we test the nl_langinfo CODESET first and if it's not UTF-8 already, set it to your hardcoded UTF-8 locale? That would work.

    Long winded stuff (read only if you have a lot of time :)):

    gordone wrote:
    >It seems an assumed standard was broken, that the new platform didn't stick with the way others were doing it. So that's something extra to design to.

    Yeah, it's true that that's the first platform I've seen that didn't use the convention "en_US.UTF-8" for UTF-8 locales. Strange.

    >Danny wrote:
    >>The _environment_ should tell the program that UTF-8 is what's wanted, not the program itself (the program doesn't set its own locale because how would it know what the system wanted to be used for input, display, file name storage, ...).

    >No, that's not quite how I interpret this. The locale set in the program sets how the program is asking to interact with the operating system.

    The locale system is only known in the base library; the kernel doesn't care about it (If it did, my life would be a lot easier ;)). It doesn't magically convert before passing things to the kernel either.

    No really, check for yourself:

    in <http://ftp.gnu.org/gnu/glibc/glibc-2.6.1.tar.bz2>:
    - glibc "open": in "sysdeps/unix/sysv/linux/open64.c" line 28 ff.

    Hmm, of course it depends what you mean by "Operating System".
    In any case, "The locale set in the program sets how the program is asking to interact with the base library as it entails to sorting, formatting and error messages"? :)

    > In the case where we have a UTF-8 string we want to write to the file system as UTF-8, we set the UTF-8 locale so that the standard file calls that write to the file system can interpret the UTF-8 strings. That's the main point of switching to UTF-8, so these paths on the file system can be read and written.

    Heh, if that worked, I would be so grateful. Unfortunately UNIX is weird and there's no encoding information whatsoever in the kernel's open() / creat() system calls (nor even in the directory entry itself for some file systems) and so it couldn't pass the encoding information even if it wanted to.

    Hence the locale was a convention set by the system administrator specifying how the byte sequence (that is the file's name) was supposed to be interpreted.

    >Console I/O is a separate issue

    Ah, I misunderstood then, sorry.

    > However, if a UTF-8 path is written by Zip using UTF-8 as the locale, then the file browser displays the path correctly, even if Japanese, which is what is intented.

    Fair enough, it's convention by the GNOME people that file names are in UTF-8 except when the environment variable "G_BROKEN_FILENAMES" is set. No joke ;)

    Most UNIX text utilities just store the file name just how they got it from the terminal - in the argument of the command line (and that depends on the terminal, not the locale).

    > However, if a UTF-8 path is written by Zip using UTF-8 as the

    Written to disk? The kernel's open() system call doesn't care about the locale :(

    >I'm not following what ill effects are resulting from how Zip does the processing, other than the effects of having the wrong locale for the target platforms.

    A simple test of:
    1) save a ZIP file containing a file whose name has umlauts in MS Windows.
    2) restore the contents on the device.
    yields the file name's umlauts broken. (and what it displays with "unzip -v" is wrong, since it's presumably printing the Windows ANSI codepage bytes to an UTF-8 terminal)

    I checked the beta UNZIP and it may be because it never checks the extra field for the UTF-8 name and prefer that. Or is it because of the failed setlocale() call? In any case, if it would just check the current locale before changing the locale, it would see that if it were to just leave it alone, it would have UTF-8 support.

    In any case, I read the ZIP source code to mean "if UTF-8 capable, just store the UTF-8 name into the archive's directory entry main name field". Which it now doesn't on the device.

    >>So if there is no locale set on the test platform, it's not supposed to use any locale (as far as I understand).
    >There's always a locale.

    Well, technically there is, but it's a backward compability measure for the programs written in a time when there was in fact no locale at all because locales were not invented yet. And that's why when I said "no locale", I meant "practically no locale". :)

    >Indeed, the default locale for a C program is the standard C locale, which has limited character support. In general, the only way to know the OS calls will support UTF-8 is to set it to a UTF-8 locale.

    As a compromise, could we test the nl_langinfo CODESET first and if it's not UTF-8 already, set it to your hardcoded UTF-8 locale? That would work.

    >>Newer systems that can handle the weird multibyte 8-bit byte sequences that are UTF-8 will have an UTF-8 locale set in the environment anyway.

    >That's an assumption,

    It's an assumption of the same kind like if the system supports locales at all, it will have a LANG or LC_* environment variable set at all, otherwise not. I'm not sure whether it's a safe assumption to make, but...

    >at least from the view of a generic application like Zip that can be run on all sorts of platforms. Linux doesn't do that and UTF-8 must be set as the locale to get it.

    I see your point that it must be robust and work even when the system is set up wrong.

    I'm sorry to be so nitpicky as to what happens where but locales are messy and so I feel being extra clear is better (if annoying :)).

    >I think you're not following the key point that Zip (and UnZip) have UTF-8 strings to read and write as paths on the file system. They don't really care what the user set the system locale to be, they just need to be able to read and write paths in UTF-8. On Linux, this happens just fine.

    Ah, it would be nice if it were so. The file system API (that is, the VFS interface of Linux) has no encoding field for file names and hence any locale you set or not set will just fall into the void. :)

    > Yep. The CTYPE locale setting is important here. The collate setting I believe impacts how the strings are sorted in some cases and can be significant.

    Yes.

    >UTF-8 is needed to write the paths stored in the archive.

    >Without UTF-8 support, some paths on the file system may not be readable (as they're in another character set) or writable (as the current locale does not support some of the characters in the paths).

    That's why paths in the kernel are opaque byte sequences with no explicit meaning field (or encoding or character set) or even implicit meaning as far as the kernel is concerned.

    >A UTF-8 path can't be read or written unless the locale is set to UTF-8, or so we found.

    The kernel doesn't care about the locale. It doesn't even know it.

    >>If the locale is not UTF-8, the console probably isn't UTF-8 capable either and so it will print gibberish, also after you set a UTF-8 locale :)

    >If your console can automatically handle Japanese paths (assuming it's not set for Japanese already), then great. Over here the fonts to show Japanese on the console are not there so I generally see spaces instead.

    I didn't mean fonts.

    I mean let's say as an extreme case you have a terminal that can do only 7 bit ASCII.
    You are logged into the system using that terminal and in a shell.

    Now you start a program that (sets the locale - who cares? - and) writes a UTF-8 sequence, lets say, \xc3\x96 using the write() system call on file descriptor 1. What do you want the console (which is far away and only gets \xc3\x96, nothing else) to do?

    Which is why I said that the locale settings (like the TERM environment variable too) are user settings (depending on the capabilities of the terminal the user is on right now) and the program just setting an UTF-8 locale doesn't help in that case, it still won't be able to print the name just the same.

    It's the same with file paths, if you open(\xc3\x96), the kernel will merrily create a file that is named \xc3\x96 but if you type "ls" and your locale is "C" (which it reasonably is, given my hypothetical 7-bit ASCII terminal), you'll not see the correct UTF-8 character (in fact ls could even just sanity check and err out).

    Then you suggested to set the locale to UTF-8 which makes the situation little better:

    "ls" will be able to sort the names correctly - taking into account that \xc3\x96 is supposed to come after "O" and before "P" - but then it tries to print it, write(1, \xc3\x96), and since the locale is UTF-8 capable "ls" doesn't err out and so tells the terminal "write \xc3\x96" and the terminal goes "huh??" and probably writes something like \x67\x22 ('g"') or beeps wildly ;)

    Note that in no case the name of the file (the byte sequence passed to the kernel) was any different.

    >we looked at that [iconv] and there's reasons we're not using it. [license, unmet dependencies]

    Fair enough.

    >you need to know the official names of the character sets you are converting from and to, and that generally is not known for Zip paths.

    That would be exactly what I would store in the directory entry: the name of the character set the name is in. But always using UTF-8 is a good compromise and good enough for all cases I can think of, with most systems using it anyway and all.

    >UTF-8 is just fine for storing all character set characters and standard system calls are usually available for converting to and from the local character set and Unicode that are more available than iconv.

    Really? Where?

    >Most modern Unix file systems store the paths on the drives as
    Unicode as far as I know. The OS then converts the paths to other character sets as needed.

    What is "the OS"?

    >More importantly, if the UTF-8 locale is set, the OS handles converting the paths to whatever format is used for the file system.

    No. Whatever byte array you pass to the open() system call, it stays just as it was until it hits the VFS in the kernel.

    >>> char* codeset = nl_langinfo(CODESET); /* get the codeset used. for example "UTF-8". */

    >Well, yeah. However, there's no standard way to determine that the current locale is UTF-8, which there should be.

    I'm not trying to be annoying, maybe I'm just losing it, but doesn't the above code do exactly that?

    With "It can vary from system to system" I understood "if the system is UTF-8 capable the result will be 'UTF-8' and otherwise it will be something else" - hence it varies from system to system. Did I get it wrong?

    I can check the SUSv2 specification on the matter if you want, but I'm pretty sure it's exactly what it does.

    Locale Naming Guidline for Linux <http://www.openi18n.org/docs/text/LocNameGuide-V10.txt>.

    UTF-8 and Unicode FAQ for Unix/Linux <http://www.cl.cam.ac.uk/~mgk25/unicode.html>:
    >>>if (((s = getenv("LC_ALL")) && *s) ||
    >>> ((s = getenv("LC_CTYPE")) && *s) ||
    >>> ((s = getenv("LANG")) && *s)) {
    >>> if (strstr(s, "UTF-8"))
    >>> utf8_mode = 1;
    >>> }
    >>>"This relies of course on all UTF-8 locales having the name of the encoding in their name, which is not always the case, therefore the nl_langinfo() query is clearly the better method."

    > It doesn't matter though, as setting a UTF-8 locale tells the OS that paths to OS calls are UTF-8 and the OS should do the right thing. If setting a UTF-8 locale fails, the OS does not support it and Zip acts accordingly.

    In my case, it does support UTF-8, as you saw in the terminal output. It just not found since it has a different name.

    <http://www.cl.cam.ac.uk/~mgk25/ucs/norm_charmap.c>
    >>> "Unfortunately the names used by the CODESET are not yet standardized".

    Note that it doesn't have any special case for matching multiple different spellings of "UTF-8".

    >>When you added UTF-8 support to ZIP, you wanted to make sure it also works on non-UTF-8 systems, right?

    >We want Zip to work if UTF-8 is supported or not.

    Which is why the "using_utf8" variable is there to stay.
    Good to know.

    >The point is to read and write file system paths so that directories with files in a mix of character sets can be restored. Zip does that now on Linux and Windows. Displaying fonts in console windows and being able to type in Japanese for instance would be good, but is not essential to archiving and restoring directories.

    >It may be that Zip and UnZip need to use different locales for file system calls and for console calls, switching between them as needed. We didn't get that far but should look at that in Zip 3.1.

    >>> if (strcmp(codeset, "UTF-8") == 0) {
    >>> using_utf8 = 1;
    >>> }

    >It might work as a check for the current locale setting.

    > On the Linux system here, if the locale
    is not switched to UTF-8, paths in UTF-8 don't work.

    O_O

    I want to see that. I want to believe that while Ulrich Drepper does a lot of things, he does not modify the base library in a way that makes path names break sometimes :)

    >>Similar code was already in the ZIP release, but it was commented out. I'm not sure why, does it causes trouble?

    >Probably. I'll have to look at it.

    Thanks.

    >So it sounds like the fix is to wire in the correct locale name for that platform. Then everything else should work.

    I would suggest doing the nl_langinfo and if that didn't report "UTF-8", use setlocale().

    >The objective is to activate the OS UTF-8 support for file system calls.

    There is no such OS UTF-8 support for file system calls.

    >We don't want to convert text between anything other than UTF-8 and the local character set, and iconv is not needed for that.

    >Sure. So send in a patch that fixes that for your target platforms. Be sure to put it in some #ifdef block specific for these platforms. Any patch that makes standard Linux platforms fail will be rejected.

    How about a patch that first uses nl_langinfo and then, if that's not UTF-8, setlocale(). If any succeeds, set "utf8_support = 1", that is:

    -------------------------------------
    setlocale(LC_CTYPE, ""); /* tell base library that we support locales. This will load the locale the user has selected. */

    char* codeset = nl_langinfo(CODESET); /* get the codeset currently used. for example "UTF-8". */

    if (strcmp(codeset, "UTF-8") == 0) {
    using_utf8 = 1;
    } else {
    if (setlocale(LC_CTYPE, "en_US.UTF-8") != NULL) {
    using_utf8 = 1;
    }
    }
    -------------------------------------

    Thanks for your time.

     
  • Ed Gordon

    Ed Gordon - 2008-09-28

    Sorry about the delay, but the list of things to do is getting long lately. I hope to reply to this shortly, and it looks like something can be done. A couple thoughts though. It seems we are missing slightly in interpretting what each is saying, so I need to step through my reply more carefully. Also, the functioning of Zip and the UnZip beta were worked out through research and much trail and error. I twinge when you suggest something should work in some way and I have experience to the contrary. Anyway, I should have time tomorrrow to work through this.

     
  • Danny Milosavljevic

    Happy new year :-)

    >It seems we are missing slightly in
    interpretting what each is saying, so I need to step through my reply more carefully.

    Yeah, encoding issues are complex and I probably misunderstood part of what you said.

    >Also, the functioning of Zip and the UnZip beta were worked out through research and much trail and error.

    > I twinge when you suggest something should work in some way and I have experience to the contrary.

    Oh no, I'm sorry, I didn't mean it like that.

    I appreciate your effort to have ZIP/UNZIP work on so many platforms. And I know that there are tradeoffs to be made when being cross-platform, and quirks on each platform.

    I'm quite OK with having an "#ifdef" for the Maemo platform (note that the setlocale call for Maemo would be 'setlocale(LC_CTYPE, "en_US")' - this will set an UTF-8 locale and indeed everything works then *(only) since the entire platform is UTF-8 anyway* -, but please note that this problem is indicative of a bigger problem that will come back to you on other platforms.

    You are right that getting things to work in practise is a lot harder than just saying "this is how it should be". I am also a practical person, hence I checked the actual code places in the libc, kernel etc where things happen and didn't go by some high-horse theoretical framework. You'll notice that nowhere in the entire chain from your program calling open() up to the kernel filesystem code doing the lookup() there is any charset conversion or even charset information. Hence I found it interesting that your tests on the Linux box failed depending on locale when there is clearly no locale information anywhere in the call stack. I'm not saying it didn't happen, but I'd really like to see the call stack from gdb in the failure case :-)

     
  • Danny Milosavljevic

    So the minimal, lowest-risk architecture-specific patch (when you really don't want to touch anything else) is:

    Cut here-----------------
    --- orig/zip30/zip.c 2008-07-05 12:34:06.000000000 +0200
    +++ zip30/zip.c 2009-01-02 22:05:55.000000000 +0100
    @@ -2211,7 +2211,11 @@ char **argv; /* command line
    printf(" Initial language locale = '%s'\n", loc);
    */

    +#ifdef MAEMO
    + loc = setlocale(LC_CTYPE, "");
    +#else
    loc = setlocale(LC_CTYPE, "en_US.UTF-8");
    +#endif

    /*
    printf("langinfo %s\n", nl_langinfo(CODESET));
    Cut here-----------------

    I added the preprocessor symbol "MAEMO" to our Maemo autobuilder rules file in order to be able to distinguish the platform:
    $(MAKE) -f unix/Makefile flags CC="$(CC)" CFLAGS_NOOPT="$(CFLAGS) -DMAEMO"

     
  • Ed Gordon

    Ed Gordon - 2009-01-10

    >So the minimal, lowest-risk architecture-specific patch (when you really
    >don't want to touch anything else) is:
    >
    >Cut here-----------------
    >--- orig/zip30/zip.c 2008-07-05 12:34:06.000000000 +0200
    >+++ zip30/zip.c 2009-01-02 22:05:55.000000000 +0100
    >@@ -2211,7 +2211,11 @@ char **argv; /* command line
    > printf(" Initial language locale = '%s'\n", loc);
    > */
    >
    >+#ifdef MAEMO
    >+ loc = setlocale(LC_CTYPE, "");
    >+#else
    > loc = setlocale(LC_CTYPE, "en_US.UTF-8");
    >+#endif
    >
    > /*
    > printf("langinfo %s\n", nl_langinfo(CODESET));
    >Cut here-----------------
    >
    >I added the preprocessor symbol "MAEMO" to our Maemo autobuilder rules
    >file in order to be able to distinguish the platform:
    > $(MAKE) -f unix/Makefile flags CC="$(CC)" CFLAGS_NOOPT="$(CFLAGS)
    >-DMAEMO"

    Looks good. If nothing else works, this should be good for adding to the next
    Zip 3.1 beta. (Been distracted trying to help get UnZip 6.0 out, but Zip 3.1
    should start showing progress later this month.)

    As a side, it seems always useful to be able to distinguish a specific platform,
    so thanks for adding the MAEMO symbol.

    -------------------------------------------

    >Happy new year :-)

    And to you. Happy new year!

    >> I twinge when you suggest something should work in some way and I have
    >experience to the contrary.
    >
    >Oh no, I'm sorry, I didn't mean it like that.

    No problem. It's just our job to make sure (as best we can) we don't break
    anything as we add things.

    >I'm quite OK with having an "#ifdef" for the Maemo platform (note that the
    >setlocale call for Maemo would be 'setlocale(LC_CTYPE, "en_US")' - this
    >will set an UTF-8 locale and indeed everything works then *(only) since the
    >entire platform is UTF-8 anyway* -,

    OK.

    >but please note that this problem is
    >indicative of a bigger problem that will come back to you on other
    >platforms.

    Maybe. Hard to guess what might come to be, but so far yours is
    the only exception. Then again, it might be starting a trend.

    >You'll
    >notice that nowhere in the entire chain from your program calling open() up
    >to the kernel filesystem code doing the lookup() there is any charset
    >conversion or even charset information. Hence I found it interesting that
    >your tests on the Linux box failed depending on locale when there is
    >clearly no locale information anywhere in the call stack. I'm not saying it
    >didn't happen, but I'd really like to see the call stack from gdb in the
    >failure case :-)

    I think the actual interpretation may be in the file system. Different
    file systems support Unicode and other character sets in different ways.
    I don't know that much about it, but Unicode support is definitely
    dependent on the file system.

    -------------------------------------------

    >As a compromise, could we test the nl_langinfo CODESET first and if it's
    >not UTF-8 already, set it to your hardcoded UTF-8 locale? That would work.

    Yeah, that might. Again, I've been distracted recently and my replies might
    reflect that, but it seems that may be a good way to do it.

    >>>The _environment_ should tell the program that UTF-8 is what's wanted,
    >not the program itself (the program doesn't set its own locale because how
    >would it know what the system wanted to be used for input, display, file
    >name storage, ...).
    >
    >>No, that's not quite how I interpret this. The locale set in the program
    >sets how the program is asking to interact with the operating system.
    >
    >The locale system is only known in the base library; the kernel doesn't
    >care about it (If it did, my life would be a lot easier ;)). It doesn't
    >magically convert before passing things to the kernel either.

    But the file system apparently does.

    >> In the case where we have a UTF-8 string we want to write to the file
    >system as UTF-8, we set the UTF-8 locale so that the standard file calls
    >that write to the file system can interpret the UTF-8 strings. That's the
    >main point of switching to UTF-8, so these paths on the file system can be
    >read and written.
    >
    >Heh, if that worked, I would be so grateful.

    It does. At least in the UNIX testing I've done.

    >Unfortunately UNIX is weird
    >and there's no encoding information whatsoever in the kernel's open() /
    >creat() system calls (nor even in the directory entry itself for some file
    >systems) and so it couldn't pass the encoding information even if it wanted
    >to.

    I don't know. I assume the file system can get this information if it needs
    it. It's clear that when Unicode is set as the character set, characters in
    various languages can be set in the file paths, and when it isn't, only
    characters in the current character set can be read and written using
    file system calls. So the system interprets character codes differently
    based on the locale set.

    >>Console I/O is a separate issue
    >
    >Ah, I misunderstood then, sorry.

    We hope to tackle that at some point. On UNIX there's the code page issue
    and on Windows there's specific calls for handling UTF-16 console I/O
    that would need wrapping to fit in to the Zip and UnZip sources.

    >> However, if a UTF-8 path is written by Zip using UTF-8 as the
    >
    >Written to disk? The kernel's open() system call doesn't care about the
    >locale :(

    But something down stream does. If UTF-8 is not set as the locale,
    then giving open() UTF-8 strings does not work well.

    >A simple test of:
    >1) save a ZIP file containing a file whose name has umlauts in MS
    >Windows.

    Zip 3.0 has been tested to do that.

    >2) restore the contents on the device.

    How?

    >yields the file name's umlauts broken. (and what it displays with "unzip
    >-v" is wrong, since it's presumably printing the Windows ANSI codepage
    >bytes to an UTF-8 terminal)

    The current UnZip does not support Unicode. UnZip 6.0 should.

    Also, console I/O is seemingly always in the current code page (because
    of font support). The actual path is visible in Windows Explorer and the
    various Linux file navigation tools.

    >I checked the beta UNZIP and it may be because it never checks the extra
    >field for the UTF-8 name and prefer that. Or is it because of the failed
    >setlocale() call?

    Probably. When the locale can't be set, things don't work.

    >In any case, if it would just check the current locale
    >before changing the locale, it would see that if it were to just leave it
    >alone, it would have UTF-8 support.

    Yeah, I think that's the best approach (better than the patch at the top).
    This should get in to Zip 3.1 shortly. UnZip 6.0 is about to go out so
    I'm not sure if it will get in before UnZip 6.1.

    >In any case, I read the ZIP source code to mean "if UTF-8 capable, just
    >store the UTF-8 name into the archive's directory entry main name field".

    Yeah. This uses a new UTF-8 flag to say that the standard path is UTF-8.
    This flag is used by an unzip to know to handle the path as UTF-8.

    >Which it now doesn't on the device.

    Probably because it can't determine that UTF-8 is to be used.

    >>>So if there is no locale set on the test platform, it's not supposed to
    >use any locale (as far as I understand).
    >>There's always a locale.
    >
    >Well, technically there is, but it's a backward compability measure for
    >the programs written in a time when there was in fact no locale at all
    >because locales were not invented yet. And that's why when I said "no
    >locale", I meant "practically no locale". :)

    We require locale support before enabling Unicode support. If there is
    no locale support, we don't try to do Unicode.

    >It's an assumption of the same kind like if the system supports locales at
    >all, it will have a LANG or LC_* environment variable set at all, otherwise
    >not. I'm not sure whether it's a safe assumption to make, but...

    If a port does not provide sufficient locale support, we don't support Unicode
    there.

    >>>If the locale is not UTF-8, the console probably isn't UTF-8 capable
    >either and so it will print gibberish, also after you set a UTF-8 locale
    >:)

    Most consoles don't seem to handle Unicode, just the current code
    page. The goal is to read and write the correct paths on the file system,
    which usually supports Unicode better than the console.

    >I mean let's say as an extreme case you have a terminal that can do only 7
    >bit ASCII.
    >You are logged into the system using that terminal and in a shell.
    >
    >Now you start a program that (sets the locale - who cares? - and) writes a
    >UTF-8 sequence, lets say, \xc3\x96 using the write() system call on file
    >descriptor 1. What do you want the console (which is far away and only gets
    >\xc3\x96, nothing else) to do?

    Zip and UnZip provide an ASCII escape sequence when a character is
    not supported in the current character set. These are basically the
    Unicode character code in ASCII hex characters, like #U0393 for the
    Greek capital gamma.

    >Which is why I said that the locale settings (like the TERM environment
    >variable too) are user settings (depending on the capabilities of the
    >terminal the user is on right now) and the program just setting an UTF-8
    >locale doesn't help in that case, it still won't be able to print the name
    >just the same.

    To print a name on the console would require switching in the right code page.
    Until consoles support Unicode fully, proper display of all paths on the
    console may not be practical.

    >It's the same with file paths, if you open(\xc3\x96), the kernel will
    >merrily create a file that is named \xc3\x96 but if you type "ls" and your
    >locale is "C" (which it reasonably is, given my hypothetical 7-bit ASCII
    >terminal), you'll not see the correct UTF-8 character (in fact ls could
    >even just sanity check and err out).

    Yep.

    >Then you suggested to set the locale to UTF-8 which makes the situation
    >little better:
    >
    >"ls" will be able to sort the names correctly - taking into account that
    >\xc3\x96 is supposed to come after "O" and before "P" - but then it tries
    >to print it, write(1, \xc3\x96), and since the locale is UTF-8 capable "ls"
    >doesn't err out and so tells the terminal "write \xc3\x96" and the terminal
    >goes "huh??" and probably writes something like \x67\x22 ('g"') or beeps
    >wildly ;)

    Yep.

    >Note that in no case the name of the file (the byte sequence passed to the
    >kernel) was any different.

    Yep. The system just interprets it differently.

    >>we looked at that [iconv] and there's reasons we're not using it.
    >[license, unmet dependencies]
    >
    >Fair enough.

    iconv is nice for what it does, but is not practical and is not needed
    for what Zip and UnZip need to do.

    >>you need to know the official names of the character sets you are
    >converting from and to, and that generally is not known for Zip paths.
    >
    >That would be exactly what I would store in the directory entry: the name
    >of the character set the name is in.

    This becomes a problem because the names of character sets is not
    consistent between OS. It's also not needed as long as Unicode is
    used as the common character set.

    >But always using UTF-8 is a good
    >compromise and good enough for all cases I can think of, with most systems
    >using it anyway and all.

    It's also now the zip standard, as of the last year.

    >>UTF-8 is just fine for storing all character set characters and standard
    >system calls are usually available for converting to and from the local
    >character set and Unicode that are more available than iconv.
    >
    >Really? Where?

    The standard ANSI multi-byte to wide character and wide character to
    multi-byte calls.

    >>Most modern Unix file systems store the paths on the drives as
    >Unicode as far as I know. The OS then converts the paths to other
    >character sets as needed.
    >
    >What is "the OS"?

    Anything the application (Zip or UnZip) is running on, including
    kernel, file system, everything. The above calls know what the local
    character set is and handle the conversions.

    >>More importantly, if the UTF-8 locale is set, the OS handles converting
    >the paths to whatever format is used for the file system.
    >
    >No. Whatever byte array you pass to the open() system call, it stays just
    >as it was until it hits the VFS in the kernel.

    The local path is converted to Unicode, the character set is set to Unicode,
    then the normal calls like open() all work with Unicode transparently.

    >>>> char* codeset = nl_langinfo(CODESET); /* get the codeset used. for
    >example "UTF-8". */
    >
    >>Well, yeah. However, there's no standard way to determine that the
    >current locale is UTF-8, which there should be.
    >
    >I'm not trying to be annoying, maybe I'm just losing it, but doesn't the
    >above code do exactly that?

    I'm not sure all ports that support locales have this call. I'll have to
    look.

    >With "It can vary from system to system" I understood "if the system is
    >UTF-8 capable the result will be 'UTF-8' and otherwise it will be something
    >else" - hence it varies from system to system. Did I get it wrong?

    Maybe not. All systems I have worked with do not start out in
    Unicode mode.

    >Locale Naming Guidline for Linux
    ><http://www.openi18n.org/docs/text/LocNameGuide-V10.txt>.
    >
    >UTF-8 and Unicode FAQ for Unix/Linux
    ><http://www.cl.cam.ac.uk/~mgk25/unicode.html>:
    >>>>if (((s = getenv("LC_ALL")) && *s) ||
    >>>> ((s = getenv("LC_CTYPE")) && *s) ||
    >>>> ((s = getenv("LANG")) && *s)) {
    >>>> if (strstr(s, "UTF-8"))
    >>>> utf8_mode = 1;
    >>>> }
    >>>>"This relies of course on all UTF-8 locales having the name of the
    >encoding in their name, which is not always the case, therefore the
    >nl_langinfo() query is clearly the better method."

    I'll have to check.

    >>It may be that Zip and UnZip need to use different locales for file
    >system calls and for console calls, switching between them as needed. We
    >didn't get that far but should look at that in Zip 3.1.
    >
    >>>> if (strcmp(codeset, "UTF-8") == 0) {
    >>>> using_utf8 = 1;
    >>>> }
    >
    >>It might work as a check for the current locale setting.

    Will check this.

    >> On the Linux system here, if the locale
    >is not switched to UTF-8, paths in UTF-8 don't work.
    >
    >O_O
    >
    >I want to see that. I want to believe that while Ulrich Drepper does a lot
    >of things, he does not modify the base library in a way that makes path
    >names break sometimes :)

    It's the interpretation of the paths.

    >I would suggest doing the nl_langinfo and if that didn't report "UTF-8",
    >use setlocale().

    OK. I'm starting to agree with that.

    >-------------------------------------
    >setlocale(LC_CTYPE, ""); /* tell base library that we support locales.
    >This will load the locale the user has selected. */
    >
    >char* codeset = nl_langinfo(CODESET); /* get the codeset currently used.
    >for example "UTF-8". */
    >
    >if (strcmp(codeset, "UTF-8") == 0) {
    > using_utf8 = 1;
    >} else {
    > if (setlocale(LC_CTYPE, "en_US.UTF-8") != NULL) {
    > using_utf8 = 1;
    > }
    >}
    >-------------------------------------

    Something like that.

    >Thanks for your time.

    And yours.

    Keep an eye out for the next Zip 3.1 beta.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks