Re: [Mingw-users] Quoted wildcards in arguments to MinGW programs and bug 3482704 (Unicode)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Op 10-5-2012 22:37, Keith Marshall schreef:
> On 10/05/12 08:36, waterlan wrote:
>> How about globbing when there are files with Unicode characters?
> The current implementation supports only ASCII file names.  It had
> occurred to me that a further UTF-16LE implementation may be desirable,
> but that will be for "later".
>
>> The normal MSVCRT.DLL globbing doesn't work. You get a question mark
>> for the Unicode characters, and when your program tries to open the file
>> you get 'file not found'.
> Sure.  You need a UTF-16LE implementation to support UTF-16LE file
> names.  However, the existing MinGW start-up code supports only the
> ASCII name space, and my present focus is on replacing that.

Mingw programs support the 8 bit ANSI name space, to be precise. That is 
more than 7 bit ASCII.

>
>> So what I do to read Unicode arguments is this:
>>
>>       wchar_t *cmdstr;
>>       wchar_t **wargv;
>>
>>       cmdstr = GetCommandLineW();
>>       wargv = CommandLineToArgvW(cmdstr,&argc);
>>
>> Actually I don't know if this does any globbing.
> I don't either.  I suspect not, since there's no selector to enable it,

It does not. I checked. It returns the unglobbed parameters. This is not 
a problem for my wcd program where I use this. But for dos2unix I would 
like to have globbing with Unicode support. Because users typically 
type: unix2dos *.txt. At the moment it will fail on all file names that 
have Unicode characters. I suspect this is a problem for a lot of Asian 
users.

> as there would be if you used __wgetmainargs() instead.  However, if you
> were to make that change, I would expect similarly broken globbing to
> that exhibited by __getmainargs(), in the ASCII case; the primary
> purpose of this effort is to fix that, while also adding a few more
> POSIX-like enhancements as a secondary (optional) benefit.
>
>> Perhaps it would be handy if you add another glob function that
>> automatically returns all file names (including Unicode names) in UTF-8
> UTF-8?  What's that?  Okay, I do know, really; just couldn't resist a
> dig at the blatantly misinformed brainwashing Microsoft subject their
> users to, when they attempt to convince us that
>
>     Unicode == UTF-16LE
>
> (as if UTF-32, UTF-32LE, UTF-32BE, UTF-16, UTF-16BE, and UTF-8 either
> don't exist at all, or somehow, don't qualify as Unicode formats).
>
>> instead of ANSI code. This is then still ASCII compatible, and if your
>> program needs to support Unicode on Windows the program can convert the
>> UTF-8 arguments to wide characters.
> When I have the glob(3) implementation working to my satisfaction, I may
> consider adapting it to provide a UTF-16LE based _wglob() variant; I
> have no plans for a UTF-8 specific variant.  (Actually, there's no
> particular reason why the present version wouldn't work with UTF-8, as
> is; the limitation would stem from the capability of _findfirst() and
> _findnext(), as used by opendir(3) and readdir(3) in libmingwex.a's
> dirent implementation, to return file names which include non-ASCII
> 8-bit characters).
>

_findfirst() and _findnext() support 8 bit (ANSI) characters. In theory 
UTF-8 should not be a problem, but as you said, MS 'wisely' decided to 
use UTF-16. We have to live with it. In the NTFS file system file names 
are stored in UTF-16. Non-Unicode programs get a translated result from 
the OS with characters not in the active code page translated to 
question marks. In contrast to Unix where the file system is ignorant of 
an encoding scheme and file names are just a stream of bytes.

UTF-8 support has improved a bit in Windows 7. For instance you can run 
batch scripts in UTF-8 format if you set the Command Prompt OEM code 
page to 65001. This was not working in Vista. But I think think that 
practically nobody changes the default code page. And Windows programs 
work with the system ANSI code page which you typically don't want to 
change. Therefore I think a good way to work on Windows is to interface 
with the OS in UTF-16 and work internally in your program with UTF-8. At 
least if you want to have a portable program, and work with the Unix 
mantra "Everything is a Stream of Bytes".

This is what I did in my wcd program that supports Unicode by using 
_wfindfirst() _wfindnext() when searching the disk. And then directly 
convert the results to UTF-8 with WideCharToMultiByte(). Then the core 
of the program stays portable and just handles a steam of bytes. Only at 
places where I have to interface with the OS I convert the UTF-8 back to 
wide char on Windows.

But perhaps there are MinGW programmers who are not interested in 
writing portable code and just would like see that glob returns UTF-16 
names. That would also be fine for me, because I can convert UTF-16 to 
UTF-8. But then I don't see how you return the globbed UTF-16 names into 
the program.

regards,

-- 
Erwin Waterlander
http://waterlan.home.xs4all.nl/

Re: [Mingw-users] Quoted wildcards in arguments to MinGW programs and bug 3482704 (Unicode)

A native Windows port of the GNU Compiler Collection (GCC)

Re: [Mingw-users] Quoted wildcards in arguments to MinGW programs and bug 3482704 (Unicode)