From: Erwin W. <wat...@xs...> - 2012-05-11 06:42:30
|
Op 10-5-2012 22:37, Keith Marshall schreef: > On 10/05/12 08:36, waterlan wrote: >> How about globbing when there are files with Unicode characters? > The current implementation supports only ASCII file names. It had > occurred to me that a further UTF-16LE implementation may be desirable, > but that will be for "later". > >> The normal MSVCRT.DLL globbing doesn't work. You get a question mark >> for the Unicode characters, and when your program tries to open the file >> you get 'file not found'. > Sure. You need a UTF-16LE implementation to support UTF-16LE file > names. However, the existing MinGW start-up code supports only the > ASCII name space, and my present focus is on replacing that. Mingw programs support the 8 bit ANSI name space, to be precise. That is more than 7 bit ASCII. > >> So what I do to read Unicode arguments is this: >> >> wchar_t *cmdstr; >> wchar_t **wargv; >> >> cmdstr = GetCommandLineW(); >> wargv = CommandLineToArgvW(cmdstr,&argc); >> >> Actually I don't know if this does any globbing. > I don't either. I suspect not, since there's no selector to enable it, It does not. I checked. It returns the unglobbed parameters. This is not a problem for my wcd program where I use this. But for dos2unix I would like to have globbing with Unicode support. Because users typically type: unix2dos *.txt. At the moment it will fail on all file names that have Unicode characters. I suspect this is a problem for a lot of Asian users. > as there would be if you used __wgetmainargs() instead. However, if you > were to make that change, I would expect similarly broken globbing to > that exhibited by __getmainargs(), in the ASCII case; the primary > purpose of this effort is to fix that, while also adding a few more > POSIX-like enhancements as a secondary (optional) benefit. > >> Perhaps it would be handy if you add another glob function that >> automatically returns all file names (including Unicode names) in UTF-8 > UTF-8? What's that? Okay, I do know, really; just couldn't resist a > dig at the blatantly misinformed brainwashing Microsoft subject their > users to, when they attempt to convince us that > > Unicode == UTF-16LE > > (as if UTF-32, UTF-32LE, UTF-32BE, UTF-16, UTF-16BE, and UTF-8 either > don't exist at all, or somehow, don't qualify as Unicode formats). > >> instead of ANSI code. This is then still ASCII compatible, and if your >> program needs to support Unicode on Windows the program can convert the >> UTF-8 arguments to wide characters. > When I have the glob(3) implementation working to my satisfaction, I may > consider adapting it to provide a UTF-16LE based _wglob() variant; I > have no plans for a UTF-8 specific variant. (Actually, there's no > particular reason why the present version wouldn't work with UTF-8, as > is; the limitation would stem from the capability of _findfirst() and > _findnext(), as used by opendir(3) and readdir(3) in libmingwex.a's > dirent implementation, to return file names which include non-ASCII > 8-bit characters). > _findfirst() and _findnext() support 8 bit (ANSI) characters. In theory UTF-8 should not be a problem, but as you said, MS 'wisely' decided to use UTF-16. We have to live with it. In the NTFS file system file names are stored in UTF-16. Non-Unicode programs get a translated result from the OS with characters not in the active code page translated to question marks. In contrast to Unix where the file system is ignorant of an encoding scheme and file names are just a stream of bytes. UTF-8 support has improved a bit in Windows 7. For instance you can run batch scripts in UTF-8 format if you set the Command Prompt OEM code page to 65001. This was not working in Vista. But I think think that practically nobody changes the default code page. And Windows programs work with the system ANSI code page which you typically don't want to change. Therefore I think a good way to work on Windows is to interface with the OS in UTF-16 and work internally in your program with UTF-8. At least if you want to have a portable program, and work with the Unix mantra "Everything is a Stream of Bytes". This is what I did in my wcd program that supports Unicode by using _wfindfirst() _wfindnext() when searching the disk. And then directly convert the results to UTF-8 with WideCharToMultiByte(). Then the core of the program stays portable and just handles a steam of bytes. Only at places where I have to interface with the OS I convert the UTF-8 back to wide char on Windows. But perhaps there are MinGW programmers who are not interested in writing portable code and just would like see that glob returns UTF-16 names. That would also be fine for me, because I can convert UTF-16 to UTF-8. But then I don't see how you return the globbed UTF-16 names into the program. regards, -- Erwin Waterlander http://waterlan.home.xs4all.nl/ |