From: Tor L. <tm...@ik...> - 2009-07-09 10:22:22
|
> Well, yes, ..., but software that uses the 8-bit calls is broken anyway in the sense that it stops working the moment the codepage changes. I don't know what you mean with "the moment the codepage changes". You make the situation sound worse than it is. The system codepage of a machine does not change without overwriting the Windows installation with a different language edition of Windows, as far as I know. Code that is written to use the "normal" C library (plain "char") APIs (and A-suffixed versions of Win32 APIs, that is without any suffix at all assuming UNICODE is not defined), for instance code ported straight from Unix, does work fine in most cases on various language editions of Windows with different system codepages, and is able to handle non-ASCII file names in the system codepages in question. You don't need to write such code to work just in one particular system codepage. (In fact, it would be hard to intentionally do it.) "Narrow char" code will usually, to the best of my knowledge, work fine on a Western Windows installation, a Greek one, an Arabic one, or a Hebrew one etc without recompilation and will handle files with names in those codepages (which all do include plain ASCII in the 7-bit half). (Then, on systems with East Asian double-byte system codepages, such "plain" code will also work mostly fine, except that doing things like strchr(filename, '\\') to find directory separators will break as some double-byte characters have '\\' as the second byte. Ditto for '/'. To properly handle strings encoded in also double-byte system codepages, one should use the multi-byte string functions like _mbschr().) It is just the case where a system has files with names containing characters not in the system codepage that absolutely *requires* using Unicode APIs, wide character strings and wide character APIs, to handle such files. As such, I have no idea how common or rare such situations are, but they might be quite common in some parts of the world, or in institutions that regularly handle files from different parts of the world. In my personal opinion, it is important to be prepared for such situations. That is why I tend to bring up the issue of being Unicode aware. --tml |
From: Marian C. <ci...@in...> - 2009-07-09 19:07:05
|
>> Well, yes, ..., but software that uses the 8-bit calls is broken anyway >> in the sense that it stops working the moment the codepage changes. > > I don't know what you mean with "the moment the codepage changes". You > make the situation sound worse than it is. The system codepage of a > machine does not change without overwriting the Windows installation > with a different language edition of Windows, as far as I know. I thought you could change it from Control Panel or at least from RegEdit. Anyway, that's not what I meant; even if you can do it, it's not a normal use scenario. I was thinking more in terms of transferring data between computers with different codepages, e.g. unpacking an archive that your "different codepage" friend sent you. As for making the situation worse than it is, you're right. Most applications work fine with 8bit calls to local codepages. My personal case is an MP3 diagnosis + correction tool (called MP3 Diags), which I wrote on Linux and then ported to Windows. It seems to work fine as long as you stay ASCII, but doesn't see anything above that, because of a mix of UTF8 and local codepage calls. I can improve it to always use the local codepage, but that doesn't really fix it, because on my computer I have files whose names don't fit in the local codepage (and if I change the codepage, then others won't fit.) If you go beyond Western Europe and North America, you see a sudden increase in the chance of people having MP3 files whose names fall outside the local codepage. Now besides being invisible (for now) to my program, they can't be copied, opened, compared, played ... by tools that only comprehend the local codepage. I think MP3 files are more likely than others to have "wrong" names because while the users have been taught to stick to ASCII when creating files if they want to keep out of trouble, CD rippers can use whatever characters they please. So this sort of brings me to my main point: we're stuck in a situation where users would like to use more characters in the file names, but they don't because many tools can't deal with such names; the tools, on the other hand, have little incentive to change because the users learnt their lesson and just use ASCII, and perhaps the local codepage. So my suggestion is that the tools should change. Then why don't "I" start making the change? Well, I'm already spending a lot of time on the freely available MP3 tool, and there are other things that I need / want to do. Also, it would take me a lot more time to make, say, MinGW UTF8-aware, than it would take somebody who is familiar with it. |
From: Marian C. <ci...@in...> - 2009-07-17 07:13:43
|
> So this sort of brings me to my main point: we're stuck in a situation > where users would like to use more characters in the file names, but they > don't because many tools can't deal with such names; the tools, on the > other hand, have little incentive to change because the users learnt > their lesson and just use ASCII, and perhaps the local codepage. So my > suggestion is that the tools should change. Then why don't "I" start > making the change? Well, I'm already spending a lot of time on the freely > available MP3 tool, and there are other things that I need / want to do. > Also, it would take me a lot more time to make, say, MinGW UTF8-aware, > than it would take somebody who is familiar with it. In case somebody stumbles upon this thread, looking for a solution to the same problem, here's what I did: I created drop-in replacement classes for fstream / ifstream / ofstream, which take Unicode names (UTF-8 or UTF-16) on their constructors and on their open() methods. The code is available for download from the MP3 Diags project. You need the files fstream_unicode.h and fstream_unicode.cpp You can also download or take a look at the files at http://mp3diags.svn.sourceforge.net/viewvc/mp3diags/src/fstream_unicode.h?view=markup and http://mp3diags.svn.sourceforge.net/viewvc/mp3diags/src/fstream_unicode.cpp?view=markup They haven't been heavily tested, but they seem to work fine for me. |
From: Tor L. <tm...@ik...> - 2009-07-10 09:15:04
|
> So this sort of brings me to my main point: we're stuck in a situation where users would like to use more characters in the file names, but they don't because many tools can't deal with such names; Thus it is a good thing that modern programming environments like Java and C# use Unicode for file names (all strings in fact). Just write your code in Java or C# and there is no problem with arbitrary file names on Windows. Also, for instance the GTK+ stack API uses Unicode (UTF-8) for file names on Windows. I don't know what other comparable toolkits like Qt do. Hopefully they do the same. So if you for some reason don't want to use Java or C#, use C (or C++) but with a library / toolkit that provides a UTF-8 view of the file system. Writing application code in plain C, in this century, without any higher-level toolkit than the C library or basic Win32 APIs, sounds a bit odd to me. --tml |
From: Mark <mar...@gm...> - 2009-07-12 07:47:42
|
Tor Lillqvist wrote: > Writing application code in plain C, in this century, without any > higher-level toolkit than the C library or basic Win32 APIs, sounds a > bit odd to me. > > --tml I can't let that pass really :-) there may be good reasons; for instance my current project is a windows port of a highly portable browser; now although its browser function needs a few libs, it makes sense that its gui should be as unbloated as possible; as well as [hopefully, when I get it right :-) ] working compatibly in as many different flavours of windows as possible; hence w32api / gdi Best Mark http://www.halloit.com Key ID 046B65CF |