|
From: Erwin W. <wat...@xs...> - 2013-02-15 07:04:55
|
Op 14-2-2013 22:55, John Brown schreef: >> 2013/2/13 John Brown >> I did not fully understand your test cases, but I do not see where your >> tests show that MSYS does not convert characters. >> >> Renato Silva >> Hopefully you did not understand because I didn't either @@. But I have >> corrected it and I hope you do now. > I was wondering if something was wrong with me. I am glad to see that I > am OK. > >> John, as for your file format, ok I understand it better >> now, but it's still just crazy to me. I can't see much sense in their >> triple-digit thing, for example. > Neither do I, but I am sure of their method. I observed the pattern > when I calculated the total that I could not see based on other > numbers that were always clear. > > As for the permille, it does not seem to have any numerical value. > It may be there just for decoration. In any case, other than the > triple-digits, the portions of the file that I need can be easily > extracted using regular expressions. I can search-and-replace the > triple-digits, again using regular expressions. > >> Actually, iconv gives an error instead, due to the permille. > ... >> >> >iconv -f latin1 -t cp850 original_bytes.txt >> iconv: test.txt:1:1: cannot convert >> > I got that too. I did not bother to report it because I was satisfied > - I managed to make MSYS display the same output as Notepad, thanks to > Erwin Waterlander. I was also tired. > > And thanks for the tip about ls --show-control-chars. > Hi, Remember that msys is derived from a very old Cygwin 1.3. Cygwin only started to support locales (and Unicode) properly since version 1.7. Msys 2.0 will be based on Cygwin 1.7. Only then we can be freed of the code page annoyance. The OEM code pages are an annoyance to all non-Unicode Windows command line programs. English speaking people don't notice it so much, because the English language doesn't use much diacritical marks like accents or umlauts. Usually ASCII is sufficient for English, with a few exceptions. Like naïve or passé. I don't understand why Microsoft didn't make the default OEM code page by default equal to the ANSI code page long time ago. A good moment in time was Vista, which was also available as 64 bit to the public. OEM code pages are for backwards compatibility with real DOS programs. How many people still run DOS command line programs on Windows? I think the majority of the command line programs are Windows programs by now. I think they should have switched the default code page in cmd.exe and PowerShell to ANSI and the few people who run DOS programs can switch to CP850 or whatever. What's the point of PowerShell to be backwards compatible with DOS programs wrt code page? And what really surprises me, is that even on 64 bit Windows the DOS OEM code pages are default, while it is not even possible to run a DOS program in cmd.exe on 64 bit (because NTVDM has been removed). Microsoft's advice is to write Unicode programs, and to use the Windows API for that (try WriteConsoleW). A Windows Unicode command line program will produce consistent output, independent of the active code page. Then the only limitation is the font. But programs ported from Unix typically don't use the Windows API. And that is why Cygwin 1.7 had to build an UTF-8 layer that translates to and from Windows internal UTF-16 format. So what can you do now? For yourself, if you don't run real DOS programs, it's fine to switch the default OEMCP permanently to 1252. If you distribute software for Windows, there is no escape to use Microsoft's Unicode functions if you want to get out of the code page trouble. regards, -- Erwin Waterlander http://waterlan.home.xs4all.nl/ |