From: John B. <joh...@ho...> - 2013-02-12 13:07:31
|
Hello All, Which codepage does an MSYS terminal use? I have a file output.dat which seems to be mostly text with 0x0A line endings, but there are a few bytes that are out of the 7-bit ASCII range. As a result, the file looks different depending on the tool used to view it. In a CMD.EXE window (code page = 850 according to chcp), `type output.dat' prints áëTOTALáî©é.23 In a MSYS window, `cat output.dat' prints %TOTAL O¸'.23 Notepad displays it as ‰TOTAL Œ¸‚.23 The actual 14 bytes in a hex editor are: A0 89 54 4F 54 41 4C A0 8C B8 82 2E 32 33 I am not sure what you will see because I am sending this as text. If outlook.com sends it as UTF-8, all will be well, but if not ... I believe that the MSYS output is closest to what the creators of the file had in mind. I would like to know what `cat' and/or MSYS did to produce that output. Regards, John Brown. |
From: Renato S. <br....@gm...> - 2013-02-12 16:08:12
|
2013/2/12 John Brown <joh...@ho...> > Which codepage does an MSYS terminal use? > > I have a file output.dat which seems to be mostly text with 0x0A line > endings, but there are a few bytes that are out of the 7-bit ASCII > range. As a result, the file looks different depending on the tool > used to view it. > > In a CMD.EXE window (code page = 850 according to chcp), > `type output.dat' prints > áëTOTALáî©é.23 > > In a MSYS window, `cat output.dat' prints > %TOTAL O¸'.23 > > Notepad displays it as > ‰TOTAL Œ¸‚.23 > > The actual 14 bytes in a hex editor are: > A0 89 54 4F 54 41 4C A0 8C B8 82 2E 32 33 > > I am not sure what you will see because I am sending this as > text. If outlook.com sends it as UTF-8, all will be well, > but if not ... > > I believe that the MSYS output is closest to what the creators > of the file had in mind. I would like to know what `cat' and/or > MSYS did to produce that output. > I don't think knowing what encoding is used by the "MSYS terminal" will help with your problem. You need to rather find out if the original file is really supposed to be read as text, and if so, what encoding was used to generate it. Based on the file name and contents, and from the fact that it doesn't look like UTF-8, CP850 or ISO-8859-1, my guess is that the bytes following that total may represent a number, not a set of characters, so that you would need to concatenate all bits together to properly read it. Either way, can you provide more details about this file and your overall problem? |
From: John B. <joh...@ho...> - 2013-02-12 18:31:24
|
On Tue, 12 Feb 2013 14:07:22 -0200 , Renato Silva wrote > 2013/2/12 John Brown > Which codepage does an MSYS terminal use? > > I have a file output.dat which seems to be mostly text with 0x0A line > endings, but there are a few bytes that are out of the 7-bit ASCII > range. As a result, the file looks different depending on the tool > used to view it. [examples of different output snipped] > > I believe that the MSYS output is closest to what the creators > of the file had in mind. I would like to know what `cat' and/or > MSYS did to produce that output. > > I don't think knowing what encoding is used by the "MSYS terminal" will > help with your problem. You need to rather find out if the original > file is really supposed to be read as text, and if so, what encoding > was used to generate it. I did not give the full story. The line that I showed was a single line from the file, which represents the total on an invoice. Actually, I understand the data well enough to do what I need to do. In the MSYS example %TOTAL O¸'.23w if we consider the two bytes to the left of the the decimal point, the first byte (the one that looks like a speck of dust on your monitor) means the digit 8, and the next byte (looks like a single quote) means that the character is repeated 3 times. Therefore the TOTAL is 888.23. I am just curious about what they were thinking when they implemented this unnecessary shorthand, and it *is* unnecessary because if the TOTAL is 12345.67, they write it out as 12,345.67 with the comma to separate thousands. Clearly there is no shortage of space. So it is my opinion that: 1) The original file is supposed to be read as text, and 2) The encoding that was used to generate the file is similar to the one that is in effect when I run `cat' on the file in a MSYS window. I know the numeric values of those bytes, and one way or another that knowledge should be enough. I am just curious about why they chose those bytes for that purpose. For example, I see a speck, but maybe they see a symbol that looks like an 8 e.g. the infinity symbol. Also, now that I know that `$ cat <file>' can give me different results from `C:\> type <file>' I am curious about that too. I could ask the vendor about their file, but they probably would not tell me. Regards, John Brown. |
From: BGINFO4X <bgi...@kz...> - 2013-02-12 19:00:58
|
> > In the MSYS example > %TOTAL O¸'.23w > if we consider the two bytes to the left of the the decimal point, > the first byte (the one that looks like a speck of dust on your > monitor) means the digit 8, and the next byte (looks like a single > quote) means that the character is repeated 3 times. Therefore the > TOTAL is 888.23. I am just curious about what they were thinking > when they implemented this unnecessary shorthand, and it *is* > unnecessary because if the TOTAL is 12345.67, they write it out > as 12,345.67 with the comma to separate thousands. Clearly there is > no shortage of space. > > Perhaps a little stupid - Have you checked the regional settings in control panel? Sometimes milliards are coma "," and some times are dots (.) depending of the region selected. Regards. |
From: Erwin W. <wat...@xs...> - 2013-02-12 21:06:13
|
Op 12-2-2013 14:07, John Brown schreef: > Hello All, > > Which codepage does an MSYS terminal use? > > I have a file output.dat which seems to be mostly text with 0x0A line > endings, but there are a few bytes that are out of the 7-bit ASCII > range. As a result, the file looks different depending on the tool > used to view it. > > In a CMD.EXE window (code page = 850 according to chcp), > `type output.dat' prints > áëTOTALáî©é.23 > > In a MSYS window, `cat output.dat' prints > %TOTAL O¸'.23 > > Notepad displays it as > ‰TOTAL Œ¸‚.23 > > The actual 14 bytes in a hex editor are: > A0 89 54 4F 54 41 4C A0 8C B8 82 2E 32 33 > > By default the command prompt (cmd.exe) uses an OEM (DOS) code page. "type <file>" will show you the content of the as on DOS. If you have a 32 bit windows, you can run the DOS editor "edit". Edit will show you the same. Notepad is a Windows program and will show the file in the ANSI system code page, which you set in the control panel. Your ANSI code page is CP1250, and your OEM code page CP850. For instance the second character with number 0x89 is a promille symbol in CP1252 and an e with umlaut in CP850. See also http://czyborra.com/charsets/codepages.html Msys cat is a windows program. I think the msys terminal runs in the same code page as cmd.exe. But cat tries to be smart and does a fuzzy translation to CP850. It translates the promille symbol to a percent symbol, and OE (0x8c) to O. I cannot find any documentation on this translation of cat. In the registry you can change the default OEM code page of cmd.exe. I think this will also change msys' default code page. If you make it equal to the system ANSI code page CP1252 you will see the same text everywhere. regards, Erwin |
From: Erwin W. <wat...@xs...> - 2013-02-12 21:42:06
|
Op 12-2-2013 22:05, Erwin Waterlander schreef: > Op 12-2-2013 14:07, John Brown schreef: >> Hello All, >> >> Which codepage does an MSYS terminal use? >> >> I have a file output.dat which seems to be mostly text with 0x0A line >> endings, but there are a few bytes that are out of the 7-bit ASCII >> range. As a result, the file looks different depending on the tool >> used to view it. >> >> In a CMD.EXE window (code page = 850 according to chcp), >> `type output.dat' prints >> áëTOTALáî©é.23 >> >> In a MSYS window, `cat output.dat' prints >> %TOTAL O¸'.23 >> >> Notepad displays it as >> ‰TOTAL Œ¸‚.23 >> >> The actual 14 bytes in a hex editor are: >> A0 89 54 4F 54 41 4C A0 8C B8 82 2E 32 33 >> >> > By default the command prompt (cmd.exe) uses an OEM (DOS) code page. > "type<file>" will show you the content of the as on DOS. If you have a > 32 bit windows, you can run the DOS editor "edit". Edit will show you > the same. > > Notepad is a Windows program and will show the file in the ANSI system > code page, which you set in the control panel. > > Your ANSI code page is CP1250, and your OEM code page CP850. For > instance the second character with number 0x89 is a promille symbol in > CP1252 and an e with umlaut in CP850. See also > http://czyborra.com/charsets/codepages.html > > Msys cat is a windows program. I think the msys terminal runs in the > same code page as cmd.exe. But cat tries to be smart and does a fuzzy > translation to CP850. It translates the promille symbol to a percent > symbol, and OE (0x8c) to O. I cannot find any documentation on this > translation of cat. > > In the registry you can change the default OEM code page of cmd.exe. I > think this will also change msys' default code page. If you make it > equal to the system ANSI code page CP1252 you will see the same text > everywhere. > I tested it. Page http://superuser.com/questions/387569/how-do-i-permantly-set-the-command-prompt-codepage-in-windows-7 tells you how to change the OEM code page permanently. It requires a reboot. After that cat in msys (in ConEmu) shows the same symbols as notepad. If you use the standard msys console you have to set the font to Lucida Console in the properties, because the raster font only supports the OEM code page. regards, Erwin |
From: Erwin W. <wat...@xs...> - 2013-02-12 22:24:49
|
Op 12-2-2013 22:05, Erwin Waterlander schreef: > Op 12-2-2013 14:07, John Brown schreef: >> Hello All, >> >> Which codepage does an MSYS terminal use? >> >> I have a file output.dat which seems to be mostly text with 0x0A line >> endings, but there are a few bytes that are out of the 7-bit ASCII >> range. As a result, the file looks different depending on the tool >> used to view it. >> >> In a CMD.EXE window (code page = 850 according to chcp), >> `type output.dat' prints >> áëTOTALáî©é.23 >> >> In a MSYS window, `cat output.dat' prints >> %TOTAL O¸'.23 >> >> Notepad displays it as >> ‰TOTAL Œ¸‚.23 >> >> The actual 14 bytes in a hex editor are: >> A0 89 54 4F 54 41 4C A0 8C B8 82 2E 32 33 >> >> > By default the command prompt (cmd.exe) uses an OEM (DOS) code page. > "type<file>" will show you the content of the as on DOS. If you have a > 32 bit windows, you can run the DOS editor "edit". Edit will show you > the same. > > Notepad is a Windows program and will show the file in the ANSI system > code page, which you set in the control panel. > > Your ANSI code page is CP1250, and your OEM code page CP850. For > instance the second character with number 0x89 is a promille symbol in > CP1252 and an e with umlaut in CP850. See also > http://czyborra.com/charsets/codepages.html > > Msys cat is a windows program. I think the msys terminal runs in the > same code page as cmd.exe. But cat tries to be smart and does a fuzzy > translation to CP850. It translates the promille symbol to a percent > symbol, and OE (0x8c) to O. I cannot find any documentation on this > translation of cat. > It is not cat that does a translation, but the msys dll, which is derived from cygwin 1.3. With a default OEM code page 850, if you start msys like this from a command prompt: set CYGWIN=codepage:oem msys.bat Then the output of cat is the same as type in cmd.exe. The default setting is CYGWIN=codepage:ansi Erwin |
From: John B. <joh...@ho...> - 2013-02-12 23:00:09
|
On Tue, 12 Feb 2013 23:24:30 +0100, Erwin Waterlander wrote: > Op 12-2-2013 22:05, Erwin Waterlander schreef: > > Op 12-2-2013 14:07, John Brown schreef: > >> Hello All, > >> > >> > >> In a CMD.EXE window (code page = 850 according to chcp), > >> `type output.dat' prints > >> áëTOTALáî©é.23 > >> > >> In a MSYS window, `cat output.dat' prints > >> %TOTAL O¸'.23 > >> > >> Notepad displays it as > >> ‰TOTAL Œ¸‚.23 > >> > >> The actual 14 bytes in a hex editor are: > >> A0 89 54 4F 54 41 4C A0 8C B8 82 2E 32 33 > > It is not cat that does a translation, but the msys dll, which is > derived from cygwin 1.3. > > With a default OEM code page 850, if you start msys like this from a > command prompt: > > set CYGWIN=codepage:oem > msys.bat > > Then the output of cat is the same as type in cmd.exe. > > The default setting is CYGWIN=codepage:ansi > > Erwin > If the default code page of the msys dll is ANSI and this is responsible for the output of `cat', then why is that output different from what is displayed in Notepad? Notepad++ shows the same thing as Notepad, and Notepad++ shows "UNIX" (for the 0x0A line endings), and "ANSI" in its status bar. Regards, John Brown. |
From: Earnie B. <ea...@us...> - 2013-02-13 02:39:05
|
On Tue, Feb 12, 2013 at 5:24 PM, Erwin Waterlander wrote: > > It is not cat that does a translation, but the msys dll, which is > derived from cygwin 1.3. > > With a default OEM code page 850, if you start msys like this from a > command prompt: > > set CYGWIN=codepage:oem > msys.bat > > Then the output of cat is the same as type in cmd.exe. > > The default setting is CYGWIN=codepage:ansi CAUTION: Using the CYGWIN environment variable may be detrimental to the health of MSYS. At the time I created MSYS I chose to ignore it and provide a standard set of defaults. Some things might still work but I know some will cause hazard. -- Earnie -- https://sites.google.com/site/earnieboyd |
From: John B. <joh...@ho...> - 2013-02-13 03:42:21
|
On Tue, 12 Feb 2013 21:38:58 -0500, Earnie Boyd wrote: > > CAUTION: Using the CYGWIN environment variable may be detrimental to > the health of MSYS. At the time I created MSYS I chose to ignore it > and provide a standard set of defaults. Some things might still work > but I know some will cause hazard. > > -- > Earnie Never fear. I have no intention of changing my settings. I just want an explanation for what I am seeing. Regards, John Brown. |
From: Earnie B. <ea...@us...> - 2013-02-13 02:42:49
|
On Tue, Feb 12, 2013 at 6:00 PM, John Brown wrote: > > If the default code page of the msys dll is ANSI and this is > responsible for the output of `cat', then why is that output > different from what is displayed in Notepad? Notepad++ shows > the same thing as Notepad, and Notepad++ shows "UNIX" (for the > 0x0A line endings), and "ANSI" in its status bar. > The cat binary supplied by MSYS sends binary streams of data to the terminal. It is the terminal's job to interpret the data. No conversion is done by cat; that is just a misunderstanding of what happens. -- Earnie -- https://sites.google.com/site/earnieboyd |
From: Renato S. <br....@gm...> - 2013-02-13 13:00:01
|
2013/2/12 Erwin Waterlander <wat...@xs...> > Msys cat is a windows program. I think the msys terminal runs in the > same code page as cmd.exe. But cat tries to be smart and does a fuzzy > translation to CP850. It translates the promille symbol to a percent > symbol, and OE (0x8c) to O. I cannot find any documentation on this > translation of cat. > I would assume the encoding is the same as cmd.exe as well, just because it's the same terminal, the Windows console. Well, unless you're using rxvt, MinTTY or something else, and unless bash is doing something in the middle. Yes, I see the same crazy conversion of cat's output in Windows console. A few tests here: 1. The cat command applied to a Latin1 text file shows its contents as the active codepage in Windows console, CP850. I would expect the output to be displayed erroneously as if it was CP850 instead, just like cmd.exe's type does for the same text file. 2. A script I wrote, which prints in CP850 by default, is displayed correctly as CP850, as expected. 3. A testing program written as Latin1 prints text correctly as Latin1 mistook for CP850. However, when the output is piped to cat (both in bash and cmd.exe), it gets displayed as if the bytes themselves were CP850. I would expect the pipe operation to not change anything. 2013/2/13 Earnie Boyd <ea...@us...> > The cat binary supplied by MSYS sends binary streams of data to the > terminal. It is the terminal's job to interpret the data. No conversion > is done by cat; that is just a misunderstanding of what happens. > Then what's the explanation for what we are seeing above? The third item proves that either the pipes (from *both* bash and cmd,exe) or cat is misbehaving. |
From: Renato S. <br....@gm...> - 2013-02-13 03:19:48
|
John, I'm finding your explanation a little confusing. I think you're just trying to random guess the encoding, but I still don't see how MSYS' "default encoding" would help you with that. If you want to keep playing with it, just install iconv or MinTTY and go trying out each encoding you suspect it is. |
From: John B. <joh...@ho...> - 2013-02-13 10:41:35
|
On Wed, 13 Feb 2013 01:19:40 -0200 , Renato Silva wrote: > > > John, I'm finding your explanation a little confusing. I will make one more attempt. 1) Technically, the file is a binary file. However, it consists of mostly plain ASCII text with 0x0A line endings and a few characters (bytes) > 0x7F. 2) 99.9% of the time, a number is represented by itself, so 123.45 is written as 123.45, and 12345.67 is written as 12,345.67 using the familiar characters in the range 0x30 - 0x39. However, when the number contains a digit that repeats 3 times, that sequence of three digits is represented by what appears to be a byte (> 0x7F) that represents the digit and another that indicates that it repeats 3 times. In 888.##, #,888.## and ##,888.## (where # is a digit) the "888" is represented as described with the same two bytes in all cases, but 8,88#.## is written as 8,88#.##. There are a handful of such non-7bit-ASCII byte sequences. I have found most of them by writing a small program to search thousands of files. An easier but not necessarily quicker way would have been to ask someone to enter in the test system transactions with the numbers that I am interested in. 3) This logical interpretation is not readily apparent when you look at the CMD.EXE output. It was when I looked at the MSYS output that it started to make sense. However, it does not need to make sense. It only has to be consistent. If I know the non-ASCII bytes that represent 888, I can replace them with 888. 4) I was simply curious about why they chose those bytes. On further investigation, it seems that the bytes in which I am interested are in a contiguous range, and treating the file as ANSI does no harm. I can extract the parts that I want using regular expressions. > I think you're > just trying to random guess the encoding, You are absolutely correct. Maybe you are not confused after all. > but I still don't see how > MSYS' "default encoding" would help you with that. MSYS output was more reasonable (less random-looking) than CMD output, but Notepad is even better than MSYS. By the way, MSYS `file' says that the file is 'Non-ISO extended-ASCII text, with LF, NEL line terminators'. On Linux, it just says 'data'. > If you want to keep > playing with it, just install iconv or MinTTY and go trying out each > encoding you suspect it is. I don't think that I will bother. That is too much work for something that I do not really *need* to know, and as I said, I can process it if I assume that it is ANSI. Regards, John Brown. |
From: waterlan <wat...@xs...> - 2013-02-13 07:09:16
|
Earnie Boyd schreef op 2013-02-13 03:42: > On Tue, Feb 12, 2013 at 6:00 PM, John Brown wrote: >> >> If the default code page of the msys dll is ANSI and this is >> responsible for the output of `cat', then why is that output >> different from what is displayed in Notepad? Notepad++ shows >> the same thing as Notepad, and Notepad++ shows "UNIX" (for the >> 0x0A line endings), and "ANSI" in its status bar. >> > > The cat binary supplied by MSYS sends binary streams of data to the > terminal. It is the terminal's job to interpret the data. No > conversion is done by cat; that is just a misunderstanding of what > happens. > Notepad++ says "Unix" only because of the Unix line break. "Unix" doesn't say anything about the used encoding. All encodings, OEM, ANSI, ISO, Unicode, ..., can have Unix or DOS or Mac line breaks. regards, -- Erwin Waterlander http://waterlan.home.xs4all.nl/ |
From: John B. <joh...@ho...> - 2013-02-13 12:22:05
|
On Wed, 13 Feb 2013 08:09:06 +0100, Erwin Waterlander wrote: > >> On Tue, Feb 12, 2013 at 6:00 PM, John Brown wrote: >>> >>> If the default code page of the msys dll is ANSI and this is >>> responsible for the output of `cat', then why is that output >>> different from what is displayed in Notepad? Notepad++ shows >>> the same thing as Notepad, and Notepad++ shows "UNIX" (for the >>> 0x0A line endings), and "ANSI" in its status bar. >>> >> > > Notepad++ says "Unix" only because of the Unix line break. "Unix" > doesn't say anything about the used encoding. All encodings, OEM, ANSI, > ISO, Unicode, ..., can have Unix or DOS or Mac line breaks. > > regards, > > -- > Erwin Waterlander Indeed "UNIX" says nothing about the encoding. Immediately after "UNIX", what do we see but: (for the 0x0A line endings) ^^^^^^^^^^^^^^^^^ So clearly I know why Notepad++ said "UNIX". For reasons best known to yourself, you ignored the rest of the sentence where I wrote: *and "ANSI" in its status bar*. ^^^^ So let me try again: 1) I opened the file in Notepad. I noticed that what was displayed was different from the MSYS output, which you said is ANSI by default. 2) Notepad uses ANSI by default, but it also works with UTF-8, Unicode and Unicode Big-endian. Notepad does not display conveniently the name of the encoding that it is using. You can find out by File -> Save As. I did that and it said ANSI as I expected. 3) To confirm it, I opened the file in Notepad++, which *does* show the encoding that it is using. It said ANSI as I expected, and the output was the same as in Notepad. 4) However, the output was diferent from the MSYS window, so back to my original question: If MSYS uses ANSI by default, and Notepad uses ANSI by default, then why is the file not displayed the same in both windows? Regards, John Brown. |
From: Renato S. <br....@gm...> - 2013-02-13 13:06:19
|
2013/2/12 John Brown <joh...@ho...> > In a MSYS window, `cat output.dat' prints > %TOTAL O¸'.23 > > Notepad displays it as > ‰TOTAL Œ¸‚.23 > 2013/2/13 John Brown <joh...@ho...> > If MSYS uses ANSI by default, and Notepad uses ANSI by default, > then why is the file not displayed the same in both windows? > Have you noticed how similar they look like? Try applying the same font for the terminal and the text editor, and it should look equal as you want. |
From: Renato S. <br....@gm...> - 2013-02-13 13:27:17
|
2013/2/13 Renato Silva <br....@gm...> > Have you noticed how similar they look like? Try applying the same font > for the terminal and the text editor, and it should look equal as you want. Your terminal's font is likely unable to display these characters correctly in the same encoding you are using in the text editors. Also, a tip for your guessing process (which I think you are already applying anyway): trying to find out encodings where the characters corresponding to those bytes make some sense in your context. For example, if the permille (‰) makes very much sense in your context, then your wanted encoding could be for instance Windows-1252<http://en.wikipedia.org/wiki/Windows-1252>(assuming only one encoding was used in the file), since its entry for 0x89 is exactly such character. |
From: John B. <joh...@ho...> - 2013-02-13 13:49:19
Attachments:
msys.png
notepad.png
|
On Wed, 13 Feb 2013 11:05:31 -0200, Renato Silva wrote: > 2013/2/12 John Brown > In a MSYS window, `cat output.dat' prints > %TOTAL O¸'.23 > > Notepad displays it as > ‰TOTAL Œ¸‚.23 > > Have you noticed how similar they look like? Try applying the same font > for the terminal and the text editor, and it should look equal as you > want. I do not think that they look similar at all, even as text in my email. I set the font to Lucida Console 24pt in MSYS and Notepad. Please see the results as I see them in the attached PNG files msys.png and notepad.png Of the 4 visible non-ASCII characters, (remember, only the non-ASCII characters are problematic) only one (the one that I have been calling a speck of dust) is the same in both images. The percent sign in MSYS is the one that I learned to write all those years ago, but the one in Notepad has an extra zero or oval or whatever you want to call it. The O in MSYS becomes something completely different in Notepad. The single quote in MSYS is a comma in Notepad. Regards, John Brown. |
From: Earnie B. <ea...@us...> - 2013-02-13 13:58:17
|
On Wed, Feb 13, 2013 at 8:49 AM, John Brown wrote: > The percent sign in MSYS is the one that I learned to write all those years > ago, but the one in Notepad has an extra zero or oval or whatever you want > to call it. The O in MSYS becomes something completely different in > Notepad. The single quote in MSYS is a comma in Notepad. Try using stty command to set the characteristics of the terminal. Perhaps ``stty raw''? -- Earnie -- https://sites.google.com/site/earnieboyd |
From: John B. <joh...@ho...> - 2013-02-13 14:34:30
|
On Wed, 13 Feb 2013 08:58:09 -0500, Earnie Boyd wrote: > > Try using stty command to set the characteristics of the terminal. > > Perhaps ``stty raw''? > > -- > Earnie Hello Earnie, I ran `stty raw' and then `cat output.dat', but it had no effect. I tried `stty istrip' (clear high bit) just to see what would happen, but there was no change either. Regards, John Brown. |
From: John B. <joh...@ho...> - 2013-02-13 14:04:54
|
Hello Renato, I needed to scroll down to see the rest of your message. Thanks to you I now know that ‰ = permille. At least one other poster mentioned it, but I did not know which character he was talking about. > > Also, a tip for your guessing process (which I think you are already > applying anyway): trying to find out encodings where the characters > corresponding to those bytes make some sense in your context. That's right. My message to you included a detailed explanation of my reasoning process. In any case, as I said, I know how to process the file. Now I am investigating the difference between Notepad and MSYS output. They should be the same if they are using the same encoding, but they are not. Regards, John Brown. |
From: Renato S. <br....@gm...> - 2013-02-13 17:00:54
|
2013/2/13 John Brown <joh...@ho...> > I do not think that they look similar at all, even as text in my email. > > I set the font to Lucida Console 24pt in MSYS and Notepad. Please see the > results as I see them in the attached PNG files msys.png and notepad.png > I was thinking that the bytes were being converted to a closer equivalent due to the font not being able to display the original character. However, I don't think percent is a good replacement for permille, and maybe a question mark or something is just always better than anything else for these cases. Besides that, I get what's in your screenshot here too, the same font leads to pretty different results. I think the reason why you see " ‰TOTAL Œ¸‚.23" in text editor and " %TOTAL O¸'.23" in command prompt is rather because in the latter the bytes are being converted from Windows-1252/Latin1 to CP850 (even though iconv -f latin1 -t cp850 does not print the exact same output). The following commands should make cat's output look the same as the file contents in the text editors: 2013/2/13 John Brown <joh...@ho...> > Now I am investigating the difference between Notepad and MSYS output. > They should be the same if they are using the same encoding, but they are > not. > If you run the command below, you should be able to see exactly the same output from cat as you see from the text editors. Without it, even though you are printing the same bytes (that we're assuming here as latin1/cp1252), you are printing it to the Windows console which has a different encoding set (cp850 here), and hence your bytes are mistook for cp850 data when they actually are not. $ cmd //c chcp 1252 $ cat yourfile ‰TOTAL Œ¸‚.23 |
From: waterlan <wat...@xs...> - 2013-02-13 15:00:16
|
John Brown schreef op 2013-02-13 13:21: > On Wed, 13 Feb 2013 08:09:06 +0100, Erwin Waterlander wrote: > >> >>> On Tue, Feb 12, 2013 at 6:00 PM, John Brown wrote: >>>> >>>> If the default code page of the msys dll is ANSI and this is >>>> responsible for the output of `cat', then why is that output >>>> different from what is displayed in Notepad? Notepad++ shows >>>> the same thing as Notepad, and Notepad++ shows "UNIX" (for the >>>> 0x0A line endings), and "ANSI" in its status bar. >>>> >>> >> >> Notepad++ says "Unix" only because of the Unix line break. "Unix" >> doesn't say anything about the used encoding. All encodings, OEM, >> ANSI, >> ISO, Unicode, ..., can have Unix or DOS or Mac line breaks. >> >> regards, >> >> -- >> Erwin Waterlander > > > Indeed "UNIX" says nothing about the encoding. Immediately after > "UNIX", what do we see but: > > (for the 0x0A line endings) > ^^^^^^^^^^^^^^^^^ > > So clearly I know why Notepad++ said "UNIX". > > > For reasons best known to yourself, you ignored the rest of the > sentence where I wrote: > > *and "ANSI" in its status bar*. > ^^^^ > > So let me try again: > > 1) I opened the file in Notepad. I noticed that what was displayed > was different from the MSYS output, which you said is ANSI by > default. > > 2) Notepad uses ANSI by default, but it also works with UTF-8, > Unicode and Unicode Big-endian. Notepad does not display > conveniently the name of the encoding that it is using. You can > find out by File -> Save As. I did that and it said ANSI as I > expected. > > 3) To confirm it, I opened the file in Notepad++, which *does* > show the encoding that it is using. It said ANSI as I expected, > and the output was the same as in Notepad. > > 4) However, the output was diferent from the MSYS window, so back > to my original question: > > If MSYS uses ANSI by default, and Notepad uses ANSI by default, > then why is the file not displayed the same in both windows? > Because your default OEMCP is CP850. And the msys dll will do some 'smart' conversion of CP1252 to CP850. As I wrote earlier: If you change your default OEMCP permanently to 1252 in the registry, reboot your PC, and start a new MSYS, set MSYS terminal font to true type Lucida Console, then your MSYS shell will show exactly the same as notepad. Because now msys will not do a conversion. see also http://superuser.com/questions/387569/how-do-i-permantly-set-the-command-prompt-codepage-in-windows-7 -- Erwin Waterlander http://waterlan.home.xs4all.nl/ |
From: John B. <joh...@ho...> - 2013-02-13 15:46:33
|
On Wed, 13 Feb 2013 16:00:04 +0100, Erwin Waterlander wrote: > > John Brown schreef op 2013-02-13 13:21: > > If MSYS uses ANSI by default, and Notepad uses ANSI by default, > > then why is the file not displayed the same in both windows? > > > > Because your default OEMCP is CP850. And the msys dll will do some > 'smart' conversion of CP1252 to CP850. > > As I wrote earlier: > If you change your default OEMCP permanently to 1252 in the registry, > reboot your PC, and start a new MSYS, set MSYS terminal font to true > type Lucida Console, then your MSYS shell will show exactly the same as > notepad. Because now msys will not do a conversion. > > see also > http://superuser.com/questions/387569/how-do-i-permantly-set-the-command-prompt-codepage-in-windows-7 > > > -- > Erwin Waterlander It worked. Thanks. Regards, John Brown. |