2013/2/12 Renato Silva <br.renatosilva@gmail.com>
and from the fact that it doesn't look like UTF-8, CP850 or ISO-8859-1, my guess is that the bytes following that total may represent a number, not a set of characters

Actually, from the fact that it didn't seem to make much sense in any common encoding. However John already explained that it's not the case, and that the file looks fine as ANSI (Windows-1252), given his knowledge of the context involved.

2013/2/12 Erwin Waterlander <waterlan@xs4all.nl>
I think the msys terminal runs in the same code page as cmd.exe. But cat tries to be smart and does a fuzzy translation to CP850. It translates the promille symbol to a percent symbol, and OE (0x8c) to O. I cannot find any documentation on this translation of cat.

We need to remember that there are many agents involved. There is the Windows console (or any other replacement such as MinTTY, rxvt etc.), and there is bash, and there are programs like cat, ls etc. Therefore, the terminal may be working with one encoding, the shell with another, and the other programs with yet another one. For example, starting bash from cmd.exe and echoing a non-ASCII string to a file produces an ANSI-encoded output, instead of CP850 which is the encoding used by cmd.exe. And what Erwin said about cat is also reproducible from bash's built-in echo and printf, for example:

for b in A0 89 54 4F 54 41 4C A0 8C B8 82 2E 32 33; do
    printf "\x$b\n"
done


This has the same crazy conversion effect as running cat on a file containing such bytes. Thus, as Erwin said later, I start to agree that it's a problem with global components of MSYS, not just bash or an specific problem. I don't know either what kind of conversion this is about, since I don't think replacing permille with a percent is a good conversion, and since using iconv to convert from latin1/cp1252 gives an error on the permille symbol.

2013/2/12 John Brown <johnbrown105@hotmail.com>
If the default code page of the msys dll is ANSI and this is responsible for the output of `cat', then why is that output different from what is displayed in Notepad?
 
2013/2/13 John Brown <johnbrown105@hotmail.com>
If MSYS uses ANSI by default, and Notepad uses ANSI by default, then why is the file not displayed the same in both windows?
 
Just to clarify, because cat and MSYS as a whole seem to perform a crazy encoding conversion as exposed above.

2013/2/13 Earnie Boyd <earnie@users.sourceforge.net>
The cat binary supplied by MSYS sends binary streams of data to the terminal.  It is the terminal's job to interpret the data.  No conversion is done by cat; that is just a misunderstanding of what happens.

Just to clarify, it's not what our test cases indicate, as explained above. I think cat is not a culprit itself, but some shared component of MSYS.

2013/2/13 Renato Silva <br.renatosilva@gmail.com>
John, I'm finding your explanation a little confusing. I think you're just trying to random guess the encoding, but I still don't see how MSYS' "default encoding" would help you with that.

Well, he did explain that with Windows-1252, the content makes some sense to him. John, as for your file format, ok I understand it better now, but it's still just crazy to me. I can't see much sense in their triple-digit thing, for example.

2013/2/13 Renato Silva <br.renatosilva@gmail.com>
  1. The cat command applied to a Latin1 text file shows its contents as the active codepage in Windows console, CP850. I would expect the output to be displayed erroneously as if it was CP850 instead, just like cmd.exe's type does for the same text file.
I think I expressed myself wrongly. The cat command applied to a Latin1 text file converts the source bytes using some crazy procedure as Erwin and I explained above, something that looks like making the CP850 output look similar to that of Latin1/ANSI (for example by replacing permille with percent). I would rather expect the same output as cmd.exe's type command, which erroneously thinks the file is CP850.
  1. A script I wrote, which prints in CP850 by default, is displayed correctly as CP850, as expected.
  2. A testing program written as Latin1 prints text correctly as Latin1 mistook for CP850. However, when the output is piped to cat (both in bash and cmd.exe), it gets displayed as if the bytes themselves were CP850. I would expect the pipe operation to not change anything.
Correcting myself again, when it's piped to cat it gets displayed as exposed in last paragraph. It's not that it's like the bytes were CP850, but that it tried to make the text similar in CP850 to what it looks like in ANSI.

2013/2/13 Renato Silva <br.renatosilva@gmail.com>
Have you noticed how similar they look like? Try applying the same font for the terminal and the text editor, and it should look equal as you want.
 
2013/2/13 Renato Silva <br.renatosilva@gmail.com>
Your terminal's font is likely unable to display these characters correctly in the same encoding you are using in the text editors.

Just to clarify, they look similar because MSYS is doing this (by modifying the bytes,a fuzzy encoding conversion as we're explaining), not because of a different font.

2013/2/13 waterlan <waterlan@xs4all.nl>
Because your default OEMCP is CP850. And the msys dll will do some 'smart' conversion of CP1252 to CP850.

As explained, I think the same.

2013/2/13 Renato Silva <br.renatosilva@gmail.com>
I think the reason why you see " ‰TOTAL Œ¸‚.23" in text editor and " %TOTAL O¸'.23" in command prompt is rather because in the latter the bytes are being converted from Windows-1252/Latin1 to CP850 (even though iconv -f latin1 -t cp850 does not print the exact same output).
 
Actually, iconv gives an error instead, due to the permille. It only successfully runs with an output not quite the same when we rather paste the former string in cmd.exe and pipe it to iconv, as shown below. And to be clear, the byte conversion looks like performed by MSYS, maybe some ancient conversion stuff inherited from old Cygwin (since for example it is replacing ‰ with % instead of giving an error like iconv does)?

>iconv -f latin1 -t cp850 original_bytes.txt
iconv: test.txt:1:1: cannot convert

>echo ‰TOTAL Œ¸‚.23 | iconv -f latin1 -t cp850
%TOTAL O÷'.23

The following commands should make cat's output look the same as the file contents in the text editors:

Typo :-/
 
2013/2/13 John Brown <johnbrown105@hotmail.com>
Now I am investigating the difference between Notepad and MSYS output. They should be the same if they are using the same encoding, but they are not.

If you run the command below, you should be able to see exactly the same output from cat as you see from the text editors. Without it, even though you are printing the same bytes (that we're assuming here as latin1/cp1252), you are printing it to the Windows console which has a different encoding set (cp850 here), and hence your bytes are mistook for cp850 data when they actually are not.

My mistake again. If the bytes were printed unmodified as displayed as if they were CP850, then the output would look the same as cmd.exe's type, which it doesn't. As explained, it's rather some fuzzy encoding conversion by MSYS itself.

2013/2/13 Renato Silva <br.renatosilva@gmail.com>
I'm not sure if MSYS is doing any conversion here though, since both John's problem (seeing the same bytes differently in text editor and Windows console) and the cat's problem I've reported (see quote below), are perfectly reproducible from outside bash and with the same results.

I think I couldn't reproduce the problem in bash and hence I doubted that MSYS was guilty. But I did today, as shown above printf and echo built-ins share the same problem of cat on a CP850 terminal. Also, both problems are actually the same.

2013/2/13 John Brown <johnbrown105@hotmail.com>
I did not fully understand your test cases, but I do not see where your tests show that MSYS does not convert characters. `cat' is a MSYS program. Maybe `cat' does not explicitly convert characters, but when it calls printf, putchar, etc, (which live in msys-1.0.dll), they perform the conversion?

Hopefully you did not understand because I didn't either @@. But I have corrected it and I hope you do now. Yes, it looks like some shared component of MSYS is doing the crazy conversion, instead of only cat. Maybe non-interactive scripts are not affected and MSYS somehow detects Windows console and performs this crazy conversion only there? Because I just changed MinTTY to CP850 and cat didn't make such translation (file content is displayed as in cmd.exe's type).So all this in sum:
  1. You see different outputs (but somewhat similar) when you expected the same because your data is ANSI (Windows-1252) and your terminal (Windows console) is CP850, and MSYS for some reason is doing a fuzzy encoding conversion.
  2. At least to me, that looks like a bug specially if we aren't able at all to obtain the original bytes as they are regardless of how they're going to look like.