Menu

#67 Detect console codepage when getting text from it

v1.0 (example)
open
nobody
None
5
2023-02-02
2023-02-02
Maximin
No

Hello!

Seems that Blat do not honor console codepage when getting text from console in the case when system codepage is different from console input codepage. That is the case in many Japan and Cyrillic typical system setup for historical reasons. For example, system is 1251 and input is 866 is a common.
The same symbols, passed thru command line parameter "body" came OK, bacause they are coming in a system encoding.

This makes impossible to mail messages with national symbols using single hyphen as body parameter.
Found while struggling using Perl script, but easily reproduces using plain command prompt, using imput codepage different from system.

Seems that Blat has to call GetConsoleCP result instead of using CP_ACP as it does for body text that got from console. And then do MultiByteToWideChar with that codepage. So it will be decoded to Unicode before trying to encode body to some other encoding.

You can verify correct result, setting -charset equal to console. Correct result is a log, containing in a body same "aboveASCII" symbols that are entered on a console. Thats ensure, that cycle "console-Unicode-console" finished correctly.

Otherwise Blat sends complete garbage.

I don't have enough skill to make a patch by myself, cause I have C++ experience very long ago. But it's enough to find a reason and make my suggestions.

Hope, that someone patch this.

Discussion

  • Maximin

    Maximin - 2023-02-02

    And yes, that is the case for NT systems (GetConsoleCP available starting from Windows 2000) and such setups are beginning from that era.

     
  • Maximin

    Maximin - 2023-02-02

    I was somewhat underestimating my C++ skills. I stiil remember something.. :)
    So i was somewhat wrong assuming that decoding console input is required only when it'scodepage is not equal to system one.
    No, it should be decoded to Unicode ALWAYS. And I haven't managed to test it when console is Unicode (UTF-7 or 8). Seems that Windows has still broken Unicode console input. In this case, I think, the code needs further improvement.
    I don't know, how to post code changes to Blat code, so I'm posting here a diff for blat.cpp file. Now it works.
    Also, I should note that auto guessing - is that a Unicode string passed by first 3 symbols - it's a bad idea, overall.

    I think, that further improvement of this patch should develop as such.
    Now, console always producing Unicode. So, output mail codepage in this case should always be UTF-8, unless user does want to shoot the knee and directly specifies different (possibly, incompatible with symbols entered) codepage in a parameter.
    But now Blat if first 3 symbols isn't plain ASCII, shoots the knee automatically assuming that output codepage sholud be ISO 8859-1, which is very likely incorrect.
    So if you are working with console input and want to be 100% sure that your nonASCII symbols will correctly encoded you should always directly specify "-codepage" (most likely, UTF-8).
    In this case everything works just perfect.
    I haven't managed that for now.

    Patch below.

    -                CommonData.TempConsole.Add((char)i);
    -            }
    +               //taking into account console codepage
    +               char inputChar = (char)i;
    +               UINT consoleCP = GetConsoleCP();
    +               if (consoleCP && (consoleCP != CP_UTF7 && consoleCP != CP_UTF8))
    +               {
    +                   int byteCount = MultiByteToWideChar(consoleCP, 0, &inputChar, 1, NULL, 0);
    +                   if (byteCount <= sizeof(wchar_t)) //succesful conversion of single char is possible
    +                   {
    +                       wchar_t unicodeChar;
    +
    +                       MultiByteToWideChar(consoleCP, 0, &inputChar, 1, &unicodeChar, byteCount);
    +
    +                       CommonData.TempConsole.Add(unicodeChar);
    +                   }
    +                   else //cannot convert - fallback
    +                       CommonData.TempConsole.Add(inputChar);
    +               }
    +               else //fallback when input console is unknown or already in UTF-7/8 (still broken in a 2023, so that's an untested case)
    +               {
    +                   CommonData.TempConsole.Add(inputChar);
    +               }
    +           }
    
     

Log in to post a comment.

MongoDB Logo MongoDB