Menu

#538 Incorrect conversion result from utf-8 to wchar_t by codecvt_utf8 on windows

v1.0 (example)
open
niXman
5
2021-07-15
2016-05-06
Li Xiang
No

Environment Tried: Win 10 / Win 2012 64bit, zh_CN/en_US locale
Version Tested: x86_64, seh/sjlj, posix, 5.3.0/5.2.0

Consider following code:

#include <codecvt>
#include <locale>
#include <cstdio>
#include <string>
#include <windows.h>

using std::wstring_convert;
using std::codecvt_utf8;
using std::wstring;

int main()
{
    wstring_convert<codecvt_utf8<wchar_t>, wchar_t> cv;
    const char* s = u8"file.txt";
    wstring filename = cv.from_bytes(s);

    wchar_t buffer[256];
    MultiByteToWideChar(CP_UTF8, 0, s, -1, buffer, 256);

    for (wchar_t c : filename)
    {
        printf("%d ", (int)c);
    }
    printf("\n");

    for (int i = 0; buffer[i] != 0; ++i)
    {
        printf("%d ", (int)buffer[i]);
    }
    printf("\n");

    return 0;
}

compile command line:
g++ 1.cc -O2 -std=c++14 -s

expected result:
102 105 108 101 46 116 120 116
102 105 108 101 46 116 120 116

actual result:
26112 26880 27648 25856 11776 29696 30720 29296
102 105 108 101 46 116 120 116

All charactor result was mutiplied by 256.

It looks like a regression introduced in 5.2.0. 5.1.0 is OK.

Discussion

  • Li Xiang

    Li Xiang - 2016-05-06

    A double confirm showes that:

    • Clang 3.7/3.8 on windows has the same issue.
    • g++ on Linux not impacted.

    It seems issue is in libstdc++.dll.

     

    Last edit: Li Xiang 2017-01-02
  • Li Xiang

    Li Xiang - 2016-05-09

    It seems codecvt incorrectly choosed big endian. Setting little endian not working.

     
  • Emily Leiviskä

    Emily Leiviskä - 2016-11-02

    We are also affected by this. Win7 latin 1. Mingw 6.2.0-2

     

    Last edit: Emily Leiviskä 2016-11-02
  • niXman

    niXman - 2016-11-02
    • assigned_to: niXman
     
  • dejan crnila

    dejan crnila - 2017-06-07

    we are using 6.3.0 and are also affected by this. Any info on solution?

     
    • niXman

      niXman - 2017-06-07

      What about 7.1 version?

      I have no solution yet...

       
  • Jan Niklas Hasse

    still happens in 7.2.0.

     
  • Zufu Liu

    Zufu Liu - 2018-01-31

    Change the line

    wstring_convert<codecvt_utf8<wchar_t>, wchar_t> cv;
    

    to

    wstring_convert<codecvt_utf8<wchar_t, 0x10ffff, std::little_endian>, wchar_t> cv;
    

    works fine on Win10 x64 1709 with gcc version 7.2.0 (x86_64-posix-seh-rev0, Built by MinGW-W64 project), and output as expected:

    D:\>g++ -Wall -Wextra -O2 t1.cpp -s
    
    D:\>a
    102 105 108 101 46 116 120 116
    102 105 108 101 46 116 120 116
    
     

    Last edit: Zufu Liu 2018-01-31
  • Roman Khazanskii

    Still happens in 8.1.0 but workaround by Zufu Liu works!

     

Log in to post a comment.