Menu

#486 String conversion macros assume UCS-2 and DBCS encoding

8
pending
Internal (140)
1
2023-09-08
2021-02-19
No

The string conversion macros, used widely within the OWLNext source code to convert between narrow and wide strings, assume that wide strings are encoded as UCS-2 and that narrow character sets can not have more than two code units per code point (character). This is no longer the case.

Narrow to wide conversion (A2W, A2CW, _A2W and _A2W_A)

The size in bytes of the temporary buffer allocated within A2W is set to two times the length of the narrow string (in code units, including the null-terminator). This may not be enough for the wide string, which in Windows 2000 and later is encoded as UTF-16 and hence may have multiple code units per code point (character). The internal function OwlA2WHelper does no error checking (the return value from MultiByteToWideChar is ignored).

This is also the case for A2CW and the conditional macros _A2W (active only in Unicode build mode) and _A2W_A (active only in ANSI build mode), which are simple wrappers around A2W.

Wide to narrow conversion (W2A, W2CA, _W2A and _W2A_A)

The macro W2A assumes that the narrow string will fit in a buffer twice the length of the wide string (in code units, including the null-terminator). This used to be sufficient for ANSI strings, as Windows' support for multi-byte character sets for narrow ANSI strings was limited to two code units per code point (double-byte character sets). See "Unicode and MBCS". However, the underlying function WideCharToMultiByte does support conversion to UTF-8, which requires up to 6 code units per code point, and Windows now (finally!) has a UTF-8 code page in ANSI mode. See "Use the UTF-8 code page". The macro implementation is hence no longer sufficient. The internal function OwlW2AHelper does no error checking (the return value from WideCharToMultiByte is ignored).

This is also the case for W2CA and the conditional macros _W2A (active only in Unicode build mode) and _W2A_A (active only in ANSI build mode), which are simple wrappers around W2A.

Related

Feature Requests: #174
Wiki: OWLNext_Roadmap_and_Prereleases
Wiki: Strings_in_OWLNext

Discussion

  • Vidar Hasfjord

    Vidar Hasfjord - 2023-08-27

    Here is an example that will fail, if the Windows system default ACP is set to UTF-8 (or the application manifest specifies UTF-8):

    // Using the macros:
    {
      USES_CONVERSION;
      const auto s = L"€€";
      const auto n = string(W2A(s));
      CHECK(A2W(n.c_str()) == wstring(s));
    }
    
    // Using new conversion functions (implemented in terms of the macros):
    {
      const auto s = L"€€";
      const auto n = ConvertToNarrow(s);
      CHECK(ConvertToWide(n) == s);
    }
    

    Note that in UTF-8 a euro sign is represented as three code units (0xE2, 0x82, 0xAC). Converting a single euro sign will succeed (since it requires 4 bytes, including null-termination, which happens to be the same as twice the input string length, including null-termination), but for two or more euro signs the conversion fails (2 euro signs requires a buffer size of 2*3 + 1 = 7, while the macro allocates just 2*(2 + 1) = 6). The conversion function does not write beyond the buffer, but it does not null-terminate, causing buffer overrun on read.

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2023-08-27
    • assigned_to: Vidar Hasfjord
    • Group: unspecified --> Owlet
     
  • Vidar Hasfjord

    Vidar Hasfjord - 2023-08-27
    • status: open --> pending
     
  • Vidar Hasfjord

    Vidar Hasfjord - 2023-08-27

    This issue has been resolved by the removal of the conversion macros [r6464] as well as the reimplementation of ConvertToNarrow and ConvertToWide [r6470].

     

    Related

    Commit: [r6464]
    Commit: [r6470]

  • Vidar Hasfjord

    Vidar Hasfjord - 2023-09-08
    • Group: Owlet --> 8
     
  • Vidar Hasfjord

    Vidar Hasfjord - 2023-09-08

    The old conversion macros have now been removed on the trunk as well [r6537], thereby resolving this issue across all our code, including extension libraries.

     

    Related

    Commit: [r6537]


Log in to post a comment.

MongoDB Logo MongoDB