Menu

#2093 AddCharUTF is broken for DBCS code page

Bug
closed-fixed
5
2019-06-27
2019-03-27
Zufu Liu
No

treatAsDBCS is always false, which cause Editor::AddCharUTF() call UTF32FromUTF8(), which may throw exception.

with code.page=936, typing 蓝猫 in SciTE crashed, tying in Notepad2 the text is truncated.

This bug was discovered by Xuewu Liu.

1 Attachments

Discussion

1 2 > >> (Page 1 of 2)
  • Zufu Liu

    Zufu Liu - 2019-03-27

    The title seems not accurate, maybe it's "IME broken for DBCS code page".
    No bug in UTF-8 code page.
    I cann't test AddCharUTF16() used in WM_CHAR / WM_UNICHAR.

     
  • Neil Hodgson

    Neil Hodgson - 2019-03-27

    treatAsDBCS is archaic and is not used by any core platforms. It was retained in case it was used by non-core platforms. There is a lot of history here. See here for example.

    The crash is most likely UTF32FromUTF8 being asked to write too much:

            if (ui == tlen) {
                throw std::runtime_error("UTF32FromUTF8: attempted write beyond end");
            }
    

    The exception is only about a year old but it was still an error before that.

     
  • Zufu Liu

    Zufu Liu - 2019-03-27

    rough stacktrace:

    ScintillaWin::HandleCompositionWindowed()
        AddWString("蓝猫")
            CodePageOfDocument() -> 936
            StringEncode(): "蓝猫" (\u84DD\u732B) -> "\xc0\xb6\xc3\xa8"
            AddCharUTF("\xc0\xb6", 2)
                Editor::AddCharUTF("\xc0\xb6", 2)
                    UTF32FromUTF8("\xc0\xb6", utf32, 1)
                        UTF8BytesOfLead['\xc0'] -> 1
                        UTF8BytesOfLead['\xb6'] -> 1
    

    I think CodePageOfDocument() is also buggy, which doesn't consistent with underline buffer like Document::WordCharacterClass().

     

    Last edit: Zufu Liu 2019-03-27
  • Zufu Liu

    Zufu Liu - 2019-03-28

    Anothor excample, (\u65E5) in cp932 ("\x93\xfa"), because of if ((ch < 0xC0) || (1 == len)), NotifyChar() will use a truncated value 0x93.

     
  • Zufu Liu

    Zufu Liu - 2019-03-28

    Some fix for AddCharUTF (assuming len=0 is already handled at beginning of AddCharUTF):

    int ch = static_cast<unsigned char>(s[0]);
    if (treatAsDBCS) {
        if (len > 1) {
            ch = (ch << 8) | static_cast<unsigned char>(s[1]);
        }
    } else {
       ...
    }
    NotifyChar(ch);
    

    Second part of ScintillaWin::AddCharUTF16():

    const bool treatAsDBCS = IsDBCSCodePage(codePage);
    inBufferCP[size] = '\0';
    AddCharUTF(inBufferCP, size, treatAsDBCS);
    
     
    • Neil Hodgson

      Neil Hodgson - 2019-03-28

      AddCharUTF16 isn't in the stack trace so changing it won't fix the crash. The same result should occur for the various routes to AddCharUTF.

      AddCharUTF is inserting into the document so it could use the document's code page to determine whether calculating the value for NotifyChar should go through UTF32FromUTF8 or just jam the bytes together.

       
  • Zufu Liu

    Zufu Liu - 2019-03-29

    It indeed not fix the crash, just shows it need handled same way as AddWString().

    Here is a problem, document's code page may not the same as CodePageOfDocument(). the latter (depending on font character set of STYLE_DEFAULT) is used to convert the input (UTF-16 or UTF-8) character/string to a buffer used by AddCharUTF().

    That why I say CodePageOfDocument() is buggy, which doesn't consistent with underline buffer like Document::WordCharacterClass().

     
    • Neil Hodgson

      Neil Hodgson - 2019-03-30

      CodePageOfDocument needs to look at the character set to be able to handle single byte character sets for Russian, Greek, etc. .

       
  • Zufu Liu

    Zufu Liu - 2019-04-03

    How about the change:

    UINT CodePageFromCharSet(DWORD characterSet, UINT documentCodePage) noexcept {
        if (documentCodePage) {
            return documentCodePage;
        }
    

    and remove DBCS code pages in the switch (leaving only SBCS code pages), this change matches dbcsCodePage in document,
    the result code page is used to intact (converting in and out) with underline buffer.

    Old CodePageFromCharSet renamed to another name, which is only used by Surface (like SurfaceD2D) to match Win32 GDI rendering with ANSI functions,
    the result code page is only used to view the underline buffer, as a shortcut to re-encoding (convert to UTF-8) the underline buffer with a proper code page.

     
    • Neil Hodgson

      Neil Hodgson - 2019-04-05

      Why are you setting a character set that is not consistent with the code page?

       
      • Zufu Liu

        Zufu Liu - 2019-04-06

        By default , when a font supports "Western European" , Windows font choose dialog shows "Western European" (DEFAULT_CHARSET) in the script dialog instead of a charset matches current locale.

        Not all font implements all charsets, compare three fonts on my system:
        Microsoft Yahei UI supports GB2312, Western European, Greek, Turkish, Central European and Cyrillic.
        Microsoft JhengHei UI supports Western European, Big5 and Greek.
        DejaVu Sans Mono supports Western European, Arabic, Greek, Turkish, Baltic, Central European, Cyrillic, and Vietnamese.

        When using DejaVu Sans Mono with GDI and SC_CHARSET_GB2312, text in ANSI/GBK is not rendered as expected (whole text instead of only Chinese characters is rendered with SimSun 宋体 on my system).
        While in D2D, the same text is rendered as expected with DejaVu Sans Mono and Microsoft YaHei/Microsoft YaHei UI as the fallback font for Chinese characters.

        Thus when using GDI, locale charset can't been set without knowing the font actual supports it.

         
  • Zufu Liu

    Zufu Liu - 2019-04-05

    The change from 3.6.6 to 3.6.7:

    -   case SC_CHARSET_DEFAULT: return documentCodePage;
    +   case SC_CHARSET_DEFAULT: return documentCodePage ? documentCodePage : 1252;
    

    had caused some bugs on using ANSI code page when using fonts with default charset.
    https://github.com/XhmikosR/notepad2-mod/issues/173
    https://github.com/XhmikosR/notepad2-mod/issues/195
    https://github.com/rizonesoft/Notepad3/issues/23

    previously it returns CP_ACP (0), but now 1252 when document code page is zero.

     
    • Neil Hodgson

      Neil Hodgson - 2019-04-05

      SC_CHARSET_DEFAULT now means Windows-1252 on Windows. If you want to use the current Windows ANSI code page then that should be retrieved and the code page and character set initialized to match that.

       
  • Zufu Liu

    Zufu Liu - 2019-04-06

    Maybe DEFAULT_CHARSET on windows really means locale invariant default charset, at least when using a font with "Western European" scripts, it covers ASCII and many Latin-1 characters.

    https://docs.microsoft.com/en-us/windows/desktop/api/wingdi/nf-wingdi-enumfontfamiliesexa

     
  • Zufu Liu

    Zufu Liu - 2019-04-06

    UTF-8 code page has same problem in GDI, when charset is not supported by the font, system fallback font is used to render the whole text.

     
  • Zufu Liu

    Zufu Liu - 2019-04-08

    The patch fix original crash.

     
  • Zufu Liu

    Zufu Liu - 2019-04-08

    Suggest moving SC_INDICATOR_INPUT and other three (and MAXLENINPUTIME in Qt) to a header, they are defined on each platform.
    export the four will let application to custom their colors (the default blue color is hardly to be seen with dark theme).

    I think maxLenInputIME/MAXLENINPUTIME can be removed, the IME itself will limit the input characters (20 or so). for ScintillaWin::AddCharUTF16(), 16 bytes is enough for the two UTF-16 code points.

     

    Last edit: Zufu Liu 2019-04-08
    • Neil Hodgson

      Neil Hodgson - 2019-04-10

      The different platforms have different IME indicators and the input/target/converted/unknown breakdown is limiting. Cocoa is more dynamic and sends its own preferred decorations but that is complex to implement. More cases should be handled in the Qt implementation.

      IIRC maxLenInputIME was used when there were fixed size buffers.

       
  • Zufu Liu

    Zufu Liu - 2019-04-08

    patch updated, replaced pdoc->CodePageFamily() != efUnicode with pdoc->dbcsCodePage != SC_CP_UTF8 and with change to ScintillaWin::
    AddCharUTF16().

    AddCharUTF16() is now same as AddWString(). if AddWString is changed to std::wstring_view, AddCharUTF16 can be changed to a call to AddWString(wsv).

     
  • Neil Hodgson

    Neil Hodgson - 2019-04-10

    Committed change to Editor as [a7d3bc].

    Unifying AddCharUTF16 into AddWString would be worthwhile.

     

    Related

    Commit: [a7d3bc]

  • Neil Hodgson

    Neil Hodgson - 2019-04-10
    • labels: --> scintilla, DBCS
    • status: open --> open-fixed
     
  • Zufu Liu

    Zufu Liu - 2019-04-10

    A naive approach to merge AddCharUTF16.

    If StringEncode() changed to std::wstring_view, const std::wstring uniChar(wsv, i, ucWidth); can be replaced with wsv.substr(i, ucWidth) as in second patch.

     
  • Neil Hodgson

    Neil Hodgson - 2019-04-10

    Broke the AddCharUTF16 changes into two parts.

    [bbef7c] changes StringEncode, StringDecode, and StringMapCase to take view arguments as they are similar.

    [bf93fe] replaces AddCharUTF16 with AddWString and uses views. Changes from AddCharUTF16-2.diff include some additional const and removing the declaration of AddCharUTF16.

     

    Related

    Commit: [bbef7c]
    Commit: [bf93fe]

1 2 > >> (Page 1 of 2)

Log in to post a comment.