Scintilla / Bugs / #2093 AddCharUTF is broken for DBCS code page

Zufu Liu - 2019-03-27

The title seems not accurate, maybe it's "IME broken for DBCS code page".
No bug in UTF-8 code page.
I cann't test AddCharUTF16() used in WM_CHAR / WM_UNICHAR.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2019-03-27

treatAsDBCS is archaic and is not used by any core platforms. It was retained in case it was used by non-core platforms. There is a lot of history here. See here for example.

The crash is most likely UTF32FromUTF8 being asked to write too much:

if (ui == tlen) { throw std::runtime_error("UTF32FromUTF8: attempted write beyond end"); }

The exception is only about a year old but it was still an error before that.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

rough stacktrace:

ScintillaWin::HandleCompositionWindowed()
    AddWString("蓝猫")
        CodePageOfDocument() -> 936
        StringEncode(): "蓝猫" (\u84DD\u732B) -> "\xc0\xb6\xc3\xa8"
        AddCharUTF("\xc0\xb6", 2)
            Editor::AddCharUTF("\xc0\xb6", 2)
                UTF32FromUTF8("\xc0\xb6", utf32, 1)
                    UTF8BytesOfLead['\xc0'] -> 1
                    UTF8BytesOfLead['\xb6'] -> 1

I think CodePageOfDocument() is also buggy, which doesn't consistent with underline buffer like Document::WordCharacterClass().

Last edit: Zufu Liu 2019-03-27

Zufu Liu - 2019-03-28

Anothor excample, 日 (\u65E5) in cp932 ("\x93\xfa"), because of if ((ch < 0xC0) || (1 == len)), NotifyChar() will use a truncated value 0x93.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-03-28

Some fix for AddCharUTF (assuming len=0 is already handled at beginning of AddCharUTF):

int ch = static_cast<unsigned char>(s[0]); if (treatAsDBCS) { if (len > 1) { ch = (ch << 8) | static_cast<unsigned char>(s[1]); } } else { ... } NotifyChar(ch);

Second part of ScintillaWin::AddCharUTF16():

const bool treatAsDBCS = IsDBCSCodePage(codePage); inBufferCP[size] = '\0'; AddCharUTF(inBufferCP, size, treatAsDBCS);
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2019-03-28
  
  AddCharUTF16 isn't in the stack trace so changing it won't fix the crash. The same result should occur for the various routes to AddCharUTF.
  
  AddCharUTF is inserting into the document so it could use the document's code page to determine whether calculating the value for NotifyChar should go through UTF32FromUTF8 or just jam the bytes together.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-03-29

It indeed not fix the crash, just shows it need handled same way as AddWString().

Here is a problem, document's code page may not the same as CodePageOfDocument(). the latter (depending on font character set of STYLE_DEFAULT) is used to convert the input (UTF-16 or UTF-8) character/string to a buffer used by AddCharUTF().

That why I say CodePageOfDocument() is buggy, which doesn't consistent with underline buffer like Document::WordCharacterClass().

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2019-03-30
  
  CodePageOfDocument needs to look at the character set to be able to handle single byte character sets for Russian, Greek, etc. .
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-04-03

How about the change:

UINT CodePageFromCharSet(DWORD characterSet, UINT documentCodePage) noexcept { if (documentCodePage) { return documentCodePage; }

and remove DBCS code pages in the switch (leaving only SBCS code pages), this change matches dbcsCodePage in document,
the result code page is used to intact (converting in and out) with underline buffer.

Old CodePageFromCharSet renamed to another name, which is only used by Surface (like SurfaceD2D) to match Win32 GDI rendering with ANSI functions,
the result code page is only used to view the underline buffer, as a shortcut to re-encoding (convert to UTF-8) the underline buffer with a proper code page.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2019-04-05
  
  Why are you setting a character set that is not consistent with the code page?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Zufu Liu - 2019-04-06
    
    By default , when a font supports "Western European" , Windows font choose dialog shows "Western European" (DEFAULT_CHARSET) in the script dialog instead of a charset matches current locale.
    
    Not all font implements all charsets, compare three fonts on my system:
    Microsoft Yahei UI supports GB2312, Western European, Greek, Turkish, Central European and Cyrillic.
    Microsoft JhengHei UI supports Western European, Big5 and Greek.
    DejaVu Sans Mono supports Western European, Arabic, Greek, Turkish, Baltic, Central European, Cyrillic, and Vietnamese.
    
    When using DejaVu Sans Mono with GDI and SC_CHARSET_GB2312, text in ANSI/GBK is not rendered as expected (whole text instead of only Chinese characters is rendered with SimSun 宋体 on my system).
    While in D2D, the same text is rendered as expected with DejaVu Sans Mono and Microsoft YaHei/Microsoft YaHei UI as the fallback font for Chinese characters.
    
    Thus when using GDI, locale charset can't been set without knowing the font actual supports it.
    
    font-charset-gdi-d2d.png
    
    font-charset.png
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-04-05

The change from 3.6.6 to 3.6.7:

- case SC_CHARSET_DEFAULT: return documentCodePage; + case SC_CHARSET_DEFAULT: return documentCodePage ? documentCodePage : 1252;

had caused some bugs on using ANSI code page when using fonts with default charset.
https://github.com/XhmikosR/notepad2-mod/issues/173
https://github.com/XhmikosR/notepad2-mod/issues/195
https://github.com/rizonesoft/Notepad3/issues/23

previously it returns CP_ACP (0), but now 1252 when document code page is zero.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2019-04-05
  
  SC_CHARSET_DEFAULT now means Windows-1252 on Windows. If you want to use the current Windows ANSI code page then that should be retrieved and the code page and character set initialized to match that.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-04-05

Buy the way,
the page https://www.scintilla.org/LongTermDownload.html is out dated, seems it should link to https://scintilla.sourceforge.io/LongTermDownload.html

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2019-04-05
  
  OK, [9e5612].
  
  Related
  
  Commit: [9e5612]
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-04-06

Maybe DEFAULT_CHARSET on windows really means locale invariant default charset, at least when using a font with "Western European" scripts, it covers ASCII and many Latin-1 characters.

https://docs.microsoft.com/en-us/windows/desktop/api/wingdi/nf-wingdi-enumfontfamiliesexa

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-04-06

UTF-8 code page has same problem in GDI, when charset is not supported by the font, system fallback font is used to render the whole text.

font-charset-utf8.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-04-08

The patch fix original crash.

dbcs.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-04-08

Suggest moving SC_INDICATOR_INPUT and other three (and MAXLENINPUTIME in Qt) to a header, they are defined on each platform.
export the four will let application to custom their colors (the default blue color is hardly to be seen with dark theme).

I think maxLenInputIME/MAXLENINPUTIME can be removed, the IME itself will limit the input characters (20 or so). for ScintillaWin::AddCharUTF16(), 16 bytes is enough for the two UTF-16 code points.

Last edit: Zufu Liu 2019-04-08

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2019-04-10
  
  The different platforms have different IME indicators and the input/target/converted/unknown breakdown is limiting. Cocoa is more dynamic and sends its own preferred decorations but that is complex to implement. More cases should be handled in the Qt implementation.
  
  IIRC maxLenInputIME was used when there were fixed size buffers.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-04-08

patch updated, replaced pdoc->CodePageFamily() != efUnicode with pdoc->dbcsCodePage != SC_CP_UTF8 and with change to ScintillaWin::
AddCharUTF16().

AddCharUTF16() is now same as AddWString(). if AddWString is changed to std::wstring_view, AddCharUTF16 can be changed to a call to AddWString(wsv).

dbcs2.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2019-04-10

Committed change to Editor as [a7d3bc].

Unifying AddCharUTF16 into AddWString would be worthwhile.

Related

Commit: [a7d3bc]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2019-04-10

labels: --> scintilla, DBCS

status: open --> open-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-04-10

A naive approach to merge AddCharUTF16.

If StringEncode() changed to std::wstring_view, const std::wstring uniChar(wsv, i, ucWidth); can be replaced with wsv.substr(i, ucWidth) as in second patch.

AddCharUTF16-2.diff

AddCharUTF16.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2019-04-10

Broke the AddCharUTF16 changes into two parts.

[bbef7c] changes StringEncode, StringDecode, and StringMapCase to take view arguments as they are similar.

[bf93fe] replaces AddCharUTF16 with AddWString and uses views. Changes from AddCharUTF16-2.diff include some additional const and removing the declaration of AddCharUTF16.

Related

Commit: [bbef7c]
Commit: [bf93fe]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

AddCharUTF is broken for DBCS code page

Group

Searches

Help

#2093 AddCharUTF is broken for DBCS code page

Discussion

Related

Related

Related