The title seems not accurate, maybe it's "IME broken for DBCS code page".
No bug in UTF-8 code page.
I cann't test AddCharUTF16() used in WM_CHAR / WM_UNICHAR.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
treatAsDBCS is archaic and is not used by any core platforms. It was retained in case it was used by non-core platforms. There is a lot of history here. See here for example.
The crash is most likely UTF32FromUTF8 being asked to write too much:
AddCharUTF16 isn't in the stack trace so changing it won't fix the crash. The same result should occur for the various routes to AddCharUTF.
AddCharUTF is inserting into the document so it could use the document's code page to determine whether calculating the value for NotifyChar should go through UTF32FromUTF8 or just jam the bytes together.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It indeed not fix the crash, just shows it need handled same way as AddWString().
Here is a problem, document's code page may not the same as CodePageOfDocument(). the latter (depending on font character set of STYLE_DEFAULT) is used to convert the input (UTF-16 or UTF-8) character/string to a buffer used by AddCharUTF().
That why I say CodePageOfDocument() is buggy, which doesn't consistent with underline buffer like Document::WordCharacterClass().
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
and remove DBCS code pages in the switch (leaving only SBCS code pages), this change matches dbcsCodePage in document,
the result code page is used to intact (converting in and out) with underline buffer.
Old CodePageFromCharSet renamed to another name, which is only used by Surface (like SurfaceD2D) to match Win32 GDI rendering with ANSI functions,
the result code page is only used to view the underline buffer, as a shortcut to re-encoding (convert to UTF-8) the underline buffer with a proper code page.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
By default , when a font supports "Western European" , Windows font choose dialog shows "Western European" (DEFAULT_CHARSET) in the script dialog instead of a charset matches current locale.
Not all font implements all charsets, compare three fonts on my system:
Microsoft Yahei UI supports GB2312, Western European, Greek, Turkish, Central European and Cyrillic.
Microsoft JhengHei UI supports Western European, Big5 and Greek.
DejaVu Sans Mono supports Western European, Arabic, Greek, Turkish, Baltic, Central European, Cyrillic, and Vietnamese.
When using DejaVu Sans Mono with GDI and SC_CHARSET_GB2312, text in ANSI/GBK is not rendered as expected (whole text instead of only Chinese characters is rendered with SimSun 宋体 on my system).
While in D2D, the same text is rendered as expected with DejaVu Sans Mono and Microsoft YaHei/Microsoft YaHei UI as the fallback font for Chinese characters.
Thus when using GDI, locale charset can't been set without knowing the font actual supports it.
SC_CHARSET_DEFAULT now means Windows-1252 on Windows. If you want to use the current Windows ANSI code page then that should be retrieved and the code page and character set initialized to match that.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Maybe DEFAULT_CHARSET on windows really means locale invariant default charset, at least when using a font with "Western European" scripts, it covers ASCII and many Latin-1 characters.
Suggest moving SC_INDICATOR_INPUT and other three (and MAXLENINPUTIME in Qt) to a header, they are defined on each platform.
export the four will let application to custom their colors (the default blue color is hardly to be seen with dark theme).
I think maxLenInputIME/MAXLENINPUTIME can be removed, the IME itself will limit the input characters (20 or so). for ScintillaWin::AddCharUTF16(), 16 bytes is enough for the two UTF-16 code points.
Last edit: Zufu Liu 2019-04-08
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The different platforms have different IME indicators and the input/target/converted/unknown breakdown is limiting. Cocoa is more dynamic and sends its own preferred decorations but that is complex to implement. More cases should be handled in the Qt implementation.
IIRC maxLenInputIME was used when there were fixed size buffers.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If StringEncode() changed to std::wstring_view, const std::wstring uniChar(wsv, i, ucWidth); can be replaced with wsv.substr(i, ucWidth) as in second patch.
[bbef7c] changes StringEncode, StringDecode, and StringMapCase to take view arguments as they are similar.
[bf93fe] replaces AddCharUTF16 with AddWString and uses views. Changes from AddCharUTF16-2.diff include some additional const and removing the declaration of AddCharUTF16.
The title seems not accurate, maybe it's "IME broken for DBCS code page".
No bug in UTF-8 code page.
I cann't test AddCharUTF16() used in WM_CHAR / WM_UNICHAR.
treatAsDBCS is archaic and is not used by any core platforms. It was retained in case it was used by non-core platforms. There is a lot of history here. See here for example.
The crash is most likely UTF32FromUTF8 being asked to write too much:
The exception is only about a year old but it was still an error before that.
rough stacktrace:
I think
CodePageOfDocument()is also buggy, which doesn't consistent with underline buffer likeDocument::WordCharacterClass().Last edit: Zufu Liu 2019-03-27
Anothor excample,
日(\u65E5) in cp932 ("\x93\xfa"), because ofif ((ch < 0xC0) || (1 == len)), NotifyChar() will use a truncated value 0x93.Some fix for AddCharUTF (assuming len=0 is already handled at beginning of AddCharUTF):
Second part of ScintillaWin::AddCharUTF16():
AddCharUTF16 isn't in the stack trace so changing it won't fix the crash. The same result should occur for the various routes to AddCharUTF.
AddCharUTF is inserting into the document so it could use the document's code page to determine whether calculating the value for NotifyChar should go through UTF32FromUTF8 or just jam the bytes together.
It indeed not fix the crash, just shows it need handled same way as AddWString().
Here is a problem, document's code page may not the same as CodePageOfDocument(). the latter (depending on font character set of STYLE_DEFAULT) is used to convert the input (UTF-16 or UTF-8) character/string to a buffer used by AddCharUTF().
That why I say CodePageOfDocument() is buggy, which doesn't consistent with underline buffer like Document::WordCharacterClass().
CodePageOfDocument needs to look at the character set to be able to handle single byte character sets for Russian, Greek, etc. .
How about the change:
and remove DBCS code pages in the switch (leaving only SBCS code pages), this change matches dbcsCodePage in document,
the result code page is used to intact (converting in and out) with underline buffer.
Old CodePageFromCharSet renamed to another name, which is only used by Surface (like SurfaceD2D) to match Win32 GDI rendering with ANSI functions,
the result code page is only used to view the underline buffer, as a shortcut to re-encoding (convert to UTF-8) the underline buffer with a proper code page.
Why are you setting a character set that is not consistent with the code page?
By default , when a font supports "Western European" , Windows font choose dialog shows "Western European" (DEFAULT_CHARSET) in the script dialog instead of a charset matches current locale.
Not all font implements all charsets, compare three fonts on my system:
Microsoft Yahei UI supports GB2312, Western European, Greek, Turkish, Central European and Cyrillic.
Microsoft JhengHei UI supports Western European, Big5 and Greek.
DejaVu Sans Mono supports Western European, Arabic, Greek, Turkish, Baltic, Central European, Cyrillic, and Vietnamese.
When using DejaVu Sans Mono with GDI and SC_CHARSET_GB2312, text in ANSI/GBK is not rendered as expected (whole text instead of only Chinese characters is rendered with SimSun 宋体 on my system).
While in D2D, the same text is rendered as expected with DejaVu Sans Mono and Microsoft YaHei/Microsoft YaHei UI as the fallback font for Chinese characters.
Thus when using GDI, locale charset can't been set without knowing the font actual supports it.
The change from 3.6.6 to 3.6.7:
had caused some bugs on using ANSI code page when using fonts with default charset.
https://github.com/XhmikosR/notepad2-mod/issues/173
https://github.com/XhmikosR/notepad2-mod/issues/195
https://github.com/rizonesoft/Notepad3/issues/23
previously it returns CP_ACP (0), but now 1252 when document code page is zero.
SC_CHARSET_DEFAULT now means Windows-1252 on Windows. If you want to use the current Windows ANSI code page then that should be retrieved and the code page and character set initialized to match that.
Buy the way,
the page https://www.scintilla.org/LongTermDownload.html is out dated, seems it should link to https://scintilla.sourceforge.io/LongTermDownload.html
OK, [9e5612].
Related
Commit: [9e5612]
Maybe DEFAULT_CHARSET on windows really means locale invariant default charset, at least when using a font with "Western European" scripts, it covers ASCII and many Latin-1 characters.
https://docs.microsoft.com/en-us/windows/desktop/api/wingdi/nf-wingdi-enumfontfamiliesexa
UTF-8 code page has same problem in GDI, when charset is not supported by the font, system fallback font is used to render the whole text.
The patch fix original crash.
Suggest moving SC_INDICATOR_INPUT and other three (and MAXLENINPUTIME in Qt) to a header, they are defined on each platform.
export the four will let application to custom their colors (the default blue color is hardly to be seen with dark theme).
I think maxLenInputIME/MAXLENINPUTIME can be removed, the IME itself will limit the input characters (20 or so). for ScintillaWin::AddCharUTF16(), 16 bytes is enough for the two UTF-16 code points.
Last edit: Zufu Liu 2019-04-08
The different platforms have different IME indicators and the input/target/converted/unknown breakdown is limiting. Cocoa is more dynamic and sends its own preferred decorations but that is complex to implement. More cases should be handled in the Qt implementation.
IIRC maxLenInputIME was used when there were fixed size buffers.
patch updated, replaced
pdoc->CodePageFamily() != efUnicodewithpdoc->dbcsCodePage != SC_CP_UTF8and with change to ScintillaWin::AddCharUTF16().
AddCharUTF16() is now same as AddWString(). if AddWString is changed to std::wstring_view, AddCharUTF16 can be changed to a call to AddWString(wsv).
Committed change to Editor as [a7d3bc].
Unifying AddCharUTF16 into AddWString would be worthwhile.
Related
Commit: [a7d3bc]
A naive approach to merge AddCharUTF16.
If StringEncode() changed to std::wstring_view,
const std::wstring uniChar(wsv, i, ucWidth);can be replaced withwsv.substr(i, ucWidth)as in second patch.Broke the AddCharUTF16 changes into two parts.
[bbef7c] changes StringEncode, StringDecode, and StringMapCase to take view arguments as they are similar.
[bf93fe] replaces AddCharUTF16 with AddWString and uses views. Changes from AddCharUTF16-2.diff include some additional
constand removing the declaration of AddCharUTF16.Related
Commit: [bbef7c]
Commit: [bf93fe]