Based on https://github.com/python/cpython/issues/72530, 0x80 in Windows 936 and web GBK is mapped to Euro sign € U+20AC.
The change for IsDBCSValidSingleByte() is simple:
@@ -90,10 +90,13 @@
bool IsDBCSValidSingleByte(int codePage, int ch) noexcept {
switch (codePage) {
case cp932:
+ // Shift_jis
return ch == 0x80
|| (ch >= 0xA0 && ch <= 0xDF)
|| (ch >= 0xFD);
-
+ case cp936:
+ // GBK
+ return ch == 0x80;
default:
return false;
}
But not sure whether it will cause problem on earlier or non-Windows systems.
Tested following code on XP, Vista and Win7:
the output is
len=1, code=20ACas on Win 10 and 11.Committed as [808977]. Works on Linux/GTK and macOS/Cocoa as well. Failures on other platforms can be addressed if they occur.
Related
Commit: [808977]
off-topic the single byte range for CP932/Shift-JIS seems contains EUDC (end user defined character?).
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt
vs https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
EUDC from https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/
Should these EUDC also be included?
Without a more specific benefit, such as a report of use of some single-byte EUDC, I think changing behaviour is more likely to produce new problems.
OK.
0x80 can be omitted from CP932/Shift_JIS, as it just maps to U+0080 C1 control character in CP932, and unsupported in other Japanese encodings:
though omit it will cause a visual change: rendered by platform as invisible/box/question block vs rendered by Scintilla as hex blob.