cp936/GBK treat 0x80 as valid single byte

Brought to you by: antoniolinares, johnsoonj, kapix93, nyamatongwe, and 2 others

#1575 cp936/GBK treat 0x80 as valid single byte

Milestone: Committed

Status: open

Owner: Neil Hodgson

Labels: Scintilla (410) encoding (7) dbcs (7)

Priority: 5

Updated: 2026-01-21

Created: 2026-01-18

Creator: Zufu Liu

Private: No

Based on https://github.com/python/cpython/issues/72530, 0x80 in Windows 936 and web GBK is mapped to Euro sign € U+20AC.
The change for IsDBCSValidSingleByte() is simple:

@@ -90,10 +90,13 @@
 bool IsDBCSValidSingleByte(int codePage, int ch) noexcept {
    switch (codePage) {
    case cp932:
+       // Shift_jis
        return ch == 0x80
            || (ch >= 0xA0 && ch <= 0xDF)
            || (ch >= 0xFD);
-
+   case cp936:
+       // GBK
+       return ch == 0x80;
    default:
        return false;
    }

But not sure whether it will cause problem on earlier or non-Windows systems.

Discussion

Zufu Liu - 2026-01-18

not sure whether it will cause problem on earlier or non-Windows systems.

Tested following code on XP, Vista and Win7:

#include <windows.h> #include <stdio.h> int main(void) { char chars[2] = {'\x80'}; wchar_t code[2] = {0}; int len = MultiByteToWideChar(936, 0, chars, 1, code, 2); return printf("len=%d, code=%04X\n", len, code[0]); }

the output is len=1, code=20AC as on Win 10 and 11.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2026-01-19

Group: Initial --> Committed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2026-01-19

Committed as [808977]. Works on Linux/GTK and macOS/Cocoa as well. Failures on other platforms can be addressed if they occur.

Related

Commit: [808977]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2026-01-19

off-topic the single byte range for CP932/Shift-JIS seems contains EUDC (end user defined character?).
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt
vs https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

EUDC from https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/

Code Page EUDC Control

932 0xa0, 0xfd - 0xff

936 0xff

949, 950 0xff 0x80

1361 0xd4 - 0xff 0x80 - 0x83

Should these EUDC also be included?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2026-01-19

Without a more specific benefit, such as a report of use of some single-byte EUDC, I think changing behaviour is more likely to produce new problems.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Zufu Liu - 2026-01-20
  
  OK.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2026-01-21

0x80 can be omitted from CP932/Shift_JIS, as it just maps to U+0080 C1 control character in CP932, and unsupported in other Japanese encodings:

>>> pages = ['cp932', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213'] >>> [b'\x80'.decode(page, 'backslashreplace') for page in pages] ['\x80', '\\x80', '\\x80', '\\x80', '\\x80', '\\x80', '\\x80'] >>>

though omit it will cause a visual change: rendered by platform as invisible/box/question block vs rendered by Scintilla as hex blob.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.