Menu

#1575 cp936/GBK treat 0x80 as valid single byte

Committed
open
5
2026-01-21
2026-01-18
Zufu Liu
No

Based on https://github.com/python/cpython/issues/72530, 0x80 in Windows 936 and web GBK is mapped to Euro sign € U+20AC.
The change for IsDBCSValidSingleByte() is simple:

@@ -90,10 +90,13 @@
 bool IsDBCSValidSingleByte(int codePage, int ch) noexcept {
    switch (codePage) {
    case cp932:
+       // Shift_jis
        return ch == 0x80
            || (ch >= 0xA0 && ch <= 0xDF)
            || (ch >= 0xFD);
-
+   case cp936:
+       // GBK
+       return ch == 0x80;
    default:
        return false;
    }

But not sure whether it will cause problem on earlier or non-Windows systems.

Discussion

  • Zufu Liu

    Zufu Liu - 2026-01-18

    not sure whether it will cause problem on earlier or non-Windows systems.

    Tested following code on XP, Vista and Win7:

    #include <windows.h>
    #include <stdio.h>
    
    int main(void) {
        char chars[2] = {'\x80'};
        wchar_t code[2] = {0};
        int len = MultiByteToWideChar(936, 0, chars, 1, code, 2);
        return printf("len=%d, code=%04X\n", len, code[0]);
    }
    

    the output is len=1, code=20AC as on Win 10 and 11.

     
  • Neil Hodgson

    Neil Hodgson - 2026-01-19
    • Group: Initial --> Committed
     
  • Neil Hodgson

    Neil Hodgson - 2026-01-19

    Committed as [808977]. Works on Linux/GTK and macOS/Cocoa as well. Failures on other platforms can be addressed if they occur.

     

    Related

    Commit: [808977]

  • Neil Hodgson

    Neil Hodgson - 2026-01-19

    Without a more specific benefit, such as a report of use of some single-byte EUDC, I think changing behaviour is more likely to produce new problems.

     
    • Zufu Liu

      Zufu Liu - 2026-01-20

      OK.

       
  • Zufu Liu

    Zufu Liu - 2026-01-21

    0x80 can be omitted from CP932/Shift_JIS, as it just maps to U+0080 C1 control character in CP932, and unsupported in other Japanese encodings:

    >>> pages = ['cp932', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213']
    >>> [b'\x80'.decode(page, 'backslashreplace') for page in pages]
    ['\x80', '\\x80', '\\x80', '\\x80', '\\x80', '\\x80', '\\x80']
    >>>
    

    though omit it will cause a visual change: rendered by platform as invisible/box/question block vs rendered by Scintilla as hex blob.

     

Log in to post a comment.