Scintilla / Feature Requests / #1408 Improve DBCS processing

Zufu Liu - 2021-06-23

LexAccessor::IsLeadByte() can be optimized into:

return static_cast<unsigned char>(ch) > 0x80 && encodingType == dbcs && pAccess->IsDBCSLeadByte(ch);
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2021-07-02
  
  Committed with change to >= 0x80 to avoid detailed knowledge of valid DBCS lead bytes spread between modules. While every DBCS encoding starts lead bytes at 0x81 or later, its simpler to handle this as 'non-ASCII -> special treatment'.
  https://github.com/ScintillaOrg/lexilla/commit/e594d14612fe26346454ee62c2582302c9c6e5d6
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2021-06-23

result in inconsistent rendering appearance and editing experience.

That is difficult to duplicate and check whether fixed. Just what are the problems?

Can a test be added to test/simpleTests.py or test/unit/testDocument.cxx that will fail with the current implementation and pass after the fix?

GetCharacterAndWidth.diff

This contains extraneous changes that make it difficult to review. A fix should address exactly the problem with no additional optimizations or cleanups. Fixing DBCS should require zero changes to the UTF-8 path.

Is the chunk that reverses the tests for !dbcsCodePage and UTF8IsAscii(leadByte) a good change? Who knows? There is no rationale or benchmark. It is entirely superfluous to using IsDBCSTrailByteInvalid.

Also, I think the name IsDBCSTrailByteInvalid can be shorted into IsDBCSTrailByte to avoid if (!.

That would require inverting the logic within the function with no unit tests.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2021-06-24

Added GetCharacterAndWidth test case (\xfe\xff for code page 936, 949, 950 and 1361) .
the code in GetCharacterAndWidth() is just de-indented after move !UTF8IsAscii(leadByte) && dbcsCodePage to one line.

As UTF-8 is popular and Scintilla is focus on code editor, check for ASCII first may have benefits.

GetCharacterAndWidth.cxx

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Zufu Liu - 2021-06-24
  
  Updated test case to "\x84\xff.", which is invalid in all DBCS code pages.
  
  SECTION("GetCharacterAndWidth") { doc.DeleteChars(0, doc.Length()); REQUIRE(0 == doc.Length()); doc.SetDBCSCodePage(932); REQUIRE(doc.CodePage() == 932); const Sci::Position length = doc.InsertString(0, "\x84\xff.", 3); REQUIRE(3 == length); REQUIRE(3 == doc.Length()); Sci::Position width = 0; int ch = doc.GetCharacterAndWidth(0, &width); REQUIRE(width == 1); REQUIRE(ch == 0x84); width = 0; ch = doc.GetCharacterAndWidth(1, &width); REQUIRE(width == 1); REQUIRE(ch == 0xff); width = 0; ch = doc.GetCharacterAndWidth(2, &width); REQUIRE(width == 1); REQUIRE(ch == '.'); }
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2021-06-24

fp= open('test.txt', 'wb'); fp.write(b'\x84\xff.'); fp.close()

for test.txt, it's treated as three characters in UTF-8 but two with DBCS code pages on press left/right arrows; but all rendered as two hex blobs + one normal character.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2021-06-25

The the confusing function IsDBCSLeadByteInvalid() can be removed, it's only used in Editor::SetRepresentations(). the combination for IsDBCSLeadByteNoExcept + IsDBCSLeadByteInvalid is equivalent to following code (only make sense for 932):

constexpr bool IsDBCSValidSingleByte(int codePage, int ch) noexcept { switch (codePage) { case 932: return ch == 0x80 || (ch >= 0xA0 && ch <= 0xDF) || (ch >= 0xFD); default: return false; } }

attached the changes to Editor::SetRepresentations() and a script to print all valid DBCS lead, tail and single bytes.

Last edit: Zufu Liu 2021-06-25

DBCS.py

SetRepresentations-2.cxx
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2021-06-26

All current changes:

Remove IsDBCSLeadByteInvalid, add IsDBCSValidSingleByte, update SetRepresentations

Rename IsDBCSTrailByteInvalid to IsDBCSTrailByteNoExcept, tested in DBCS-2.py

Check trail bye in CharacterAfter and GetCharacterAndWidth

DBCS-2.cxx

DBCS-2.diff

DBCS-2.py
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2021-07-01

I don't understand the line

character = character + leadByte;

Which just leads to invalid UTF-8 bytes being reported as 2*leadByte.

I could understand an objection to using a singleton surrogate (which comes from https://www.python.org/dev/peps/pep-0383/ ) but this seems quite confusing.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2021-07-01

It should be character = 0xDC80 + character; to match old character = 0xDC80 + leadByte;.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2021-07-02

Here is a minimal patch that appears to achieve the initial purpose.

InvalidTrail.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Zufu Liu - 2021-07-02
  
  It does the same thing, but in GetCharacterAndWidth(), IsDBCSLeadByteNoExcept() can be avoided by moving UTF8IsAscii() out from UTF-8 block.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Neil Hodgson - 2021-07-03
    
    by moving UTF8IsAscii() out from UTF-8 block
    
    That can be part of an optimization or code improvement change that causes no change in behaviour.
    
    The InvalidTrail.patch was pushed (accidentally) so is now [64fd59].
    
    Related
    
    Commit: [64fd59]
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2021-07-03

labels: --> scintilla, dbcs

Group: Initial --> Committed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2021-07-03

Committed other changes with further modifications as [b2b3ee], [e13006], [c58daf], [ab3655].

Related

Commit: [ab3655]
Commit: [b2b3ee]
Commit: [c58daf]
Commit: [e13006]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2021-07-03

The unused function IsDBCSLeadByteInvalid() can be removed.
LenChar(), MovePositionOutsideChar() and NextPosition() will some trail byte checks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2021-07-04

Incomplete changes, update GetCharacterAndWidth() test string to "H\x84\xff\x84H" (can be used for all DBCS code pages).

following tests failed, need some change in NextPosition() (before return (pos - 1 - ((pos - posTemp) & 1))).

pos = doc.NextPosition(2, -1); REQUIRE(pos == 1); pos = doc.NextPosition(3, -1); REQUIRE(pos == 2);

DBCS-3-test.cxx

DBCS-3.diff
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2021-07-05
  
  This is causing failures in test/win32Tests.py for the 4-byte text "[\x85\xff]" due to differing results from counting characters then converting them to UTF-16: GetTextLength returns 4 as the 2 bytes are considered separate now but GetText returns a 3 character string "[・]", as the Win32 function MultiByteToWideChar substitutes U+30FB Katakana Middle Dot for bad characters.
  
  It is likely that applications are expecting agreement between these APIs and may fail with the change.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Neil Hodgson - 2021-07-05
    
    Its likely best to fix this particular issue by using MultiByteToWideChar to count the characters although that requires some more work to avoid potentially large allocations.
    
    However, there may be more downstream effects. For example, SCI_POSITIONRELATIVECODEUNITS , SCI_COUNTCODEUNITS and related APIs will have different results and these may interact with features like copy to clipboard.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Neil Hodgson - 2021-07-06
      
      Committed change to WM_GETTEXTLENGTH and WM_GETTEXT as [b41b81].
      
      Related
      
      Commit: [b41b81]
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Zufu Liu - 2021-07-06
        
        https://docs.microsoft.com/en-us/windows/win32/api/winuser/nf-winuser-getwindowtextlengthw
        https://docs.microsoft.com/en-us/windows/win32/api/winuser/nf-winuser-getwindowtextw
        
        GetWindowTextLength() and GetWindowText() returns int, so I think a single MultiByteToWideChar() call is enough instead of line by line.
        
        As UTF16FromUTF8 assuming input are valid UTF-8, it can not be used for WM_GETTEXT too.
        
        I think CountUTF16 can be used for WM_GETTEXTLENGTH because
        
        the GetWindowTextLength function may return a value that is larger than the actual length of the text.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Neil Hodgson - 2021-07-06
        
        GetWindowTextLength() and GetWindowText() returns int, so I think a single MultiByteToWideChar() call is enough instead of line by line.
        
        MultiByteToWideChar can only handle contiguous text and the gap may be in a position that divides the bytes of a DBCS character. I didn't want to move the gap for a read-only operation so they are performed line-by-line so as to only allocate enough memory for the longest line instead of enough to copy the requested length which is often the whole document. The UTF-8 case may still allocate a large buffer.
        
        As UTF16FromUTF8 assuming input are valid UTF-8, it can not be used for WM_GETTEXT too.
        
        A variant of UTF16FromUTF8 that provides some defined handling for invalid bytes could be useful here as well as for copying to the clipboard.
        
        There is the question of what to substitute for invalid bytes and whether there should be options to tailor this behaviour. Potential choices would be to ignore invalid bytes; to substitute a known marker character (like '�', '?', or '・'); or a range of values (like Python's surrogateescape) so that sufficiently interested clients could recover the original invalid bytes.
        
        For some of the possible encoding error variations, see:
        https://docs.python.org/3/library/codecs.html#codec-base-classes
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2021-07-05

It's likely return value for Document::CountUTF16() is already inconsistent with MultiByteToWideChar for invalid UTF-8.

if (utf8status & UTF8MaskInvalid) pos++; else pos += utf8status & UTF8MaskWidth;

Unicode Standard 13.0, "3.9 Unicode Encoding Forms", section "Constraints on Conversion Processes" have some examples.

treat each invalid byte as one code point is consistent with current UTF-8 paths.
Different API may handles consecutive invalid bytes in different ways.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2021-07-05
  
  The above report is for code page 932.
  
  UTF16FromUTF8 is used instead of MultiByteToWideChar for UTF-8. There may be similar inconsistencies with UTF16FromUTF8 but they aren't detected by the unit tests.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2021-07-06

IsDBCSLeadByteInvalid removed with [7412aa].

Related

Commit: [7412aa]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Improve DBCS processing

Group

Searches

Help

#1408 Improve DBCS processing

Discussion

Related

Related

Related

Related