Menu

#1408 Improve DBCS processing

Committed
closed
5
2021-07-26
2021-06-23
Zufu Liu
No

Currently IsDBCSTrailByteInvalid() is only used inside DBCSDrawBytes(), other places unconditional treats IsDBCSLeadByteNoExcept()==true as two bytes, result in inconsistent rendering appearance and editing experience.

Attachment patch fixed GetCharacterAndWidth() and CharacterAfter().

Also, I think the name IsDBCSTrailByteInvalid can be shorted into IsDBCSTrailByte to avoid if (!.

1 Attachments

Discussion

1 2 > >> (Page 1 of 2)
  • Zufu Liu

    Zufu Liu - 2021-06-23

    LexAccessor::IsLeadByte() can be optimized into:

    return static_cast<unsigned char>(ch) > 0x80
        && encodingType == dbcs && pAccess->IsDBCSLeadByte(ch);
    
     
    • Neil Hodgson

      Neil Hodgson - 2021-07-02

      Committed with change to >= 0x80 to avoid detailed knowledge of valid DBCS lead bytes spread between modules. While every DBCS encoding starts lead bytes at 0x81 or later, its simpler to handle this as 'non-ASCII -> special treatment'.
      https://github.com/ScintillaOrg/lexilla/commit/e594d14612fe26346454ee62c2582302c9c6e5d6

       
  • Neil Hodgson

    Neil Hodgson - 2021-06-23

    result in inconsistent rendering appearance and editing experience.

    That is difficult to duplicate and check whether fixed. Just what are the problems?

    Can a test be added to test/simpleTests.py or test/unit/testDocument.cxx that will fail with the current implementation and pass after the fix?

    GetCharacterAndWidth.diff

    This contains extraneous changes that make it difficult to review. A fix should address exactly the problem with no additional optimizations or cleanups. Fixing DBCS should require zero changes to the UTF-8 path.

    Is the chunk that reverses the tests for !dbcsCodePage and UTF8IsAscii(leadByte) a good change? Who knows? There is no rationale or benchmark. It is entirely superfluous to using IsDBCSTrailByteInvalid.

    Also, I think the name IsDBCSTrailByteInvalid can be shorted into IsDBCSTrailByte to avoid if (!.

    That would require inverting the logic within the function with no unit tests.

     
  • Zufu Liu

    Zufu Liu - 2021-06-24

    Added GetCharacterAndWidth test case (\xfe\xff for code page 936, 949, 950 and 1361) .
    the code in GetCharacterAndWidth() is just de-indented after move !UTF8IsAscii(leadByte) && dbcsCodePage to one line.

    As UTF-8 is popular and Scintilla is focus on code editor, check for ASCII first may have benefits.

     
    • Zufu Liu

      Zufu Liu - 2021-06-24

      Updated test case to "\x84\xff.", which is invalid in all DBCS code pages.

      SECTION("GetCharacterAndWidth") {
          doc.DeleteChars(0, doc.Length());
          REQUIRE(0 == doc.Length());
          doc.SetDBCSCodePage(932);
          REQUIRE(doc.CodePage() == 932);
          const Sci::Position length = doc.InsertString(0, "\x84\xff.", 3);
          REQUIRE(3 == length);
          REQUIRE(3 == doc.Length());
          Sci::Position width = 0;
          int ch = doc.GetCharacterAndWidth(0, &width);
          REQUIRE(width == 1);
          REQUIRE(ch == 0x84);
          width = 0;
          ch = doc.GetCharacterAndWidth(1, &width);
          REQUIRE(width == 1);
          REQUIRE(ch == 0xff);
          width = 0;
          ch = doc.GetCharacterAndWidth(2, &width);
          REQUIRE(width == 1);
          REQUIRE(ch == '.');
      }
      
       
  • Zufu Liu

    Zufu Liu - 2021-06-24
    fp= open('test.txt', 'wb'); fp.write(b'\x84\xff.'); fp.close()
    

    for test.txt, it's treated as three characters in UTF-8 but two with DBCS code pages on press left/right arrows; but all rendered as two hex blobs + one normal character.

     
  • Zufu Liu

    Zufu Liu - 2021-06-25

    The the confusing function IsDBCSLeadByteInvalid() can be removed, it's only used in Editor::SetRepresentations(). the combination for IsDBCSLeadByteNoExcept + IsDBCSLeadByteInvalid is equivalent to following code (only make sense for 932):

    constexpr bool IsDBCSValidSingleByte(int codePage, int ch) noexcept {
        switch (codePage) {
        case 932:
            return ch == 0x80
                || (ch >= 0xA0 && ch <= 0xDF)
                || (ch >= 0xFD);
    
        default:
            return false;
        }
    }
    

    attached the changes to Editor::SetRepresentations() and a script to print all valid DBCS lead, tail and single bytes.

     

    Last edit: Zufu Liu 2021-06-25
  • Zufu Liu

    Zufu Liu - 2021-06-26

    All current changes:
    1. Remove IsDBCSLeadByteInvalid, add IsDBCSValidSingleByte, update SetRepresentations
    2. Rename IsDBCSTrailByteInvalid to IsDBCSTrailByteNoExcept, tested in DBCS-2.py
    3. Check trail bye in CharacterAfter and GetCharacterAndWidth

     
  • Neil Hodgson

    Neil Hodgson - 2021-07-01

    I don't understand the line

    character = character + leadByte;
    

    Which just leads to invalid UTF-8 bytes being reported as 2*leadByte.

    I could understand an objection to using a singleton surrogate (which comes from https://www.python.org/dev/peps/pep-0383/ ) but this seems quite confusing.

     
  • Zufu Liu

    Zufu Liu - 2021-07-01

    It should be character = 0xDC80 + character; to match old character = 0xDC80 + leadByte;.

     
  • Neil Hodgson

    Neil Hodgson - 2021-07-02

    Here is a minimal patch that appears to achieve the initial purpose.

     
    • Zufu Liu

      Zufu Liu - 2021-07-02

      It does the same thing, but in GetCharacterAndWidth(), IsDBCSLeadByteNoExcept() can be avoided by moving UTF8IsAscii() out from UTF-8 block.

       
      • Neil Hodgson

        Neil Hodgson - 2021-07-03

        by moving UTF8IsAscii() out from UTF-8 block

        That can be part of an optimization or code improvement change that causes no change in behaviour.

        The InvalidTrail.patch was pushed (accidentally) so is now [64fd59].

         

        Related

        Commit: [64fd59]

  • Neil Hodgson

    Neil Hodgson - 2021-07-03
    • labels: --> scintilla, dbcs
    • Group: Initial --> Committed
     
  • Zufu Liu

    Zufu Liu - 2021-07-03

    The unused function IsDBCSLeadByteInvalid() can be removed.
    LenChar(), MovePositionOutsideChar() and NextPosition() will some trail byte checks.

     
  • Zufu Liu

    Zufu Liu - 2021-07-04

    Incomplete changes, update GetCharacterAndWidth() test string to "H\x84\xff\x84H" (can be used for all DBCS code pages).

    following tests failed, need some change in NextPosition() (before return (pos - 1 - ((pos - posTemp) & 1))).

    pos = doc.NextPosition(2, -1);
    REQUIRE(pos == 1);
    pos = doc.NextPosition(3, -1);
    REQUIRE(pos == 2);
    
     
    • Neil Hodgson

      Neil Hodgson - 2021-07-05

      This is causing failures in test/win32Tests.py for the 4-byte text "[\x85\xff]" due to differing results from counting characters then converting them to UTF-16: GetTextLength returns 4 as the 2 bytes are considered separate now but GetText returns a 3 character string "[・]", as the Win32 function MultiByteToWideChar substitutes U+30FB Katakana Middle Dot for bad characters.

      It is likely that applications are expecting agreement between these APIs and may fail with the change.

       
      • Neil Hodgson

        Neil Hodgson - 2021-07-05

        Its likely best to fix this particular issue by using MultiByteToWideChar to count the characters although that requires some more work to avoid potentially large allocations.

        However, there may be more downstream effects. For example, SCI_POSITIONRELATIVECODEUNITS , SCI_COUNTCODEUNITS and related APIs will have different results and these may interact with features like copy to clipboard.

         
        • Neil Hodgson

          Neil Hodgson - 2021-07-06

          Committed change to WM_GETTEXTLENGTH and WM_GETTEXT as [b41b81].

           

          Related

          Commit: [b41b81]

          • Zufu Liu

            Zufu Liu - 2021-07-06

            https://docs.microsoft.com/en-us/windows/win32/api/winuser/nf-winuser-getwindowtextlengthw
            https://docs.microsoft.com/en-us/windows/win32/api/winuser/nf-winuser-getwindowtextw

            GetWindowTextLength() and GetWindowText() returns int, so I think a single MultiByteToWideChar() call is enough instead of line by line.

            As UTF16FromUTF8 assuming input are valid UTF-8, it can not be used for WM_GETTEXT too.

            I think CountUTF16 can be used for WM_GETTEXTLENGTH because

            the GetWindowTextLength function may return a value that is larger than the actual length of the text.

             
            • Neil Hodgson

              Neil Hodgson - 2021-07-06

              GetWindowTextLength() and GetWindowText() returns int, so I think a single MultiByteToWideChar() call is enough instead of line by line.

              MultiByteToWideChar can only handle contiguous text and the gap may be in a position that divides the bytes of a DBCS character. I didn't want to move the gap for a read-only operation so they are performed line-by-line so as to only allocate enough memory for the longest line instead of enough to copy the requested length which is often the whole document. The UTF-8 case may still allocate a large buffer.

              As UTF16FromUTF8 assuming input are valid UTF-8, it can not be used for WM_GETTEXT too.

              A variant of UTF16FromUTF8 that provides some defined handling for invalid bytes could be useful here as well as for copying to the clipboard.

              There is the question of what to substitute for invalid bytes and whether there should be options to tailor this behaviour. Potential choices would be to ignore invalid bytes; to substitute a known marker character (like '�', '?', or '・'); or a range of values (like Python's surrogateescape) so that sufficiently interested clients could recover the original invalid bytes.

              For some of the possible encoding error variations, see:
              https://docs.python.org/3/library/codecs.html#codec-base-classes

               
  • Zufu Liu

    Zufu Liu - 2021-07-05

    It's likely return value for Document::CountUTF16() is already inconsistent with MultiByteToWideChar for invalid UTF-8.

    if (utf8status & UTF8MaskInvalid)
        pos++;
    else
        pos += utf8status & UTF8MaskWidth;
    

    Unicode Standard 13.0, "3.9 Unicode Encoding Forms", section "Constraints on Conversion Processes" have some examples.

    treat each invalid byte as one code point is consistent with current UTF-8 paths.
    Different API may handles consecutive invalid bytes in different ways.

     
    • Neil Hodgson

      Neil Hodgson - 2021-07-05

      The above report is for code page 932.

      UTF16FromUTF8 is used instead of MultiByteToWideChar for UTF-8. There may be similar inconsistencies with UTF16FromUTF8 but they aren't detected by the unit tests.

       
  • Neil Hodgson

    Neil Hodgson - 2021-07-06

    IsDBCSLeadByteInvalid removed with [7412aa].

     

    Related

    Commit: [7412aa]

1 2 > >> (Page 1 of 2)

Log in to post a comment.