Currently IsDBCSTrailByteInvalid() is only used inside DBCSDrawBytes(), other places unconditional treats IsDBCSLeadByteNoExcept()==true as two bytes, result in inconsistent rendering appearance and editing experience.
Attachment patch fixed GetCharacterAndWidth() and CharacterAfter().
Also, I think the name IsDBCSTrailByteInvalid can be shorted into IsDBCSTrailByte to avoid if (!.
LexAccessor::IsLeadByte() can be optimized into:
Committed with change to
>= 0x80to avoid detailed knowledge of valid DBCS lead bytes spread between modules. While every DBCS encoding starts lead bytes at 0x81 or later, its simpler to handle this as 'non-ASCII -> special treatment'.https://github.com/ScintillaOrg/lexilla/commit/e594d14612fe26346454ee62c2582302c9c6e5d6
That is difficult to duplicate and check whether fixed. Just what are the problems?
Can a test be added to test/simpleTests.py or test/unit/testDocument.cxx that will fail with the current implementation and pass after the fix?
This contains extraneous changes that make it difficult to review. A fix should address exactly the problem with no additional optimizations or cleanups. Fixing DBCS should require zero changes to the UTF-8 path.
Is the chunk that reverses the tests for !dbcsCodePage and UTF8IsAscii(leadByte) a good change? Who knows? There is no rationale or benchmark. It is entirely superfluous to using IsDBCSTrailByteInvalid.
That would require inverting the logic within the function with no unit tests.
Added GetCharacterAndWidth test case (
\xfe\xfffor code page 936, 949, 950 and 1361) .the code in GetCharacterAndWidth() is just de-indented after move
!UTF8IsAscii(leadByte) && dbcsCodePageto one line.As UTF-8 is popular and Scintilla is focus on code editor, check for ASCII first may have benefits.
Updated test case to
"\x84\xff.", which is invalid in all DBCS code pages.for test.txt, it's treated as three characters in UTF-8 but two with DBCS code pages on press left/right arrows; but all rendered as two hex blobs + one normal character.
The the confusing function IsDBCSLeadByteInvalid() can be removed, it's only used in Editor::SetRepresentations(). the combination for IsDBCSLeadByteNoExcept + IsDBCSLeadByteInvalid is equivalent to following code (only make sense for 932):
attached the changes to Editor::SetRepresentations() and a script to print all valid DBCS lead, tail and single bytes.
Last edit: Zufu Liu 2021-06-25
All current changes:
1. Remove IsDBCSLeadByteInvalid, add IsDBCSValidSingleByte, update SetRepresentations
2. Rename IsDBCSTrailByteInvalid to IsDBCSTrailByteNoExcept, tested in DBCS-2.py
3. Check trail bye in CharacterAfter and GetCharacterAndWidth
I don't understand the line
Which just leads to invalid UTF-8 bytes being reported as
2*leadByte.I could understand an objection to using a singleton surrogate (which comes from https://www.python.org/dev/peps/pep-0383/ ) but this seems quite confusing.
It should be
character = 0xDC80 + character;to match oldcharacter = 0xDC80 + leadByte;.Here is a minimal patch that appears to achieve the initial purpose.
It does the same thing, but in GetCharacterAndWidth(), IsDBCSLeadByteNoExcept() can be avoided by moving UTF8IsAscii() out from UTF-8 block.
That can be part of an optimization or code improvement change that causes no change in behaviour.
The InvalidTrail.patch was pushed (accidentally) so is now [64fd59].
Related
Commit: [64fd59]
Committed other changes with further modifications as [b2b3ee], [e13006], [c58daf], [ab3655].
Related
Commit: [ab3655]
Commit: [b2b3ee]
Commit: [c58daf]
Commit: [e13006]
The unused function IsDBCSLeadByteInvalid() can be removed.
LenChar(), MovePositionOutsideChar() and NextPosition() will some trail byte checks.
Incomplete changes, update GetCharacterAndWidth() test string to "H\x84\xff\x84H" (can be used for all DBCS code pages).
following tests failed, need some change in NextPosition() (before
return (pos - 1 - ((pos - posTemp) & 1))).This is causing failures in test/win32Tests.py for the 4-byte text "[\x85\xff]" due to differing results from counting characters then converting them to UTF-16: GetTextLength returns 4 as the 2 bytes are considered separate now but GetText returns a 3 character string "[・]", as the Win32 function MultiByteToWideChar substitutes U+30FB Katakana Middle Dot for bad characters.
It is likely that applications are expecting agreement between these APIs and may fail with the change.
Its likely best to fix this particular issue by using MultiByteToWideChar to count the characters although that requires some more work to avoid potentially large allocations.
However, there may be more downstream effects. For example, SCI_POSITIONRELATIVECODEUNITS , SCI_COUNTCODEUNITS and related APIs will have different results and these may interact with features like copy to clipboard.
Committed change to WM_GETTEXTLENGTH and WM_GETTEXT as [b41b81].
Related
Commit: [b41b81]
https://docs.microsoft.com/en-us/windows/win32/api/winuser/nf-winuser-getwindowtextlengthw
https://docs.microsoft.com/en-us/windows/win32/api/winuser/nf-winuser-getwindowtextw
GetWindowTextLength() and GetWindowText() returns int, so I think a single MultiByteToWideChar() call is enough instead of line by line.
As UTF16FromUTF8 assuming input are valid UTF-8, it can not be used for WM_GETTEXT too.
I think CountUTF16 can be used for WM_GETTEXTLENGTH because
MultiByteToWideChar can only handle contiguous text and the gap may be in a position that divides the bytes of a DBCS character. I didn't want to move the gap for a read-only operation so they are performed line-by-line so as to only allocate enough memory for the longest line instead of enough to copy the requested length which is often the whole document. The UTF-8 case may still allocate a large buffer.
A variant of UTF16FromUTF8 that provides some defined handling for invalid bytes could be useful here as well as for copying to the clipboard.
There is the question of what to substitute for invalid bytes and whether there should be options to tailor this behaviour. Potential choices would be to ignore invalid bytes; to substitute a known marker character (like '�', '?', or '・'); or a range of values (like Python's
surrogateescape) so that sufficiently interested clients could recover the original invalid bytes.For some of the possible encoding error variations, see:
https://docs.python.org/3/library/codecs.html#codec-base-classes
It's likely return value for Document::CountUTF16() is already inconsistent with MultiByteToWideChar for invalid UTF-8.
Unicode Standard 13.0, "3.9 Unicode Encoding Forms", section "Constraints on Conversion Processes" have some examples.
treat each invalid byte as one code point is consistent with current UTF-8 paths.
Different API may handles consecutive invalid bytes in different ways.
The above report is for code page 932.
UTF16FromUTF8 is used instead of MultiByteToWideChar for UTF-8. There may be similar inconsistencies with UTF16FromUTF8 but they aren't detected by the unit tests.
IsDBCSLeadByteInvalid removed with [7412aa].
Related
Commit: [7412aa]