Scintilla / Feature Requests / #1211 Use pre-computed table for UTF8CharLength and UTF8BytesOfLead

Neil Hodgson - 2018-03-17

The patch has UTF8CharLength exactly the same as UTF8BytesOfLead although I don't think that was the intent. The intent appears to be to treat the invalid lead bytes 0x80..0xC1 and 0xF5..0xFF differently for strong compatibility with previous behaviour. UTF8BytesOfLead forces invalid values to length 1 which is likely safer. The utf8LengthFromLead functions are probably just historical not seeing that UTF8CharLength existed rather than being consciously differentiated.

Autogeneration has a numbered section feature in template lines that can be seen in src/CaseConvert.cxx to handle multiple insertions. However, the UTF-8 tables should never change, unlike other tables which may change when the Unicode standard is updated or when features are added to Scintilla/SciTE. Thus it should be OK to copy the tables in without the Autogeneration comments.

It would increase safety to add length checking to UTF16FromUTF8 instead of removing it from UTF32FromUTF8. This is checking to prevent writing beyond the end of a memory allocation so it may be better to fail more strongly perhaps with an exception or assertion.

Converting from UTF-8 character length to UTF-16 character length is changed to use the expression

1 + lenChar >> 2

which, to me, is less obvious than

(lenChar < 4) ? 1 : 2

If there is a benefit to this then it should be encapsulated in a function.

Declaring variables in a switch branch has more complex behaviour so it's better to add a scope for isolation in this change:

default: int val = (ch & 0x7) << 18;
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2018-03-17

The patch and script is updated. Length checking in UTF16FromUTF8 is added. ~~Because of lack an else branch, the original length checking in UTF32FromUTF8 may has an infinite loop.~~ The the original length checking in UTF32FromUTF8 seems buggy.

Last edit: Zufu Liu 2018-03-17

GenerateCharTable-2.py

UniConversion.cxx

utf8-table-2.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2018-03-17

The line

The assembly for const unsigned char UTF8BytesOfLead[256]; and const int UTF8BytesOfLead[256]; generated by VC is same (command line option: /FAs).

is wrong, please change the type to appreciated one.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2018-03-18

You seem to prefer UTF8CharLength over UTF8BytesOfLead such as the change to PlatWin.cxx.

I prefer the values in UTF8BytesOfLead since these return 1 for invalid cases. Scintilla's drawing behaviour is to draw invalid bytes in a UTF-8 file as isolated single-byte blobs. UTF8CharLength treats invalid lead bytes as the starts of either 2 or 4 byte characters which I think is less consistent and less safe. Its less consistent as it differs from UTF8Classify/UTF8DrawBytes and thus drawing, and its less safe as it may lead to calling code trying to grab the next byte(s) to construct a bogus character - eating a single byte will allow better resynchronisation when, for example, an invalid lead byte is followed by an ASCII byte.

Another approach would be to treat invalid lead bytes as more invalid and output a marker character such as unicodeReplacementChar (0xFFFD). This is more lossy since all the invalid lead bytes are converted to the same 0xFFFD making it more difficult to track back from failures. A separate version of the UTF8BytesOfLead table could use 0 or 5 for invalid lead bytes.

The behaviour of UTF32FromUTF8 when running out of room should terminate but may write an unexpected 0 to the output.

Since the charLen is now calculated up-front, some assertions or checks can be added before the switch. Something like

if (charLen > len-i) { throw std::runtime_error("UTF32FromUTF8: attempted read beyond end"); } const size_t outLen = (charLen < 4) ? 1 : 2; if (outLen > tlen-ui) { throw std::runtime_error("UTF32FromUTF8: attempted write beyond end"); }

Last edit: Neil Hodgson 2018-03-18
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2018-03-18

Did you means use UTF8BytesOfLead in UTF16FromUTF8, UTF32FromUTF8 and surface's MeasureWidths? how about UTF16Length?

The current UTF8CharLength table matches original behaviour, except for D2D surface.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2018-03-18

I would use UTF8BytesOfLead in UTF16FromUTF8, UTF32FromUTF8, UTF16Length and surface's MeasureWidths.

Currently, MeasureWidths (and other Surface methods) should only be given valid UTF-8 when in Unicode mode, as the platform-independent drawing code breaks up the document text and uses special 'blob' code for invalid bytes. That means that there should be no difference between UTF8BytesOfLead and UTF8CharLength.

UTF16Length should use the same technique as UTF16FromUTF8 as they are generally used as a pair. UTF16Length is called in 3 places to measure / allocate before UTF16FromUTF8 is called. Its other use is for WM_GETTEXTLENGTH which is often called by applications before WM_GETTEXT which uses UTF16FromUTF8.

The main places where UTF8BytesOfLead versus UTF8CharLength may produce changed behaviour are in clipboard and drag-and-drop actions. If the user selects a range to copy on Windows, that range is converted to UTF-16 by UTF16FromUTF8 so they may see better/worse behaviour depending on how UTF16FromUTF8 is implemented. Say you have "a\xB5yz" in the document in UTF-8 mode which means the '\xB5' is an invalid lead byte. Copy the whole document to the clipboard. If UTF8CharLength is used, this produces the 3 character UTF-16 string "aչz" but when UTF8BytesOfLead is used in both UTF16FromUTF8 and UTF16Length then the 4 character string "aµyz" is produced. I think this is better since it has preserved the 'y' and there is a resonable guess for the '\xB5'.

The above example may occur because a Windows-1252 file containing "aµyz" was opened as if it was UTF-8. Some Windows-1252 character values in 0x80..0xff like 'µ' match Unicode due to Windows-1252 being based on ISO 8859-1 which defined values 0x80..0xff in Unicode. Windows-1252 is a very common encoding.

If the invalid lead byte was represented with the Unicode Replacement Character then there would be "a�yz" on the clipboard. I think the best behaviour can be argued different ways but I would personally prefer "aµyz" to "a�yz".

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2018-03-18

OK, I will remove the UTF8CharLength table.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2018-03-18

Sorry, I have something must leaving, may return after PM 8, UTC+8 .

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2018-03-18

Hi Neil, the patch updated.
*.asm is added to .hgignore, $(DIR_O)\*.asm is added to scintilla.mak's clean target, please ignore these two changes if not need.

Last edit: Zufu Liu 2018-03-18

GenerateCharTable-3.py

UniConversion-3.cxx

utf8-table-3.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2018-03-18

OK.

UTF8CharLength is still called in gtk/PlatGTK.cxx causing a compilation failure. It should be replaced with using UTF8BytesOfLead.

I don't understand why "static" was added to UTF8IsTrailByte which is already inline. Other inline functions are not marked static in Scintilla headers.

Making UnicodeFromUTF8 depend on UTF8BytesOfLead means there's no real benefit to it being in a separate header so it should move into UniConversion.h. I can't really recall why it is a separate header but the most likely reason was so it could be used inside separate lexers without adding dependencies like UniConversion.cxx. The unit tests are also broken as test/unit/testUnicodeFromUTF8.cxx doesn't include UniConversion.h and the unit tests don't link UniConversion.cxx. This is sufficiently complex that UnicodeFromUTF8 should be left out of the initial change set.

Reusing value in UTF16FromUTF8 for different purposes is confusing. Making the variable used for output length a const should optimize nicely. Or inline the expression into the check.

In UTF16Length, charLen is only assigned once so should be const.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2018-03-19

Patch updated, unit tests now passed.

There are three patchs:
utf8-table-4.diff contains changes to UnicodeFromUTF8.h and related files.
utf8-table-4no-UnicodeFromUTF8.diff doses not include changes to UnicodeFromUTF8.h and related files.
utf8-table-4del-UnicodeFromUTF8.diff with function UnicodeFromUTF8() moved to UniConversion.h and the header UnicodeFromUTF8.h is removed.

Last edit: Zufu Liu 2018-03-19

GenerateCharTable-3.py

UniConversion-4.cxx

utf8-table-4.diff

utf8-table-4del-UnicodeFromUTF8.diff

utf8-table-4no-UnicodeFromUTF8.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2018-03-19

The current implement of UTF8Classify is rather complex, I have make a simple one according to RFC3629, "4. Syntax of UTF-8 Byte Sequences"
https://tools.ietf.org/html/rfc3629#section-4

You can review these changes at
https://github.com/zufuliu/notepad2/blob/master/scintilla/src/UniConversion.cxx#L269
and
https://github.com/zufuliu/notepad2/blob/master/scintilla/src/UniConversion.h#L25

In this change (UTF8Classify), table UTF8BytesOfLead is replaced with inline function to reduce binary size:

extern const unsigned short UTF8ClassifyTable[256]; enum { UTF8ClassifyMaskOctetCount = 0b111, UTF8ClassifyMaskLeadByte = 1 << 3, UTF8ClassifyMaskTrailByte = 1 << 4, }; inline int UTF8BytesOfLead(unsigned char ch) { return UTF8ClassifyTable[ch] & UTF8ClassifyMaskOctetCount; } inline bool UTF8IsLeadByte(unsigned char ch) { return (UTF8ClassifyTable[ch] & UTF8ClassifyMaskLeadByte) !=0; } inline bool UTF8IsTrailByte(unsigned char ch) { return (UTF8ClassifyTable[ch] & UTF8ClassifyMaskTrailByte) !=0; }

Should I pack UTF8Classify's change with this patch, or make new patch?

Last edit: Zufu Liu 2018-03-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2018-03-19
  
  I don't think the minor saving is worth the loss of clarity caused by merging the widths with other information. Its more difficult to tell what is going on by looking at the values in UTF8ClassifyTable.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Zufu Liu - 2018-03-19
    
    The content of UTF8ClassifyTable has description above it at https://github.com/zufuliu/notepad2/blob/master/scintilla/src/UniConversion.cxx#L203
    
    It's auto generated by
    https://github.com/zufuliu/notepad2/blob/master/scintilla/scripts/GenerateCharTable.py#L27
    
    Last edit: Zufu Liu 2018-03-19
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2018-03-19

Patch updated, unit tests now passed.

There are three patchs:
utf8-table-4.diff contains changes to UnicodeFromUTF8.h and related files.
utf8-table-4no-UnicodeFromUTF8.diff doses not include changes to UnicodeFromUTF8.h and related files.
utf8-table-4del-UnicodeFromUTF8.diff with function UnicodeFromUTF8() moved to UniConversion.h and the header UnicodeFromUTF8.h is removed.

GenerateCharTable-3.py

UniConversion-4.cxx

utf8-table-4.diff

utf8-table-4del-UnicodeFromUTF8.diff

utf8-table-4no-UnicodeFromUTF8.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2018-03-19

UTF8Classify is more complex so it needs some unit tests and should also be in a new issue. Adding multiple features into a single issue makes it more complex to see what is happening, particularly if some things are integrated into one release and others are deferred.

1) I think it would help to add a new test/unit/testUniConversion.cxx to ensure that the changes made to UniConversion for this issue are good and stay good in the future. Something of similar scope to testUnicodeFromUTF8.cxx with both UTF16FromUTF8 and UTF32FromUTF8 checked against the different lengths of character as well as an invalid string.

2) Then when UnicodeFromUTF8 is moved into UniConversion then its tests are moved into testUniConversion.

3) (new issue) Later, tests should be added to testUniConversion for UTF8Classify to capture its current behaviour.

4) Changes to UTF8Classify can be checked to see if there are changes in behaviour.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2018-03-19

Do you means I should keep UnicodeFromUTF8.h unchanged (apply utf8-table-4no-UnicodeFromUTF8.diff), and add an unit test testUniConversion.cxx?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2018-03-19
  
  Yes, make a patch containing utf8-table-4no-UnicodeFromUTF8.diff along with a new testUniConversion.cxx containing tests of UTF16FromUTF8 and UTF32FromUTF8.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Zufu Liu - 2018-03-19
    
    New patch with unit test.
    
    GenerateCharTable-3.py
    
    UniConversion-4.cxx
    
    utf8-table-5.diff
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2018-03-19

Unit test rewrote, split REQUIRE() test.

testUniConversion-2.cxx

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2018-03-19

That looks reasonable.

I added some more test cases and am not real happy about some changed behaviour with invalid input. The previous functions were more forgiving of spurious lead bytes at the end of the string.

Here are the extra test cases for UTF16FromUTF8 - the first was discussed before and works well.

SECTION("UTF16FromUTF8 Invalid byte in lead position") { // a\xB5yz const char s[] = "a\xB5yz"; wchar_t tbuf[4] = {}; bool pass = UTF16FromUTF8(s, 4, tbuf, 4) == 4 && tbuf[0] == 'a' && tbuf[1] == 0xB5 && tbuf[2] == 'y' && tbuf[3] == 'z'; REQUIRE(pass); } SECTION("UTF16FromUTF8 Lead byte at end") { // a\xC2 const char s[] = "a\xC2"; wchar_t tbuf[2] = {}; bool pass = UTF16FromUTF8(s, 2, tbuf, 2) == 2 && tbuf[0] == 'a' && tbuf[1] == 0xC2; REQUIRE(pass); }

The second fails as it tries to read beyond the end of the input string. The previous version would generally find memory after the end (often a NUL) so would appear to work OK although it is undefined behaviour so does need fixing.

I think the most forgiving behaviour would be for the input to be treated as if there are NUL bytes past the len and not throw an exception for input exhaustion. This is based on thinking about the user copying some text that ends with a spurious lead byte - they shouldn't be punished by nothing going on the clipboard.

Throwing for output exhaustion is still good as the caller should always allocate enough memory.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2018-03-19
  
  Rather than complicating the input string access code, it may be better to drop out where the input exhaustion exhaustion is currently:
  
  if ((i + charLen > len) && (ui + 1 <= tlen)) { // Trying to read past end but still have space to write tbuf[ui] = ch; return ui + 1; }
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

How about this behaviour (copy invalid bytes as many as possible):

        if (i + charLen > len) {
            if (ui < tlen) {
                tbuf[ui] = ch;
                ui++;
                i++;
                continue;
            } else {
                 break;
            }
        }

Last edit: Zufu Liu 2018-03-19

         // Treat invalid lead byte as isolated single byte, continue decode remaining bytes.
        if (i + charLen > len) {
            if (ui < tlen) {
                tbuf[ui] = ch;
                ui++;
                i++;
                continue;
            } else {
                break;
            }
        }

Last edit: Zufu Liu 2018-03-19

Zufu Liu - 2018-03-19

Since i + charLen > len only occurs at string end, I think we can decode more bytes as many as possible, other than drop the remaining.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Use pre-computed table for UTF8CharLength and UTF8BytesOfLead

Group

Searches

Help

#1211 Use pre-computed table for UTF8CharLength and UTF8BytesOfLead

Related

Discussion