Use pre-computed table for UTF8CharLength and UTF8BytesOfLead to improve performance and make code clean.
Three big changes:
1. In win32/PlatWin.cxx, inside method void SurfaceGDI::MeasureWidths(Font &font_, const char *s, int len, XYPOSITION *positions)
, UTF8BytesOfLead
is changed to UTF8CharLength
, this is a bug fix, since UTF8CharLength
is used in D2D, Qt and Cocoa surface.
2. Length checking in size_t UTF32FromUTF8(const char *s, size_t len, unsigned int *tbuf, size_t tlen)
(in UniConversion.cxx) is removed, since it's not checked in above function size_t UTF16FromUTF8(const char *s, size_t len, wchar_t *tbuf, size_t tlen)
.
3. Array UTF8BytesOfLead
is changed from int
to const unsigned char
, I think it will have better cache and locality. The assembly for const unsigned char UTF8BytesOfLead[256];
and const int UTF8BytesOfLead[256];
generated by VC is same (command line option: /FAs
).
The patch has UTF8CharLength exactly the same as UTF8BytesOfLead although I don't think that was the intent. The intent appears to be to treat the invalid lead bytes 0x80..0xC1 and 0xF5..0xFF differently for strong compatibility with previous behaviour. UTF8BytesOfLead forces invalid values to length 1 which is likely safer. The utf8LengthFromLead functions are probably just historical not seeing that UTF8CharLength existed rather than being consciously differentiated.
Autogeneration has a numbered section feature in template lines that can be seen in src/CaseConvert.cxx to handle multiple insertions. However, the UTF-8 tables should never change, unlike other tables which may change when the Unicode standard is updated or when features are added to Scintilla/SciTE. Thus it should be OK to copy the tables in without the Autogeneration comments.
It would increase safety to add length checking to UTF16FromUTF8 instead of removing it from UTF32FromUTF8. This is checking to prevent writing beyond the end of a memory allocation so it may be better to fail more strongly perhaps with an exception or assertion.
Converting from UTF-8 character length to UTF-16 character length is changed to use the expression
which, to me, is less obvious than
If there is a benefit to this then it should be encapsulated in a function.
Declaring variables in a switch branch has more complex behaviour so it's better to add a scope for isolation in this change:
The patch and script is updated. Length checking in
UTF16FromUTF8
is added.Because of lack anThe the original length checking inelse
branch, the original length checking inUTF32FromUTF8
may has an infinite loop.UTF32FromUTF8
seems buggy.Last edit: Zufu Liu 2018-03-17
The line
is wrong, please change the type to appreciated one.
You seem to prefer UTF8CharLength over UTF8BytesOfLead such as the change to PlatWin.cxx.
I prefer the values in UTF8BytesOfLead since these return 1 for invalid cases. Scintilla's drawing behaviour is to draw invalid bytes in a UTF-8 file as isolated single-byte blobs. UTF8CharLength treats invalid lead bytes as the starts of either 2 or 4 byte characters which I think is less consistent and less safe. Its less consistent as it differs from UTF8Classify/UTF8DrawBytes and thus drawing, and its less safe as it may lead to calling code trying to grab the next byte(s) to construct a bogus character - eating a single byte will allow better resynchronisation when, for example, an invalid lead byte is followed by an ASCII byte.
Another approach would be to treat invalid lead bytes as more invalid and output a marker character such as unicodeReplacementChar (0xFFFD). This is more lossy since all the invalid lead bytes are converted to the same 0xFFFD making it more difficult to track back from failures. A separate version of the UTF8BytesOfLead table could use 0 or 5 for invalid lead bytes.
The behaviour of UTF32FromUTF8 when running out of room should terminate but may write an unexpected 0 to the output.
Since the charLen is now calculated up-front, some assertions or checks can be added before the switch. Something like
Last edit: Neil Hodgson 2018-03-18
Did you means use
UTF8BytesOfLead
inUTF16FromUTF8
,UTF32FromUTF8
and surface'sMeasureWidths
? how aboutUTF16Length
?The current
UTF8CharLength
table matches original behaviour, except for D2D surface.I would use UTF8BytesOfLead in UTF16FromUTF8, UTF32FromUTF8, UTF16Length and surface's MeasureWidths.
Currently, MeasureWidths (and other Surface methods) should only be given valid UTF-8 when in Unicode mode, as the platform-independent drawing code breaks up the document text and uses special 'blob' code for invalid bytes. That means that there should be no difference between UTF8BytesOfLead and UTF8CharLength.
UTF16Length should use the same technique as UTF16FromUTF8 as they are generally used as a pair. UTF16Length is called in 3 places to measure / allocate before UTF16FromUTF8 is called. Its other use is for WM_GETTEXTLENGTH which is often called by applications before WM_GETTEXT which uses UTF16FromUTF8.
The main places where UTF8BytesOfLead versus UTF8CharLength may produce changed behaviour are in clipboard and drag-and-drop actions. If the user selects a range to copy on Windows, that range is converted to UTF-16 by UTF16FromUTF8 so they may see better/worse behaviour depending on how UTF16FromUTF8 is implemented. Say you have "a\xB5yz" in the document in UTF-8 mode which means the '\xB5' is an invalid lead byte. Copy the whole document to the clipboard. If UTF8CharLength is used, this produces the 3 character UTF-16 string "aչz" but when UTF8BytesOfLead is used in both UTF16FromUTF8 and UTF16Length then the 4 character string "aµyz" is produced. I think this is better since it has preserved the 'y' and there is a resonable guess for the '\xB5'.
The above example may occur because a Windows-1252 file containing "aµyz" was opened as if it was UTF-8. Some Windows-1252 character values in 0x80..0xff like 'µ' match Unicode due to Windows-1252 being based on ISO 8859-1 which defined values 0x80..0xff in Unicode. Windows-1252 is a very common encoding.
If the invalid lead byte was represented with the Unicode Replacement Character then there would be "a�yz" on the clipboard. I think the best behaviour can be argued different ways but I would personally prefer "aµyz" to "a�yz".
OK, I will remove the UTF8CharLength table.
Sorry, I have something must leaving, may return after PM 8, UTC+8 .
Hi Neil, the patch updated.
*.asm is added to .hgignore,
$(DIR_O)\*.asm
is added to scintilla.mak's clean target, please ignore these two changes if not need.Last edit: Zufu Liu 2018-03-18
OK.
UTF8CharLength is still called in gtk/PlatGTK.cxx causing a compilation failure. It should be replaced with using UTF8BytesOfLead.
I don't understand why "static" was added to UTF8IsTrailByte which is already inline. Other inline functions are not marked static in Scintilla headers.
Making UnicodeFromUTF8 depend on UTF8BytesOfLead means there's no real benefit to it being in a separate header so it should move into UniConversion.h. I can't really recall why it is a separate header but the most likely reason was so it could be used inside separate lexers without adding dependencies like UniConversion.cxx. The unit tests are also broken as test/unit/testUnicodeFromUTF8.cxx doesn't include UniConversion.h and the unit tests don't link UniConversion.cxx. This is sufficiently complex that UnicodeFromUTF8 should be left out of the initial change set.
Reusing value in UTF16FromUTF8 for different purposes is confusing. Making the variable used for output length a const should optimize nicely. Or inline the expression into the check.
In UTF16Length, charLen is only assigned once so should be const.
Patch updated, unit tests now passed.
There are three patchs:
utf8-table-4.diff contains changes to UnicodeFromUTF8.h and related files.
utf8-table-4no-UnicodeFromUTF8.diff doses not include changes to UnicodeFromUTF8.h and related files.
utf8-table-4del-UnicodeFromUTF8.diff with function UnicodeFromUTF8() moved to UniConversion.h and the header UnicodeFromUTF8.h is removed.
Last edit: Zufu Liu 2018-03-19
The current implement of UTF8Classify is rather complex, I have make a simple one according to RFC3629, "4. Syntax of UTF-8 Byte Sequences"
https://tools.ietf.org/html/rfc3629#section-4
You can review these changes at
https://github.com/zufuliu/notepad2/blob/master/scintilla/src/UniConversion.cxx#L269
and
https://github.com/zufuliu/notepad2/blob/master/scintilla/src/UniConversion.h#L25
In this change (UTF8Classify), table UTF8BytesOfLead is replaced with inline function to reduce binary size:
Should I pack UTF8Classify's change with this patch, or make new patch?
Last edit: Zufu Liu 2018-03-19
I don't think the minor saving is worth the loss of clarity caused by merging the widths with other information. Its more difficult to tell what is going on by looking at the values in UTF8ClassifyTable.
The content of UTF8ClassifyTable has description above it at https://github.com/zufuliu/notepad2/blob/master/scintilla/src/UniConversion.cxx#L203
It's auto generated by
https://github.com/zufuliu/notepad2/blob/master/scintilla/scripts/GenerateCharTable.py#L27
Last edit: Zufu Liu 2018-03-19
Patch updated, unit tests now passed.
There are three patchs:
utf8-table-4.diff contains changes to UnicodeFromUTF8.h and related files.
utf8-table-4no-UnicodeFromUTF8.diff doses not include changes to UnicodeFromUTF8.h and related files.
utf8-table-4del-UnicodeFromUTF8.diff with function UnicodeFromUTF8() moved to UniConversion.h and the header UnicodeFromUTF8.h is removed.
UTF8Classify is more complex so it needs some unit tests and should also be in a new issue. Adding multiple features into a single issue makes it more complex to see what is happening, particularly if some things are integrated into one release and others are deferred.
1) I think it would help to add a new test/unit/testUniConversion.cxx to ensure that the changes made to UniConversion for this issue are good and stay good in the future. Something of similar scope to testUnicodeFromUTF8.cxx with both UTF16FromUTF8 and UTF32FromUTF8 checked against the different lengths of character as well as an invalid string.
2) Then when UnicodeFromUTF8 is moved into UniConversion then its tests are moved into testUniConversion.
3) (new issue) Later, tests should be added to testUniConversion for UTF8Classify to capture its current behaviour.
4) Changes to UTF8Classify can be checked to see if there are changes in behaviour.
Do you means I should keep UnicodeFromUTF8.h unchanged (apply utf8-table-4no-UnicodeFromUTF8.diff), and add an unit test testUniConversion.cxx?
Yes, make a patch containing utf8-table-4no-UnicodeFromUTF8.diff along with a new testUniConversion.cxx containing tests of UTF16FromUTF8 and UTF32FromUTF8.
New patch with unit test.
Unit test rewrote, split REQUIRE() test.
That looks reasonable.
I added some more test cases and am not real happy about some changed behaviour with invalid input. The previous functions were more forgiving of spurious lead bytes at the end of the string.
Here are the extra test cases for UTF16FromUTF8 - the first was discussed before and works well.
The second fails as it tries to read beyond the end of the input string. The previous version would generally find memory after the end (often a NUL) so would appear to work OK although it is undefined behaviour so does need fixing.
I think the most forgiving behaviour would be for the input to be treated as if there are NUL bytes past the len and not throw an exception for input exhaustion. This is based on thinking about the user copying some text that ends with a spurious lead byte - they shouldn't be punished by nothing going on the clipboard.
Throwing for output exhaustion is still good as the caller should always allocate enough memory.
Rather than complicating the input string access code, it may be better to drop out where the input exhaustion exhaustion is currently:
How about this behaviour (copy invalid bytes as many as possible):
Last edit: Zufu Liu 2018-03-19
or
Last edit: Zufu Liu 2018-03-19
Since
i + charLen > len
only occurs at string end, I think we can decode more bytes as many as possible, other than drop the remaining.