Scintilla / Feature Requests / #1347 Optimize CellBuffer::BasicInsertString()

Zufu Liu - 2020-04-13

Replace the loop in ContractionState<LINE>::InsertLines() with InsertSpace() and FillRange() will make it even faster. The loop in ContractionState<LINE>::DeleteLines() can be changed to DeleteRange() too,

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2020-04-13

Do you have timing results for this change?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2020-04-14

OK, I will try to find some public big files on the Internet and do the timing.

off topic:
Patch lambda-noexcept.diff changes all lambda to noexcept, they were found by rg "\]\(" command, https://github.com/BurntSushi/ripgrep
Other noexcept can be found with Visual Studio 2019's Code Analysis, like bool operator()(int a, int b) (can be changed to const noexcept) in Sorter (in AutoComplete.cxx).

Edit: ripgrep doesn't support .hgignore, but the silver searcher does (command ag "\]\("). https://github.com/ggreer/the_silver_searcher or https://github.com/k-takata/the_silver_searcher-win32

Last edit: Zufu Liu 2020-04-14

lambda-noexcept.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2020-04-14

Isn't it possible to avoid duplicating logic for the last character by running the loop one more iteration but guarding aginst reading past end?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

It can be change to that with following changes:

unsigned char ch = ' ';
const char * const end = s + insertLength;
while (ptr < end) {
    ch = *ptr++;
    switch (ch) {
    case '\r':
        if (ptr < end && *ptr == '\n') {
            ++ptr;
        }
        [[fallthrough]];

Last edit: Zufu Liu 2020-04-14

Zufu Liu - 2020-04-14

Whole changes to avoid last character, the performance lost is very subtle (5ms for a 4,962,027 lines file with CR+LF line endings on i5).

Type Time

old loop 231ms

with testing last character 169ms

avoid testing last character 174ms

SSE2 73ms

AVX2 69ms

Last edit: Zufu Liu 2020-04-14

BasicInsertString2.cxx

BasicInsertString2.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Type	Time
old loop	231ms
with testing last character	169ms
avoid testing last character	174ms
SSE2	73ms
AVX2	69ms

Neil Hodgson - 2020-04-15

Adding IsActive to PerLine appears good. The modularity appears better to me if CellBuffer decides how to respond to PerLine::IsActive by calling that early in BasicInsertString then passing the result as an extra flag to CellBuffer::InsertLine and LineVector::InsertLine, repurposing the lineStart argument as a bit set.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Zufu Liu - 2020-04-15
  
  Sorry, I seems forgot something. Optimize out perLine inside CellBuffer::BasicInsertString() is not worth for most normal operations: after document been lexered, there may have line levels or line states, so perLine is likely always active. And it also depends on document content, as it only been called once per-line. So I only optimize out it in Document::InsertString() for large chunk of string.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2020-04-15

While there is a significant performance improvement with the SSE/AVX2 code, it does make the code more complex and difficult to change. More line end modes could be added in the future: form feed and vertical tab may be treated as line ends or isolated carriage returns could be treated as not line ends.

A mode that only allowed \n and \r\n line ends would cover almost all files (old-style Mac and other archaic system files are now quite rare) and be simpler and perform better as line splitting can be done after every \n. It would match the behaviour of some other editors such as vi. Some lexers may fail against such files though (if they assume an isolated \r increments the line number) so could be validated and marked as OK.

Most SSE code I've seen does an initial loop to achieve aligned loads before the main loop assumes aligned loads whereas this patch does (potentially) unaligned loads. Since the buffer is provided by the application, it can choose to align. Have you looked at the speed difference between aligned and unaligned buffers with this code?

For now, the SSE change doesn't appear worth including to me but that could change in the future.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2020-04-15

MSVC seems never generates aligned loading when the alignment is not known as compile time even with __assume (tested assembly generated for IsUTF8() and IsUTF7() in https://github.com/zufuliu/notepad2/blob/master/src/EditEncoding.c ).
For end of line detection, because following code, aligned loading is not possible.

if (*next == '\n') { // CR+LF across boundary ++next; }

AVX2 is 7 years old, while SSE2 is already 20 years old, all x64 CPU supports it. I think the SSE2/AVX2 code (about 60 lines) can be inserted before while (ptr < end) in BasicInsertString2.diff, possible after moving to C++20, where there is a portable CTZ (std::countr_zero() in <bit>).

How about BasicInsertString2.diff (optimize patch up out of the loop) in in https://sourceforge.net/p/scintilla/feature-requests/1347/#9c49 I think it stil make code faster.

Last edit: Zufu Liu 2020-04-17
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2020-04-19
  
  The current plan for C++20 is for core Scintilla to start using it 12 months after strong support of almost all C++20 features in the 3 main compilers.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

For InsertLine, it may be faster to buffer changes then make multiple at once replacing InsertLine with

void InsertLines(Sci::Line lineFirst, const Sci::Position *positions, size_t lines, bool lineStart);

and continuing this into PerLine which is likely to improve performance in other cases such as when some of the per-line data is active.

When avoiding many reallocations of the line starts, the number of lines can be estimated at the end of BasicInsertString or in the Document code:

// If this is the first block loaded and huge body allocated
// then estimate line size and thus number of lines likely
// then allocate that number of lines. May avoid much resizing
// of line starts array. (with obvious added methods)  
if ((position == 0) && (substance.AllocatedLength() > 10'000'000)) {
    const size_t lineSizeMean = (substance.Length() / Lines());
    // Over-allocate lines by 10% in case first block non-typical.
    const size_t linesDoc = substance.AllocatedLength() / lineSizeMean * 11 / 10;
    plv->AllocateLines(linesDoc);
}

Zufu Liu - 2020-04-18

Change to InsertLines() indeed make the function faster, I added the changes (without adding InsertLines to PerLine) in https://github.com/zufuliu/notepad2/commit/acc51b7ff526515290bc6dbcc582aac41e2f7e32
After some tests, it seems the cache count affects performance, too small or too large cache count will decrease the improvement.
Tomorrow I will measure (by enable AttachConsole() in wWinMain() in Notepad2.c, and uncomment ElapsedPeriod related lines in CellBuffer.cxx) the change on i5 with the same file used in above table.

I added SetInitLineCount() to LineVector in https://github.com/zufuliu/notepad2/commit/09de3a28b992c90e586b351c83a847d005ca9a91 to make starts not resizing. For Notepad2, the exact total line count is known after line endings detection (EditDetectEOLMode() in Edit.c, implemented with AVX2/SSE2 + POPCNT or table lookup).

CellBuffer::ResetLineEnds() can take advantage of current line count or allocated size.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2020-04-19

Add InsertLines(Sci::Line lineFirst, Sci::Line lineCount) to PerLine improved the performance a lot, nearly same effect as set perLine to nullptr.
https://github.com/zufuliu/notepad2/commit/1472bafa24e1c7e7050729241d9088bfd519c1c9

The change can be measured by change (in CellBuffer.h)

// set perLine to nullptr #define Enable_WithoutPerLine 1 #define EnablePerLine_InsertLines 0 // without InsertLines #define Enable_WithoutPerLine 0 #define EnablePerLine_InsertLines 0 // with InsertLines #define Enable_WithoutPerLine 0 #define EnablePerLine_InsertLines 1

LineMarkers::InsertLines() and LineTabstops::InsertLines() have side effect: InsertValue() doesn't accept nullptr, for simplicity, InsertEmpty() is used.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2020-04-19

I seems got strange result after measuring Notepad2 on i5 with enwiki-20200401-pages-articles-multistream-index.txt.bz2 from https://dumps.wikimedia.org/enwiki/20200401/
The file is 894 MiB, 20,128,864 lines (only LF).

result-i5-0419.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

An external SCI_ API for setting the initial line count doesn't work as well for SciTE as it is using the ILoader interface from a background thread and loading 128K blocks. Blocks are used to avoid having an extra copy of the file in memory. When loading huge files, its common to be near the limit of physical RAM in the machine so an extra copy may switch to paging, possibly thrashing. Having the estimation inside BasicInsertString meant it was applicable to SciTE and to other applications without modifying them. Its probably better to leave this to the application and add another method to ILoader in the next major release. It should be possible to call SetInitLineCount at any time without destroying any data (it may be called after loading the first block) so could be called something similar to AllocateLines.

Inserting multiple partitions at once also helps. This is made more complex by the partitions being sized by either int or ptrdiff_t so there is one that method that does a bulk copy and another that performs casts. This patch also includes some commented out code that moved whether the line indices are active out to LineVector but that isn't important once the lines are being batched.

diff -r 7a8ff85aa5a8 src/CellBuffer.cxx
--- a/src/CellBuffer.cxx    Fri Apr 17 20:26:40 2020 +1000
+++ b/src/CellBuffer.cxx    Mon Apr 20 07:51:14 2020 +1000
@@ -65,6 +65,7 @@
    virtual void SetPerLine(PerLine *pl) noexcept = 0;
    virtual void InsertText(Sci::Line line, Sci::Position delta) noexcept = 0;
    virtual void InsertLine(Sci::Line line, Sci::Position position, bool lineStart) = 0;

+   virtual void InsertLines(Sci::Line line, const Sci::Position *positions, size_t lines, bool lineStart) = 0;
    virtual void SetLineStart(Sci::Line line, Sci::Position position) noexcept = 0;
    virtual void RemoveLine(Sci::Line line) = 0;
    virtual Sci::Line Lines() const noexcept = 0;
@@ -177,6 +178,29 @@
            perLine->InsertLine(line);
        }
    }
+   void InsertLines(Sci::Line line, const Sci::Position *positions, size_t lines, bool lineStart) override {
+       const POS lineAsPos = static_cast<POS>(line);
+       if constexpr (sizeof(Sci::Position) == sizeof(POS)) {
+           starts.InsertPartitions(lineAsPos, positions, lines);
+       } else {
+           starts.InsertPartitionsWithCast(lineAsPos, positions, lines);
+       }
+       /*
+       if (activeIndices) {
+           if (activeIndices & SC_LINECHARACTERINDEX_UTF32) {
+               startsUTF32.InsertLines(line, lines);
+           }
+           if (activeIndices & SC_LINECHARACTERINDEX_UTF16) {
+               startsUTF16.InsertLines(line, lines);
+           }
+       }
+       if (perLine) {
+           if ((line > 0) && lineStart)
+               line--;
+           perLine->InsertLines(line, lines);
+       }
+       */
+   }
    void SetLineStart(Sci::Line line, Sci::Position position) noexcept override {
        starts.SetPartitionStartPosition(static_cast<POS>(line), static_cast<POS>(position));
    }
diff -r 7a8ff85aa5a8 src/CellBuffer.h
--- a/src/CellBuffer.h  Fri Apr 17 20:26:40 2020 +1000
+++ b/src/CellBuffer.h  Mon Apr 20 07:51:14 2020 +1000
@@ -166,6 +166,7 @@
    Sci::Line LineFromPosition(Sci::Position pos) const noexcept;
    Sci::Line LineFromPositionIndex(Sci::Position pos, int lineCharacterIndex) const noexcept;
    void InsertLine(Sci::Line line, Sci::Position position, bool lineStart);
+   void InsertLines(Sci::Line line, const Sci::Position *position, size_t lines, bool lineStart);
    void RemoveLine(Sci::Line line);
    const char *InsertString(Sci::Position position, const char *s, Sci::Position insertLength, bool &startSequence);

diff -r 7a8ff85aa5a8 src/Partitioning.h
--- a/src/Partitioning.h    Fri Apr 17 20:26:40 2020 +1000
+++ b/src/Partitioning.h    Mon Apr 20 07:51:14 2020 +1000
@@ -122,6 +122,24 @@
        stepPartition++;
    }


+   void InsertPartitions(T partition, const T *positions, size_t length) {
+       if (stepPartition < partition) {
+           ApplyStep(partition);
+       }
+       body->InsertFromArray(partition, positions, 0, length);
+       stepPartition += static_cast<T>(length);
+   }
+
+   void InsertPartitionsWithCast(T partition, const ptrdiff_t *positions, size_t length) {
+       // Used for 64-bit builds when T is 32-bits
+       if (stepPartition < partition) {
+           ApplyStep(partition);
+       }
+       for (size_t i = 0; i < length; i++)
+           body->Insert(partition+i, static_cast<T>(positions[i]));
+       stepPartition += static_cast<T>(length);
+   }
+
    void SetPartitionStartPosition(T partition, T pos) noexcept {
        ApplyStep(partition+1);
        if ((partition < 0) || (partition > body->Length())) {

Neil Hodgson - 2020-04-19

The UTF-8 line ends case can be optimized because the bytes that end LS, PS, and NEL are 0xa8, 0xa9, and 0x85 so only performing the complex test for non-ASCII bytes like:

if (utf8LineEnds && (ch >= 0x80)) { const unsigned char back3[3] = { chBeforePrev, chPrev, ch }; if (UTF8IsSeparator(back3) || UTF8IsNEL(back3 + 1)) {
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2020-04-20

Add (ch >= 0x80) seems does nothing, I end up with a table lookup like

uint8_t eolTable[256]{}; eolTable[static_cast<uint8_t>('\n')] = 1; eolTable[static_cast<uint8_t>('\r')] = 2; if (utf8LineEnds) { // see UniConversion.h for LS, PS and NEL eolTable[0x85] = 4; eolTable[0xa8] = 3; eolTable[0xa9] = 3; }

change to InsertPartitions() improved a lot for Win32 builds;
the table lookup improved a lot for UTF-8 line ends; form feed or line feed can be added in the further when needed.

attached test results.

result-i5-0420.txt
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2020-04-20
  
  The reason for the good speedup in Win32 builds is that it matches the sizeof(Sci::Position) == sizeof(POS) condition so calls InsertFromArray. The same would occur for a 64-bit build with largeDocument set.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Build Scintilla with attached changes (table lookup) indeed make SciTE faster.

inserting single character is handled at the end of loop.

BasicInsertString3.cpp

While nice and fast, that change doesn't work with some Unicode line end manipulations. test/simpleTests.py has several tests for LS, PS, and NEL and the results follow. I couldn't quickly see the problem but hope to get to it later. simpleTests.py can be run after building the DLL and require matching bit widths: 32-bit Python with a 32-bit DLL or both 64-bit. To see which particular test step caused the failure, add a 'msg=' argument to assertEquals.

>pyw -u "simpleTests.py"
simpleTests
.................................................................................................................................F.F...............F.F........................................
======================================================================
FAIL: testNELBreakApartEOL (simpleTests.TestSimple)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\u\hg\scintilla\test\simpleTests.py", line 481, in testNELBreakApartEOL
    self.assertEquals(self.ed.LineCount, 2)
AssertionError: 1 != 2

======================================================================
FAIL: testNELFragmentedEOLStart (simpleTests.TestSimple)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\u\hg\scintilla\test\simpleTests.py", line 456, in testNELFragmentedEOLStart
    self.assertEquals(self.ed.LineCount, 2)
AssertionError: 1 != 2

======================================================================
FAIL: testUBreakApartEOL (simpleTests.TestSimple)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\u\hg\scintilla\test\simpleTests.py", line 399, in testUBreakApartEOL
    self.assertEquals(self.ed.LineCount, 2)
AssertionError: 1 != 2

======================================================================
FAIL: testUFragmentedEOLStart (simpleTests.TestSimple)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\u\hg\scintilla\test\simpleTests.py", line 374, in testUFragmentedEOLStart
    self.assertEquals(self.ed.LineCount, 2)
AssertionError: 1 != 2

----------------------------------------------------------------------
Ran 190 tests in 0.230s

FAILED (failures=4)

self.assertEquals(self.ed.LineCount, 2, msg=f"Failed with {i}")

Last edit: Neil Hodgson 2020-04-21

originally code

simpleTests2
testNELBreakApartEOL
BasicInsertString insertLength=4, chBeforePrev=00, chPrev=00
        78 c2 85 79
        InsertLine line=1, position=3, lineStart=1
BasicInsertString insertLength=1, chBeforePrev=00, chPrev=00
        78
        RemoveLine line=1
BasicInsertString insertLength=1, chBeforePrev=00, chPrev=78
        c2
        InsertLine line=1, position=3, lineStart=0
        RemoveLine line=1
BasicInsertString insertLength=1, chBeforePrev=78, chPrev=c2
        85
        InsertLine line=1, position=3, lineStart=0
BasicInsertString insertLength=1, chBeforePrev=c2, chPrev=85
        79
.
----------------------------------------------------------------------
Ran 1 test in 0.016s

OK

table lookup

simpleTests2
testNELBreakApartEOL
BasicInsertString insertLength=4, chBeforePrev=00, chPrev=00
        78 c2 85 79
        InsertLine line=1, position=3, lineStart=1
BasicInsertString insertLength=1, chBeforePrev=00, chPrev=00
        78
        RemoveLine line=1
BasicInsertString insertLength=1, chBeforePrev=00, chPrev=78
        c2
F
======================================================================
FAIL: testNELBreakApartEOL (simpleTests2.TestSimple2)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "F:\Libs\scintilla\test\simpleTests2.py", line 47, in testNELBreakApartEOL
    self.assertEquals(self.ed.LineCount, 2)
AssertionError: 1 != 2

----------------------------------------------------------------------
Ran 1 test in 0.003s

FAILED (failures=1)

simpleTests2.zip

Updated log, it seems something is wrong after the loop.
old code

testNELBreakApartEOL
BasicInsertString insertLength=4, chBeforePrev=00, chPrev=00
        78 c2 85 79
        InsertLine line=1, position=3, lineStart=1
        chBeforePrev=85, chPrev=79, ch=79, chAfter=00
BasicInsertString insertLength=1, chBeforePrev=00, chPrev=00
        78
        chBeforePrev=00, chPrev=78, ch=78, chAfter=c2
        RemoveLine line=1
BasicInsertString insertLength=1, chBeforePrev=00, chPrev=78
        c2
        chBeforePrev=78, chPrev=c2, ch=c2, chAfter=85
        InsertLine line=1, position=3, lineStart=0
        RemoveLine line=1
BasicInsertString insertLength=1, chBeforePrev=78, chPrev=c2
        85
        InsertLine line=1, position=3, lineStart=0
        chBeforePrev=c2, chPrev=85, ch=85, chAfter=79
BasicInsertString insertLength=1, chBeforePrev=c2, chPrev=85
        79
        chBeforePrev=85, chPrev=79, ch=79, chAfter=00

table lookup

testNELBreakApartEOL
BasicInsertString insertLength=4, chBeforePrev=00, chPrev=00
        78 c2 85 79
        InsertLine line=1, position=3, lineStart=1
        chBeforePrev=c2, chPrev=85, ch=79, chAfter=00
BasicInsertString insertLength=1, chBeforePrev=00, chPrev=00
        78
        chBeforePrev=00, chPrev=00, ch=78, chAfter=c2
        RemoveLine line=1
BasicInsertString insertLength=1, chBeforePrev=00, chPrev=78
        c2
        chBeforePrev=00, chPrev=78, ch=c2, chAfter=85

simpleTests2-2.zip

Zufu Liu - 2020-04-21

the failure can be fixed by assigning chBeforePrev and chPrev like:

} else if (utf8LineEnds && !UTF8IsAscii(chAfter)) { chBeforePrev = chPrev; chPrev = ch; // May have end of UTF-8 line end in buffer and start in insertion

in originally code, they are assigned at the of the loop; but in table lookup code, they not assigned after last character

ch = *end; if (ptr == end) {
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2020-04-21

Whole changes with SSE2 and AVX2 optimization for non-UTF-8 line endings, don't need C++20.
I will check whether Intel C++ could generate TZCNT (_tzcnt_u32) without any flags.

BasicInsertString-0421.cxx

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Optimize CellBuffer::BasicInsertString()

Group

Searches

Help

#1347 Optimize CellBuffer::BasicInsertString()

Related

Discussion