First part (the BasicInsertString.diff) is to Optimize CellBuffer::BasicInsertString() to only patch up CRLF before the loop to insert lines, this improved loading file with CR+LF line endings (commonly used on Windows).
For non-UTF-8 line endings, the loop can be optimized with AVX2, SSE2 or table loop to make detection for end of line more efficient; The code implements these optimizations is available at https://github.com/zufuliu/notepad2/blob/master/scintilla/src/CellBuffer.cxx#L987
This is a small optimization, allocation for starts or call perLine->InsertLine() in plv->InsertLine() actually consumes more time, especially the later.
I think perLine->InsertLine() can be disabled (by set it to nullptr) when there is no active per-line data (i.e. length is zero), this avoid calling Document::InsertLine() on each plv->InsertLine(). On the first time application adding text (for most applications, this is the whole document loaded from disk), there would have no per-line data before the document been styled.
Implementation brief idea: in Document::InsertString(), check perLineData to see whether there are active per-line data, if no then call cb.SetPerLine(nullptr), cb.InsertString() and cb.SetPerLine(this); else directly call cb.InsertString(); The code implements this change is available at https://github.com/zufuliu/notepad2/blob/master/scintilla/src/Document.cxx#L1212
Replace the loop in
ContractionState<LINE>::InsertLines()with InsertSpace() and FillRange() will make it even faster. The loop inContractionState<LINE>::DeleteLines()can be changed to DeleteRange() too,Do you have timing results for this change?
OK, I will try to find some public big files on the Internet and do the timing.
off topic:
Patch lambda-noexcept.diff changes all lambda to noexcept, they were found by
rg "\]\("command, https://github.com/BurntSushi/ripgrepOther noexcept can be found with Visual Studio 2019's Code Analysis, like
bool operator()(int a, int b)(can be changed toconst noexcept) in Sorter (in AutoComplete.cxx).Edit: ripgrep doesn't support .hgignore, but the silver searcher does (command
ag "\]\("). https://github.com/ggreer/the_silver_searcher or https://github.com/k-takata/the_silver_searcher-win32Last edit: Zufu Liu 2020-04-14
Isn't it possible to avoid duplicating logic for the last character by running the loop one more iteration but guarding aginst reading past end?
It can be change to that with following changes:
Last edit: Zufu Liu 2020-04-14
Whole changes to avoid last character, the performance lost is very subtle (5ms for a 4,962,027 lines file with CR+LF line endings on i5).
Last edit: Zufu Liu 2020-04-14
Adding IsActive to PerLine appears good. The modularity appears better to me if CellBuffer decides how to respond to PerLine::IsActive by calling that early in BasicInsertString then passing the result as an extra flag to CellBuffer::InsertLine and LineVector::InsertLine, repurposing the lineStart argument as a bit set.
Sorry, I seems forgot something. Optimize out perLine inside CellBuffer::BasicInsertString() is not worth for most normal operations: after document been lexered, there may have line levels or line states, so perLine is likely always active. And it also depends on document content, as it only been called once per-line. So I only optimize out it in Document::InsertString() for large chunk of string.
While there is a significant performance improvement with the SSE/AVX2 code, it does make the code more complex and difficult to change. More line end modes could be added in the future: form feed and vertical tab may be treated as line ends or isolated carriage returns could be treated as not line ends.
A mode that only allowed \n and \r\n line ends would cover almost all files (old-style Mac and other archaic system files are now quite rare) and be simpler and perform better as line splitting can be done after every \n. It would match the behaviour of some other editors such as vi. Some lexers may fail against such files though (if they assume an isolated \r increments the line number) so could be validated and marked as OK.
Most SSE code I've seen does an initial loop to achieve aligned loads before the main loop assumes aligned loads whereas this patch does (potentially) unaligned loads. Since the buffer is provided by the application, it can choose to align. Have you looked at the speed difference between aligned and unaligned buffers with this code?
For now, the SSE change doesn't appear worth including to me but that could change in the future.
MSVC seems never generates aligned loading when the alignment is not known as compile time even with
__assume(tested assembly generated for IsUTF8() and IsUTF7() in https://github.com/zufuliu/notepad2/blob/master/src/EditEncoding.c ).For end of line detection, because following code, aligned loading is not possible.
AVX2 is 7 years old, while SSE2 is already 20 years old, all x64 CPU supports it. I think the SSE2/AVX2 code (about 60 lines) can be inserted before
while (ptr < end)in BasicInsertString2.diff, possible after moving to C++20, where there is a portable CTZ (std::countr_zero()in<bit>).How about BasicInsertString2.diff (optimize patch up out of the loop) in in https://sourceforge.net/p/scintilla/feature-requests/1347/#9c49 I think it stil make code faster.
Last edit: Zufu Liu 2020-04-17
The current plan for C++20 is for core Scintilla to start using it 12 months after strong support of almost all C++20 features in the 3 main compilers.
For InsertLine, it may be faster to buffer changes then make multiple at once replacing InsertLine with
and continuing this into PerLine which is likely to improve performance in other cases such as when some of the per-line data is active.
When avoiding many reallocations of the line starts, the number of lines can be estimated at the end of BasicInsertString or in the Document code:
Change to InsertLines() indeed make the function faster, I added the changes (without adding InsertLines to PerLine) in https://github.com/zufuliu/notepad2/commit/acc51b7ff526515290bc6dbcc582aac41e2f7e32
After some tests, it seems the cache count affects performance, too small or too large cache count will decrease the improvement.
Tomorrow I will measure (by enable AttachConsole() in wWinMain() in Notepad2.c, and uncomment ElapsedPeriod related lines in CellBuffer.cxx) the change on i5 with the same file used in above table.
I added SetInitLineCount() to LineVector in https://github.com/zufuliu/notepad2/commit/09de3a28b992c90e586b351c83a847d005ca9a91 to make starts not resizing. For Notepad2, the exact total line count is known after line endings detection (EditDetectEOLMode() in Edit.c, implemented with AVX2/SSE2 + POPCNT or table lookup).
CellBuffer::ResetLineEnds()can take advantage of current line count or allocated size.Add
InsertLines(Sci::Line lineFirst, Sci::Line lineCount)to PerLine improved the performance a lot, nearly same effect as set perLine to nullptr.https://github.com/zufuliu/notepad2/commit/1472bafa24e1c7e7050729241d9088bfd519c1c9
The change can be measured by change (in CellBuffer.h)
LineMarkers::InsertLines() and LineTabstops::InsertLines() have side effect: InsertValue() doesn't accept nullptr, for simplicity, InsertEmpty() is used.
I seems got strange result after measuring Notepad2 on i5 with enwiki-20200401-pages-articles-multistream-index.txt.bz2 from https://dumps.wikimedia.org/enwiki/20200401/
The file is 894 MiB, 20,128,864 lines (only LF).
An external SCI_ API for setting the initial line count doesn't work as well for SciTE as it is using the ILoader interface from a background thread and loading 128K blocks. Blocks are used to avoid having an extra copy of the file in memory. When loading huge files, its common to be near the limit of physical RAM in the machine so an extra copy may switch to paging, possibly thrashing. Having the estimation inside BasicInsertString meant it was applicable to SciTE and to other applications without modifying them. Its probably better to leave this to the application and add another method to ILoader in the next major release. It should be possible to call SetInitLineCount at any time without destroying any data (it may be called after loading the first block) so could be called something similar to AllocateLines.
Inserting multiple partitions at once also helps. This is made more complex by the partitions being sized by either int or ptrdiff_t so there is one that method that does a bulk copy and another that performs casts. This patch also includes some commented out code that moved whether the line indices are active out to LineVector but that isn't important once the lines are being batched.
The UTF-8 line ends case can be optimized because the bytes that end LS, PS, and NEL are 0xa8, 0xa9, and 0x85 so only performing the complex test for non-ASCII bytes like:
Add
(ch >= 0x80)seems does nothing, I end up with a table lookup likechange to
InsertPartitions()improved a lot for Win32 builds;the table lookup improved a lot for UTF-8 line ends; form feed or line feed can be added in the further when needed.
attached test results.
The reason for the good speedup in Win32 builds is that it matches the
sizeof(Sci::Position) == sizeof(POS)condition so calls InsertFromArray. The same would occur for a 64-bit build withlargeDocumentset.Build Scintilla with attached changes (table lookup) indeed make SciTE faster.
inserting single character is handled at the end of loop.
While nice and fast, that change doesn't work with some Unicode line end manipulations. test/simpleTests.py has several tests for LS, PS, and NEL and the results follow. I couldn't quickly see the problem but hope to get to it later. simpleTests.py can be run after building the DLL and require matching bit widths: 32-bit Python with a 32-bit DLL or both 64-bit. To see which particular test step caused the failure, add a 'msg=' argument to assertEquals.
Last edit: Neil Hodgson 2020-04-21
originally code
table lookup
Updated log, it seems something is wrong after the loop.
old code
table lookup
the failure can be fixed by assigning chBeforePrev and chPrev like:
in originally code, they are assigned at the of the loop; but in table lookup code, they not assigned after last character
Whole changes with SSE2 and AVX2 optimization for non-UTF-8 line endings, don't need C++20.
I will check whether Intel C++ could generate TZCNT (
_tzcnt_u32) without any flags.