During fix https://github.com/zufuliu/notepad2/issues/236 I found several changes that improve the speed.
They are in following three commits:
Commit 1: https://github.com/zufuliu/notepad2/commit/526cee6363f32c041a7c366421878ef7a8362aeb
Commit 2: https://github.com/zufuliu/notepad2/commit/84ea3babbc6547c43962573a31c220f7c929594f
Commit 3: https://github.com/zufuliu/notepad2/commit/de27fd98f15769d2e6c91e54e106b0224ea42280
Summery of changes:
Commit 1. For case insensitive search, use MakeLowerCase() to replace pcf->Fold() when widthChar is 1 (UTF-8 and DBCS), or when it's ASCII (SBCS).
Commit 2. Avoid calling NextCharacter(). For SBCS, it can be replaced with pos += increment. For DBCS, pos += widthFirstCharacter; can be used for forward search (similar to UTF-8). For case sensitive search, pos += increment can be used when CharAt(pos) is a safe one byte character (encoded in single byte, not part of any multi-byte sequence, i.e. all characters in SBCS, ASCII in UTF-8, C0 control characters in DBCS).
Commit 3: Optimized case sensitive search for long pattern by move pos to one after fist mismatch character when the character is not in search pattern. This helps a lot when document contains text that matches begging of the long pattern, but followed by character not in the pattern (as in the linked Notepad2 issue).
Beside, I think SciTE's MatchMarker::Continue() can be changed to timing on bytes instead of lines, see [feature-requests:#1373].
Case sensitive search can use the Boyer-Moore-Horspool-Sunday algorithm, which works better (except when lengthFind is 1), I added a implementation based on code on http://www-igm.univ-mlv.fr/~lecroq/string/node19.html in https://github.com/zufuliu/notepad2/commit/334befc6c0b945f60f5ea5241258bd9fcbb2cf39
https://github.com/zufuliu/notepad2/commit/d308d41f7ec7bf876a76780b691572752bb4372c fixed backward search bug with Sunday algorithm.
In commit 2:
Isn't this safe only for forward searches as the leadByte doesn't determine the width of the previous character.
For Boyer-Moore there is an implementation in the standard library with <functional> containing boyer_moore_searcher and boyer_moore_horspool_searcher.
https://en.cppreference.com/w/cpp/utility/functional/boyer_moore_horspool_searcher
No, safeChar is 256 for single byte forward and backward, 0x80 (can be changed to 0xc2) for UTF-8 forward and backward.
0x80 for DBCS forward, 0x31, 0x40 or 0x41 for DBCS backward based on the code page (the minimum trail byte).
current Boyer-Moore-Horspool-Sunday code at https://github.com/zufuliu/notepad2/blob/master/scintilla/src/Document.cxx#L2006
libc++ boyer_moore_horspool_searcher at
https://github.com/llvm/llvm-project/blob/main/libcxx/include/experimental/functional#L377
Here is a test case that relies on the ambiguity of trail bytes in DBCS:
The space byte was chosen to provoke the safeCharDBCS case in the second patch.
Two test cases were added to test/unit/testDocument.cxx with [d20926] . It may help maintain the searching feature through optimizations to add some more tests for edge cases.
One direction that may be useful in optimizing is to remove a layer or two of abstraction from NextCharacter / NextPosition by writing specialized NextCharacter functions for iterating SBCS/UTF-8/DBCS (or further specializing for forwards and backwards) then using them in code that has checked for encoding type or writing generic search code that takes them as template arguments.
Related
Commit: [d20926]
While UTF-8 doesn't show the same ambiguous trail byte bug as DBCS due to its self-synchronizing nature, it has changed the way backward iteration is performed, potentially leading to different results over invalid UTF-8. If possible, forward and backward iteration should try to match at whole characters, not character fragments and the same positions should be checked backwards and forwards.
Backwards searches are performed much less often than forward searches so it is fine to concentrate optimization on the forward case and use accurate but slow code for backwards.
For UTF-8 forward search, the self-synchronizing aspect of UTF-8 makes it safe to perform the same byte-by-byte scan as for SBCS when the search string is valid UTF-8. A non-trail byte (all characters start with a non-trail byte) at the start of the search string will always match at character/fragment start positions just like UTF-8 iteration does. It is extremely rare to search for invalid UTF-8.
One occasion where searching for invalid UTF-8 may be reasonable could be when some process is buggy and produces invalid files with character fragments like isolated trail bytes. A search string that starts with a trail byte could find these but it shouldn't match trail-bytes that are part of a valid character. Thus, the document should be searched by following the UTF-8 iteration scheme, not byte-by-byte.
Much coverage of searching performance points to combining memchr and memcmp as a simple technique that works well because of how much those two calls are optimized. The basic code for a contiguous buffer looks similar to:
In Scintilla, the memory isn't contiguous but an adapter can hide that.
Start with a class to provide a view over the split vector:
Then add methods that are versions of memchr and memcmp that work over the split (not thoroughly tested):
Forward searching for SBCS or valid UTF-8 is then:
This is quite a bit faster (around 8x for a 6 byte not-present search string on Editor.cxx) than the current code and has very little initialisation overhead. Reverse searching requires memrchr which isn't portable but implementing a non-optimised version is simple.
Alternate implementation: add FindChar and Match to SplitVector and make available through CellBuffer. SplitView becomes invalid after any document modification but CellBuffer remains valid. May lose some performance through extra layers of calls.
The memmem approach would works for UTF-8 and DBCS case sensitive forward search when first search is not a trail byte.
libc++ developer says
cbView.Match(pos, search)is faster, see__search_substringforstd::string_view::findat https://github.com/llvm/llvm-project/blob/main/libcxx/include/__string#L946A standard library compatible iterator over the document will allow use of wide variety of standard library features, such as std::search. An iterator that works with all std::search searchers and takes a SplitView by pointer (adding the SplitView into the iterator was slower, possibly as it doubles the size to 32 bytes) produced these results (milliseconds for 20 runs over Editor.cxx * 100) for MSVC 2019 64-bit release mode:
Forward case-sensitive search SBCS and UTF8
I wouldn't read too much into the 5% between most of the cases. Its good evidence that collapsing the gap isn't needed for fast searching.
As well as the simple byte-by-byte iterator, it may be worthwhile having a DBCS iterator that treats the document as a sequence of characters instead of bytes like UTF8Iterator does to interface with std::regex.
BIterator, based on ByteIterator but with extra functions as needed:
This can be used like so:
Measured with MinGW g++ 9.2.0 -Os (optimized for space) on Windows using the same cases as with MSVC above. Some minor fixes were made. This table includes the start up cost of each approach by performing a search (for "Scintilla" in Editor.cxx) that succeeds quickly. The search values shouldn't be directly compared with the startup values as they are for different cases and repetitions.
The Boyer-Moore variants start-up times are sensitive to the actual string being searched for and boyer_moore_horspool_searcher is commonly closer to boyer_moore_searcher.
There are many factors that go into the start-up and search speeds but, very roughly speaking, the Boyer-Moore variants reach parity with the 2 fastest setup cases after searching about 3K. This isn't significant for user-initiated find dialogs but its more of a concern for programmatic actions like highlighting each instance of a common identifier.
There could be another option flag to indicate whether to use a searcher that requires more expensive initialisation. Another possibility is a multi-find API that marks each match with an indicator and spreads the start-up cost over multiple matches.
Its possible that more optimisation of SplitView and BIterator may improve search results.
There is a problem with case-insensitive SBCS search in Commit 1: UTF8IsAscii is called on a signed char so always fails. A case-insensitive search for
бinБ бusing Windows-1251 encoding will match theбbut not the upper-caseБ.The
chargument should be cast to unsigned char.Overloaded UTF8IsAscii for char and unsigned char can be added into UniConversion.h to make things works.
Committed similar to Commit1 and Commit2 as [49b5dc], [8fdbe4], [e44a54]. Made UTF8IsAscii more specific with [aab1df]. Added search tests in [55dc74] and added support for benchmarking using unit tests with [df32c5].
There's strong speed improvement although it differs by case and compiler / optimization options. Benchmarks run on EditView.cxx searching for a 6 byte needle that isn't present.
Further improvements using SplitView in [c09fe4] and [dbb18a].
There are more improvements that can be made but adding more special cases can make other improvements and maintenance more difficult. It may be worthwhile adding a
SplitViewIterator(like BIterator above) as a more 'standard C++' way of working as that allows using standard library features more often.Other improvements that are reasonable but don't appear to be huge wins include:
IsDBCSLeadByteNoExceptto a table lookup, avoiding theswitch (dbcsCodePage).std::lower_boundin a sorted array which is log(n). This improves case-insensitive searching in a document that is mostly in a language like Russian.Related
Commit: [49b5dc]
Commit: [55dc74]
Commit: [8fdbe4]
Commit: [aab1df]
Commit: [c09fe4]
Commit: [dbb18a]
Commit: [df32c5]
Commit: [e44a54]
Last edit: Neil Hodgson 2021-07-14
Attached a sketchy (may not work) implementation of SplitViewIterator.
The
safeCharaspect of case-sensitive search in Commit2 was not included as that led to problems with reverse search for DBCS andforward && UTF8IsAscii(leadByte)achieved good performance for the most common cases.I have a tabled version for IsDBCSLeadByteNoExcept() and IsDBCSTrailByteNoExcept() at https://github.com/zufuliu/notepad2/blob/master/scintilla/src/CharClassify.h#L77 and
https://github.com/zufuliu/notepad2/blob/master/scintilla/src/CharClassify.cxx#L699
but it use dynamic allocation (which may fails, so is different from current switch code). howerver the table can be used for Platform code, e.g. SurfaceD2D::MeasureWidths().
As there are only 2847 (where 966 less than 0x600) case sensitive characters (Python 3.9, Unicode 13). I made a small function to check character sensitive used in Notepad2 to auto set SCFIND_MATCHCASE. it generated by updateCaseSensitivityBlock in https://github.com/zufuliu/notepad2/blob/master/scintilla/scripts/GenerateCaseConvert.py#L412
It seems SplitView can be moved into Document.h, so it can be reused for checkTextAndStyle in EditView::LayoutLine().
I tried this but didn't see a consistent speed up on some searching tests: some cases were a little better and some a little worse (+/- 10%). Its possible trading some memory for logic pushes other data out of cache. This was with a simple dumb implementation: add two arrays on Document and set them up in SetDBCSCodePage.
Does this occur often enough to be worthwhile since [a-zA-Z] are case-sensitive?
OK, although it'll probably be CellBuffer.h to allow it to be used in CellBuffer.cxx and for there to be a method to return a SplitView from a CellBuffer instead of calling 4 methods.
Committed change to SplitView as [d950ae].
Related
Commit: [d950ae]
const std::string_view suffix = search + 1;can be changed toconst std::string_view suffix(search + 1, lengthFind - 1);to avoid second strlen().PTRDIFF_MAX can be changed to -1, so
(pos == PTRDIFF_MAX)=>pos < 0.SplitFindChar+SplitMatch is very slow for some search pattern, e.g. the one from https://github.com/zufuliu/notepad2/issues/236
https://github.com/zufuliu/notepad2/issues/236#issuecomment-708430913
https://github.com/zufuliu/notepad2/issues/236#issuecomment-708440553 (unfortunately the link https://fex.net/s/ypz5fv5 was expired, I have a local copy)
OK. [cf6689]
Fixed bug with SplitFindChar. [92fde0]
https://github.com/zufuliu/notepad2/issues/236:
Could also have a threshold for length and contents of search text: 15,000 commas is not reasonable. The goal of a highlight current word / selection is to show multiple occurrences of an identifier or similar.
Related
Commit: [92fde0]
Commit: [cf6689]
The test file is available again (thanks mahotilo) at http://mhdev.bplaced.net/data/_uploaded/file/consumption.zip
such long commas (first occurred on line 2 after GUID, 16129 commas) is rare for normal usage, but there still have the needs to find them. boyer-moore or similar would improve average cases for different search pattern. P.S. Python is going to use glibc memmem like two-way: https://bugs.python.org/issue41972
While finding 15,000 commas may be a useful deliberate action, automatically marking each occurrence is not. Placing too much emphasis on extreme cases leads to code that degrades other cases. Due to initialization costs, boyer-moore is slower on short searches.
Python switches to two-way when size thresholds are exceeded. I think its
needle >= 100andhaystack >= 2000on this line although there's another check form >= 100 && w >= 8000.