Scintilla / Feature Requests / #1381 Optimize Document::FindText()

Zufu Liu - 2020-10-16

Case sensitive search can use the Boyer-Moore-Horspool-Sunday algorithm, which works better (except when lengthFind is 1), I added a implementation based on code on http://www-igm.univ-mlv.fr/~lecroq/string/node19.html in https://github.com/zufuliu/notepad2/commit/334befc6c0b945f60f5ea5241258bd9fcbb2cf39

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Zufu Liu - 2020-10-21
  
  https://github.com/zufuliu/notepad2/commit/d308d41f7ec7bf876a76780b691572752bb4372c fixed backward search bug with Sunday algorithm.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2021-06-30

In commit 2:

if (leadByte < safeChar) { pos += increment;

Isn't this safe only for forward searches as the leadByte doesn't determine the width of the previous character.

For Boyer-Moore there is an implementation in the standard library with <functional> containing boyer_moore_searcher and boyer_moore_horspool_searcher.
https://en.cppreference.com/w/cpp/utility/functional/boyer_moore_horspool_searcher
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Zufu Liu - 2021-06-30
  
  No, safeChar is 256 for single byte forward and backward, 0x80 (can be changed to 0xc2) for UTF-8 forward and backward.
  0x80 for DBCS forward, 0x31, 0x40 or 0x41 for DBCS backward based on the code page (the minimum trail byte).
  
  current Boyer-Moore-Horspool-Sunday code at https://github.com/zufuliu/notepad2/blob/master/scintilla/src/Document.cxx#L2006
  libc++ boyer_moore_horspool_searcher at
  https://github.com/llvm/llvm-project/blob/main/libcxx/include/experimental/functional#L377
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Neil Hodgson - 2021-06-30
    
    Here is a test case that relies on the ambiguity of trail bytes in DBCS:
    
    SECTION("SearchInShiftJIS") { // {CJK UNIFIED IDEOGRAPH-9955} is two bytes: {0xE9, 'b'} in Shift-JIS // The 'b' can be incorrectly matched by the search string 'b' when the search // does not iterate the text correctly. DocPlus doc("ab\xe9" "b ", 932); // a b {CJK UNIFIED IDEOGRAPH-9955} {space} std::string finding = "b"; // Search forwards Sci::Position lengthFinding = finding.length(); Sci::Position location = doc.document.FindText(0, doc.document.Length(), finding.c_str(), FindOption::MatchCase, &lengthFinding); REQUIRE(location == 1); // Search backwards lengthFinding = finding.length(); location = doc.document.FindText(doc.document.Length(), 0, finding.c_str(), FindOption::MatchCase, &lengthFinding); REQUIRE(location == 1); }
    
    The space byte was chosen to provoke the safeCharDBCS case in the second patch.
    
    Two test cases were added to test/unit/testDocument.cxx with [d20926] . It may help maintain the searching feature through optimizations to add some more tests for edge cases.
    
    One direction that may be useful in optimizing is to remove a layer or two of abstraction from NextCharacter / NextPosition by writing specialized NextCharacter functions for iterating SBCS/UTF-8/DBCS (or further specializing for forwards and backwards) then using them in code that has checked for encoding type or writing generic search code that takes them as template arguments.
    
    Related
    
    Commit: [d20926]
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Neil Hodgson - 2021-07-01
    
    While UTF-8 doesn't show the same ambiguous trail byte bug as DBCS due to its self-synchronizing nature, it has changed the way backward iteration is performed, potentially leading to different results over invalid UTF-8. If possible, forward and backward iteration should try to match at whole characters, not character fragments and the same positions should be checked backwards and forwards.
    
    Backwards searches are performed much less often than forward searches so it is fine to concentrate optimization on the forward case and use accurate but slow code for backwards.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

For UTF-8 forward search, the self-synchronizing aspect of UTF-8 makes it safe to perform the same byte-by-byte scan as for SBCS when the search string is valid UTF-8. A non-trail byte (all characters start with a non-trail byte) at the start of the search string will always match at character/fragment start positions just like UTF-8 iteration does. It is extremely rare to search for invalid UTF-8.

One occasion where searching for invalid UTF-8 may be reasonable could be when some process is buggy and produces invalid files with character fragments like isolated trail bytes. A search string that starts with a trail byte could find these but it shouldn't match trail-bytes that are part of a valid character. Thus, the document should be searched by following the UTF-8 iteration scheme, not byte-by-byte.

Much coverage of searching performance points to combining memchr and memcmp as a simple technique that works well because of how much those two calls are optimized. The basic code for a contiguous buffer looks similar to:

while (p < end) {
    p = memchr(p, search[0], end - p);
    if (!p)
        return notThere;
   if (!memcmp(p+1, search+1, lenSearch-1))
        return p - start;
    p++;
}

In Scintilla, the memory isn't contiguous but an adapter can hide that.

Start with a class to provide a view over the split vector:

struct SplitView {
    const char *s1 = nullptr;
    size_t len1 = 0;
    const char *s2 = nullptr;
    size_t len = 0;

    SplitView(CellBuffer &cb) {
        s1 = cb.RangePointer(0, 0);
        len1 = cb.GapPosition();
        s2 = cb.RangePointer(len1, 0) - len1;
        len = cb.Length();
    }

    char CharAt(size_t position) const noexcept {
        if (position < len1) {
            return s1[position];
        }
        if (position < len) {
            return s2[position];
        }
        return 0;
    }
}

Then add methods that are versions of memchr and memcmp that work over the split (not thoroughly tested):

size_t FindChar(size_t start, size_t length, int ch) const {
    size_t range1Length = 0;
    if (start < len1) {
        range1Length = std::min(length, len1 - start);
        const char *match = static_cast<const char *>(memchr(s1 + start, ch, range1Length));
        if (match) {
            return match - s1;
        }
        start += range1Length;
    }
    const char *match2 = static_cast<const char *>(memchr(s2 + start, ch, length - range1Length));
    if (match2) {
        return match2 - s2 + len1;
    }
    return SIZE_MAX;
}

bool Match(size_t start, std::string_view sv) {
    // Break into 2 and memcmp both
    size_t range1Length = 0;
    if (start < len1) {
        range1Length = std::min(sv.length(), len1 - start);
        if (memcmp(s1 + start, sv.data(), range1Length) != 0) {
            return false;
        }
        start += range1Length;
        sv.remove_prefix(range1Length);
        if (sv.empty()) {
            return true;
        }
    }
    return memcmp(s2 + start, sv.data(), sv.length() - range1Length) == 0;
}

Forward searching for SBCS or valid UTF-8 is then:

SplitView cbView(cb);
std::string_view suffix = search + 1;
while (pos < endSearch) {
    pos = cbView.FindChar(pos, limitPos-pos, search[0]);
    if (pos == SIZE_MAX) {
        break;
    }
    if (cbView.Match(pos+1, suffix) && CheckOtherConditions(pos)) {
        return pos;
    }
    pos++;
}

This is quite a bit faster (around 8x for a 6 byte not-present search string on Editor.cxx) than the current code and has very little initialisation overhead. Reverse searching requires memrchr which isn't portable but implementing a non-optimised version is simple.

Alternate implementation: add FindChar and Match to SplitVector and make available through CellBuffer. SplitView becomes invalid after any document modification but CellBuffer remains valid. May lose some performance through extra layers of calls.

Zufu Liu - 2021-07-06

The memmem approach would works for UTF-8 and DBCS case sensitive forward search when first search is not a trail byte.

libc++ developer says cbView.Match(pos, search) is faster, see __search_substring for std::string_view::find at https://github.com/llvm/llvm-project/blob/main/libcxx/include/__string#L946

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A standard library compatible iterator over the document will allow use of wide variety of standard library features, such as std::search. An iterator that works with all std::search searchers and takes a SplitView by pointer (adding the SplitView into the iterator was slower, possibly as it doubles the size to 32 bytes) produced these results (milliseconds for 20 runs over Editor.cxx * 100) for MSVC 2019 64-bit release mode:

Forward case-sensitive search SBCS and UTF8

milliseconds	case
444	collapse gap then std::string_view of whole document and std::string_view::find
934	std::default_searcher over BIterator
473	std::boyer_moore_searcher over BIterator
453	std::boyer_moore_horspool_searcher over BIterator
475	SplitView FindChar + Match (memchr + memcmp)

I wouldn't read too much into the 5% between most of the cases. Its good evidence that collapsing the gap isn't needed for fast searching.

As well as the simple byte-by-byte iterator, it may be worthwhile having a DBCS iterator that treats the document as a sequence of characters instead of bytes like UTF8Iterator does to interface with std::regex.

BIterator, based on ByteIterator but with extra functions as needed:

class BIterator {
public:
    typedef std::bidirectional_iterator_tag iterator_category;
    typedef char value_type;
    typedef ptrdiff_t difference_type;
    typedef char *pointer;
    typedef char &reference;

    const SplitView *sv;
    Sci::Position position;

    BIterator(const SplitView *sv_, Sci::Position position_ = 0) noexcept :
        sv(sv_), position(position_) {
    }
    BIterator(const BIterator &other) noexcept {
        sv = other.sv;
        position = other.position;
    }
    BIterator(BIterator &&other) noexcept {
        sv = other.sv;
        position = other.position;
    }
    BIterator &operator=(const BIterator &other) noexcept {
        if (this != &other) {
            sv = other.sv;
            position = other.position;
        }
        return *this;
    }
    BIterator &operator=(BIterator &&) noexcept = default;
    ~BIterator() = default;
    char operator*() const noexcept {
        return sv->CharAt(position);
    }
    BIterator &operator++() noexcept {
        position++;
        return *this;
    }
    BIterator operator++(int) noexcept {
        BIterator retVal(*this);
        position++;
        return retVal;
    }
    BIterator &operator--() noexcept {
        position--;
        return *this;
    }
    BIterator &operator+=(ptrdiff_t a) noexcept {
        position += a;
        return *this;
    }
    ptrdiff_t operator-(const BIterator &other) const noexcept {
        return position - other.position;
    }
    BIterator operator-(const ptrdiff_t difference) const noexcept {
        return BIterator(sv, position - difference);
    }
    BIterator operator+(const ptrdiff_t difference) const noexcept {
        return BIterator(sv, position + difference);
    }
    bool operator==(const BIterator &other) const noexcept {
        return sv == other.sv && position == other.position;
    }
    bool operator!=(const BIterator &other) const noexcept {
        return sv != other.sv || position != other.position;
    }
    Sci::Position Pos() const noexcept {
        return position;
    }
};

This can be used like so:

SplitView cbView(cb);
BIterator itStart(&cbView, startPos);
BIterator itEnd(&cbView, endPos);

std::boyer_moore_searcher theSearcher(search, search + lengthFind);
// or std::default_searcher or std::boyer_moore_horspool_searcher
BIterator it = itStart;
while (it != itEnd) {
    it = std::search(it, itEnd, theSearcher);
    if (it == itEnd) {
        break;
    }
    if (MatchesWordOptions(word, wordStart, it.Pos(), lengthFind)) {
        return it.Pos();
    }
    it++;
}

Neil Hodgson - 2021-07-08

Measured with MinGW g++ 9.2.0 -Os (optimized for space) on Windows using the same cases as with MSVC above. Some minor fixes were made. This table includes the start up cost of each approach by performing a search (for "Scintilla" in Editor.cxx) that succeeds quickly. The search values shouldn't be directly compared with the startup values as they are for different cases and repetitions.

search start-up case

609 953 collapse gap

1155 1165 default_searcher

529 6127 boyer_moore_searcher

517 3514 boyer_moore_horspool_searcher

716 1004 SplitView FindChar + Match

The Boyer-Moore variants start-up times are sensitive to the actual string being searched for and boyer_moore_horspool_searcher is commonly closer to boyer_moore_searcher.

There are many factors that go into the start-up and search speeds but, very roughly speaking, the Boyer-Moore variants reach parity with the 2 fastest setup cases after searching about 3K. This isn't significant for user-initiated find dialogs but its more of a concern for programmatic actions like highlighting each instance of a common identifier.

There could be another option flag to indicate whether to use a searcher that requires more expensive initialisation. Another possibility is a multi-find API that marks each match with an indicator and spreads the start-up cost over multiple matches.

Its possible that more optimisation of SplitView and BIterator may improve search results.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

search	start-up	case
609	953	collapse gap
1155	1165	default_searcher
529	6127	boyer_moore_searcher
517	3514	boyer_moore_horspool_searcher
716	1004	SplitView FindChar + Match

Neil Hodgson - 2021-07-12

There is a problem with case-insensitive SBCS search in Commit 1: UTF8IsAscii is called on a signed char so always fails. A case-insensitive search for б in Б б using Windows-1251 encoding will match the б but not the upper-case Б.

The ch argument should be cast to unsigned char.

const char ch = CharAt(pos + indexSearch); const char chTest = searchData[indexSearch]; if (UTF8IsAscii(static_cast<unsigned char>(ch))) { found = chTest == MakeLowerCase(ch);
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Zufu Liu - 2021-07-12
  
  Overloaded UTF8IsAscii for char and unsigned char can be added into UniConversion.h to make things works.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Committed similar to Commit1 and Commit2 as [49b5dc], [8fdbe4], [e44a54]. Made UTF8IsAscii more specific with [aab1df]. Added search tests in [55dc74] and added support for benchmarking using unit tests with [df32c5].

There's strong speed improvement although it differs by case and compiler / optimization options. Benchmarks run on EditView.cxx searching for a 6 byte needle that isn't present.

.	MSVC /O2 /Oi /GL Original	New	Change	g++ -Os Original	New	Change	g++ -O2 Original	New	Change
case-insensitive search	4002	2116	189%	6493	3575	182%	4997	2035	246%
case-sensitive search	2277	1471	155%	4693	2450	192%	3714	1551	239%
case-sensitive reverse search	2402	1958	123%	4645	2948	158%	3936	2060	191%
case-insensitive search UTF8	4097	3340	123%	6491	3582	181%	4329	2514	172%
case-sensitive search UTF8	4081	1475	277%	6630	2447	271%	4855	1544	314%
case-insensitive search 932	8149	3909	208%	13980	5546	252%	10747	3860	278%
case-sensitive search 932	4664	1479	315%	8254	2448	337%	6460	1539	420%
Average			198%			224%			265%

Further improvements using SplitView in [c09fe4] and [dbb18a].

There are more improvements that can be made but adding more special cases can make other improvements and maintenance more difficult. It may be worthwhile adding a SplitViewIterator (like BIterator above) as a more 'standard C++' way of working as that allows using standard library features more often.

Other improvements that are reasonable but don't appear to be huge wins include:

Converting IsDBCSLeadByteNoExcept to a table lookup, avoiding the switch (dbcsCodePage).
Changing the single byte case folding table so its simpler and not behind a virtual call
Using indexed lookup for a subset of Unicode characters instead of std::lower_bound in a sorted array which is log(n). This improves case-insensitive searching in a document that is mostly in a language like Russian.
Using boyer-moore or similar

Commit: [49b5dc]
Commit: [55dc74]
Commit: [8fdbe4]
Commit: [aab1df]
Commit: [c09fe4]
Commit: [dbb18a]
Commit: [df32c5]
Commit: [e44a54]

Last edit: Neil Hodgson 2021-07-14

Neil Hodgson - 2021-07-14

Attached a sketchy (may not work) implementation of SplitViewIterator.

SplitViewIterator.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2021-07-14

The safeChar aspect of case-sensitive search in Commit2 was not included as that led to problems with reverse search for DBCS and forward && UTF8IsAscii(leadByte) achieved good performance for the most common cases.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2021-07-14

I have a tabled version for IsDBCSLeadByteNoExcept() and IsDBCSTrailByteNoExcept() at https://github.com/zufuliu/notepad2/blob/master/scintilla/src/CharClassify.h#L77 and
https://github.com/zufuliu/notepad2/blob/master/scintilla/src/CharClassify.cxx#L699
but it use dynamic allocation (which may fails, so is different from current switch code). howerver the table can be used for Platform code, e.g. SurfaceD2D::MeasureWidths().

As there are only 2847 (where 966 less than 0x600) case sensitive characters (Python 3.9, Unicode 13). I made a small function to check character sensitive used in Notepad2 to auto set SCFIND_MATCHCASE. it generated by updateCaseSensitivityBlock in https://github.com/zufuliu/notepad2/blob/master/scintilla/scripts/GenerateCaseConvert.py#L412

It seems SplitView can be moved into Document.h, so it can be reused for checkTextAndStyle in EditView::LayoutLine().

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2021-07-15
  
  tabled version for IsDBCSLeadByteNoExcept() and IsDBCSTrailByteNoExcept()
  
  I tried this but didn't see a consistent speed up on some searching tests: some cases were a little better and some a little worse (+/- 10%). Its possible trading some memory for logic pushes other data out of cache. This was with a simple dumb implementation: add two arrays on Document and set them up in SetDBCSCodePage.
  
  --- a/src/Document.h Mon Jul 12 15:55:43 2021 +1000 +++ b/src/Document.h Thu Jul 15 12:01:50 2021 +1000 @@ -271,6 +271,8 @@ Scintilla::EndOfLine eolMode; /// Can also be SC_CP_UTF8 to enable UTF-8 mode int dbcsCodePage; + bool leadBytes[0x100] {}; + bool trailBytes[0x100] {}; Scintilla::LineEndType lineEndBitSet; int tabInChars; int indentInChars; --- a/src/Document.cxx Mon Jul 12 15:55:43 2021 +1000 +++ b/src/Document.cxx Thu Jul 15 12:01:50 2021 +1000 @@ -237,6 +237,12 @@ cb.SetLineEndTypes(lineEndBitSet & LineEndTypesSupported()); cb.SetUTF8Substance(CpUtf8 == dbcsCodePage); ModifiedAt(0); // Need to restyle whole document + if (dbcsCodePage && dbcsCodePage != CpUtf8) { + for (int byteValue = 0; byteValue < 0x100; byteValue++) { + leadBytes[byteValue] = IsDBCSLeadByteNoExcept(byteValue); + trailBytes[byteValue] = IsDBCSTrailByteNoExcept(byteValue); + } + } return true; } else { return false;
  
  check character sensitive used in Notepad2 to auto set SCFIND_MATCHCASE
  
  Does this occur often enough to be worthwhile since [a-zA-Z] are case-sensitive?
  
  It seems SplitView can be moved into Document.h,
  
  OK, although it'll probably be CellBuffer.h to allow it to be used in CellBuffer.cxx and for there to be a method to return a SplitView from a CellBuffer instead of calling 4 methods.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2021-07-15

Committed change to SplitView as [d950ae].

Related

Commit: [d950ae]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2021-07-15

const std::string_view suffix = search + 1; can be changed to const std::string_view suffix(search + 1, lengthFind - 1); to avoid second strlen().
PTRDIFF_MAX can be changed to -1, so(pos == PTRDIFF_MAX) => pos < 0.

SplitFindChar+SplitMatch is very slow for some search pattern, e.g. the one from https://github.com/zufuliu/notepad2/issues/236

https://github.com/zufuliu/notepad2/issues/236#issuecomment-708430913
https://github.com/zufuliu/notepad2/issues/236#issuecomment-708440553 (unfortunately the link https://fex.net/s/ypz5fv5 was expired, I have a local copy)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2021-07-16
  
  suffix ... PTRDIFF_MAX
  
  OK. [cf6689]
  
  Fixed bug with SplitFindChar. [92fde0]
  
  https://github.com/zufuliu/notepad2/issues/236:
  
  There should be some threshold to automatically disable this feature for big files.
  
  Could also have a threshold for length and contents of search text: 15,000 commas is not reasonable. The goal of a highlight current word / selection is to show multiple occurrences of an identifier or similar.
  
  Related
  
  Commit: [92fde0]
  Commit: [cf6689]
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2021-07-16

The test file is available again (thanks mahotilo) at http://mhdev.bplaced.net/data/_uploaded/file/consumption.zip

such long commas (first occurred on line 2 after GUID, 16129 commas) is rare for normal usage, but there still have the needs to find them. boyer-moore or similar would improve average cases for different search pattern. P.S. Python is going to use glibc memmem like two-way: https://bugs.python.org/issue41972

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2021-07-16
  
  While finding 15,000 commas may be a useful deliberate action, automatically marking each occurrence is not. Placing too much emphasis on extreme cases leads to code that degrades other cases. Due to initialization costs, boyer-moore is slower on short searches.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2021-07-16
  
  Python switches to two-way when size thresholds are exceeded. I think its needle >= 100 and haystack >= 2000 on this line although there's another check for m >= 100 && w >= 8000.
  
  if (m >= 100 && w >= 2000 && w / m >= 5) {
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Optimize Document::FindText()

Group

Searches

Help

#1381 Optimize Document::FindText()

Related

Discussion

Related

Related

Related

Related