Menu

#1381 Optimize Document::FindText()

Initial
open
None
5
2021-07-16
2020-10-15
Zufu Liu
No

During fix https://github.com/zufuliu/notepad2/issues/236 I found several changes that improve the speed.
They are in following three commits:
Commit 1: https://github.com/zufuliu/notepad2/commit/526cee6363f32c041a7c366421878ef7a8362aeb
Commit 2: https://github.com/zufuliu/notepad2/commit/84ea3babbc6547c43962573a31c220f7c929594f
Commit 3: https://github.com/zufuliu/notepad2/commit/de27fd98f15769d2e6c91e54e106b0224ea42280

Summery of changes:
Commit 1. For case insensitive search, use MakeLowerCase() to replace pcf->Fold() when widthChar is 1 (UTF-8 and DBCS), or when it's ASCII (SBCS).

Commit 2. Avoid calling NextCharacter(). For SBCS, it can be replaced with pos += increment. For DBCS, pos += widthFirstCharacter; can be used for forward search (similar to UTF-8). For case sensitive search, pos += increment can be used when CharAt(pos) is a safe one byte character (encoded in single byte, not part of any multi-byte sequence, i.e. all characters in SBCS, ASCII in UTF-8, C0 control characters in DBCS).

Commit 3: Optimized case sensitive search for long pattern by move pos to one after fist mismatch character when the character is not in search pattern. This helps a lot when document contains text that matches begging of the long pattern, but followed by character not in the pattern (as in the linked Notepad2 issue).

Beside, I think SciTE's MatchMarker::Continue() can be changed to timing on bytes instead of lines, see [feature-requests:#1373].

Related

Feature Requests: #1373

Discussion

  • Neil Hodgson

    Neil Hodgson - 2021-06-30

    In commit 2:

    if (leadByte < safeChar) {
        pos += increment;
    

    Isn't this safe only for forward searches as the leadByte doesn't determine the width of the previous character.

    For Boyer-Moore there is an implementation in the standard library with <functional> containing boyer_moore_searcher and boyer_moore_horspool_searcher.
    https://en.cppreference.com/w/cpp/utility/functional/boyer_moore_horspool_searcher

     
    • Zufu Liu

      Zufu Liu - 2021-06-30

      No, safeChar is 256 for single byte forward and backward, 0x80 (can be changed to 0xc2) for UTF-8 forward and backward.
      0x80 for DBCS forward, 0x31, 0x40 or 0x41 for DBCS backward based on the code page (the minimum trail byte).

      current Boyer-Moore-Horspool-Sunday code at https://github.com/zufuliu/notepad2/blob/master/scintilla/src/Document.cxx#L2006
      libc++ boyer_moore_horspool_searcher at
      https://github.com/llvm/llvm-project/blob/main/libcxx/include/experimental/functional#L377

       
      • Neil Hodgson

        Neil Hodgson - 2021-06-30

        Here is a test case that relies on the ambiguity of trail bytes in DBCS:

        SECTION("SearchInShiftJIS") {
            // {CJK UNIFIED IDEOGRAPH-9955} is two bytes: {0xE9, 'b'} in Shift-JIS
            // The 'b' can be incorrectly matched by the search string 'b' when the search
            // does not iterate the text correctly.
            DocPlus doc("ab\xe9" "b ", 932);    // a b {CJK UNIFIED IDEOGRAPH-9955} {space}
            std::string finding = "b";
            // Search forwards
            Sci::Position lengthFinding = finding.length();
            Sci::Position location = doc.document.FindText(0, doc.document.Length(), finding.c_str(), FindOption::MatchCase, &lengthFinding);
            REQUIRE(location == 1);
            // Search backwards
            lengthFinding = finding.length();
            location = doc.document.FindText(doc.document.Length(), 0, finding.c_str(), FindOption::MatchCase, &lengthFinding);
            REQUIRE(location == 1);
        }
        

        The space byte was chosen to provoke the safeCharDBCS case in the second patch.

        Two test cases were added to test/unit/testDocument.cxx with [d20926] . It may help maintain the searching feature through optimizations to add some more tests for edge cases.

        One direction that may be useful in optimizing is to remove a layer or two of abstraction from NextCharacter / NextPosition by writing specialized NextCharacter functions for iterating SBCS/UTF-8/DBCS (or further specializing for forwards and backwards) then using them in code that has checked for encoding type or writing generic search code that takes them as template arguments.

         

        Related

        Commit: [d20926]

      • Neil Hodgson

        Neil Hodgson - 2021-07-01

        While UTF-8 doesn't show the same ambiguous trail byte bug as DBCS due to its self-synchronizing nature, it has changed the way backward iteration is performed, potentially leading to different results over invalid UTF-8. If possible, forward and backward iteration should try to match at whole characters, not character fragments and the same positions should be checked backwards and forwards.

        Backwards searches are performed much less often than forward searches so it is fine to concentrate optimization on the forward case and use accurate but slow code for backwards.

         
  • Neil Hodgson

    Neil Hodgson - 2021-07-05

    For UTF-8 forward search, the self-synchronizing aspect of UTF-8 makes it safe to perform the same byte-by-byte scan as for SBCS when the search string is valid UTF-8. A non-trail byte (all characters start with a non-trail byte) at the start of the search string will always match at character/fragment start positions just like UTF-8 iteration does. It is extremely rare to search for invalid UTF-8.

    One occasion where searching for invalid UTF-8 may be reasonable could be when some process is buggy and produces invalid files with character fragments like isolated trail bytes. A search string that starts with a trail byte could find these but it shouldn't match trail-bytes that are part of a valid character. Thus, the document should be searched by following the UTF-8 iteration scheme, not byte-by-byte.

    Much coverage of searching performance points to combining memchr and memcmp as a simple technique that works well because of how much those two calls are optimized. The basic code for a contiguous buffer looks similar to:

    while (p < end) {
        p = memchr(p, search[0], end - p);
        if (!p)
            return notThere;
       if (!memcmp(p+1, search+1, lenSearch-1))
            return p - start;
        p++;
    }
    

    In Scintilla, the memory isn't contiguous but an adapter can hide that.

    Start with a class to provide a view over the split vector:

    struct SplitView {
        const char *s1 = nullptr;
        size_t len1 = 0;
        const char *s2 = nullptr;
        size_t len = 0;
    
        SplitView(CellBuffer &cb) {
            s1 = cb.RangePointer(0, 0);
            len1 = cb.GapPosition();
            s2 = cb.RangePointer(len1, 0) - len1;
            len = cb.Length();
        }
    
        char CharAt(size_t position) const noexcept {
            if (position < len1) {
                return s1[position];
            }
            if (position < len) {
                return s2[position];
            }
            return 0;
        }
    }
    

    Then add methods that are versions of memchr and memcmp that work over the split (not thoroughly tested):

    size_t FindChar(size_t start, size_t length, int ch) const {
        size_t range1Length = 0;
        if (start < len1) {
            range1Length = std::min(length, len1 - start);
            const char *match = static_cast<const char *>(memchr(s1 + start, ch, range1Length));
            if (match) {
                return match - s1;
            }
            start += range1Length;
        }
        const char *match2 = static_cast<const char *>(memchr(s2 + start, ch, length - range1Length));
        if (match2) {
            return match2 - s2 + len1;
        }
        return SIZE_MAX;
    }
    
    bool Match(size_t start, std::string_view sv) {
        // Break into 2 and memcmp both
        size_t range1Length = 0;
        if (start < len1) {
            range1Length = std::min(sv.length(), len1 - start);
            if (memcmp(s1 + start, sv.data(), range1Length) != 0) {
                return false;
            }
            start += range1Length;
            sv.remove_prefix(range1Length);
            if (sv.empty()) {
                return true;
            }
        }
        return memcmp(s2 + start, sv.data(), sv.length() - range1Length) == 0;
    }
    

    Forward searching for SBCS or valid UTF-8 is then:

    SplitView cbView(cb);
    std::string_view suffix = search + 1;
    while (pos < endSearch) {
        pos = cbView.FindChar(pos, limitPos-pos, search[0]);
        if (pos == SIZE_MAX) {
            break;
        }
        if (cbView.Match(pos+1, suffix) && CheckOtherConditions(pos)) {
            return pos;
        }
        pos++;
    }
    

    This is quite a bit faster (around 8x for a 6 byte not-present search string on Editor.cxx) than the current code and has very little initialisation overhead. Reverse searching requires memrchr which isn't portable but implementing a non-optimised version is simple.

    Alternate implementation: add FindChar and Match to SplitVector and make available through CellBuffer. SplitView becomes invalid after any document modification but CellBuffer remains valid. May lose some performance through extra layers of calls.

     
  • Zufu Liu

    Zufu Liu - 2021-07-06

    The memmem approach would works for UTF-8 and DBCS case sensitive forward search when first search is not a trail byte.

    libc++ developer says cbView.Match(pos, search) is faster, see __search_substring for std::string_view::find at https://github.com/llvm/llvm-project/blob/main/libcxx/include/__string#L946

     
  • Neil Hodgson

    Neil Hodgson - 2021-07-07

    A standard library compatible iterator over the document will allow use of wide variety of standard library features, such as std::search. An iterator that works with all std::search searchers and takes a SplitView by pointer (adding the SplitView into the iterator was slower, possibly as it doubles the size to 32 bytes) produced these results (milliseconds for 20 runs over Editor.cxx * 100) for MSVC 2019 64-bit release mode:

    Forward case-sensitive search SBCS and UTF8

    milliseconds case
    444 collapse gap then std::string_view of whole document and std::string_view::find
    934 std::default_searcher over BIterator
    473 std::boyer_moore_searcher over BIterator
    453 std::boyer_moore_horspool_searcher over BIterator
    475 SplitView FindChar + Match (memchr + memcmp)

    I wouldn't read too much into the 5% between most of the cases. Its good evidence that collapsing the gap isn't needed for fast searching.

    As well as the simple byte-by-byte iterator, it may be worthwhile having a DBCS iterator that treats the document as a sequence of characters instead of bytes like UTF8Iterator does to interface with std::regex.

    BIterator, based on ByteIterator but with extra functions as needed:

    class BIterator {
    public:
        typedef std::bidirectional_iterator_tag iterator_category;
        typedef char value_type;
        typedef ptrdiff_t difference_type;
        typedef char *pointer;
        typedef char &reference;
    
        const SplitView *sv;
        Sci::Position position;
    
        BIterator(const SplitView *sv_, Sci::Position position_ = 0) noexcept :
            sv(sv_), position(position_) {
        }
        BIterator(const BIterator &other) noexcept {
            sv = other.sv;
            position = other.position;
        }
        BIterator(BIterator &&other) noexcept {
            sv = other.sv;
            position = other.position;
        }
        BIterator &operator=(const BIterator &other) noexcept {
            if (this != &other) {
                sv = other.sv;
                position = other.position;
            }
            return *this;
        }
        BIterator &operator=(BIterator &&) noexcept = default;
        ~BIterator() = default;
        char operator*() const noexcept {
            return sv->CharAt(position);
        }
        BIterator &operator++() noexcept {
            position++;
            return *this;
        }
        BIterator operator++(int) noexcept {
            BIterator retVal(*this);
            position++;
            return retVal;
        }
        BIterator &operator--() noexcept {
            position--;
            return *this;
        }
        BIterator &operator+=(ptrdiff_t a) noexcept {
            position += a;
            return *this;
        }
        ptrdiff_t operator-(const BIterator &other) const noexcept {
            return position - other.position;
        }
        BIterator operator-(const ptrdiff_t difference) const noexcept {
            return BIterator(sv, position - difference);
        }
        BIterator operator+(const ptrdiff_t difference) const noexcept {
            return BIterator(sv, position + difference);
        }
        bool operator==(const BIterator &other) const noexcept {
            return sv == other.sv && position == other.position;
        }
        bool operator!=(const BIterator &other) const noexcept {
            return sv != other.sv || position != other.position;
        }
        Sci::Position Pos() const noexcept {
            return position;
        }
    };
    

    This can be used like so:

    SplitView cbView(cb);
    BIterator itStart(&cbView, startPos);
    BIterator itEnd(&cbView, endPos);
    
    std::boyer_moore_searcher theSearcher(search, search + lengthFind);
    // or std::default_searcher or std::boyer_moore_horspool_searcher
    BIterator it = itStart;
    while (it != itEnd) {
        it = std::search(it, itEnd, theSearcher);
        if (it == itEnd) {
            break;
        }
        if (MatchesWordOptions(word, wordStart, it.Pos(), lengthFind)) {
            return it.Pos();
        }
        it++;
    }
    
     
  • Neil Hodgson

    Neil Hodgson - 2021-07-08

    Measured with MinGW g++ 9.2.0 -Os (optimized for space) on Windows using the same cases as with MSVC above. Some minor fixes were made. This table includes the start up cost of each approach by performing a search (for "Scintilla" in Editor.cxx) that succeeds quickly. The search values shouldn't be directly compared with the startup values as they are for different cases and repetitions.

    search start-up case
    609 953 collapse gap
    1155 1165 default_searcher
    529 6127 boyer_moore_searcher
    517 3514 boyer_moore_horspool_searcher
    716 1004 SplitView FindChar + Match

    The Boyer-Moore variants start-up times are sensitive to the actual string being searched for and boyer_moore_horspool_searcher is commonly closer to boyer_moore_searcher.

    There are many factors that go into the start-up and search speeds but, very roughly speaking, the Boyer-Moore variants reach parity with the 2 fastest setup cases after searching about 3K. This isn't significant for user-initiated find dialogs but its more of a concern for programmatic actions like highlighting each instance of a common identifier.

    There could be another option flag to indicate whether to use a searcher that requires more expensive initialisation. Another possibility is a multi-find API that marks each match with an indicator and spreads the start-up cost over multiple matches.

    Its possible that more optimisation of SplitView and BIterator may improve search results.

     
  • Neil Hodgson

    Neil Hodgson - 2021-07-12

    There is a problem with case-insensitive SBCS search in Commit 1: UTF8IsAscii is called on a signed char so always fails. A case-insensitive search for б in Б б using Windows-1251 encoding will match the б but not the upper-case Б.

    The ch argument should be cast to unsigned char.

    const char ch = CharAt(pos + indexSearch);
    const char chTest = searchData[indexSearch];
    if (UTF8IsAscii(static_cast<unsigned char>(ch))) {
        found = chTest == MakeLowerCase(ch);
    
     
    • Zufu Liu

      Zufu Liu - 2021-07-12

      Overloaded UTF8IsAscii for char and unsigned char can be added into UniConversion.h to make things works.

       
  • Neil Hodgson

    Neil Hodgson - 2021-07-14

    Committed similar to Commit1 and Commit2 as [49b5dc], [8fdbe4], [e44a54]. Made UTF8IsAscii more specific with [aab1df]. Added search tests in [55dc74] and added support for benchmarking using unit tests with [df32c5].

    There's strong speed improvement although it differs by case and compiler / optimization options. Benchmarks run on EditView.cxx searching for a 6 byte needle that isn't present.

    . MSVC /O2 /Oi /GL Original New Change g++ -Os Original New Change g++ -O2 Original New Change
    case-insensitive search 4002 2116 189% 6493 3575 182% 4997 2035 246%
    case-sensitive search 2277 1471 155% 4693 2450 192% 3714 1551 239%
    case-sensitive reverse search 2402 1958 123% 4645 2948 158% 3936 2060 191%
    case-insensitive search UTF8 4097 3340 123% 6491 3582 181% 4329 2514 172%
    case-sensitive search UTF8 4081 1475 277% 6630 2447 271% 4855 1544 314%
    case-insensitive search 932 8149 3909 208% 13980 5546 252% 10747 3860 278%
    case-sensitive search 932 4664 1479 315% 8254 2448 337% 6460 1539 420%
    Average 198% 224% 265%

    Further improvements using SplitView in [c09fe4] and [dbb18a].

    There are more improvements that can be made but adding more special cases can make other improvements and maintenance more difficult. It may be worthwhile adding a SplitViewIterator (like BIterator above) as a more 'standard C++' way of working as that allows using standard library features more often.

    Other improvements that are reasonable but don't appear to be huge wins include:

    • Converting IsDBCSLeadByteNoExcept to a table lookup, avoiding the switch (dbcsCodePage).
    • Changing the single byte case folding table so its simpler and not behind a virtual call
    • Using indexed lookup for a subset of Unicode characters instead of std::lower_bound in a sorted array which is log(n). This improves case-insensitive searching in a document that is mostly in a language like Russian.
    • Using boyer-moore or similar
     

    Related

    Commit: [49b5dc]
    Commit: [55dc74]
    Commit: [8fdbe4]
    Commit: [aab1df]
    Commit: [c09fe4]
    Commit: [dbb18a]
    Commit: [df32c5]
    Commit: [e44a54]


    Last edit: Neil Hodgson 2021-07-14
  • Neil Hodgson

    Neil Hodgson - 2021-07-14

    Attached a sketchy (may not work) implementation of SplitViewIterator.

     
  • Neil Hodgson

    Neil Hodgson - 2021-07-14

    The safeChar aspect of case-sensitive search in Commit2 was not included as that led to problems with reverse search for DBCS and forward && UTF8IsAscii(leadByte) achieved good performance for the most common cases.

     
  • Zufu Liu

    Zufu Liu - 2021-07-14

    I have a tabled version for IsDBCSLeadByteNoExcept() and IsDBCSTrailByteNoExcept() at https://github.com/zufuliu/notepad2/blob/master/scintilla/src/CharClassify.h#L77 and
    https://github.com/zufuliu/notepad2/blob/master/scintilla/src/CharClassify.cxx#L699
    but it use dynamic allocation (which may fails, so is different from current switch code). howerver the table can be used for Platform code, e.g. SurfaceD2D::MeasureWidths().

    As there are only 2847 (where 966 less than 0x600) case sensitive characters (Python 3.9, Unicode 13). I made a small function to check character sensitive used in Notepad2 to auto set SCFIND_MATCHCASE. it generated by updateCaseSensitivityBlock in https://github.com/zufuliu/notepad2/blob/master/scintilla/scripts/GenerateCaseConvert.py#L412

    It seems SplitView can be moved into Document.h, so it can be reused for checkTextAndStyle in EditView::LayoutLine().

     
    • Neil Hodgson

      Neil Hodgson - 2021-07-15

      tabled version for IsDBCSLeadByteNoExcept() and IsDBCSTrailByteNoExcept()

      I tried this but didn't see a consistent speed up on some searching tests: some cases were a little better and some a little worse (+/- 10%). Its possible trading some memory for logic pushes other data out of cache. This was with a simple dumb implementation: add two arrays on Document and set them up in SetDBCSCodePage.

      --- a/src/Document.h    Mon Jul 12 15:55:43 2021 +1000
      +++ b/src/Document.h    Thu Jul 15 12:01:50 2021 +1000
      @@ -271,6 +271,8 @@
          Scintilla::EndOfLine eolMode;
          /// Can also be SC_CP_UTF8 to enable UTF-8 mode
          int dbcsCodePage;
      +   bool leadBytes[0x100] {};
      +   bool trailBytes[0x100] {};
          Scintilla::LineEndType lineEndBitSet;
          int tabInChars;
          int indentInChars;
      --- a/src/Document.cxx  Mon Jul 12 15:55:43 2021 +1000
      +++ b/src/Document.cxx  Thu Jul 15 12:01:50 2021 +1000
      @@ -237,6 +237,12 @@
              cb.SetLineEndTypes(lineEndBitSet & LineEndTypesSupported());
              cb.SetUTF8Substance(CpUtf8 == dbcsCodePage);
              ModifiedAt(0);  // Need to restyle whole document
      +       if (dbcsCodePage && dbcsCodePage != CpUtf8) {
      +           for (int byteValue = 0; byteValue < 0x100; byteValue++) {
      +               leadBytes[byteValue] = IsDBCSLeadByteNoExcept(byteValue);
      +               trailBytes[byteValue] = IsDBCSTrailByteNoExcept(byteValue);
      +           }
      +       }
              return true;
          } else {
              return false;
      

      check character sensitive used in Notepad2 to auto set SCFIND_MATCHCASE

      Does this occur often enough to be worthwhile since [a-zA-Z] are case-sensitive?

      It seems SplitView can be moved into Document.h,

      OK, although it'll probably be CellBuffer.h to allow it to be used in CellBuffer.cxx and for there to be a method to return a SplitView from a CellBuffer instead of calling 4 methods.

       
  • Neil Hodgson

    Neil Hodgson - 2021-07-15

    Committed change to SplitView as [d950ae].

     

    Related

    Commit: [d950ae]

  • Zufu Liu

    Zufu Liu - 2021-07-16

    The test file is available again (thanks mahotilo) at http://mhdev.bplaced.net/data/_uploaded/file/consumption.zip

    such long commas (first occurred on line 2 after GUID, 16129 commas) is rare for normal usage, but there still have the needs to find them. boyer-moore or similar would improve average cases for different search pattern. P.S. Python is going to use glibc memmem like two-way: https://bugs.python.org/issue41972

     
    • Neil Hodgson

      Neil Hodgson - 2021-07-16

      While finding 15,000 commas may be a useful deliberate action, automatically marking each occurrence is not. Placing too much emphasis on extreme cases leads to code that degrades other cases. Due to initialization costs, boyer-moore is slower on short searches.

       
    • Neil Hodgson

      Neil Hodgson - 2021-07-16

      Python switches to two-way when size thresholds are exceeded. I think its needle >= 100 and haystack >= 2000 on this line although there's another check for m >= 100 && w >= 8000.

      if (m >= 100 && w >= 2000 && w / m >= 5) {
      
       

Log in to post a comment.