Menu

#2405 Regex backward search and substitution bugs

Bug
closed-fixed
5
2023-12-25
2023-10-12
Zufu Liu
No

Split from [feature-requests:#1500] as a bug report.

  1. RESearch substitution failed for backward search due to bopat and eopat been cleared inside search.Execute().
  2. (related to [bugs:#2244]) bopat and eopat may contains position in the middle of a character for RESearch and ByteIterator, this can be fixed by call MovePositionOutsideChar() to ensure only whole characters selected (already did for first search.eopat[0]).
  3. RESearch backward search does not return longest match (e.g. for \w+ only find one letter/byte). this can be fixed by
    (A) similar to MatchOnLines(), change the code to Execute(di, search.eopat[0], endOfLine), then fix infinite loop for empty match like \<.
    (B) use current code, but save current longest match (consecutive matches that has same eopat[0], then first one is the longest). the code is better to use search.Execute(di, doc->NextPosition(pos, 1), endOfLine) to avoid search from middle of character (e.g. for \w+).
  4. I think it's better to put #ifdef REGEX_MULTILINE block inside an if forward search (infinite for backward search) to make it usable.
  5. (not bug?) I don't understand what if (match[0].first == match[0].second) is doing, as match is not changed by above std::regex_search().

Related

Bugs: #2244
Feature Requests: #1500

Discussion

1 2 3 > >> (Page 1 of 3)
  • Zufu Liu

    Zufu Liu - 2023-10-12

    then fix infinite loop for empty match like \<.

    This is related to [bugs:#2157] change bol to line start position or change BOW to ignore bol would fix the infinite loop.

     

    Related

    Bugs: #2157

  • Zufu Liu

    Zufu Liu - 2023-10-13
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -2,8 +2,8 @@
    
    
     1. `RESearch` substitution failed for backward search due to `bopat`  and `eopat` been cleared inside `search.Execute()`.
     2. (related to [bugs:#2244]) `bopat` and `eopat` may contains position in the middle of a character for `RESearch` and `ByteIterator`, this can be fixed by call `MovePositionOutsideChar()`  to ensure only whole characters selected (already did for first `search.eopat[0]`). 
    -3. `RESearch` backward search does not return longest match (e.g. for `\w` only find one letter/byte). this can be fixed by
    +3. `RESearch` backward search does not return longest match (e.g. for `\w+` only find one letter/byte). this can be fixed by
        (A) similar to `MatchOnLines()`, change the code to `Execute(di, search.eopat[0], endOfLine)`, then fix infinite loop for empty match like `\&lt;`.
        (B) use current code, but save current longest match (consecutive matches that has same `eopat[0]`, then first one is the longest). the code is better to use `search.Execute(di, doc-&gt;NextPosition(pos, 1), endOfLine)` to avoid search from middle of character (e.g. for `\w+`).
    -4. I think it&#39;s better to put `#ifdef REGEX_MULTILINE` block inside an if forward search (infinite for backward search) to it usable.
    +4. I think it&#39;s better to put `#ifdef REGEX_MULTILINE` block inside an if forward search (infinite for backward search) to make it usable.
     5. (not bug?) I don&#39;t understand what `if (match[0].first == match[0].second)` is doing, as `match` is not changed by above `std::regex_search()`.
    
     

    Related

    Bugs: #2244

  • Zufu Liu

    Zufu Liu - 2023-10-13

    First patch (split out from SubstituteByPosition-1011-2.diff) to fix substitution failure for RESearch backward search, and adding test case.

     
  • Neil Hodgson

    Neil Hodgson - 2023-10-14

    Probably worthwhile using std::array for bopat and eopat so they can be copied easily bopat = search.bopat.

        static constexpr int MAXTAG = 10;
        static constexpr int NOTFOUND = -1;
    
    
    -   Sci::Position bopat[MAXTAG];
    -   Sci::Position eopat[MAXTAG];
    +   using MatchPositions = std::array<Sci::Position, MAXTAG>;
    +   MatchPositions bopat;
    +   MatchPositions eopat;
        std::string pat[MAXTAG];
    
     private:
    
     
    • Zufu Liu

      Zufu Liu - 2023-10-14

      There is warning for bopat = search.bopat:
      warning C4701: potentially uninitialized local variable 'bopat' used

       
      • Zufu Liu

        Zufu Liu - 2023-10-14

        Updated patch to use std::array.

         
        • Zufu Liu

          Zufu Liu - 2023-10-14

          Fixed header order for <array>.

           
          • Neil Hodgson

            Neil Hodgson - 2023-10-14

            Committed with [82822d] and preparatory [c5acbb]. std::array::fill isn't noexcept so propagated to RESearch::Clear.

             

            Related

            Commit: [82822d]
            Commit: [c5acbb]

            • Zufu Liu

              Zufu Liu - 2023-10-14

              There is no bugprone-exception-escape warning from clang-tidy and VS Code Analysis.

              move [[fallthrough]]; out from the brace (a compound statement) fixed VS Code Analysis warning: warning C26819: Unannotated fallthrough between switch labels (es.78).

               
              • Neil Hodgson

                Neil Hodgson - 2023-10-15

                VS Code Analysis before removing noexcept:

                C:\u\hg\scintilla\src\RESearch.cxx(265): warning C26447: 
                The function is declared 'noexcept' but calls function 'fill()' 
                which may throw exceptions (f.6).
                C:\u\hg\scintilla\src\RESearch.cxx(266): warning C26447: 
                The function is declared 'noexcept' but calls function 'fill()' 
                which may throw exceptions (f.6).
                
                 
                • Zufu Liu

                  Zufu Liu - 2023-10-15

                  No warnings from 17.7.5, maybe because I use C++latest.

                   
            • Neil Hodgson

              Neil Hodgson - 2023-10-17

              This (current hg state) fails on Ubuntu Linux 23.10 with g++ 13.2:

              Document
                RegexSearchAndSubstitution
              -------------------------------------------------------------------------------
              testDocument.cxx:456
              ...............................................................................
              
              testDocument.cxx:484: FAILED:
                REQUIRE( lengthFinding == 5 )
              with expansion:
                2 == 5
              
              ===============================================================================
              test cases:   41 |   40 passed | 1 failed
              assertions: 3351 | 3350 passed | 1 failed
              
               
              • Zufu Liu

                Zufu Liu - 2023-10-17

                It passed on Windows msys2 GCC 13.2 with both UTF8Iterator.

                 
                • Neil Hodgson

                  Neil Hodgson - 2023-10-18

                  It appears to be a locale issue with only ASCII characters in the alnum set.

                   
                  • Zufu Liu

                    Zufu Liu - 2023-10-18

                    workaround based on https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63776

                    @@ -38,6 +38,18 @@
                     using namespace Scintilla;
                     using namespace Scintilla::Internal;
                    
                    +#if !defined(_WIN32) && !defined(NO_CXX11_REGEX)
                    +// set global locale to pass std::regex related tests
                    +// see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63776
                    +struct GlobalLocaleInitializer {
                    
                    +   GlobalLocaleInitializer() {
                    +       try {
                    +           std::locale::global(std::locale("en_US.UTF-8"));
                    +       } catch (...) {}
                    +   }
                    +} globalLocaleInitializer;
                    +#endif
                    +
                     // Test Document.
                    
                     struct Folding {
                    
                     
                    • Neil Hodgson

                      Neil Hodgson - 2023-10-18
                       

                      Related

                      Commit: [b11098]

  • Neil Hodgson

    Neil Hodgson - 2023-10-14

    I think it's better to put #ifdef REGEX_MULTILINE block inside an if forward search (infinite for backward search) to make it usable.

    If CRLF is still an issue then this really is just for experimenters not users.

     
    • Zufu Liu

      Zufu Liu - 2023-10-14

      OK, not full tested the three engines.

       
  • Neil Hodgson

    Neil Hodgson - 2023-10-14

    call MovePositionOutsideChar() to ensure only whole characters selected

    Yes, each tagged part should be extended to whole characters.

     
    • Zufu Liu

      Zufu Liu - 2023-10-14

      Add MovePositionOutsideChar() for RESearch and ByteIterator.

       
      • Neil Hodgson

        Neil Hodgson - 2023-10-15

        This is a more significant and dangerous change than I first thought. It isn't covered by examples or tests. Moving backwards has a potential for no-progress infinite loops. RESearch and ByteIterator are byte-string oriented and are unlikely to be made DBCS-safe with minor tweaks. DBCS-compliant regex would be a larger project.

        There are simple cases with a bracketed regular expression where the visual match doesn't match the tag values so can lose text: replacing (!.) with [\1], for example. Extending the end of tag ranges (eopats) similarly to the main match is safer.

         
        • Zufu Liu

          Zufu Liu - 2023-10-15

          Change for ByteIterator seems is safe as Pos() is only used to construct final eopat not change matching behavior. RESearch.bopat can be changed at end of Execute()?

           
          • Zufu Liu

            Zufu Liu - 2023-10-16

            Changed BOT to bopat[static_cast<int>(*ap++)] = ci.MovePositionOutsideChar(lp, -1); to not decrease lp, but ensure tag range include whole character. added test (including failure example for DBCS).

            I don't understand when replace (!.) with[\1] would cause data lose, as dot now matches a whole character instead of single byte.

            Not sure whether is worth to fix DBCS problem (match started from trailing ASCII byte), a simple fix/improvement is run RESearch::Execute() switch block in a loop, when match is found butlp is inside middle of character, try to match from next character. This is just few lines of change though.

             
            • Zufu Liu

              Zufu Liu - 2023-10-16

              run RESearch::Execute() switch block in a loop

              it's actually much simple:

              default:            /* regular matching all the way. */
                  while (lp < endp) {
                      ep = PMatch(ci, lp, endp, ap);
                      if (ep != NOTFOUND) {
                          // fix match started from DBCS trailing ASCII byte
                          const Sci::Position pos = ci.MovePositionOutsideChar(lp, -1);
                          if (pos == lp) {
                              break;
                          }
                      }
                      lp++;
                  }
                  break;
              
               
1 2 3 > >> (Page 1 of 3)

Log in to post a comment.

MongoDB Logo MongoDB