Menu

#613 Regular expressions search/replace issue within text files with special characters (german Umlaute like ä, ö, ...)

Release_xx.yy
open
nobody
search (10)
Patch
2024-01-02
2018-02-01
CBUser2017
No

Hi folks,

I'm having issues with C::B 16.01 using regular expressions for finding / replacing text within "normal" text files containing German umlaute (ä, ö, ü) (using advanced regexp feature).

The search function itself is working (in file src/src/find_replace.cpp) but the selected string is sometimes not shown at the correct start position and (if the search string contains special characters) with an invalid length.

The problem is that the position by wxRegEx.GetMatch delivers the character position (not the binary position!) and the number of characters found by the regexp.
Setting the correct selection (start position, selection length) requires the binary position and length.
Thus we required the binary length and position which can be calculated using function UTF8Length
from

#include "../src/scintilla/src/Uniconversion.h"

size_t utf8start = UTF8Length( text.c_str(), start );
size_t utf8len = UTF8Length( text.Mid( start, len ).c_str(), len );

I've already made a small patch to this to see if that would work to the ::Find function in line 1172 (don't forget to use the include mentioned above!) :

        if (re.Matches(text))
        {
            size_t start, len;
            re.GetMatch(&start, &len, 0);
            size_t utf8start = UTF8Length( text.c_str(), start );
            pos = utf8start + data->start; // before: pos = start + data->start;
            size_t utf8len = UTF8Length( text.Mid( start, len ).c_str(), len ) ;
            len = utf8len;
            lengthFound = len;

            if ((start==0) && (len==0)) //For searches for "^" or "$" (and null returning variants on this) need to make sure we have forward progress and not simply matching on a previous BOL/EOL find
            {
                text = text.Mid(1);
                if (re.Matches(text))
                {
                    re.GetMatch(&start, &len, 0);
                    utf8start = UTF8Length( text.c_str(), start );
                    pos = utf8start + data->start + 1; // pos = start + data->start + 1;
                    utf8len = UTF8Length( text.Mid( start, len ).c_str(), len );
                    lengthFound = len;
                }
                else
                    pos=-1;
            }
        }

To test the faulty behaviour I've made a very small file containing:
ävoid Test::Functionä0( void )
ävoid Test::Functionä1( void )

If you now start a regexp search using the pattern Test.+( you'll see that (without the patch) C::B selects " Test::Functionä" as result instead of "Test::Functionä0("

Here we see both effects in one:
1. the invalid start position
2. the invalid selection length

The behaviour can be explained very easily: the ä characters requires 2 bytes in UTF-8 representation (which is internally used by scintilla). The first "ä" leeds to the invalid start position while the second one leeds to the invalid length.

I really often use the find-and-replace-feature using regular expressions but with advanvced regexp option it is really unusable. For now I've deactivated the advanced regexp.

Could someone please have a look on that problem? I've only had a look on the find feature till now...

Kind regards

ChristophMS

Discussion

  • Teodor Petrov

    Teodor Petrov - 2018-02-01

    Can you please provide a patch against latest trunk version?
    See how to do it: http://wiki.codeblocks.org/index.php/Creating_a_patch_to_submit_(Patch_Tracker)

     
  • CBUser2017

    CBUser2017 - 2018-02-01

    Sorry, I did not find the Click on "Submit A Patch" link. So I attached a diff file.
    Hopefully you're able to use it.

    I've also had a look into the other find/replace functions and extended the former "patch".
    There are now three patched sections now. It's tested in a small environment and seemed to work.
    Please check whether I've found all relevant code lines before applying that patch :-)

    Kind regards

    ChristophMS

     
  • Teodor Petrov

    Teodor Petrov - 2018-02-01

    Do you know that this is an internal scintilla header and it is not supposed to be used by application?

     
  • Teodor Petrov

    Teodor Petrov - 2018-02-02

    Note you could probably use those scintilla calls http://www.scintilla.org/ScintillaDoc.html#SCI_POSITIONBEFORE to get the correct possitions.

     
  • CBUser2017

    CBUser2017 - 2018-02-02

    Sorry, I'm not (yet) familiar with the internals of scintilla and codeblocks. I've found that header while debugging through the code. The patch I made was a "fast" one to get it work for me.
    I don't know how to use that scintilla stuff at the moment. If that's the better way this one should be used.

    Can you make a patch using the scintilla commands?

     
    • Teodor Petrov

      Teodor Petrov - 2018-02-02

      I could, but I don't have time to spent on this at the moment...

       
  • CBUser2017

    CBUser2017 - 2018-02-02

    It's the same for me at this moment but that's ok for now since the patch is working for me right now.
    The more important thing is that this bug is kept open and that s.o. else could fix it for one of the next releases ...

     
  • Teodor Petrov

    Teodor Petrov - 2021-05-17
    • labels: --> search
    • Type: Bug_Report --> Patch
     
  • Robert Morin

    Robert Morin - 2024-01-02

    Hi,

    I have the same issue with French characters with version svn 13394.

    I am attaching a test file to show the problem and I will try to update the patch according to Petrov.

    Regards,
    Robert

     

Log in to post a comment.