Menu

Bug in Regex

Olivier
2014-06-28
2014-07-29
  • Olivier

    Olivier - 2014-06-28

    I found a little bug in the Regex replacement of François-R Boyer :
    In case of character UTF-8 encoded in 4 bytes, the result of regex search is wrong.
    (see also comments on Notepad++ discussion on which this version of François-R Boyer is discussed, http://sourceforge.net/p/notepad-plus/bugs/4531/#34c0)
    In this example, searching U+20089 (japanese glyph), but U+0009 (tab) is found

    This is due to an error in conversion between utf32 and utf16
    Here a fix :
    --- PythonScript-1.0.6-orig/PythonScript/src/UtfConversion.h 2014-06-22 17:01:48 +0000
    +++ PythonScript-1.0.6/PythonScript/src/UtfConversion.h 2014-06-24 18:06:08 +0000
    @ -89,8 +89,9 @@
    utf16Out.length(1);
    }
    else if (utf32 <= maximum_unicode_point) {
    + utf32-=0x10000u;
    utf16Out[0] = U16(utf32 >> 10) + lead_surrogate_base;
    - utf16Out[1] = U16(utf32 & 0x3F) + tail_surrogate_base;
    + utf16Out[1] = U16(utf32 & 0x3FF) + tail_surrogate_base;
    utf16Out.length(2);
    }
    else utf16Out.length(0);
    @@ -103,7 +104,7 @@
    if (utf16.length() == 1)
    return utf16[0];
    if (utf16.length() == 2)
    - return ((utf16[0]&0x3F)<<10) | (utf16[1]&0x3F);
    + return ((utf16[0]-lead_surrogate_base)<<10) + (utf16[1]-tail_surrogate_base)+0x10000u;
    return INVALID;
    }

    Olivier

     
  • Dave Brotherstone

    Hmmm... I can't reproduce this.

    editor.write(u'Hello \u4e79 World \t Tab')
    editor.rereplace(u'\u4e79', '**HERE**')
    

    I get:

    Hello HERE World     Tab
    

    There's a tab still between "World" and "Tab". I'm not saying the fix is wrong (it looks reasonable, at first glance), but I'd like to be able to reproduce the case it fixes first.

    Thanks for reporting, and the fix, and apologies for the delay in responding.

    Cheers,
    Dave.

     
  • Olivier

    Olivier - 2014-07-29

    The problem occurs only when the character in UTF-16 takes 4 bytes (2 words) and in the 'ignore-case' mode.
    So your test with u+4e79 doesn't trigger the problem.

    Searching u+20089 in a text containing some tabs, will find the tab but it should not.

    import re
    editor.write(u'Hello  World \t Tab')
    editor.rereplace(u'\U00020089', '==HERE==',re.IGNORECASE)
    

    I get :

    Hello  World ==HERE== Tab

    Olivier

     

Log in to post a comment.