Bug in Regex

A Python Scripting plugin for Notepad++

Status: Beta

Brought to you by: davegb3

Bug in Regex

Forum: Open Discussion

Creator: Olivier

Created: 2014-06-28

Updated: 2014-07-29

Olivier - 2014-06-28

I found a little bug in the Regex replacement of François-R Boyer :
In case of character UTF-8 encoded in 4 bytes, the result of regex search is wrong.
(see also comments on Notepad++ discussion on which this version of François-R Boyer is discussed, http://sourceforge.net/p/notepad-plus/bugs/4531/#34c0)
In this example, searching U+20089 (japanese glyph), but U+0009 (tab) is found

This is due to an error in conversion between utf32 and utf16
Here a fix :
--- PythonScript-1.0.6-orig/PythonScript/src/UtfConversion.h 2014-06-22 17:01:48 +0000
+++ PythonScript-1.0.6/PythonScript/src/UtfConversion.h 2014-06-24 18:06:08 +0000
@ -89,8 +89,9 @@
utf16Out.length(1);
}
else if (utf32 <= maximum_unicode_point) {
+ utf32-=0x10000u;
utf16Out[0] = U16(utf32 >> 10) + lead_surrogate_base;
- utf16Out[1] = U16(utf32 & 0x3F) + tail_surrogate_base;
+ utf16Out[1] = U16(utf32 & 0x3FF) + tail_surrogate_base;
utf16Out.length(2);
}
else utf16Out.length(0);
@@ -103,7 +104,7 @@
if (utf16.length() == 1)
return utf16[0];
if (utf16.length() == 2)
- return ((utf16[0]&0x3F)<<10) | (utf16[1]&0x3F);
+ return ((utf16[0]-lead_surrogate_base)<<10) + (utf16[1]-tail_surrogate_base)+0x10000u;
return INVALID;
}

Olivier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brotherstone - 2014-07-15

Hmmm... I can't reproduce this.

editor.write(u'Hello \u4e79 World \t Tab') editor.rereplace(u'\u4e79', '**HERE**')

I get:

Hello HERE World Tab

There's a tab still between "World" and "Tab". I'm not saying the fix is wrong (it looks reasonable, at first glance), but I'd like to be able to reproduce the case it fixes first.

Thanks for reporting, and the fix, and apologies for the delay in responding.

Cheers,
Dave.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Olivier - 2014-07-29

The problem occurs only when the character in UTF-16 takes 4 bytes (2 words) and in the 'ignore-case' mode.
So your test with u+4e79 doesn't trigger the problem.

Searching u+20089 in a text containing some tabs, will find the tab but it should not.

import re editor.write(u'Hello World \t Tab') editor.rereplace(u'\U00020089', '==HERE==',re.IGNORECASE)

I get :

Hello World ==HERE== Tab

Olivier
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.