I found a little bug in the Regex replacement of François-R Boyer :
In case of character UTF-8 encoded in 4 bytes, the result of regex search is wrong.
(see also comments on Notepad++ discussion on which this version of François-R Boyer is discussed, http://sourceforge.net/p/notepad-plus/bugs/4531/#34c0)
In this example, searching U+20089 (japanese glyph), but U+0009 (tab) is found
This is due to an error in conversion between utf32 and utf16
Here a fix :
--- PythonScript-1.0.6-orig/PythonScript/src/UtfConversion.h 2014-06-22 17:01:48 +0000
+++ PythonScript-1.0.6/PythonScript/src/UtfConversion.h 2014-06-24 18:06:08 +0000
@ -89,8 +89,9 @@
utf16Out.length(1);
}
else if (utf32 <= maximum_unicode_point) {
+ utf32-=0x10000u;
utf16Out[0] = U16(utf32 >> 10) + lead_surrogate_base;
- utf16Out[1] = U16(utf32 & 0x3F) + tail_surrogate_base;
+ utf16Out[1] = U16(utf32 & 0x3FF) + tail_surrogate_base;
utf16Out.length(2);
}
else utf16Out.length(0);
@@ -103,7 +104,7 @@
if (utf16.length() == 1)
return utf16[0];
if (utf16.length() == 2)
- return ((utf16[0]&0x3F)<<10) | (utf16[1]&0x3F);
+ return ((utf16[0]-lead_surrogate_base)<<10) + (utf16[1]-tail_surrogate_base)+0x10000u;
return INVALID;
}
Olivier
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
editor.write(u'Hello \u4e79 World \t Tab')
editor.rereplace(u'\u4e79', '**HERE**')
I get:
Hello HERE World Tab
There's a tab still between "World" and "Tab". I'm not saying the fix is wrong (it looks reasonable, at first glance), but I'd like to be able to reproduce the case it fixes first.
Thanks for reporting, and the fix, and apologies for the delay in responding.
Cheers,
Dave.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The problem occurs only when the character in UTF-16 takes 4 bytes (2 words) and in the 'ignore-case' mode.
So your test with u+4e79 doesn't trigger the problem.
Searching u+20089 in a text containing some tabs, will find the tab but it should not.
import re
editor.write(u'Hello World \t Tab')
editor.rereplace(u'\U00020089', '==HERE==',re.IGNORECASE)
I get :
Hello World ==HERE== Tab
Olivier
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I found a little bug in the Regex replacement of François-R Boyer :
In case of character UTF-8 encoded in 4 bytes, the result of regex search is wrong.
(see also comments on Notepad++ discussion on which this version of François-R Boyer is discussed, http://sourceforge.net/p/notepad-plus/bugs/4531/#34c0)
In this example, searching U+20089 (japanese glyph), but U+0009 (tab) is found
This is due to an error in conversion between utf32 and utf16
Here a fix :
--- PythonScript-1.0.6-orig/PythonScript/src/UtfConversion.h 2014-06-22 17:01:48 +0000
+++ PythonScript-1.0.6/PythonScript/src/UtfConversion.h 2014-06-24 18:06:08 +0000
@ -89,8 +89,9 @@
utf16Out.length(1);
}
else if (utf32 <= maximum_unicode_point) {
+ utf32-=0x10000u;
utf16Out[0] = U16(utf32 >> 10) + lead_surrogate_base;
- utf16Out[1] = U16(utf32 & 0x3F) + tail_surrogate_base;
+ utf16Out[1] = U16(utf32 & 0x3FF) + tail_surrogate_base;
utf16Out.length(2);
}
else utf16Out.length(0);
@@ -103,7 +104,7 @@
if (utf16.length() == 1)
return utf16[0];
if (utf16.length() == 2)
- return ((utf16[0]&0x3F)<<10) | (utf16[1]&0x3F);
+ return ((utf16[0]-lead_surrogate_base)<<10) + (utf16[1]-tail_surrogate_base)+0x10000u;
return INVALID;
}
Olivier
Hmmm... I can't reproduce this.
I get:
There's a tab still between "World" and "Tab". I'm not saying the fix is wrong (it looks reasonable, at first glance), but I'd like to be able to reproduce the case it fixes first.
Thanks for reporting, and the fix, and apologies for the delay in responding.
Cheers,
Dave.
The problem occurs only when the character in UTF-16 takes 4 bytes (2 words) and in the 'ignore-case' mode.
So your test with u+4e79 doesn't trigger the problem.
Searching u+20089 in a text containing some tabs, will find the tab but it should not.
I get :
Olivier