I found a little bug in the Regex replacement of François-R Boyer :

In case of character UTF-8 encoded in 4 bytes, the result of regex search is wrong.

(see also comments on Notepad++ discussion on which this version of François-R Boyer is discussed, http://sourceforge.net/p/notepad-plus/bugs/4531/#34c0)

In this example, searching U+20089 (japanese glyph), but U+0009 (tab) is found

This is due to an error in conversion between utf32 and utf16

Here a fix :

--- PythonScript-1.0.6-orig/PythonScript/src/UtfConversion.h 2014-06-22 17:01:48 +0000

+++ PythonScript-1.0.6/PythonScript/src/UtfConversion.h 2014-06-24 18:06:08 +0000

@ -89,8 +89,9 @@

utf16Out.length(1);

}

else if (utf32 <= maximum_unicode_point) {

+ utf32-=0x10000u;

utf16Out[0] = U16(utf32 >> 10) + lead_surrogate_base;

- utf16Out[1] = U16(utf32 & 0x3F) + tail_surrogate_base;

+ utf16Out[1] = U16(utf32 & 0x3FF) + tail_surrogate_base;

utf16Out.length(2);

}

else utf16Out.length(0);

@@ -103,7 +104,7 @@

if (utf16.length() == 1)

return utf16[0];

if (utf16.length() == 2)

- return ((utf16[0]&0x3F)<<10) | (utf16[1]&0x3F);

+ return ((utf16[0]-lead_surrogate_base)<<10) + (utf16[1]-tail_surrogate_base)+0x10000u;

return INVALID;

}

Olivier