Support for Unicode extended characters in Np++ 6.5.5 Unicode

  • Yann

    Hi all,

    I've been using Np++ for some while, but recently I have observed some oddity with this neat text editor.

    I am using Windows 7 64 bit Ultimate, locale setting is Chinese, with all patches installed.

    When I try to open a text file with extended Unicode characters having code points larger than U+FFFF, Np++ could display text normally ONLY when the text is UTF-8 encoded.
    It doesn't show these extended characters correctly when the UTF encoding is UTF-16BE or UTF-16LE.
    I haven't tested the UTF-8 (w/o BOM) case since I rarely manipulate UTF-8 encoded texts without a BOM.

    Besides, if you open a text file containing extended characters properly encoded in UTF-8, and convert, using Np++, into UTF-16 BE or LE, then save it, close Np++, restart the program, and reload the file, the extended characters become total mess.

    In all these scenarios the file encodings are correctly detected, though.

    In addition, I have found regex searching in Np++ using the \x{} escaping to be limited to Unicode BMP characters (code points <= U+FFFF). If you try to search characters with larger code points, for example \x{20191}, it won't succeed.

    My question is: is this a software bug, or a feature not realized yet, or a Windows-only issue?
    I am looking for work-arounds / quick-fix / alternatives to solve these issues.

    Thanks very much.

    If you like, please see the attached picture to have a better understanding of my problem.
    Problem ScreenShot Collage
    Download the sample text files:


    Hello Yann and All

    Some general points about UNICODE ( to begin with ! ) :

    UNICODE is a structure that organize the management of all the code-points of the characters of the Universal Character Set ( UCS )

    The UCS lays between 0x0 and 0x10FFFF ( 1114112 code-points ), that are split in 17 planes of 65536 values :

    • The first plane ( Plane 0 ), named Basic Multilingual Plane ( BMP ), contains "almost" all the characters of any modern language. But it can, practically, code, only, 57054 characters, due to some reserved areas :

    The SURROGATE mechanism ( a 32-bits emulation ), between the code-points 0xD800 and 0xDFFF

    The Private Area, between the code-points 0xE000 and 0xF8FF

    Some NO UNICODE characters, between the code-points 0xFDD0 and 0xFDEF + the two code-points 0xFFFE and 0xFFFF

    • The second plane ( Plane 1 ), named Supplementary Multilingual Plane ( SMP ), contains some "extra" linguistic characters. It codes 65534 characters ( 65536 - the two NON Unicode characters 0x1FFFE and 0x1FFFF )

    • The third plane ( Plane 2 ), named Supplementary Ideographic Plane ( SIP ), contains some "extra" ideographic CJK characters. It also codes 65534 characters, as Plane 1.

    • The 11 following planes, from 3 to 13, are, still, NOT used by the UNICODE consortium.

    • The plane 14, named Supplementary Special-purpose Plane ( SSP ) contains some non-graphics characters.

    • Finally, planes 15 and 16, named Supplementary Private Use Area A and B ( SPUA ), contains private characters.

    The different UNICODE encodings, that allow to store all these UNICODE characters, into files, are :

    • The UCS-2 Big Endian encoding, which codes every character, of code-point
      < 0xFFFE, in TWO bytes, with the Most Significant Byte written first.

    • The UTF-16 Big Endian encoding, which codes every character, of code-point between 0x0 and 0x10FFFE, in TWO bytes, with the Most Significant Byte written first.

    • the UCS-2 Low Endian encoding, which codes every character, of code-point
      < 0xFFFE, in TWO bytes, with the Least Significant Byte written first.

    • The UTF-16 Little Endian encoding, which codes every character, of code-point between 0x0 and 0x10FFFE, in TWO bytes, with the Most Significant Byte written first.

    • The UTF-8 encoding. It's a very clever encoding, which codes every character, of code-point < 0x10FFFE, in ONE, TWO, THREE or FOUR bytes, depending upon the value of the code-point.

    • The UCS-4 encoding, which codes every character, of code-point up to 0x7FFFFFFF, in FOUR bytes.

    • The UTF-32 encoding, subset of the UCS-4 encoding, which codes every character, of code-point < 0x10FFFE, in FOUR bytes


    Notice that the UCS-4, UTF-32 and UTF-16 encodings are NOT yet supported by Notepad++.

    Let's consider an brief example, with the EURO character, of absolute code-point 8364 ( Hex value = 0x20AC ) :

    • With the UCS-2 Big Endian or the UTF-16 Big Endian encodings, it's coded with the TWO bytes 0x20 , 0xAC

    • With the UCS-2 Little Endian or the UTF-16 Little Endian encodings, it's coded with the TWO bytes 0xAC , 0x20

    • With the UTF-8 encoding, it's coded with the THREE bytes 0xE2, 0x82 , 0xAC

    • With the UCS-4 or UTF-32 encodings, it should be coded with the FOUR bytes 0x00 , 0x00 , 0x20 , 0cAC

    And the highest possible VALID UNICODE code-point 1114109 ( 0x10FFFD ), as it's a code-point above 0xFFFF, can ONLY be coded, in the 5 encodings below, with FOUR bytes :

    • 0xF4 , 0x8F , 0xBF , 0xBD, with the UTF-8 encoding

    • 0xDB , 0xFF , 0xDF , 0xFD, with the UTF-16 Big Endian encoding, using SURROGATE pairs

    • 0xDF , 0xFD , 0xDB , 0xFF, with the UTF-16 Little Endian encoding, using SURROGATE pairs

    • 0x00 , 0x10 , 0xFF, 0xFD, with the UCS-4 or the UTF-32 encodings

    See the Surrogate mechanism at the address below :

    So, Yann, unfortunately, it's NOT a software bug or a future feature, not yet implemented or a Windows issue !

    It's just that, with the two encodings UCS-2BE and UCS-2LE, it will ALWAYS be impossible to code Unicode characters, with code-point > 0xFFFF :-)

    So, with Notepad++, ONLY the UTF-8 encoding can encode ALL the characters of the Universal Character Set ! Therefore, when you re-encode an UTF-8 text, which contains characters with code-point > 0xFFFF, in UCS-2BE or UCS-2LE encodings, it's quite normal than the resulting text contains odd characters !!!

    About regular expression searches, you're perfectly right : the form \x{nnnn} don't allow you to search characters outside the BMP ( with code-point over 0xFFFF ) :-(

    But, as any character with code-point > 0xFFFF is coded, in UTF-8, with exactly FOUR bytes, you could use this ( ugly ) method, as a work-around :

    • Try to find the FOUR UTF-8 bytes of your character, with this Swiss-knife tool, at the address below :

    • Re-encode, in Notepad++, your UTF-8 text as an ANSI text, with individual bytes encoding, from 0x00 to 0xff.

    • Finally, search, in regular expression mode, for the expression \x..\x..\x..\x.. ( 4 consecutive bytes which represent the UTF-8 encoding of the searched character ). Remember that, with an ANSI text, the regular expression forms \x{..} and \x{....} are INVALID. So, in an ANSI file, the form \x.. is the only VALID form !

    For example, your character, with hex code-point 0x20191, is an unified HAN ideogram, part of the CJK Unified Ideographs Extension B area below :

    The UTF-8 tool says that this character ( 0x20191 ) is encoded in UTF-8 with the the 4 bytes 0xF0, 0xA0, 0x86 and 0x91.

    So, it could be searched, after re-encoding in ANSI, with the simple regular expresion below :


    To end up, a ( non exhaustive ! ) list of some addresses, about that topîc :

    • General information about UNICODE, Unicode encodings and Unicode planes :

    • General information about Code Pages :

    • Information about Byte Order Mark ( BOM ), Endianness and the Unicode block Specials

    • General information about UNICODE v6.3, the different Character Code Charts in PDF files, the alphabetic list of all non ideographic Unicode characters,

    • Additional information about the Unicode Collation algorithm and the Unicode Regular expressions, unfortunately NOT supported by N++ :

    • A very practical UTF-tool to determine the UTF-8 encoding bytes from any UNICODE code-point or the opposite :



    Last edit: THEVENOT Guy 2014-07-11
    • Yann

      Thanks, guy038 for your informative help!

      If this could be added to the user manual that comes along with the software package then there won't be such "issues".

      Since it is definitely clear that np++ supports only ANSI/UCS-2 (LE/BE)/UTF-8(w and w/o BOM) encodings, I feel confident to say that the French localization of np++ mis-translated the "Encoding" menu. It used "UTF-16" instead of the "UCS-2" in the original English version. However, to say support for UTF-16 is to mean support of UTF-16 surrogates, according to what i have learnt from ( Np++ is apparently NOT supporting surrogates, as the screenshot in my problem suggests. UTF-16 surrogates are being treated as other plain UCS-2 printable characters.

      So I think you would agree with me that the French localization in Np++ of the "Encoding" menu is misleading.


    Hi Yann,

    Many thanks for your feedback ! The best thing to fully understand a topic is ALWAYS trying to explain it to someone else !!

    While reading your post, I just realized that I was completely wrong, regarding definition of the both encodings UCS-2 and UTF-16 :-( There are, definitively, NOT identical encodings !! Actually, as it's well explained, at the Wiki address, :

    • The UTF-16 encoding can code every Unicode code-points between 0x0 and 0x10FFFF, with two bytes. Characters, with code-points, over 0xFFFF, can, still be encoded in two bytes, by using the surrogate pairs mechanism. To sum up, the UTF-16 encoding fully encode ALL Unicode characters !

    • The older UCS-2BE and UCS-2LE encodings, used in Notepad++, are a subset of UTF-16, which, also, encode Unicode characters in two bytes, but they DON'T use the surrogate mechanism. Therefore, they CAN'T encode Unicode characters with code-point over 0xFFFF. Only, Unicode characters, with code-points between 0x0 and 0xFFFF, may be encoded, with the UCS-2 encodings !

    So, I've modified my previous post, to correct this error ! Also, the higher code-point, coded with the UCS-4 encoding, is 0x7FFFFFFF, as it's the upper limit of the UCS Transformation Format ( UTF ) !

    Moreover, I added some encoding examples and a link to the surrogate mechanism. Please, Yann, let me know if it still contains some errors :-)

    Regarding the translation, you're perfectly right ! Generally, I write posts for non French people. So, I prefer to keep the default English localization of my Notepad++, as I sometimes attach an hard-screen picture to posts !

    Indeed, I'm French but I've NEVER noticed that the expressions UCS-2 Big Endian and UCS-2 Little Endian are MIS-translated in the French expressions UTF-16BE and UTF-16LE !!

    But, you just have to edit the NativeLang.xml file, located in the Notepad++ installation folder, to change these wrong expressions and, then, re-start Notepad++ !



    Last edit: THEVENOT Guy 2014-05-05