Support for Unicode extended characters in Np++ 6.5.5 Unicode

Yann
2014-04-29
2014-05-05
  • Yann
    Yann
    2014-04-29

    Hi all,

    I've been using Np++ for some while, but recently I have observed some oddity with this neat text editor.

    I am using Windows 7 64 bit Ultimate, locale setting is Chinese, with all patches installed.

    When I try to open a text file with extended Unicode characters having code points larger than U+FFFF, Np++ could display text normally ONLY when the text is UTF-8 encoded.
    It doesn't show these extended characters correctly when the UTF encoding is UTF-16BE or UTF-16LE.
    I haven't tested the UTF-8 (w/o BOM) case since I rarely manipulate UTF-8 encoded texts without a BOM.

    Besides, if you open a text file containing extended characters properly encoded in UTF-8, and convert, using Np++, into UTF-16 BE or LE, then save it, close Np++, restart the program, and reload the file, the extended characters become total mess.

    In all these scenarios the file encodings are correctly detected, though.

    In addition, I have found regex searching in Np++ using the \x{} escaping to be limited to Unicode BMP characters (code points <= U+FFFF). If you try to search characters with larger code points, for example \x{20191}, it won't succeed.

    My question is: is this a software bug, or a feature not realized yet, or a Windows-only issue?
    I am looking for work-arounds / quick-fix / alternatives to solve these issues.

    Thanks very much.

    If you like, please see the attached picture to have a better understanding of my problem.
    Problem ScreenShot Collage
    Download the sample text files:
    http://pan.baidu.com/s/1pJJgUfT

     
  • THEVENOT Guy
    THEVENOT Guy
    2014-05-01

    Hello Yann and All

    Some general points about UNICODE ( to begin with ! ) :

    UNICODE is a structure that organize the management of all the code-points of the characters of the Universal Character Set ( UCS )

    The UCS lays between 0x0 and 0x10FFFF ( 1114112 code-points ), that are split in 17 planes of 65536 values :

    • The first plane ( Plane 0 ), named Basic Multilingual Plane ( BMP ), contains "almost" all the characters of any modern language. But it can, practically, code, only, 57054 characters, due to some reserved areas :

    The SURROGATE mechanism ( a 32-bits emulation ), between the code-points 0xD800 and 0xDFFF

    The Private Area, between the code-points 0xE000 and 0xF8FF

    Some NO UNICODE characters, between the code-points 0xFDD0 and 0xFDEF + the two code-points 0xFFFE and 0xFFFF

    • The second plane ( Plane 1 ), named Supplementary Multilingual Plane ( SMP ), contains some "extra" linguistic characters. It codes 65534 characters ( 65536 - the two NON Unicode characters 0x1FFFE and 0x1FFFF )

    • The third plane ( Plane 2 ), named Supplementary Ideographic Plane ( SIP ), contains some "extra" ideographic CJK characters. It also codes 65534 characters, as Plane 1.

    • The 11 following planes, from 3 to 13, are, still, NOT used by the UNICODE consortium.

    • The plane 14, named Supplementary Special-purpose Plane ( SSP ) contains some non-graphics characters.

    • Finally, planes 15 and 16, named Supplementary Private Use Area A and B ( SPUA ), contains private characters.


    The different UNICODE encodings, that allow to store all these UNICODE characters, into files, are :

    • The UCS-2 Big Endian encoding, which codes every character, of code-point
      < 0xFFFE, in TWO bytes, with the Most Significant Byte written first.

    • The UTF-16 Big Endian encoding, which codes every character, of code-point between 0x0 and 0x10FFFE, in TWO bytes, with the Most Significant Byte written first.

    • the UCS-2 Low Endian encoding, which codes every character, of code-point
      < 0xFFFE, in TWO bytes, with the Least Significant Byte written first.

    • The UTF-16 Little Endian encoding, which codes every character, of code-point between 0x0 and 0x10FFFE, in TWO bytes, with the Most Significant Byte written first.

    • The UTF-8 encoding. It's a very clever encoding, which codes every character, of code-point < 0x10FFFE, in ONE, TWO, THREE or FOUR bytes, depending upon the value of the code-point.

    • The UCS-4 encoding, which codes every character, of code-point up to 0x7FFFFFFF, in FOUR bytes.

    • The UTF-32 encoding, subset of the UCS-4 encoding, which codes every character, of code-point < 0x10FFFE, in FOUR bytes

    VERY IMPORTANT :

    Notice that the UCS-4, UTF-32 and UTF-16 encodings are NOT yet supported by Notepad++.


    Let's consider an brief example, with the EURO character, of absolute code-point 8364 ( Hex value = 0x20AC ) :

    • With the UCS-2 Big Endian or the UTF-16 Big Endian encodings, it's coded with the TWO bytes 0x20 , 0xAC

    • With the UCS-2 Little Endian or the UTF-16 Little Endian encodings, it's coded with the TWO bytes 0xAC , 0x20

    • With the UTF-8 encoding, it's coded with the THREE bytes 0xE2, 0x82 , 0xAC

    • With the UCS-4 or UTF-32 encodings, it should be coded with the FOUR bytes 0x00 , 0x00 , 0x20 , 0cAC

    And the highest possible VALID UNICODE code-point 1114109 ( 0x10FFFD ), as it's a code-point above 0xFFFF, can ONLY be coded, in the 5 encodings below, with FOUR bytes :

    • 0xF4 , 0x8F , 0xBF , 0xBD, with the UTF-8 encoding

    • 0xDB , 0xFF , 0xDF , 0xFD, with the UTF-16 Big Endian encoding, using SURROGATE pairs

    • 0xDF , 0xFD , 0xDB , 0xFF, with the UTF-16 Little Endian encoding, using SURROGATE pairs

    • 0x00 , 0x10 , 0xFF, 0xFD, with the UCS-4 or the UTF-32 encodings

    See the Surrogate mechanism at the address below :

    http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B10000_to_U.2B10FFFF


    So, Yann, unfortunately, it's NOT a software bug or a future feature, not yet implemented or a Windows issue !

    It's just that, with the two encodings UCS-2BE and UCS-2LE, it will ALWAYS be impossible to code Unicode characters, with code-point > 0xFFFF :-)

    So, with Notepad++, ONLY the UTF-8 encoding can encode ALL the characters of the Universal Character Set ! Therefore, when you re-encode an UTF-8 text, which contains characters with code-point > 0xFFFF, in UCS-2BE or UCS-2LE encodings, it's quite normal than the resulting text contains odd characters !!!


    About regular expression searches, you're perfectly right : the form \x{nnnn} don't allow you to search characters outside the BMP ( with code-point over 0xFFFF ) :-(

    But, as any character with code-point > 0xFFFF is coded, in UTF-8, with exactly FOUR bytes, you could use this ( ugly ) method, as a work-around :

    • Try to find the FOUR UTF-8 bytes of your character, with this Swiss-knife tool, at the address below :

    http://www.cogsci.ed.ac.uk/%7erichard/utf-8.html

    • Re-encode, in Notepad++, your UTF-8 text as an ANSI text, with individual bytes encoding, from 0x00 to 0xff.

    • Finally, search, in regular expression mode, for the expression \x..\x..\x..\x.. ( 4 consecutive bytes which represent the UTF-8 encoding of the searched character ). Remember that, with an ANSI text, the regular expression forms \x{..} and \x{....} are INVALID. So, in an ANSI file, the form \x.. is the only VALID form !

    For example, your character, with hex code-point 0x20191, is an unified HAN ideogram, part of the CJK Unified Ideographs Extension B area below :

    http://www.unicode.org/charts/PDF/U20000.pdf

    The UTF-8 tool says that this character ( 0x20191 ) is encoded in UTF-8 with the the 4 bytes 0xF0, 0xA0, 0x86 and 0x91.

    So, it could be searched, after re-encoding in ANSI, with the simple regular expresion below :

    \xF0\xA0\x86\x91


    To end up, a ( non exhaustive ! ) list of some addresses, about that topîc :

    • General information about UNICODE, Unicode encodings and Unicode planes :

    http://en.wikipedia.org/wiki/Unicode
    http://en.wikipedia.org/wiki/UTF-16
    http://en.wikipedia.org/wiki/UTF-8
    http://en.wikipedia.org/wiki/UTF-32
    http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane

    • General information about Code Pages :

    http://en.wikipedia.org/wiki/Code_page
    http://www.i18nguy.com/unicode/codepages.html
    http://en.wikibooks.org/wiki/Unicode/Character_reference
    http://www.lingua-systems.com/knowledge/unicode-mappings
    http://www.fileformat.info/info/charset/index.htm

    • Information about Byte Order Mark ( BOM ), Endianness and the Unicode block Specials

    http://en.wikipedia.org/wiki/Byte_order_mark
    http://en.wikipedia.org/wiki/Endianness
    http://en.wikipedia.org/wiki/Unicode_Specials

    • General information about UNICODE v6.3, the different Character Code Charts in PDF files, the alphabetic list of all non ideographic Unicode characters,

    http://www.unicode.org/versions/Unicode6.3.0/
    http://www.unicode.org/charts/
    http://www.unicode.org/charts/charindex.html

    • Additional information about the Unicode Collation algorithm and the Unicode Regular expressions, unfortunately NOT supported by N++ :

    http://www.unicode.org/reports/tr10/tr10-28.html
    http://www.unicode.org/reports/tr18/

    • A very practical UTF-tool to determine the UTF-8 encoding bytes from any UNICODE code-point or the opposite :

    http://www.cogsci.ed.ac.uk/%7erichard/utf-8.html

    Cheers,

    guy038

     
    Last edit: THEVENOT Guy 2014-07-11
    • Yann
      Yann
      2014-05-05

      Thanks, guy038 for your informative help!

      If this could be added to the user manual that comes along with the software package then there won't be such "issues".

      Since it is definitely clear that np++ supports only ANSI/UCS-2 (LE/BE)/UTF-8(w and w/o BOM) encodings, I feel confident to say that the French localization of np++ mis-translated the "Encoding" menu. It used "UTF-16" instead of the "UCS-2" in the original English version. However, to say support for UTF-16 is to mean support of UTF-16 surrogates, according to what i have learnt from Unicode.org (http://www.unicode.org/faq/utf_bom.html#utf16-11). Np++ is apparently NOT supporting surrogates, as the screenshot in my problem suggests. UTF-16 surrogates are being treated as other plain UCS-2 printable characters.

      So I think you would agree with me that the French localization in Np++ of the "Encoding" menu is misleading.

       
  • THEVENOT Guy
    THEVENOT Guy
    2014-05-05

    Hi Yann,

    Many thanks for your feedback ! The best thing to fully understand a topic is ALWAYS trying to explain it to someone else !!

    While reading your post, I just realized that I was completely wrong, regarding definition of the both encodings UCS-2 and UTF-16 :-( There are, definitively, NOT identical encodings !! Actually, as it's well explained, at the Wiki address, :

    http://en.wikipedia.org/wiki/UTF-16

    • The UTF-16 encoding can code every Unicode code-points between 0x0 and 0x10FFFF, with two bytes. Characters, with code-points, over 0xFFFF, can, still be encoded in two bytes, by using the surrogate pairs mechanism. To sum up, the UTF-16 encoding fully encode ALL Unicode characters !

    • The older UCS-2BE and UCS-2LE encodings, used in Notepad++, are a subset of UTF-16, which, also, encode Unicode characters in two bytes, but they DON'T use the surrogate mechanism. Therefore, they CAN'T encode Unicode characters with code-point over 0xFFFF. Only, Unicode characters, with code-points between 0x0 and 0xFFFF, may be encoded, with the UCS-2 encodings !

    So, I've modified my previous post, to correct this error ! Also, the higher code-point, coded with the UCS-4 encoding, is 0x7FFFFFFF, as it's the upper limit of the UCS Transformation Format ( UTF ) !

    Moreover, I added some encoding examples and a link to the surrogate mechanism. Please, Yann, let me know if it still contains some errors :-)

    Regarding the translation, you're perfectly right ! Generally, I write posts for non French people. So, I prefer to keep the default English localization of my Notepad++, as I sometimes attach an hard-screen picture to posts !

    Indeed, I'm French but I've NEVER noticed that the expressions UCS-2 Big Endian and UCS-2 Little Endian are MIS-translated in the French expressions UTF-16BE and UTF-16LE !!

    But, you just have to edit the NativeLang.xml file, located in the Notepad++ installation folder, to change these wrong expressions and, then, re-start Notepad++ !

    Cheers,

    guy038

     
    Last edit: THEVENOT Guy 2014-05-05