Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

About Encodings, Conversions and Characters sets....

2013-06-23
2013-06-23
  • THEVENOT Guy
    THEVENOT Guy
    2013-06-23

    Hi, all,

    Yesterday, I read an help topic, concerning encodings at the address http://sourceforge.net/p/notepad-plus/discussion/331754/thread/ab0ebb78/

    This topic isn't very "fresh", except the last post.

    So, to sum up, I would like to share some infos about encodings, conversions and Characters sets, in Notepad++.

    From now on, text, below, concerns the behaviour of the 6.3.3 version of Notepad++ or above

    Let suppose a new file with, ONLY, the three characters A±€

    I specially chose the extended characters ± and because they are part of most Microsoft ANSI regional Pagecode, as Windows-1252, Windows-1255, Windows-1250 ...

    With an ANSI encoding, the hexadecimal value of the three characters of the test file are, respectively :

    41, B1 and 80

    With UNICODE encoding, the hexadecimal value of the three characters of the test file are respectively :

    0041, 00B1 and 20AC

    As this file contains characters, with code-point > 7F, then, regardless to the actual encoding, if you convert this test file ( Menu Encoding / Sub-Menu Convert to ....... ), the real contents of the test file become :

    • ANSI => 41 , B1 , 80
    • UTF-8 without BOM => 41 , C2 B1 , E2 82 AC
    • UTF-8 => EF BB BF , 41 , C2 B1 , E2 82 AC
    • UCS-2 Big Endian => FE FF, 00 41 , 00 B1 , 20 AC
    • UCS-2 Little Endian => FF FE, 41 00 , B1 00 , AC 20

    Notes :

    I underlined the words Convert to ... because use of the option Encode in ... has a different behaviour, which will be discuss later, in this post.

    The underlined part represents a HEADER, which is NEVER displayed in Notepad++ and is used to identify to right encoding of a file.

    In UCS-2 Big Endian :

    • The header is the Unicode character 0xFEFF and represents the BOM ( Byte Order Mark ). If this character is found, further, in the file, it stands for the character ZWNBSP ( Zero Width Non-Breaking Space ).

    • Every valid Unicode character, of the Basic Multilingual Plane, ( from 0000 to D7FF and from E000 to FFFD ) is coded with TWO bytes,

    • The FIRST byte stored is the Most Significant Byte of each sequence of two bytes, so the three characters of the test file, are stored : 00 41 , 00 B1 , 20 AC

    In UCS-2 Little Endian :

    • The header is the sequence FFFE which represents the character 0xFEFF ( BOM ), with the Least Significant Byte, written FIRST.

    • Every valid Unicode character, of the Basic Multilingual Plane, ( from 0000 to D7FF and from E000 to FFFD ) is coded with TWO bytes,

    • The FIRST byte stored is the Least Significant Byte of each sequence of two bytes, so the three characters of the test file, are stored : 41 00 , B1 00 , AC 20

    In UTF-8 :

    • The header is the sequence EFBBBF which represents the UTF-8 form of the character 0xFEFF ( BOM )

    • Every valid Unicode character, of the Basic Multilingual Plane, ( from 0000 to D7FF and from E000 to FFFD ) is coded with :

    1 byte if Unicode value of the character is < 0x0080 ( 128 )
    2 bytes if Unicode value of the character is > 0x007f ( 127 ) and < 0x0800 ( 2048 )
    3 bytes if Unicode value of the character is > 0x07ff ( 2047 ) and < 0xFFFE ( 65534 )

    • A single UTF-8 byte, with hexadecimal value is :

    from 00 to 7F, stands for a standard character of a one byte sequence
    from 80 to BF, stands for a continuation byte, in a two or three bytes sequence
    from C0 to C1, is a forbidden value
    from C2 to DF, is the FIRST byte of a two bytes sequence
    from E0 to EF, is the FIRST byte of a three bytes sequence
    from F0 to F4, is a forbidden value in the UNICODE Basic Multilingual Plane ( Value > \xFFFF )
    from F5 to FF, is, ALWAYS, a forbidden value

    So the three characters of the test file are : 41 , C2 B1 , E2 82 AC ( one byte for the character A, two bytes for the character ± and three bytes for character )

    Refer to this link for further informations, about UTF-8 :
    http://en.wikipedia.org/wiki/UTF-8

    In UTF-8 without BOM :

    • The encoding of characters is identical to UTF-8, but there's NO header ( BOM ). So, the invisible three characters, at the very beginning of file, are ABSENT.

    In ANSI :

    • No header is present at the very beginning of file.

    • Each character with UNICODE code-point < \x00FF is coded with one byte sequence, so the three characters of the test file are simply stored :  41 , B1 , 80

    IMPORTANT :

    • If the default encoding for a new document ( Menu Settings / Preferences / New document ) is set to UTF-8 without BOM with the box Apply on opened ANSI files ticked :

    then, for a file without any character with hexadecimal value > \x7F, regardless to its actual encoding, a conversion to ANSI set automatically the encoding of this file to UTF-8 without BOM, on next opening.

    • If the default encoding for a new document is different from above :

    then, for a file without any character with hexadecimal value > \x7F, regardless to its actual encoding, a conversion to UTF-8 without BOM set automatically the encoding of this file to ANSI, on next opening.

    • Conversion of current file to UTF-8, UCS-2 Big Endian or UCS-2 Little Endian is ALWAYS immediate.

    • Conversion to ANSI or UTF-8 without BOM, is ALWAYS immediate, if current file contains, at least one character > \x7F.

    DIFFERENCES between the options Convert to .... and Encode in .... OR Character sets, in the Encoding menu :

    • The option Convert to... transforms the current file, with its actual encoding, to the same contents file, translated in the chosen encoding.

    Use this option, ONLY if all the actual characters of the file are correctly displayed. Non representable characters, in the new encoding, will be replace by a question Mark ?

    The contents of the current file, after conversion, are ALWAYS modified.

    • The option Encode in .... OR Character sets / .... / ...., apply the chosen encoding to the actual contents of the current file.

    Generally, the contents of the current file, after encoding, are not modified and ONLY displaying is changed.

    But, if the actual OR the future encoding is UTF-8, UCS-2 Big Endian or UCS-2 Little Endian, then, the contents of the current file are modified

    Use this option, ONLY if some characters of the file are unreadable or displayed as small boxes.

    If your current file is correctly displayed, special characters will, generally, be displayed :

    For example, if the actual encoding of the test file is ANSI, the option Encode in UTF-8 displays, in the control character way, the string xB1x80, as these bytes are not part of a legal UTF-8 sequence.

    If the actual encoding of the test file is UTF-8 without BOM, the option Encode in ANSI, for example, displays the string A±€, according to the actual contents of the file ( 41 , C2 , B1 , E2 , 82 , AC )

    And, if the actual encoding of the test file is ANSI, the option Character Sets / Western European / OEM850 displays the string A▒Ç, in DOS CP 850, according to the actual contents of the test file ( 41 , B1 , 80 )


    Whaou... I spent some hours to analyse these different concepts, to make a ton of tests and try to extract the main ideas about them !

    Hope it'll useful to someone :-)

    Cheers,

    guy038

    P.S.

    You may find further documentation, on Wikipedia, at the addresses below :

    http://en.wikipedia.org/wiki/UTF-16

    http://www.i18nguy.com/unicode/codepages.html#msftdos

    http://en.wikipedia.org/wiki/Endianness

    http://en.wikipedia.org/wiki/Byte_order_mark

    http://en.wikipedia.org/wiki/Unicode

    http://www.unicode.org/charts/charindex.html

    http://www.unicode.org/charts/

    http://en.wikipedia.org/wiki/Unicode_Specials

    http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane

     
    Last edit: THEVENOT Guy 2013-08-11