Yesterday, I read an help topic, concerning encodings at the address http://sourceforge.net/p/notepad-plus/discussion/331754/thread/ab0ebb78/
This topic isn't very "fresh", except the last post.
So, to sum up, I would like to share some infos about encodings, conversions and Characters sets, in Notepad++.
From now on, text, below, concerns the behaviour of the 6.3.3 version of Notepad++ or above
Let suppose a new file with, ONLY, the three characters A±€
I specially chose the extended characters ± and € because they are part of most Microsoft ANSI regional Pagecode, as Windows-1252, Windows-1255, Windows-1250 ...
With an ANSI encoding, the hexadecimal value of the three characters of the test file are, respectively :
41, B1 and 80
With UNICODE encoding, the hexadecimal value of the three characters of the test file are respectively :
0041, 00B1 and 20AC
As this file contains characters, with code-point > 7F, then, regardless to the actual encoding, if you convert this test file ( Menu Encoding / Sub-Menu Convert to ....... ), the real contents of the test file become :
41 , B1 , 80
41 , C2 B1 , E2 82 AC
EF BB BF
, 41 , C2 B1 , E2 82 AC
, 00 41 , 00 B1 , 20 AC
, 41 00 , B1 00 , AC 20
I underlined the words Convert to ... because use of the option Encode in ... has a different behaviour, which will be discuss later, in this post.
The underlined part represents a HEADER, which is NEVER displayed in Notepad++ and is used to identify to right encoding of a file.
In UCS-2 Big Endian :
The header is the Unicode character 0xFEFF and represents the BOM ( Byte Order Mark ). If this character is found, further, in the file, it stands for the character ZWNBSP ( Zero Width Non-Breaking Space ).
Every valid Unicode character, of the Basic Multilingual Plane, ( from 0000 to D7FF and from E000 to FFFD ) is coded with TWO bytes,
The FIRST byte stored is the Most Significant Byte of each sequence of two bytes, so the three characters of the test file, are stored : 00 41 , 00 B1 , 20 AC
00 41 , 00 B1 , 20 AC
In UCS-2 Little Endian :
The header is the sequence FFFE which represents the character 0xFEFF ( BOM ), with the Least Significant Byte, written FIRST.
The FIRST byte stored is the Least Significant Byte of each sequence of two bytes, so the three characters of the test file, are stored : 41 00 , B1 00 , AC 20
41 00 , B1 00 , AC 20
In UTF-8 :
The header is the sequence EFBBBF which represents the UTF-8 form of the character 0xFEFF ( BOM )
Every valid Unicode character, of the Basic Multilingual Plane, ( from 0000 to D7FF and from E000 to FFFD ) is coded with :
1 byte if Unicode value of the character is < 0x0080 ( 128 )
2 bytes if Unicode value of the character is > 0x007f ( 127 ) and < 0x0800 ( 2048 )
3 bytes if Unicode value of the character is > 0x07ff ( 2047 ) and < 0xFFFE ( 65534 )
from 00 to 7F, stands for a standard character of a one byte sequence
from 80 to BF, stands for a continuation byte, in a two or three bytes sequence
from C0 to C1, is a forbidden value
from C2 to DF, is the FIRST byte of a two bytes sequence
from E0 to EF, is the FIRST byte of a three bytes sequence
from F0 to F4, is a forbidden value in the UNICODE Basic Multilingual Plane ( Value > \xFFFF )
from F5 to FF, is, ALWAYS, a forbidden value
So the three characters of the test file are : 41 , C2 B1 , E2 82 AC ( one byte for the character A, two bytes for the character ± and three bytes for character € )
Refer to this link for further informations, about UTF-8 :
In UTF-8 without BOM :
In ANSI :
No header is present at the very beginning of file.
Each character with UNICODE code-point < \x00FF is coded with one byte sequence, so the three characters of the test file are simply stored : 41 , B1 , 80
41 , B1 , 80
then, for a file without any character with hexadecimal value > \x7F, regardless to its actual encoding, a conversion to ANSI set automatically the encoding of this file to UTF-8 without BOM, on next opening.
then, for a file without any character with hexadecimal value > \x7F, regardless to its actual encoding, a conversion to UTF-8 without BOM set automatically the encoding of this file to ANSI, on next opening.
Conversion of current file to UTF-8, UCS-2 Big Endian or UCS-2 Little Endian is ALWAYS immediate.
Conversion to ANSI or UTF-8 without BOM, is ALWAYS immediate, if current file contains, at least one character > \x7F.
DIFFERENCES between the options Convert to .... and Encode in .... OR Character sets, in the Encoding menu :
Use this option, ONLY if all the actual characters of the file are correctly displayed. Non representable characters, in the new encoding, will be replace by a question Mark ?
The contents of the current file, after conversion, are ALWAYS modified.
Generally, the contents of the current file, after encoding, are not modified and ONLY displaying is changed.
But, if the actual OR the future encoding is UTF-8, UCS-2 Big Endian or UCS-2 Little Endian, then, the contents of the current file are modified
Use this option, ONLY if some characters of the file are unreadable or displayed as small boxes.
If your current file is correctly displayed, special characters will, generally, be displayed :
For example, if the actual encoding of the test file is ANSI, the option Encode in UTF-8 displays, in the control character way, the string xB1x80, as these bytes are not part of a legal UTF-8 sequence.
If the actual encoding of the test file is UTF-8 without BOM, the option Encode in ANSI, for example, displays the string AÂ±â‚¬, according to the actual contents of the file ( 41 , C2 , B1 , E2 , 82 , AC )
And, if the actual encoding of the test file is ANSI, the option Character Sets / Western European / OEM850 displays the string A▒Ç, in DOS CP 850, according to the actual contents of the test file ( 41 , B1 , 80 )
Whaou... I spent some hours to analyse these different concepts, to make a ton of tests and try to extract the main ideas about them !
41 , C2 , B1 , E2 , 82 , AC
Hope it'll useful to someone :-)
You may find further documentation, on Wikipedia, at the addresses below :
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.