1. Summary
  2. Files
  3. Support
  4. Report Spam
  5. Create account
  6. Log in

Encodings And Character Display

From notepad-plus

Revision as of 20:46, 26 August 2013 by Cchris (Talk | contribs)
Jump to: navigation, search
Wrong display of non-AXSCII characters


Instances of typing non-western characters (Russian, Hebrew etc) and getting text which is all garbled (strange characters) have been reported.

Old plugins had been considered a likely cause, but as of Notepad++ v5.4.2 this does no longer seem to hold.

Since there are many ays to represent the same character, called encodings, let's delve a bit into the matter. Following the general descrition are numerous examples.

Contents

Encodings : a short overview

(contributed by guy038)

Let consider a new file with, ONLY, the three characters A±€

I specially chose the extended characters ± and € because they are part of most Microsoft ANSI regional Pagecode, as Windows-1252, Windows-1255, Windows-1250 ...

With an ANSI encoding, the hexadecimal value of the three characters of the test file are, respectively :

41, B1 and 80.

(not showing the usual $ or 0x prefixes with hexadecimal codes, as the context makes them unambiguous as such).

Note however that with different ANSI code pages, the byte representation could be different. For instance, in the DOS CP850 code page, the euro symbol is not part of the represented characters, and "plus or minus" has code F1.

With UNICODE encoding, the hexadecimal value of the three characters of the test file are respectively :

0041, 00B1 and 20AC

As this file contains characters, with code-point > 7F, then, regardless to the actual encoding, if you convert this test file Encoding -> Convert to ..., the real contents of the test file become :

   ANSI => 41 , B1 , 80
   UTF-8 without BOM => 41 , C2 B1 , E2 82 AC
   UTF-8 => EF BB BF , 41 , C2 B1 , E2 82 AC
   UCS-2 Big Endian => FE FF, 00 41 , 00 B1 , 20 AC
   UCS-2 Little Endian => FF FE, 41 00 , B1 00 , AC 20

Sine Notepad++ does not support UTF32 encodings yet, we won't cover them here.

Note that the similar options found in Encoding -> Encode in ... have a different behaviour, as explained in detail in Convert Or Encode?

The bold face part represents a HEADER, which is NEVER displayed in Notepad++ and is used to identify the encoding of a file.

In UCS-2 Big Endian

The header is the Unicode character 0xFEFF and represents the BOM ( Byte Order Mark ). If this character is found, further, in the file, it stands for the character ZWNBSP ( Zero Width Non-Breaking Space ).

Every valid Unicode character, of the Basic Multilingual Plane, ( from 0000 to D7FF and from E000 to FFFD ) is coded with TWO bytes,

The FIRST byte stored is the Most Significant Byte of each sequence of two bytes, so the three characters of the test file, are stored : 00 41 , 00 B1 , 20 AC

In UCS-2 Little Endian

The header is the sequence FFFE which represents the character 0xFEFF ( BOM ), with the Least Significant Byte, written

Every valid Unicode character, of the Basic Multilingual Plane, ( from 0000 to D7FF and from E000 to FFFD ) is coded with TWO bytes,

The FIRST byte stored is the Least Significant Byte of each sequence of two bytes, so the three characters of the test file, are stored : 41 00 , B1 00 , AC 20

In UTF-8

The header is the sequence EFBBBF which represents the UTF-8 form of the character 0xFEFF (

Every valid Unicode character, of the Basic Multilingual Plane, ( from 0000 to D7FF and from E000 to FFFD ) is coded with :

1 byte if Unicode value of the character is < 0x0080 ( 128 ) 2 bytes if Unicode value of the character is > 0x007f ( 127 ) and < 0x0800 ( 2048 ) 3 bytes if Unicode value of the character is > 0x07ff ( 2047 ) and < 0xFFFE ( 65534 )

A single byte in an UTF-8 encoded file, with hexadecimal value is either:

  • from 00 to 7F, stands for a standard character of a one byte sequence
  • from 80 to BF, stands for a continuation byte, in a two or three bytes sequence
  • from C0 to C1, is a forbidden value
  • from C2 to DF, is the FIRST byte of a two bytes sequence
  • from E0 to EF, is the FIRST byte of a three bytes sequence
  • from F0 to F4, is a forbidden value in the UNICODE Basic Multilingual Plane ( Value > \xFFFF )
  • from F5 to FF, is, ALWAYS, a forbidden value

So the three characters of the test file are : 41 , C2 B1 , E2 82 AC ( one byte for the character A, two bytes for the character ± and three bytes for character € )

Please refer to this link for further informations, about UTF-8 : http://en.wikipedia.org/wiki/UTF-8

In UTF-8 without BOM

The encoding of characters is identical to UTF-8, but there's NO header ( BOM ). So, the invisible three characters, at the very beginning of file, are ABSENT.

In ANSI

No header is present at the very beginning of file.

Each character with UNICODE code-point < \x00FF is coded with one byte sequence, so the three characters of the test file are simply stored : 41 , B1 , 80

IMPORTANT :

  • If the default encoding for a new document Settings -> Preferences -> New document is set to UTF-8 without BOM with the box Apply on opened ANSI files ticked, then, for a file without any character with hexadecimal value > \x7F, regardless to its actual encoding, a conversion to ANSI set automatically the encoding of this file to UTF-8 without BOM, on next opening.
  • If the default encoding for a new document is different from above, then, for a file without any character with hexadecimal value > \x7F, regardless to its actual encoding, a conversion to UTF-8 without BOM set automatically the encoding of this file to ANSI, on next opening.

Conversion of current file to UTF-8, UCS-2 Big Endian or UCS-2 Little Endian is ALWAYS immediate.

Conversion to ANSI or UTF-8 without BOM, is ALWAYS immediate, if current file contains, at least one character > \x7F.

References

You may find further documentation, on Wikipedia, at the addresses below :

http://en.wikipedia.org/wiki/UTF-16

http://www.i18nguy.com/unicode/codepages.html#msftdos

http://en.wikipedia.org/wiki/Endianness

http://en.wikipedia.org/wiki/Byte_order_mark

http://en.wikipedia.org/wiki/Unicode

http://www.unicode.org/charts/charindex.html

http://www.unicode.org/charts/

http://en.wikipedia.org/wiki/Unicode_Specials

http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane

What to check when chacaracters display in an unusable way?

There are a few things to check:

  1. Encoding for the file may need to be set to UCS2 Little Endian, on the Format menu. This will happen when there is no OEM code page for your language, like for Esperanto.
  2. Does the font selected in Settings -> Styler Configurator -> Global Styles , Default Style support the characters you wish to type?
  3. Make sure advanced text services are on. This will also enable you to show the language bar, which will help in the step below
  4. What is the input language for your keyboard? Notepad++ was shown to reset it according to current locale (still true in v6.5.5). Press Left Alt+Shift repeatedly until the right input language is found.
  5. If you get the right character set, but the wrong characters, repeatedly press Ctrl+Shift until the right keyboard layout is back.

Notepad++ not only provides standard (OEM437) support in ANSI mode, but also allows selecting a different code page. All known OEM/Windows/ISO standards are supported. They are to be found in Format -> Character Sets as a series of submenus grouping related languages together.

HTML and XML files allow auto detection of the encoding being used, and Notepad++ uses these mechanisms.

Please note that, as of v6.5.5, conversion from some Unicode format to a non-default ANSI code page is not supported.

If the encoding is still wrong

If you need to deal with an encoding different from these, for instance an EBCDIC code page, oor if the font you selected doesn't switch to the right character set on its own, the following macro may be useful. It is to be inserted in shortcuts.xml inside the <Macros> tag:

<Macro name=""setCharSet"" Ctrl="yes" Alt="yes" Shift="no" Key="100">
   <Action type="0" message="2066" wParam="32" lParam=charset sParam=""/>
</Macro>

As usual, the integer value must be used inside double quotes. This can be any OEM or Windows code page number.

NOTE: you must fire this macro, and then change to some font. You may have to change to some bogus font and then back to the one you had.

If the style you wish to modify is not the default style, please look up the relevant styleID in styler.xml, and then replace "32" with that number.

Cyrillic scripts specific issues

For writers in cyrillic scripts: there are three commonplace, different non Unicode font encodings:

  1. CP866 (DOS, OEM, ASCII). This comes from DOS epoch, moreover, Microsoft still uses it in *.bat files, even in Windows 7 and Windows Server 2008 R2!!!
  2. KOI8-R. This is from Unix world. Many Unix/Linux servers still use it, but more and more of them are moving to UTF-8.
  3. CP1251 (ANSI). This is the Russian Windows codepage. Now modern Windows versions use UTF-8, but in progs, that doesn't understand it there's still codepage CP1251.

Check whether the font you use has the encoding you expect. If not, then the font needs to be changed.

The following chart shows various mangling patterns for cyrillic text, depending on how it is encoded and decoded.

Various renderings of Ещё раз ("Once again" in russian)
Initial encoding Decoding Rendering without UTF-8 conversion Rendering with UTF-8 conversion
CP 866
(Hex: 85 e9 f1 20 e0 a0 a7)
CP 866 Ещё раз

┬Е├й├▒ ├а┬а┬з

KOI-8 ┘ИЯ Ю═ї б┘ц╘ц╠ б═ц═цї
CP 1251 …йс а § …éñ Г В В§
CP 850 àÚ± ÓẠ┬à├®├░ ├á┬á┬º
KOI-8
(Hex: e5 dd a3 20 d2 c1 da)
CP 866 х▌г ╥┴┌

┬е┬Э└г ├Т┬а┬Ъ

KOI-8 Ещё раз ц╔ц²бё ц▓ц│䑆 
CP 1251 еЭЈ ТБЪ ГµГќВЈ Г’ГЃГљ
CP 850 Õ¦ú Ê┴┌ ├Á├Ø┬ú ├Æ├ü├Ü
CP 1251
(Hex: c5 f9 b8 20 f0 e0 e7)
CP 866 ┼∙╕ ≡рч ├Е├╣┬╕ ├░├а├з
KOI-8 еЫ╦ ПЮГ ц┘ц╧б╦ ц╟ц═цї
CP 1251 Ещё раз Г…Г№Вё Г°Г Г§
CP 850 ┼¨© ­Óþ ├à├╣┬© ├░├á├º
Unicode (UCS-2 LE)
(Hex: 15 04 49 04 51 04 20 00 40 04 30 04 37 04)
CP 866 §♦I♦Q♦ @♦0♦7♦ ╨Х╤З╤С ╤А╨░╨╖
KOI-8 §♦I♦Q♦ @♦0♦7♦ п┙я┴я▒ я─п╟Я╥
CP 1251 §♦I♦Q♦ @♦0♦7♦ Ещё СЂР°Р·
CP 850 §♦I♦Q♦ @♦0♦7♦ ðòÐëÐæ ÐÇð░ðÀ

Windows 7

Problems viewing or typing various accented letters have been reported under Windows 7. It has also been reported that running Notepad++ in XP or Vista compatibility mode would solve the issue.

Personal tools
INVISIBLE