Notepad++ / Discussion / [READ ONLY] Open Discussion: Latin-1 heuristics in newer versions

Hi Sf acc, Mike and All,

I spent several hours to to do numerous tests but I still don't understand why, sometimes, N++ opens a file with a wrong encoding :-( Luckily, I find out a way to get, most of the time, the right behaviour. For people, in a hurry, go to the end of this post !!

To begin with, just an obvious remark : the smaller the file is, the harder it is to get an automatic right encoding. And if we're speaking about files of very few bytes ( < 10 ) we may expect unpredictable results !! However, the present issues are annoying.

To Sf acc,

So an ANSI file, with contents = §A§A§A§A, is normally opened as ANSI, in N++.

But if your add a fifth couple "§A" => §A§A§A§A§A it's opened with the Chinese encoding Big5 ?!

Of course, the displayed character is quite logic, because of this new encoding. So, I won't speak about it. It's just that N++ shouldn't choose this encoding !

I tested any possibility, instead of the § character ( \xA7 ) between \x80 and \xFF

=> Only 7 characters produce a wrong encoding :

The \x81, \x83, \x8b and \x98 characters opened the test file with the Japanese Shift-JIS encoding !
The \xa6, \xa7 and \xaa characters opened the test file, with the Chinese Big5 encoding !

The protocol, that I used, was :

Create a zero length file Test.txt
Open Test.txt with the usual Windows Notepad
Write the string €A€A€A€A€A
Save this file, as the ANSI file, named Test.txt
Close Windows Notepad
Open N++ ( Note that the Test.txt tab is still not present )
Open the Test.txt file
Note the encoding, chosen by N++
Close the Test.txt tab, first, and, then, close Notepad++
Return to point 2, replacing the \x80 character ( €, with the Windows-1252 encoding ) by the \x81 character, and so on...

So, it looks like I quite wasted my time, as no logic can be found about these results :-(( However, for these 7 characters, as soon as your reduce that string to eight characters max, it's was always OK ?

To Mike

Unfortunately, your file ptest.txt isn't a strict UTF-8 file ! Indeed the Hexa contents of this file are :

67 65 6d 2e 20 c2 a7 32 35 61 20 55 53 74 47

Just note the two consecutive bytes c2 a7, which represents the UTF-8 encoding of the paragraph symbol §. So your file looks rather like an UTF-8 w/o BOM file.

Then, I opened Window Notepad, pasted your text gem. §25a UStG and saved it, with the UTF-8 encoding. When you do so, Notepad, Notepad++, and every decent text editor, put an invisible mark, at the very beginning of the current file, called the BOM ( Byte Order Mark ), of exact Unicode value \xFEFF

When your file is saved with the Unicode Big Endian encoding ( UCS-2 Big Endian in N++ ), these two invisible bytes added, at the very beginning of the current file, is \xFEFF, and indicate that the Most significant byte \xFE, of the BOM, is written first

When your file is saved with the classical Unicode encoding ( UCS-2 Little Endian in N++ ), the two invisible bytes \xFFFE are added, and indicate that the Least Significant byte \xFF, of the BOM, is written first

When your file is saved with the UTF-8 encoding, the three invisible bytes \xEFBBBF are added. They just represents the UTF-8 form of the BOM ( \xFEFF )

And, when your file is saved, in Notepad++, with the UTF-8 w/o BOM encoding, contents of file are converted in UTF-8, but no BOM is added, at the very beginning of the file !

To All,

I also found some examples, that seem even worse !

Consider the string Sél¨¨ ( the beginning of the French word Sélection, followed with two diæresis ), so the HEXA string 53 e9 6c a8 a8. Opened in N++, it produces the Thai TIS-620 encoding ??
If you get rid of the last \xa8 character ( 53 e9 6c a8 ), you get the right ANSI encoding
If you only keep the three characters 53 e9 6c, this time, you get the UTF-8 w/o BOM encoding
And if you choose the string Sé¨¨ ( 53 e9 a8 a8 ), Notepad++ choose the Cyrillic OEM-866 encoding !

I can't see any logic to these different encodings chosen !!

Luckily, I remembered that, in a recent version, an Auto detect character encoding option was added. I found that it was the v6.5.5 version.

So, in the last v6.6.9 version, I UNCHECK the menu option Preferences - MISC - Auto detect character encoding

Then, immediately, all the problems disappeared : N++ chose the right ANSI encoding, in all cases, except my cases 3 and 4, where N++ chose the UTF-8 w/o BOM encoding. But, remember my preliminary remark, about files with tiny contents. With some more bytes, the ANSI encoding is quite detected :-))

So, In N++, for these last two cases, I just choose the Menu option Encode in ANSI, add some characters, save it and close it. Then, when re-opened in N++, this time, the encoding was the expected ANSI encoding !

Cheers

guy038

Last edit: THEVENOT Guy 2014-10-07

Latin-1 heuristics in newer versions

Notepad++ project is moving to GitHub:

Forums

Help

Latin-1 heuristics in newer versions

Latin-1 heuristics in newer versions

Notepad++ project is moving to GitHub:

Forums

Help

Latin-1 heuristics in newer versions document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Latin-1 heuristics in newer versions