Menu

Latin-1 heuristics in newer versions

sf acc
2014-10-05
2014-10-09
  • sf acc

    sf acc - 2014-10-05

    Hello!

    I realized that the heuristics for "extended ASCII" may have changed.

    In v6.6.9, create a text document with the content "§A§A§A§A§A" (by a hex editor), open it with npp and it will show 5 chinese characters (§ 0xA7 is a "Section sign", see http://www.ascii-code.com).
    The "Encoding" menu doesn't show a check mark anywhere, also not in the "Character sets" submenus.
    Manually choosing "ISO 8859-1" helps.

    With only 4 repetitions, like "§A§A§A§A", this doesn't happen and the "Encode in ANSI" is checked.
    Also, in v6.5.3 it doesn't happen, not even with 5 repetitions.

    Bug or feature? :)

     
  • THEVENOT Guy

    THEVENOT Guy - 2014-10-05

    Hello sf acc,

    Indeed, you're perfectly right !! It seems that there's an strange issue.

    I'm investigating on it...

    The encoding problems are difficult enough because there are numerous defined set of characters ( due to the numerous languages used, all over the world ) and because of the numerous encodings used to write all the possible characters in files !

    Give me 2/3 hours, to get a general idea of this problem, and I'll post the results of my search, in this thread !

    See you soon

    Best Regards

    guy038

     

    Last edit: THEVENOT Guy 2014-10-05
  • Mike Cowperthwaite

    See also bug 4981.

     
  • THEVENOT Guy

    THEVENOT Guy - 2014-10-06

    Hello Sf acc and Mike,

    I didn't understand the logic, used by N++, to choose a wrong encoding, but I find out a way to prevent N++ from choosing it :-)

    See you tomorrow for the solution, cause it's about 0h45 a.m. in France => It's not a decent hour to speak about encodings, isn't it ?!

    Cheers,

    guy038

     

    Last edit: THEVENOT Guy 2014-10-06
  • THEVENOT Guy

    THEVENOT Guy - 2014-10-07

    Hi Sf acc, Mike and All,

    I spent several hours to to do numerous tests but I still don't understand why, sometimes, N++ opens a file with a wrong encoding :-( Luckily, I find out a way to get, most of the time, the right behaviour. For people, in a hurry, go to the end of this post !!

    To begin with, just an obvious remark : the smaller the file is, the harder it is to get an automatic right encoding. And if we're speaking about files of very few bytes ( < 10 ) we may expect unpredictable results !! However, the present issues are annoying.

    To Sf acc,

    So an ANSI file, with contents = §A§A§A§A, is normally opened as ANSI, in N++.

    But if your add a fifth couple "§A" => §A§A§A§A§A it's opened with the Chinese encoding Big5 ?!

    Of course, the displayed character is quite logic, because of this new encoding. So, I won't speak about it. It's just that N++ shouldn't choose this encoding !

    I tested any possibility, instead of the § character ( \xA7 ) between \x80 and \xFF

    => Only 7 characters produce a wrong encoding :

    • The \x81, \x83, \x8b and \x98 characters opened the test file with the Japanese Shift-JIS encoding !

    • The \xa6, \xa7 and \xaa characters opened the test file, with the Chinese Big5 encoding !

    The protocol, that I used, was :

    1. Create a zero length file Test.txt

    2. Open Test.txt with the usual Windows Notepad

    3. Write the string €A€A€A€A€A

    4. Save this file, as the ANSI file, named Test.txt

    5. Close Windows Notepad

    6. Open N++ ( Note that the Test.txt tab is still not present )

    7. Open the Test.txt file

    8. Note the encoding, chosen by N++

    9. Close the Test.txt tab, first, and, then, close Notepad++

    10. Return to point 2, replacing the \x80 character ( , with the Windows-1252 encoding ) by the \x81 character, and so on...

    So, it looks like I quite wasted my time, as no logic can be found about these results :-(( However, for these 7 characters, as soon as your reduce that string to eight characters max, it's was always OK ?


    To Mike

    Unfortunately, your file ptest.txt isn't a strict UTF-8 file ! Indeed the Hexa contents of this file are :

    67 65 6d 2e 20 c2 a7 32 35 61 20 55 53 74 47

    Just note the two consecutive bytes c2 a7, which represents the UTF-8 encoding of the paragraph symbol §. So your file looks rather like an UTF-8 w/o BOM file.

    Then, I opened Window Notepad, pasted your text gem. §25a UStG and saved it, with the UTF-8 encoding. When you do so, Notepad, Notepad++, and every decent text editor, put an invisible mark, at the very beginning of the current file, called the BOM ( Byte Order Mark ), of exact Unicode value \xFEFF

    When your file is saved with the Unicode Big Endian encoding ( UCS-2 Big Endian in N++ ), these two invisible bytes added, at the very beginning of the current file, is \xFEFF, and indicate that the Most significant byte \xFE, of the BOM, is written first

    When your file is saved with the classical Unicode encoding ( UCS-2 Little Endian in N++ ), the two invisible bytes \xFFFE are added, and indicate that the Least Significant byte \xFF, of the BOM, is written first

    When your file is saved with the UTF-8 encoding, the three invisible bytes \xEFBBBF are added. They just represents the UTF-8 form of the BOM ( \xFEFF )

    And, when your file is saved, in Notepad++, with the UTF-8 w/o BOM encoding, contents of file are converted in UTF-8, but no BOM is added, at the very beginning of the file !


    To All,

    I also found some examples, that seem even worse !

    1. Consider the string Sél¨¨ ( the beginning of the French word Sélection, followed with two diæresis ), so the HEXA string 53 e9 6c a8 a8. Opened in N++, it produces the Thai TIS-620 encoding ??

    2. If you get rid of the last \xa8 character ( 53 e9 6c a8 ), you get the right ANSI encoding

    3. If you only keep the three characters 53 e9 6c, this time, you get the UTF-8 w/o BOM encoding

    4. And if you choose the string S騨 ( 53 e9 a8 a8 ), Notepad++ choose the Cyrillic OEM-866 encoding !

    I can't see any logic to these different encodings chosen !!


    Luckily, I remembered that, in a recent version, an Auto detect character encoding option was added. I found that it was the v6.5.5 version.

    So, in the last v6.6.9 version, I UNCHECK the menu option Preferences - MISC - Auto detect character encoding

    Then, immediately, all the problems disappeared : N++ chose the right ANSI encoding, in all cases, except my cases 3 and 4, where N++ chose the UTF-8 w/o BOM encoding. But, remember my preliminary remark, about files with tiny contents. With some more bytes, the ANSI encoding is quite detected :-))

    So, In N++, for these last two cases, I just choose the Menu option Encode in ANSI, add some characters, save it and close it. Then, when re-opened in N++, this time, the encoding was the expected ANSI encoding !

    Cheers

    guy038

     

    Last edit: THEVENOT Guy 2014-10-07
  • sf acc

    sf acc - 2014-10-09

    Thank you very much!
    For now, disabling the auto detection helps.
    I'll keep an eye on the bug tracker though.

    Greetings from Austria