#4781 Wrong encoding when opening document

All
accepted
nobody
encoding (7)
6
2014-09-01
2014-04-17
Watilin
No

Version: 6.5.5 Unicode
OS: Win XP 32 bits

Notepad++ doesn't correctly detect encoding of UTF-8 without BOM files containing "§" (U+00A7, UTF-8 sequence C2 A7). It detects TIS-620 (Thaï).

Steps to reproduce

  1. create a new document
  2. set encoding to UTF-8 without BOM
  3. type any text containg "§"
  4. save and close the document
  5. re-open the document

Expected behavior

The document should be detected as UTF-8 without BOM and be left unchanged.

Observed behavior

The document is detected as TIS-620 and all occurences of "§" have been replaced with the two Thaï symbols "ยง" (U+0E22 "Yo Yak" and U+0E07 "Ngo Ngu") which appear as two question marks in boxes.
When I select the "Encode in UTF-8 without BOM" command, nothing happens.
When I select "Convert to UTF-8 without BOM", the symbols "ยง" are converted into their UTF-8 sequences (E0 B8 A2, E0 B8 87) -- I checked with a hex editor.

This bug pratically prevents the user from using "§" because, each time they open the document, they have to:

  1. Switch the encoding back to UTF-8 without BOM
  2. Perform a search/replace to restore all occurences of "§"

I made a quick search and found that it may be related to #4731 and #4744.

Discussion

  • Watilin
    Watilin
    2014-04-17

    Also #4413 about Thaï characters.

     
  • Don HO
    Don HO
    2014-04-29

    • status: open --> accepted
    • Priority: 5 --> 6
     
  • v_decadence
    v_decadence
    2014-05-02

    The similar problem with Cyrillic. When I open some files it shows Macintosh encoding instead of Windows-1251 and some charachters are broken (i.e. С and Я). Setting right encoding manually and resave / reopen file not working.
    How can I disable autodetection feature? Before it all worked fine.