Problem with PHP file and Thai characters

2. Help
John
2011-02-24
2013-07-28
  • John
    John
    2011-02-24

    A file with a .php extension and Thai characters will not open as "UTF-8 without BOM".  Here are some examples, and all work except a .php file with Thai characters. Note that a .html file with Thai opens correctly.
    File with .html extension and Spanish characters: opens correctly as "UTF-8 without BOM"
    File with .html extension and Thai characters: opens correctly as "UTF-8 without BOM"
    File with .php extension and Spanish characters: opens correctly as "UTF-8 without BOM"
    File with .php extension and Thai characters: OPENS AS ANSI AND TRASHES THE THAI CHARACTERS

    Even if I enable the "UTF-8 without BOM" encoding setting and check the "Apply to opened ANSI files", it still opens as ANSI.

    Sample files can be downloaded from http://ic.payap.ac.th/SampleFiles.zip.

    Can someone help with a workaround other than manually selecting the "Encode in UTF-8 without BOM" menu option every time? Should this be added as a bug?
    Any help is greatly appreciated.
    Thanks,
    John

     
  • cchris
    cchris
    2011-02-24

    Your failing file fools Notepad++ by indicating UTF-8 charset and not having a BOM. Since the file is thus inconsistent, N++ opens it as ANSI.
    Removing the charset: meta attribute, file opens in UTF-8.

    CChris

     
  • John
    John
    2011-02-25

    CChris,

    Thanks for your response. Unfortunately the file still opens as ANSI for me. I've tried removing the whole meta Content-Type line and tried removing just the charset=UTF-8 portion, but in all cases it still opens as ANSI for me.

    If the meta tag line is the problem then why does it work if the file has a .html extension and why does it work with a .php file that has Spanish characters?

    I guess I just don't understand why it seems that file extension matters to N++ when deciding if the file should open as ANSI or UTF-8 without BOM. Why doesn't it open the .php file the same as the .html file?

    I don't think I would have to deal with this issue if the web server was running PHP5, but it is running PHP4 and I don't have any control over that so I'm stuck with having to save the files as UTF-8 without BOM.

    Thanks again for your time.
    John

     
  • cchris
    cchris
    2011-02-25

    What is the locale on your OS? And your OS, for that matter? On XP with French locale, I get the file to open as UTF-8 when I remove the charset: part.

    CChris

     
  • John
    John
    2011-02-28

    I'm using WinXP with U.S. locale.
    When you say yours opened as UTF-8, do you mean "UTF-8" or "UTF-8 without BOM". If you look in the Encoding menu, which one of those options is selected? I need mine to be "UTF-8 without BOM" since I cannot save the BOM character or the website doesn't display properly when using PHP4 (which I have to use for now).

    Thanks,
    John

     
  • John
    John
    2011-02-28

    Thanks for the help, but I've decided to try and request that the server be upgraded to PHP5. I think that will be better in the long run. Hopefully this won't be an issue then because I think I'll be able to save as UTF-8 with BOM.
    Thanks,
    John

     
  • cchris
    cchris
    2011-02-28

    With the charset attribute, file opens as ANSI.
    Without the chareset directive, it opens as UTF-8 (with BOM).

    CChris

     
  • I can confirm having this issue as well. I have a file I have been using in a project for some time, and suddenly it stopped opening in utf8. Nothing I tried was allowing me to tell np++ to open it as utf8 (without BOM) and eventually I discovered it was due to the addition of thai characters.

    it really seems like a bug to me since it works without these characters.

     
  • Typo
    Typo
    2013-07-28

    I cannot believe how old this post is and this bug still exists.

    I just wasted hours today not to mention the time I wasted previously all due to Thai characters ("฿" in this case) completely breaks the encoding. Any file with that character (and I would assume other Thai characters after reading above) will open as ANSI regardless of settings or previous encoding conversion/settings.

     
  • You're right. That's ridiculous. The UTF-8 code detection code had a bug, where for a certain range of characters (where U+0E3F, your Thai Baht symbol is one of them), it would incorrectly decide it was an invalid UTF-8 encoding, and therefore must be 8 bit ASCII (i.e. "ANSI").

    I've created a patch on the patch tracker to fix this. https://sourceforge.net/p/notepad-plus/patches/513/

    My build now works correctly with this patch for your scenario.

    Hopefully this will be integrated by Don soon.

    Dave.