Problem with PHP file and Thai characters

John
2011-02-24
2013-07-28
  • John

    John - 2011-02-24

    A file with a .php extension and Thai characters will not open as "UTF-8 without BOM".  Here are some examples, and all work except a .php file with Thai characters. Note that a .html file with Thai opens correctly.
    File with .html extension and Spanish characters: opens correctly as "UTF-8 without BOM"
    File with .html extension and Thai characters: opens correctly as "UTF-8 without BOM"
    File with .php extension and Spanish characters: opens correctly as "UTF-8 without BOM"
    File with .php extension and Thai characters: OPENS AS ANSI AND TRASHES THE THAI CHARACTERS

    Even if I enable the "UTF-8 without BOM" encoding setting and check the "Apply to opened ANSI files", it still opens as ANSI.

    Sample files can be downloaded from http://ic.payap.ac.th/SampleFiles.zip.

    Can someone help with a workaround other than manually selecting the "Encode in UTF-8 without BOM" menu option every time? Should this be added as a bug?
    Any help is greatly appreciated.
    Thanks,
    John

     
  • cchris

    cchris - 2011-02-24

    Your failing file fools Notepad++ by indicating UTF-8 charset and not having a BOM. Since the file is thus inconsistent, N++ opens it as ANSI.
    Removing the charset: meta attribute, file opens in UTF-8.

    CChris

     
  • John

    John - 2011-02-25

    CChris,

    Thanks for your response. Unfortunately the file still opens as ANSI for me. I've tried removing the whole meta Content-Type line and tried removing just the charset=UTF-8 portion, but in all cases it still opens as ANSI for me.

    If the meta tag line is the problem then why does it work if the file has a .html extension and why does it work with a .php file that has Spanish characters?

    I guess I just don't understand why it seems that file extension matters to N++ when deciding if the file should open as ANSI or UTF-8 without BOM. Why doesn't it open the .php file the same as the .html file?

    I don't think I would have to deal with this issue if the web server was running PHP5, but it is running PHP4 and I don't have any control over that so I'm stuck with having to save the files as UTF-8 without BOM.

    Thanks again for your time.
    John

     
  • cchris

    cchris - 2011-02-25

    What is the locale on your OS? And your OS, for that matter? On XP with French locale, I get the file to open as UTF-8 when I remove the charset: part.

    CChris

     
  • John

    John - 2011-02-28

    I'm using WinXP with U.S. locale.
    When you say yours opened as UTF-8, do you mean "UTF-8" or "UTF-8 without BOM". If you look in the Encoding menu, which one of those options is selected? I need mine to be "UTF-8 without BOM" since I cannot save the BOM character or the website doesn't display properly when using PHP4 (which I have to use for now).

    Thanks,
    John

     
  • John

    John - 2011-02-28

    Thanks for the help, but I've decided to try and request that the server be upgraded to PHP5. I think that will be better in the long run. Hopefully this won't be an issue then because I think I'll be able to save as UTF-8 with BOM.
    Thanks,
    John

     
  • cchris

    cchris - 2011-02-28

    With the charset attribute, file opens as ANSI.
    Without the chareset directive, it opens as UTF-8 (with BOM).

    CChris

     
  • Philip Nicolcev

    Philip Nicolcev - 2012-10-22

    I can confirm having this issue as well. I have a file I have been using in a project for some time, and suddenly it stopped opening in utf8. Nothing I tried was allowing me to tell np++ to open it as utf8 (without BOM) and eventually I discovered it was due to the addition of thai characters.

    it really seems like a bug to me since it works without these characters.

     
  • Typo

    Typo - 2013-07-28

    I cannot believe how old this post is and this bug still exists.

    I just wasted hours today not to mention the time I wasted previously all due to Thai characters ("฿" in this case) completely breaks the encoding. Any file with that character (and I would assume other Thai characters after reading above) will open as ANSI regardless of settings or previous encoding conversion/settings.

     
  • Dave Brotherstone

    You're right. That's ridiculous. The UTF-8 code detection code had a bug, where for a certain range of characters (where U+0E3F, your Thai Baht symbol is one of them), it would incorrectly decide it was an invalid UTF-8 encoding, and therefore must be 8 bit ASCII (i.e. "ANSI").

    I've created a patch on the patch tracker to fix this. https://sourceforge.net/p/notepad-plus/patches/513/

    My build now works correctly with this patch for your scenario.

    Hopefully this will be integrated by Don soon.

    Dave.

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks