automatic encoding detection

2005-09-21
2012-11-13
  • Hrotkó Gábor

    Hrotkó Gábor - 2005-09-21

    I think there should be an automatic encoding detection. Now I have to select utf-8 encoding for all of my utf-8 encoded php file.

     
    • Paulius

      Paulius - 2005-09-22

      Only way for Notepad++ to recognize UTF-8 encoding is to put BOM into the file. Notepad++ does this when you select menu option: Format->Encode in UTF-8.
      However, PHP interpreter doesn't recognize BOM in the file so it will OUTPUT those three characters to the browser. So, you can not use files encoded this way for PHP scripts.
      Hence, there is the menu option Format->UTF-8 without BOM. But when there is no BOM - Notepad++ can not recognize it as UTF-8 encoded file... I think you get the idea by now...

       
    • chorlya

      chorlya - 2005-09-27

      I don't know how, but Notepad2 *does* detect UTF-8 files even without BOM. I guess it scans through file looking for some UTF-8 specific byte sequences and then makes it's guess based on that.
      I've been using it for almost 2 years now and I'm using it mainly for that reason only. Otherwise I work in Notepad++ :)
      I need that alot since I also work with PHP and UTF-8 files with BOM are just not acceptable.
      Any chance of adding detection for UTF-8 files without BOM to Notepad++?

       
    • Nobody/Anonymous

      I agree, there must be a way, since EditPlus and even Notepad on Windows XP does that...

      It would be a nice thing to have in Npp, I work with XML in Flash, and the default encoding is UTF-8.

      Rico

       
    • Josh Harris

      Josh Harris - 2005-09-29

      The relevant function is IsUTF8() from edit.c of the Notepad2 source.  If you add that and it's supporting functions and defines to Utf8_16.cpp you can get Notepad++ to auto detect UTF8 files that do not have a signature.

      Here's the modified source for Utf8_16.cpp and Utf8_16.h if anyone wants to take a look at it:
      http://www.tateu.net/software/dl.php?f=Utf8_16_src

       
      • Nobody/Anonymous

        Josh,

        it works fine your Utf8_16_src.
        I will integrate it in project (if you agree with that, of course) to extend auto-detection of utf8 without BOM capacity.

        Don

         
    • Hrotkó Gábor

      Hrotkó Gábor - 2005-09-29

      When I have a php, with this encoding: Format/Encode in UTF-8 [with BOM]

      The BOM characters mess up the header handling in php:

      --------------------------------------
      &#271;&#357;&#380;<?php
      ob_start();
          if(!isset($PHP_AUTH_USER))
          {
              Header("WWW-Authenticate: Basic realm=\&quot;test\&quot;");
              Header("HTTP/1.0 401 Unauthorized");
              echo "errorstring";
              exit();
          }
          if (($PHP_AUTH_USER != "test") || ($PHP_AUTH_PW != "test"))
          {
              Header("WWW-Authenticate: Basic realm=\&quot;test\&quot;");
              Header("HTTP/1.0 401 Unauthorized");
              echo "errorstring";
              exit();
          }
      ob_end_flush();
      ?>

      <html>
      <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      </head>
      <body background="../../images/blue_vil.gif">
      &#337;
      </body>
      </html>
      --------------------------------------
      the server answers:

      Warning: Cannot modify header information - headers already sent by (output started at /www/data/...php:1) in /www/data/...php on line 5

      Warning: Cannot modify header information - headers already sent by (output started at /www/data/...php:1) in /www/data/...php on line 6
      errorstring

      but when I change the encoding to:
      Format/Encode in ANSI and the php is without the (&#271;&#357;&#380;)-s, It will work fine.

      I think np++ shouldn't support BOM.

       
      • Paulius

        Paulius - 2005-09-29

        Yes, i agree the BOM isn't friendly to PHP scripts (that send headers). But BOM is the standard, and it is (might be) important for other file formats (not PHP scripts). So it is important. Saying that Notepad++ shouldn't support BOM is wrong. In fact, Notepad++ MUST support BOM.

        There could be some UTF-8 autodetection, but i don't think it would work in all cases (Unicode has defined more then 60000 characters, checking if there are any of those in the document would be slow and near to impossible).

        PS: Microsoft Notepad opens the "UTF-8 without BOM" encoded file correctly, but when youo hit File->Save, the file will be reencoded to "UTF-8 with BOM". So Microsoft Notepad shouldn't be mentioned here as an example.

         
    • Nobody/Anonymous

      I agree, BOM may not be good for PHP but it can be useful for many other purposes.

      Some autodetection would be useful, at least for the most common characters, not all the 60000 characters.

      As for Notepad, I didn't meant it support BOM, just sad that it detects UTF-8.

      Rico

       
    • Josh Harris

      Josh Harris - 2005-09-29

      I certainly don't mind but it's not my code.  I just copied it over from the Notepad2 source.  Notepad2 is also GPL'd so I think it should be ok but I've never really looked into the GPL license for the exact details.

      Also, two other things I noticed this morning...1) The encoding detection only checks the first blockSize (131,072) bytes, which is probably a good thing, though, since checking a large document adds loading time. 2) Ansi files greater than or equal to blockSize bytes are getting incorrectly detected as UTF8.  This seems to be because data is not NULL terminated after fread.  Adding "memset(&data[lenFile], 0, 1);" to Notepad_plus.cpp

      there is no 0 character at the end of the "char data[blockSize]" after each fread call.  The problem seems to be fixed by adding "memset(&data[lenFile], 0, 1);" after each fread call in Notepad_plus.cpp.

       
    • Josh Harris

      Josh Harris - 2005-09-29

      Forget about the fix from item 2) above.  It only worked when compiled in Debug mode, it caused a buffer overrun in Release mode.  Instead, I added the null terminated character in UTF8_mbslen_bytes and that seemed to work better.  I'll put up the modified source in about an hour.

       
      • Nobody/Anonymous

        Hi,
        is there a way, today(in version 3.2), to customize the encoding format related with the file extension ?
        like you could choose if you want for instance, for xml always utf8, same for java, ansi for some other kind,.. ?
        You may consider this, since you can already define a specific style for each language.

         
    • Josh Harris

      Josh Harris - 2005-09-29

      Here's the modified version I was talking about, same link:
      http://www.tateu.net/software/dl.php?f=Utf8_16_src

       
      • Don HO

        Don HO - 2005-09-29

        Josh,

        I tested with some UTF8 file without signature,
        it seems good. this feature will be in the next release.

        Thank you.

        Don

         
        • Don HO

          Don HO - 2005-09-30

          The code you posted won't be included in the next release finally,
          since the author of Notepad2 reclaim to mention Notepad2 in the Notepad++ document and in About Box.

          Don

           
          • Don HO

            Don HO - 2005-12-21

            The automatic encoding detection for UTF-8 without BOM will be supported by Notepad++ in v3.4 for good.

            I make a point of thanking Chris Severance for his compact, efficient and elegant code of this feature, especially for his generosity to share it.

            Here's the further info about v3.4 RC :
            http://sourceforge.net/forum/forum.php?thread_id=1407017&forum_id=331753

            enjoy

            Don

             
    • Hrotkó Gábor

      Hrotkó Gábor - 2005-10-12

      maybe at least, there should be some workaround: when the the file has a meta tag like:

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      or
      <?xml version="1.0" encoding="UTF-8"?>

      notepad should switch automatically to "UTF-8 without BOM" (when it has no BOM by default)

      and there is another request to memorize this  "UTF-8 without BOM" option to make it default

      http://sourceforge.net/tracker/index.php?func=detail&aid=1238719&group_id=95717&atid=612385

       
      • Paulius

        Paulius - 2005-10-12

        Are you claiming, that Notepad++ should parse files that it opens looking for some file headers? Even if the file would be well formed HTML or XML (badly formed would be those files that has PHP or other scripting code in them) it would heavily affect Notepad++ performance. And besides, that would only work with files that HAS header with such information. What if you are writing PHP script, that prints some (utf-8 encoded) text, but it doesn't have the headers - the file is supposed to be included from another PHP script, which has the header?

        IMHO - I don't think that would work for Notepad++.

        Basicaly, i think there are two options:
        1) when opening file, check if it has some byte sequences that might be UTF-8 encoded characters;
        2) use some external file to remember (store) encodings of all files that it opens (or in fact - closes).

        Those two option has their own advantages and drawbacks:
        option 1:
        a) it's impossible to guess correctly if its UTF-8 without BOM or ANSI in all cases;
        b) someone has to investigate and give a short list of most frequently used UTF-8 characters;
        c) notepad++ will open files slower, depending on the size of this character list.
        option 2:
        a) if you frequently look at source files, scripts etc. your encodings file will grow very fast;
        b) when the file grows - it will take more and more time for notepad++ to find the right encodings;
        c) to prevent the situation, where the file becomes too large - there must be some method for triming old files, that was not opened in a predefined period of time.

        Personaly, i don't think it's worth it to do something like that - i think it isn't that hard to turn some menu item each time i open some file. But, i guess, that's just me... :)

         
    • Nobody/Anonymous

      Why next version of Notepad++ not auto-detect-encoding. I waited so long :(

       
    • Nobody/Anonymous

      Hi, found only this thread about encoding issue.

      What about other encoding (cyrillic) detection?

      I mean if I want to see source of HTML file in Content-Type:text/html; charset=windows-1251, Content-language: RU - where I can change encoding for fonts?

       

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks