Menu

#557 Autodetect character encoding

Next_major_release
closed
nobody
None
8
2014-08-28
2014-02-08
Giles Payne
No

This patch adds automatic detection character encoding when a document is opened, using the universal charset detection library from Mozilla. Additionally adds an option to enable/disabled this feature (enabled by default).
A brief search of the Feature Requests found 3 feature requests that would probably be solved by this:

1701 Encoding and format should work with existing documents

1820 Set character set according to windows settings on file open

1908 Detect encoding of files, for example ciryllic

1 Attachments

Discussion

  • Don HO

    Don HO - 2014-02-16

    Hi Giles,

    Thank you for your patch. It's indeed a very interesting feature.
    I have some problem to apply your patch.
    Could you provide me directly "uchardet" folder and notepadPlus.vcproj please?

    Don

     
  • Giles Payne

    Giles Payne - 2014-02-17

    Hi Don,
    Attaching zipped uchardet folder and notepadPlus.vcproj file
    Giles

     

    Last edit: Giles Payne 2014-02-17
    • Don HO

      Don HO - 2014-02-23

      Thank you Giles.

      I integrated your patch and tested with big5, gb2312 and windows-1250 3 files (see attached file), the detections are all wrong.

      Could you try them with your compiled binary and tell me the result?

      Don

       
  • Giles Payne

    Giles Payne - 2014-02-23

    Hi Don,

    I tested the files with my build and got the following results:

    big5 recognized as TIS-620(Thai) with confidence level:0.2698 (big5 confidence level 0.00999)
    gb2312 recognized as IBM855(Cyrillic) confidence level:0.5119 (gb2312 confidence level 0.00999)
    win1250 recoginized as windows-1252 confidence level 0.5 (win1250 confidence level N/A)

    Looking through the uchardet source code I found the following commented out code (in nsSBCSGroupProber.cpp) which appears to indicate that win1250 detection is disabled due to difficulty of distinguishing between win1250 and win1252
    // disable latin2 before latin1 is available, otherwise all latin1
    // will be detected as latin2 because of their similarity.
    //mProbers[10] = new nsSingleByteCharSetProber(&Latin2HungarianModel);
    //mProbers[11] = new nsSingleByteCharSetProber(&Win1250HungarianModel);

    In general, character encoding detection is very difficult with very short files like these so I am not too surprised by these results.

    To avoid this kind of misdetection the only thing I can suggest would be to increase the value of MINIMUM_THRESHOLD in nsUniversalDetector.cpp changing this line:

    #define MINIMUM_THRESHOLD (float)0.20

    (maybe try setting to something like 0.6)

    Giles

     
    • Don HO

      Don HO - 2014-02-23

      Thank you Giles for these very detailed informations.

      By setting MINIMUM_THRESHOLD to 0.6, the detection is more accurate.
      I didn't integrate the patch of settings part, in the case of fail of detection (win1252 or win1256), the encoding will be ANSI (old behaviour) so I think it's OK.

      The patch was committed in SVN rev.1189

      Thank you again for your patch and your help.

      Don

       
  • Don HO

    Don HO - 2014-02-23
    • status: open --> accepted
    • Priority: 5 --> 8
     
  • Giles Payne

    Giles Payne - 2014-02-26

    Thanks Don - I really needed this feature for my office; nearly all documents are in SJIS encoding (;・∀・)

     
  • Don HO

    Don HO - 2014-03-07
    • Status: accepted --> closed