Notepad++ / Patches / #557 Autodetect character encoding

Don HO - 2014-02-16

Hi Giles,

Thank you for your patch. It's indeed a very interesting feature.
I have some problem to apply your patch.
Could you provide me directly "uchardet" folder and notepadPlus.vcproj please?

Don

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Giles Payne - 2014-02-17

Hi Don,
Attaching zipped uchardet folder and notepadPlus.vcproj file
Giles

Last edit: Giles Payne 2014-02-17

notepadPlus.vcproj

uchardet.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Don HO - 2014-02-23
  
  Thank you Giles.
  
  I integrated your patch and tested with big5, gb2312 and windows-1250 3 files (see attached file), the detections are all wrong.
  
  Could you try them with your compiled binary and tell me the result?
  
  Don
  
  codage.zip
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Giles Payne - 2014-02-23

Hi Don,

I tested the files with my build and got the following results:

big5 recognized as TIS-620(Thai) with confidence level:0.2698 (big5 confidence level 0.00999)
gb2312 recognized as IBM855(Cyrillic) confidence level:0.5119 (gb2312 confidence level 0.00999)
win1250 recoginized as windows-1252 confidence level 0.5 (win1250 confidence level N/A)

Looking through the uchardet source code I found the following commented out code (in nsSBCSGroupProber.cpp) which appears to indicate that win1250 detection is disabled due to difficulty of distinguishing between win1250 and win1252
// disable latin2 before latin1 is available, otherwise all latin1
// will be detected as latin2 because of their similarity.
//mProbers[10] = new nsSingleByteCharSetProber(&Latin2HungarianModel);
//mProbers[11] = new nsSingleByteCharSetProber(&Win1250HungarianModel);

In general, character encoding detection is very difficult with very short files like these so I am not too surprised by these results.

To avoid this kind of misdetection the only thing I can suggest would be to increase the value of MINIMUM_THRESHOLD in nsUniversalDetector.cpp changing this line:

#define MINIMUM_THRESHOLD (float)0.20

(maybe try setting to something like 0.6)

Giles

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Don HO - 2014-02-23
  
  Thank you Giles for these very detailed informations.
  
  By setting MINIMUM_THRESHOLD to 0.6, the detection is more accurate.
  I didn't integrate the patch of settings part, in the case of fail of detection (win1252 or win1256), the encoding will be ANSI (old behaviour) so I think it's OK.
  
  The patch was committed in SVN rev.1189
  
  Thank you again for your patch and your help.
  
  Don
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don HO - 2014-02-23

status: open --> accepted

Priority: 5 --> 8
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Giles Payne - 2014-02-26

Thanks Don - I really needed this feature for my office; nearly all documents are in SJIS encoding (；・∀・)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don HO - 2014-03-07

Status: accepted --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Autodetect character encoding

Notepad++ project is moving to GitHub:

Group

Searches

Help

#557 Autodetect character encoding

1701 Encoding and format should work with existing documents

1820 Set character set according to windows settings on file open

1908 Detect encoding of files, for example ciryllic

Discussion