This patch adds automatic detection character encoding when a document is opened, using the universal charset detection library from Mozilla. Additionally adds an option to enable/disabled this feature (enabled by default).
A brief search of the Feature Requests found 3 feature requests that would probably be solved by this:
Hi Giles,
Thank you for your patch. It's indeed a very interesting feature.
I have some problem to apply your patch.
Could you provide me directly "uchardet" folder and notepadPlus.vcproj please?
Don
Hi Don,
Attaching zipped uchardet folder and notepadPlus.vcproj file
Giles
Last edit: Giles Payne 2014-02-17
Thank you Giles.
I integrated your patch and tested with big5, gb2312 and windows-1250 3 files (see attached file), the detections are all wrong.
Could you try them with your compiled binary and tell me the result?
Don
Hi Don,
I tested the files with my build and got the following results:
big5 recognized as TIS-620(Thai) with confidence level:0.2698 (big5 confidence level 0.00999)
gb2312 recognized as IBM855(Cyrillic) confidence level:0.5119 (gb2312 confidence level 0.00999)
win1250 recoginized as windows-1252 confidence level 0.5 (win1250 confidence level N/A)
Looking through the uchardet source code I found the following commented out code (in nsSBCSGroupProber.cpp) which appears to indicate that win1250 detection is disabled due to difficulty of distinguishing between win1250 and win1252
// disable latin2 before latin1 is available, otherwise all latin1
// will be detected as latin2 because of their similarity.
//mProbers[10] = new nsSingleByteCharSetProber(&Latin2HungarianModel);
//mProbers[11] = new nsSingleByteCharSetProber(&Win1250HungarianModel);
In general, character encoding detection is very difficult with very short files like these so I am not too surprised by these results.
To avoid this kind of misdetection the only thing I can suggest would be to increase the value of MINIMUM_THRESHOLD in nsUniversalDetector.cpp changing this line:
#define MINIMUM_THRESHOLD (float)0.20
(maybe try setting to something like 0.6)
Giles
Thank you Giles for these very detailed informations.
By setting MINIMUM_THRESHOLD to 0.6, the detection is more accurate.
I didn't integrate the patch of settings part, in the case of fail of detection (win1252 or win1256), the encoding will be ANSI (old behaviour) so I think it's OK.
The patch was committed in SVN rev.1189
Thank you again for your patch and your help.
Don
Thanks Don - I really needed this feature for my office; nearly all documents are in SJIS encoding (;・∀・)