#481 Bugfix for Wrong display of # characters, bytes and words in status bar and Summary

Next_release
closed
nobody
None
5
2013-09-19
2013-05-30
FLS
No

For UTF-8 and UTF-16 encodings, Notepad++ (tested with version 6.3.3) apparently shows sometimes wrong figures for number of characters and words in the “Summary” dialogue (View – Summary ...).

Furthermore, it is not clear, what is displayed in the status bar: the number of characters or the number of bytes?! Looking for ANSI and UTF-8, apparently the number of bytes is displayed, inclusive CR and LF. However, for UTF-16 it seems that the number of characters (inclusive CR and LF) is displayed. I would expect to see in the status bar the number of selected characters, independently of encoding.

Refer also to bug report #4005: Selecting characters: "Ääni" is only 4 characters not 6
and patch #431 reporting bytes instead of characters.

Finally, the “Summary” dialogue displays different numbers than the status bar! Especially for multi-byte characters like French “éèê“ or German “äöü”. For those, the number of characters is reported wrongly.

Up to now, multi-area selections are not handled by the status bar – a N/A is displayed.

In order to proof the above said, I have provided below some prepared text files with instructions inside (see TestFiles_UnicodeCharacterWordCount.7z).

Analysing the reasons for that behaviour, I saw that there are different subroutines at different places (ScintillaEditView.cpp and Notepad_plus.cpp) for the status bar and the Summary dialogue doing nearly the same things, but just nearly and with bugs.

In order to correct those bugs, I have provided a patch. I merged the spread functions and moved them to ScintillaEditView.cpp.
Furthermore, Scintilla Version 3.3.x provides a function to retrieve the number of selected characters, independently of encoding (i.e. apparently Scintilla converts and stores internally all in UTF-8). This function is used but a replacement function is also provided for the currently used Scintilla 2.2.7.

There are two patches provided (based on SVN 1046):
a.) NppPatch_6.3.3_CountUnicodeCharsAndBytes_Preliminary.patch
The first patch includes the old code just commented, a lot of explanations and some comments for debugging and proofing of the new functions.

b.) NppPatch_6.3.3_CountUnicodeCharsAndBytes_Final.patch
This is the final, cleaned-up patch.
However, you are free to modify the “printouts” into status bar and Summary dialogue.

Commentcode: xCountUnicodeCharsAndBytes:

Discussion

  • FLS
    FLS
    2013-05-30

    And here, you get the three files, zipped together:

     
  • Don HO
    Don HO
    2013-09-19

    • status: open --> closed
     
  • Don HO
    Don HO
    2013-09-19

    This bug has been fixed in v6.4.5
    Thank you for your patch.

     
    • FLS
      FLS
      2013-09-22

      Dear Don
      Unfortunately, Npp v6.4.5 fixes only a few but not all of the above mentioned bugs in the status bar and in the “View -Summary...” dialogue. I have just downloaded Npp v6.4.5 from Npp download site and did some detailed tests (see below).

      Furthermore, Scintilla 3.3.x provides a much more efficient function to get the number of selected characters, independently of encoding (SCI_COUNTCHARACTERS).

      Finally, there are different subroutines at different places (ScintillaEditView.cpp and Notepad_plus.cpp) for the status bar and the Summary dialogue doing nearly the same things, but just nearly and with bugs. That should be tidied up.

      You can just use my comprehensively commented patch and look trough. By the way, my patch handles also multi-selection areas.
      Perhaps, if I have some time, I will provide a patch with the current Npp version, using Scintilla 3.3.x function.

      Regards
      FLS

      Detailed test with v6.4.5:

      Just taking the test files provided in TestFiles_UnicodeCharacterWordCount.7z (see attached) the following bugs and inconsistencies are still present:

      Using file Test-UTF-8_Count.txt:

      a) Select e.g. four characters at the end of a line and extend the selection to the first character of the next line then the status bar will report: 6 characters | 2 lines.
      This means, the CR-LF characters are counted as characters! Is this correct?
      Furthermore, why are two lines reported as selected? When selecting something in one line zero line is reported!

      b) If the whole text is selected (ctrl-A)
      * The Summary dialogue reports wrongly: 490 characters (without blanks), but the 490 are all characters WITH blanks but just without the twenty CR-LF characters!
      * The status bar reports 510 characters, which is the number of characters plus the number of CR and LF characters.

      Using file Test-UTF-16_Count.txt:

      a) Select the seven multi-byte characters in line three.
      * The Summary dialogue reports wrongly: 24 selected characters (12 bytes).

      b) When the whole text is selected (ctrl-A)
      * The Summary dialogue reports wrongly: 638 characters (without blanks), but there are 22 CRLF-chars (42 bytes), 90 blanks and 543 other characters (=633 characters)!
      Furthermore, it is wrongly reported: 1276 selected characters (660 bytes).
      * In the status bar and in the Summary dialogue, a document length of 660 is reported. That is obviously wrong, because there are just 655 characters (even with CR-LF).

      Using file Test-WordCount.txt:

      a) Summary dialogue
      * There are 11 words but Summary dialogue reports 14!