Menu â–¾ â–´

#1779 Incomplete/broken Unicode input on Windows

Bug
closed-fixed
5
2016-01-18
2015-11-18
Sam Hocevar
No

There are several problems with the handling of WM_CHAR and WM_UNICHAR messages on Windows:
surrogate pairs (codepoints higher than U+FFFF) in WM_CHAR are not recombined and are instead replaced with two unrelated, garbage characters
the WM_UNICHAR message handler casts wParam to a wchar_t and passes it to UTF-16 functions, which does not make sense since wParam is a UTF-32 value. This results in more invalid characters being input
* there is no huge difference between WM_CHAR and WM_UNICHAR, so their logic could be merged and WM_UNICHAR messages made to work even if IsUnicodeMode() is false

These bugs are fixed in the following merge request: https://sourceforge.net/p/scintilla/code/merge-requests/13/

Related

Scintilla: 56433015b9363c101fceafcd
Scintilla: 564d7e313e5e837d1c03b7d8

Discussion

  • Neil Hodgson

    Neil Hodgson - 2015-11-18

    Which application are you using that sends WM_UNICHAR?

     
  • Sam Hocevar

    Sam Hocevar - 2015-11-18

    I am not aware of such an application; I tested it directly from within Notepad++ with the following code:

    PostMessage(MainHWND(), WM_UNICHAR, 0x1f64c, 0);
    

    This should output character U+1F64C (🙌) but right now it outputs U+F64C () which is not a real character. With the above patch, it works fine.

    My main reason for this bug report is because I wrote WinCompose (https://github.com/samhocevar/wincompose) which is affected by the WM_CHAR issue (it uses SendInput() which Windows translates to WM_CHAR messages).

    By the way, some changes in WM_UNICHAR handling date back to http://sourceforge.net/p/scintilla/bugs/604/ which was a questionable bug report IMHO (it should be possible to input Unicode characters in non-Unicode mode, because some of these could actually be valid in the current codepage (if it’s non-ASCII).

     
  • Neil Hodgson

    Neil Hodgson - 2015-11-19

    AddCharUTF16 has a comment that implies it is for a multi-character strings but it does not loop over each character. Too few SCN_CHARADDED notifications will be sent if it is called with a multi-character string. If it should handle multi-character strings then that should be implemented.

    The line "utfval[len] = '\0';" in AddCharUTF16 is after the last use of utfval so should be removed or moved earlier.

     
  • Sam Hocevar

    Sam Hocevar - 2015-11-19

    Thanks for the review; I fixed the comment rather than the code because there is no scenario yet where AddCharUTF16 would be called on multi-character strings. I did not understand how to combine merge requests so I created a new one (https://sourceforge.net/p/scintilla/code/merge-requests/14/), I apology for the inconvenience.

     
  • Neil Hodgson

    Neil Hodgson - 2015-11-21

    Committed as [ce4680] with some minor changes to formatting and documentation.

    The code copied from HandleCompositionWindowed has problems (HandleCompositionInline is implemented better) which are part of current discussions so may be replaced. This code reports each byte in a DBCS character with a separate SCN_CHARADDED notification which may lead to the application reading sliced characters.

     

    Related

    Commit: [ce4680]

  • Neil Hodgson

    Neil Hodgson - 2015-11-21
    • labels: --> scintilla, win32, keyboard
    • status: open --> open-fixed
    • assigned_to: Neil Hodgson
     
  • Neil Hodgson

    Neil Hodgson - 2016-01-18
    • status: open-fixed --> closed-fixed
     

Log in to post a comment.