Menu

#2295 Potential memory leak in win32/HanjaDic.cxx

Bug
closed-fixed
5
2022-03-03
2021-11-13
Zufu Liu
No

When OpenMainDic() or GetHanjaType() failed, hr will be changed, HJinterface will leak.

Patch fixed the leak (also remove type casts by replacing unsigned short with wchar_t).

1 Attachments

Related

Feature Requests: #1565

Discussion

1 2 > >> (Page 1 of 2)
  • Neil Hodgson

    Neil Hodgson - 2021-11-14

    Moving the OpenMainDic seems reasonable.

    On my Windows 10 system with Korean installed, there is no mshjdic.hanjadic (hr=0x800401f3 : Invalid class string) so the Hanja conversion never works but there is a imkrhjd.hanjadic which does work. Perhaps Korean machines are commonly set up with mshjdic.hanjadic but this code would be more widely applicable if both class IDs were tried.

    A way to test this is:

    • Open a UTF-8 document
    • Change the IME to Korean; Hangul
    • Type a syllable block like "ㄴㅏㅁ" (US keys "ska") -> "남"
    • Press VK_HANJA (right Ctrl) for a candidate list
    • Press Enter to convert to the first Hanja candidate "南"
    • Convert back to Hangul by selecting "南" and pressing VK_HANJA
    • Result is "남"

    IHanjaDic is defined by the type library included in imkrhjd.dll which is found in C:\Windows\System32\IME\IMEKR\DICTS. This can be imported into C++ with

    #import "file:C:\Windows\System32\IME\IMEKR\DICTS\imkrhjd.dll" raw_interfaces_only
    

    The resulting imkrhjd.tlh header found in the objects directory uses unsigned short instead of wchar_t. It differs slightly in other ways which may be due to using a different type library extraction tool or version. OleView is no longer provided by Microsoft and it apparently doesn't work on recent versions of Windows.

    The definition of IHanjaDic provided by the type library should be respected. Modifying it to reduce casting seems wrong. Ideally, Scintilla could #import the type library but it won't be present on most developers machines and this won't work with gcc.

    struct __declspec(uuid("ad75f3ac-18cd-48c6-a27d-f1e9a7dce432"))
    IHanjaDic : IUnknown
    {
        //
        // Raw methods provided by interface
        //
    
          virtual HRESULT __stdcall OpenMainDic ( ) = 0;
          virtual HRESULT __stdcall CloseMainDic ( ) = 0;
          virtual HRESULT __stdcall GetHanjaWords (
            /*[in]*/ BSTR bstrHangul,
            /*[out]*/ SAFEARRAY * * ppsaHanja,
            /*[out,retval]*/ VARIANT_BOOL * pfFound ) = 0;
          virtual HRESULT __stdcall GetHanjaChars (
            /*[in]*/ unsigned short wchHangul,
            /*[out]*/ BSTR * pbstrHanjaChars,
            /*[out,retval]*/ VARIANT_BOOL * pfFound ) = 0;
          virtual HRESULT __stdcall HanjaToHangul (
            /*[in]*/ BSTR bstrHanja,
            /*[out,retval]*/ BSTR * pbstrHangul ) = 0;
          virtual HRESULT __stdcall GetHanjaType (
            /*[in]*/ unsigned short wchHanja,
            /*[out,retval]*/ HANJA_TYPE * pHanjaType ) = 0;
          virtual HRESULT __stdcall GetHanjaSense (
            /*[in]*/ unsigned short wchHanja,
            /*[out,retval]*/ BSTR * pbstrSense ) = 0;
          virtual HRESULT __stdcall GetRadicalID (
            /*[in]*/ short SeqNumOfRadical,
            /*[out]*/ short * pRadicalID,
            /*[out]*/ unsigned short * pwchRadical ) = 0;
          virtual HRESULT __stdcall GetRadical (
            /*[in]*/ short nRadicalID,
            /*[out,retval]*/ struct IRadical * * ppIRadical ) = 0;
          virtual HRESULT __stdcall RadicalIDToHanja (
            /*[in]*/ short nRadicalID,
            /*[out,retval]*/ unsigned short * pwchRadical ) = 0;
          virtual HRESULT __stdcall GetHanja (
            /*[in]*/ unsigned short wchHanja,
            /*[out,retval]*/ struct IHanja * * ppIHanja ) = 0;
          virtual HRESULT __stdcall GetStrokes (
            /*[in]*/ short nStrokes,
            /*[out,retval]*/ struct IStrokes * * ppIStrokes ) = 0;
          virtual HRESULT __stdcall OpenDefaultCustomDic ( ) = 0;
          virtual HRESULT __stdcall OpenCustomDic (
            /*[in]*/ BSTR bstrPath,
            /*[out,retval]*/ long * plUdr ) = 0;
          virtual HRESULT __stdcall CloseDefaultCustomDic ( ) = 0;
          virtual HRESULT __stdcall CloseCustomDic (
            /*[in]*/ long lUdr ) = 0;
          virtual HRESULT __stdcall CloseAllCustomDics ( ) = 0;
          virtual HRESULT __stdcall GetDefaultCustomHanjaWords (
            /*[in]*/ BSTR bstrHangul,
            /*[out]*/ struct tagSAFEARRAY * * ppsaHanja,
            /*[out,retval]*/ VARIANT_BOOL * pfFound ) = 0;
          virtual HRESULT __stdcall GetCustomHanjaWords (
            /*[in]*/ long lUdr,
            /*[in]*/ BSTR bstrHangul,
            /*[out]*/ struct tagSAFEARRAY * * ppsaHanja,
            /*[out,retval]*/ VARIANT_BOOL * pfFound ) = 0;
          virtual HRESULT __stdcall PutDefaultCustomHanjaWord (
            /*[in]*/ BSTR bstrHangul,
            /*[in]*/ BSTR bstrHanja ) = 0;
          virtual HRESULT __stdcall PutCustomHanjaWord (
            /*[in]*/ long lUdr,
            /*[in]*/ BSTR bstrHangul,
            /*[in]*/ BSTR bstrHanja ) = 0;
          virtual HRESULT __stdcall get_MaxNumOfRadicals (
            /*[out,retval]*/ short * pVal ) = 0;
          virtual HRESULT __stdcall get_MaxNumOfStrokes (
            /*[out,retval]*/ short * pVal ) = 0;
          virtual HRESULT __stdcall get_DefaultCustomDic (
            /*[out,retval]*/ long * pVal ) = 0;
          virtual HRESULT __stdcall put_DefaultCustomDic (
            /*[in]*/ long pVal ) = 0;
          virtual HRESULT __stdcall get_MaxHanjaType (
            /*[out,retval]*/ HANJA_TYPE * pHanjaType ) = 0;
          virtual HRESULT __stdcall put_MaxHanjaType (
            /*[in]*/ HANJA_TYPE pHanjaType ) = 0;
    };
    
     
  • Zufu Liu

    Zufu Liu - 2021-11-14

    Full source without touch interface definition.
    Added the sample as a test at end of HanjaDic.cxx.

    (just guess) maybe "mshjdic.hanjadic" is come with Office, or old MS IME.

     
  • Zufu Liu

    Zufu Liu - 2021-11-14

    Changed bstrHangul to nullptr as it's /*[out,retval]*/, I'm not sure whether it's need be freed for each conversion.

    BSTR bstrHangul = nullptr;
    

    Sine the conversion is dictionary based, an improvement might pass consecutive Hanja characters in one conversion. Also conversion result might longer than input?

     
    • Neil Hodgson

      Neil Hodgson - 2021-11-14

      Changed bstrHangul to nullptr as it's /[out,retval]/, I'm not sure whether it's need be freed for each conversion.

      For safety, there should be no reuse of bstrHangul (as its [out], it shouldn't be freed by HanjaToHangul) and it should be explicitly freed on each iteration.

      Sine the conversion is dictionary based, an improvement might pass consecutive Hanja characters in one conversion. Also conversion result might longer than input?

      The original author may have had some reason to perform this character by character - perhaps there is something like unwanted Hangul normalisation or there were bugs in some release. We have no documentation on HanjaToHangul and, without knowing Korean, its difficult to know what is important.

      Added the sample as a test at end of HanjaDic.cxx.

      This sort of test code is a maintenance burden and tends to decay.

       
  • Zufu Liu

    Zufu Liu - 2021-11-14

    Updated the code, moved constructor to a separator function.

    As confirmed at https://github.com/zufuliu/notepad2/issues/393, mshjdic.hanjadic does not work, so changed the code to try imkrhjd.hanjadic first.

     
  • Zufu Liu

    Zufu Liu - 2021-11-15

    Mayeb it's better to surround the OpenMainDic() block and CloseMainDic() method with catch all (and mark HanjaDic() and OpenHanjaDic() as noexcept).

    try {           
    } catch (...) {
        // Ignore any exception
    }
    

    attachment simplified case Message::GetSelText: in Editor.cxx.

     
    • Neil Hodgson

      Neil Hodgson - 2021-11-16

      Since SelectionToHangul allocates so may throw, it doesn't help much to noexcept HanjaDic. I could see an argument to throw from HanjaDic on, for example, E_NOMEMORY from IHanjaDic::OpenMainDic but that would only be worthwhile if we were being much more thorough with out-of-memory and had a strategy for handling it.

      Over time, my opinion on adding noexcept has changed. Constructors often allocate memory so they shouldn't be noexcept even though linters complain. Trivial value constructors that will never allocate like PRectangle should be noexcept. Destructors should be noexcept and try to handle any possible exceptions as it should be possible to finalize safely and completely.

      Rule-of-zero means that the default operations should be concentrated in resource management classes and other classes should then use these classes and define no default operations themselves. Thus HanjaDic shouldn't define a destructor (or assignment, ...). Instead, HJInterface should be a specialization of std::unique_ptr like LexerInstance in Document.h. Something like

      struct UnknownReleaser {
          // Called by unique_ptr to destroy/free the resource
          template <class T>
          void operator()(T *pUnknown) noexcept {
              if (pUnknown) {
                  try {
                      pUnknown->Release();
                  } catch (...) {
                      // IUnknown::Release must not throw, ignore if it does.
                  }
              }
          }
      };
      
      class HanjaDic {
          std::unique_ptr<IHanjaDic, UnknownReleaser> HJinterface;
      

      There's some more complexity as CloseMainDic should be called if OpenMainDic succeeded but that could be moved out of construction/destruction.

       
  • Zufu Liu

    Zufu Liu - 2021-11-16

    Using UnknownReleaser simplified the code a lot. I added HanjaDicCloser, which seems fixed the CloseMainDic problem.

    struct HanjaDicCloser {
        IHanjaDic *HJinterface;
        ~HanjaDicCloser() {
            HJinterface->CloseMainDic();
        }
    };
    

    HanjaDic2.cxx contains a word based conversion and examples from https://www.koreanwikiproject.com/wiki/Hanja#Does_every_Hanja_character_have_only_one_sound_representation.3F

    But both function gives same result on my Win10, don't know whether this is the expected behavior or not, possible it's not worth the complex.

     

    Last edit: Zufu Liu 2021-11-16
  • Zufu Liu

    Zufu Liu - 2021-11-16

    Updated HanjaDicCloser to following to not expose the raw pointer.

    struct HanjaDicCloser {
        const HanjaDic *dict;
        ~HanjaDicCloser() {
            dict->CloseHanjaDic();
        }
    };
    

    Added BLUEnLIVE's comment (need wording updates, basically there is no need to make word based conversion) from
    https://github.com/zufuliu/notepad2/issues/393#issuecomment-970404057

     
    • Neil Hodgson

      Neil Hodgson - 2021-11-17
      C:\u\hg\scintilla\win32\HanjaDic.cxx(120): warning C26447: The function is declared 'noexcept' but calls function 'CloseHanjaDic()' which may throw exceptions (f.6).
      

      Destructors are default noexcept, so functions they call should be noexcept.

      UnknownReleaser can be reused in ScintillaWin.cxx and PlatWin.cxx so there will be a new header containing it along with ReleaseUnknown and DLLFunction which are generically useful and don't refer to other Scintilla identifiers. Don't have a great name but will call it WinTypes.h if there is nothing better.

       
  • Zufu Liu

    Zufu Liu - 2021-11-17

    Added catch all for the destructor. It seems HanjaDicCloser with IHanjaDic *HJinterface produces small function code than with const HanjaDic *dict;.

    Also added a thin wrapper ScopedBSTR for SysAllocString() and SysFreeString(); _bstr_t seems not usable on some mingw gcc,e.g. https://stackoverflow.com/questions/51363689/qt5-mingw-undefined-reference-to-convertstringtobstr

    For the header, I think it better to prefix it. WinTypes.h may conflict with Windows SDK header names. I don't have a better name either, PlatWinBase.h?

     
  • Zufu Liu

    Zufu Liu - 2021-11-18

    Added a generic template (similar to unique_ptr)

    template <class T, class Deleter>
    class ScopedPointer;
    

    I think it can be used with UnknownReleaser to replace some ReleaseUnknown() in PlatWin.cxx, like following:

    struct BSTRDeleter {
        void operator()(BSTR bstr) const noexcept {
            SysFreeString(bstr);
        }
    };
    
    using ScopedBSTR = ScopedPointer<OLECHAR, BSTRDeleter>;
    // ...
    const ScopedBSTR bstrHanja{SysAllocString(hanja)};
    ScopedBSTR bstrHangul;
    if (dict.HanjaToHangul(bstrHanja, &bstrHangul))
    

    Also, according to https://en.cppreference.com/w/cpp/memory/unique_ptr/%7Eunique_ptr
    null check in custom deleter can be omitted.

     
    • Neil Hodgson

      Neil Hodgson - 2021-11-18

      Why define ScopedPointer for these cases when the standard library defines unique_ptr? Linters and other developers already know what unique_ptr is.

      unique_ptr doesn't have & but that's because the semantics are difficult - should the original value be cleared first or be passed? For this code, passing a temporary, null BSTR, then reseting it into the smart pointer will be easily understood. This could be moved up into HanjaDic::HanjaToHangul, possibly returning a unique_ptr instead of using an out argument.

       
  • Zufu Liu

    Zufu Liu - 2021-11-19

    OK, changed to using UniqueBSTR = std::unique_ptr<OLECHAR[], BSTRDeleter>; with a new HanjaToHangul(BSTR bstrHanja, UniqueBSTR &bstrHangul) method (as there is no? grantee it returns null on fails):

    The HanjaToHangul(BSTR bstrHanja, BSTR* pbstrHangul) method is kept, so that in C++23 using out_ptr (see https://en.cppreference.com/w/cpp/memory/out_ptr_t) the code can be simplified to if (dict.HanjaToHangul(bstrHanja.get(), std::out_ptr(bstrHangul))).

     
    • Neil Hodgson

      Neil Hodgson - 2021-11-19

      Unsure if std::unique_ptr<OLECHAR[], BSTRDeleter> is correct since BSTR is OLECHAR*, not OLECHAR[]*. std::unique_ptr<OLECHAR, BSTRDeleter> would seem more correct. That means [] isn't available when retrieving characters but *bstrHangul can be used.

      Unused code shouldn't be included as it makes the file more difficult to understand and produces warnings.

      Its more difficult to trace code when methods have the same name so the private OpenHanjaDic should be renamed.

      CloseMainDic is STDMETHOD which is virtual COM_DECLSPEC_NOTHROW HRESULT STDMETHODCALLTYPE so its fine to make CloseHanjaDic noexcept and avoid the catch in ~HanjaDicCloser. The difference with IUnknown is that Release is just virtual ULONG STDMETHODCALLTYPE.

       
      • Zufu Liu

        Zufu Liu - 2021-11-19

        If we treat STDMETHOD as noexcept, then UniqueBSTR and HanjaDicCloser is not even needed at all, all method including GetHangulOfHanja can be marked as noexcept.

        I would prefer to rename public OpenHanjaDic and CloseHanjaDic just as Open and Close as the class name already says it's HanjaDic.

         

        Last edit: Zufu Liu 2021-11-20
        • Neil Hodgson

          Neil Hodgson - 2021-11-20

          OK, I thought you wanted allocations to be encapsulated, not just for exception safety. To me, BSTR is allocation so there is some value with encapsulating to ensure release but Open|CloseMainDic are life cycle with less benefit as there's no handle returned by OpenMainDic.

          I wouldn't define UniqueBSTR for just this code but might if BSTR was used more extensively.

          Open and Close are fine.

           
          • Zufu Liu

            Zufu Liu - 2021-11-20

            I currently can think only one usage for BSTR for other code: accessibility like Text and TextRange Control Patterns, maybe there are other COM APIs that use BSTR.

            UniqueBSTR is similar toUniqueString, so I think provide index operator is appreciate.

             
          • Zufu Liu

            Zufu Liu - 2021-11-21

            So, reverted the code to using UniqueBSTR = std::unique_ptr<OLECHAR[], BSTRDeleter>;.

            unique_ptr<OLECHAR[]> is array of OLECHAR not pointer to array of OLECHAR (OLECHAR[]* ? or OLECHAR**), so it behaviors same as unique_ptr<OLECHAR> (pointer to OLECHAR), except it has an index operator.

             
  • Zufu Liu

    Zufu Liu - 2021-11-19

    Simplified code with SysAllocStringLen:

    const wchar_t hanja[2] = { character, L'\0' };
    const UniqueBSTR bstrHanja{SysAllocString(hanja)};
    
    =>
    
    const UniqueBSTR bstrHanja{SysAllocStringLen(&character, 1)};
    
     
  • Neil Hodgson

    Neil Hodgson - 2021-11-22
    • labels: --> scintilla, win32, Korean
    • status: open --> open-fixed
     
  • Neil Hodgson

    Neil Hodgson - 2021-11-22

    Committed as [bab27e]. Subsequent change adds WinTypes.h.

     

    Related

    Commit: [bab27e]

  • Neil Hodgson

    Neil Hodgson - 2021-12-06
    • status: open-fixed --> closed-fixed
     
  • Neil Hodgson

    Neil Hodgson - 2022-01-19
    • status: closed-fixed --> open-accepted
     
  • Neil Hodgson

    Neil Hodgson - 2022-01-19

    Coverity dislikes UniqueBSTR as OLECHAR[] isn't the same as BSTR. This is in the expression :

    dict.HanjaToHangul(bstrHanja.get(), bstrHangul)
    
    1469035 COM bad conversion to BSTR
    
    Accessing a wchar_t as if it were a BSTR may read out of bounds memory or cause memory corruption.
    
    In Scintilla::​Internal::​HanjaDict::​GetHangulOfHanja(std::​basic_string<wchar_t, std::​char_traits<wchar_t>, std::​allocator<wchar_t>> &): A wide character string cannot be converted to a BSTR because it lacks the hidden length field (CWE-119)
    
    eLearning: Learn more about CWE-119
    

    CWE-119

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.

MongoDB Logo MongoDB