On my Windows 10 system with Korean installed, there is no mshjdic.hanjadic (hr=0x800401f3 : Invalid class string) so the Hanja conversion never works but there is a imkrhjd.hanjadic which does work. Perhaps Korean machines are commonly set up with mshjdic.hanjadic but this code would be more widely applicable if both class IDs were tried.
A way to test this is:
Open a UTF-8 document
Change the IME to Korean; Hangul
Type a syllable block like "ㄴㅏㅁ" (US keys "ska") -> "남"
Press VK_HANJA (right Ctrl) for a candidate list
Press Enter to convert to the first Hanja candidate "南"
Convert back to Hangul by selecting "南" and pressing VK_HANJA
Result is "남"
IHanjaDic is defined by the type library included in imkrhjd.dll which is found in C:\Windows\System32\IME\IMEKR\DICTS. This can be imported into C++ with
The resulting imkrhjd.tlh header found in the objects directory uses unsigned short instead of wchar_t. It differs slightly in other ways which may be due to using a different type library extraction tool or version. OleView is no longer provided by Microsoft and it apparently doesn't work on recent versions of Windows.
The definition of IHanjaDic provided by the type library should be respected. Modifying it to reduce casting seems wrong. Ideally, Scintilla could #import the type library but it won't be present on most developers machines and this won't work with gcc.
struct__declspec(uuid("ad75f3ac-18cd-48c6-a27d-f1e9a7dce432"))IHanjaDic:IUnknown{//// Raw methods provided by interface//virtualHRESULT__stdcallOpenMainDic()=0;virtualHRESULT__stdcallCloseMainDic()=0;virtualHRESULT__stdcallGetHanjaWords(/*[in]*/BSTRbstrHangul,/*[out]*/SAFEARRAY**ppsaHanja,/*[out,retval]*/VARIANT_BOOL*pfFound)=0;virtualHRESULT__stdcallGetHanjaChars(/*[in]*/unsignedshortwchHangul,/*[out]*/BSTR*pbstrHanjaChars,/*[out,retval]*/VARIANT_BOOL*pfFound)=0;virtualHRESULT__stdcallHanjaToHangul(/*[in]*/BSTRbstrHanja,/*[out,retval]*/BSTR*pbstrHangul)=0;virtualHRESULT__stdcallGetHanjaType(/*[in]*/unsignedshortwchHanja,/*[out,retval]*/HANJA_TYPE*pHanjaType)=0;virtualHRESULT__stdcallGetHanjaSense(/*[in]*/unsignedshortwchHanja,/*[out,retval]*/BSTR*pbstrSense)=0;virtualHRESULT__stdcallGetRadicalID(/*[in]*/shortSeqNumOfRadical,/*[out]*/short*pRadicalID,/*[out]*/unsignedshort*pwchRadical)=0;virtualHRESULT__stdcallGetRadical(/*[in]*/shortnRadicalID,/*[out,retval]*/structIRadical**ppIRadical)=0;virtualHRESULT__stdcallRadicalIDToHanja(/*[in]*/shortnRadicalID,/*[out,retval]*/unsignedshort*pwchRadical)=0;virtualHRESULT__stdcallGetHanja(/*[in]*/unsignedshortwchHanja,/*[out,retval]*/structIHanja**ppIHanja)=0;virtualHRESULT__stdcallGetStrokes(/*[in]*/shortnStrokes,/*[out,retval]*/structIStrokes**ppIStrokes)=0;virtualHRESULT__stdcallOpenDefaultCustomDic()=0;virtualHRESULT__stdcallOpenCustomDic(/*[in]*/BSTRbstrPath,/*[out,retval]*/long*plUdr)=0;virtualHRESULT__stdcallCloseDefaultCustomDic()=0;virtualHRESULT__stdcallCloseCustomDic(/*[in]*/longlUdr)=0;virtualHRESULT__stdcallCloseAllCustomDics()=0;virtualHRESULT__stdcallGetDefaultCustomHanjaWords(/*[in]*/BSTRbstrHangul,/*[out]*/structtagSAFEARRAY**ppsaHanja,/*[out,retval]*/VARIANT_BOOL*pfFound)=0;virtualHRESULT__stdcallGetCustomHanjaWords(/*[in]*/longlUdr,/*[in]*/BSTRbstrHangul,/*[out]*/structtagSAFEARRAY**ppsaHanja,/*[out,retval]*/VARIANT_BOOL*pfFound)=0;virtualHRESULT__stdcallPutDefaultCustomHanjaWord(/*[in]*/BSTRbstrHangul,/*[in]*/BSTRbstrHanja)=0;virtualHRESULT__stdcallPutCustomHanjaWord(/*[in]*/longlUdr,/*[in]*/BSTRbstrHangul,/*[in]*/BSTRbstrHanja)=0;virtualHRESULT__stdcallget_MaxNumOfRadicals(/*[out,retval]*/short*pVal)=0;virtualHRESULT__stdcallget_MaxNumOfStrokes(/*[out,retval]*/short*pVal)=0;virtualHRESULT__stdcallget_DefaultCustomDic(/*[out,retval]*/long*pVal)=0;virtualHRESULT__stdcallput_DefaultCustomDic(/*[in]*/longpVal)=0;virtualHRESULT__stdcallget_MaxHanjaType(/*[out,retval]*/HANJA_TYPE*pHanjaType)=0;virtualHRESULT__stdcallput_MaxHanjaType(/*[in]*/HANJA_TYPEpHanjaType)=0;};
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Changed bstrHangul to nullptr as it's /*[out,retval]*/, I'm not sure whether it's need be freed for each conversion.
BSTRbstrHangul=nullptr;
Sine the conversion is dictionary based, an improvement might pass consecutive Hanja characters in one conversion. Also conversion result might longer than input?
Changed bstrHangul to nullptr as it's /[out,retval]/, I'm not sure whether it's need be freed for each conversion.
For safety, there should be no reuse of bstrHangul (as its [out], it shouldn't be freed by HanjaToHangul) and it should be explicitly freed on each iteration.
Sine the conversion is dictionary based, an improvement might pass consecutive Hanja characters in one conversion. Also conversion result might longer than input?
The original author may have had some reason to perform this character by character - perhaps there is something like unwanted Hangul normalisation or there were bugs in some release. We have no documentation on HanjaToHangul and, without knowing Korean, its difficult to know what is important.
Added the sample as a test at end of HanjaDic.cxx.
This sort of test code is a maintenance burden and tends to decay.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Since SelectionToHangul allocates so may throw, it doesn't help much to noexcept HanjaDic. I could see an argument to throw from HanjaDic on, for example, E_NOMEMORY from IHanjaDic::OpenMainDic but that would only be worthwhile if we were being much more thorough with out-of-memory and had a strategy for handling it.
Over time, my opinion on adding noexcept has changed. Constructors often allocate memory so they shouldn't be noexcept even though linters complain. Trivial value constructors that will never allocate like PRectangle should be noexcept. Destructors should be noexcept and try to handle any possible exceptions as it should be possible to finalize safely and completely.
Rule-of-zero means that the default operations should be concentrated in resource management classes and other classes should then use these classes and define no default operations themselves. Thus HanjaDic shouldn't define a destructor (or assignment, ...). Instead, HJInterface should be a specialization of std::unique_ptr like LexerInstance in Document.h. Something like
structUnknownReleaser{// Called by unique_ptr to destroy/free the resourcetemplate<classT>voidoperator()(T*pUnknown)noexcept{if(pUnknown){try{pUnknown->Release();}catch(...){// IUnknown::Release must not throw, ignore if it does.}}}};classHanjaDic{std::unique_ptr<IHanjaDic,UnknownReleaser>HJinterface;
There's some more complexity as CloseMainDic should be called if OpenMainDic succeeded but that could be moved out of construction/destruction.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
C:\u\hg\scintilla\win32\HanjaDic.cxx(120): warning C26447: The function is declared 'noexcept' but calls function 'CloseHanjaDic()' which may throw exceptions (f.6).
Destructors are default noexcept, so functions they call should be noexcept.
UnknownReleaser can be reused in ScintillaWin.cxx and PlatWin.cxx so there will be a new header containing it along with ReleaseUnknown and DLLFunction which are generically useful and don't refer to other Scintilla identifiers. Don't have a great name but will call it WinTypes.h if there is nothing better.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For the header, I think it better to prefix it. WinTypes.h may conflict with Windows SDK header names. I don't have a better name either, PlatWinBase.h?
Why define ScopedPointer for these cases when the standard library defines unique_ptr? Linters and other developers already know what unique_ptr is.
unique_ptr doesn't have & but that's because the semantics are difficult - should the original value be cleared first or be passed? For this code, passing a temporary, null BSTR, then reseting it into the smart pointer will be easily understood. This could be moved up into HanjaDic::HanjaToHangul, possibly returning a unique_ptr instead of using an out argument.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, changed to using UniqueBSTR = std::unique_ptr<OLECHAR[], BSTRDeleter>; with a new HanjaToHangul(BSTR bstrHanja, UniqueBSTR &bstrHangul) method (as there is no? grantee it returns null on fails):
The HanjaToHangul(BSTR bstrHanja, BSTR* pbstrHangul) method is kept, so that in C++23 using out_ptr (see https://en.cppreference.com/w/cpp/memory/out_ptr_t) the code can be simplified to if (dict.HanjaToHangul(bstrHanja.get(), std::out_ptr(bstrHangul))).
Unsure if std::unique_ptr<OLECHAR[], BSTRDeleter> is correct since BSTR is OLECHAR*, not OLECHAR[]*. std::unique_ptr<OLECHAR, BSTRDeleter> would seem more correct. That means [] isn't available when retrieving characters but *bstrHangul can be used.
Unused code shouldn't be included as it makes the file more difficult to understand and produces warnings.
Its more difficult to trace code when methods have the same name so the private OpenHanjaDic should be renamed.
CloseMainDic is STDMETHOD which is virtual COM_DECLSPEC_NOTHROW HRESULT STDMETHODCALLTYPE so its fine to make CloseHanjaDic noexcept and avoid the catch in ~HanjaDicCloser. The difference with IUnknown is that Release is just virtual ULONG STDMETHODCALLTYPE.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If we treat STDMETHOD as noexcept, then UniqueBSTR and HanjaDicCloser is not even needed at all, all method including GetHangulOfHanja can be marked as noexcept.
I would prefer to rename public OpenHanjaDic and CloseHanjaDic just as Open and Close as the class name already says it's HanjaDic.
OK, I thought you wanted allocations to be encapsulated, not just for exception safety. To me, BSTR is allocation so there is some value with encapsulating to ensure release but Open|CloseMainDic are life cycle with less benefit as there's no handle returned by OpenMainDic.
I wouldn't define UniqueBSTR for just this code but might if BSTR was used more extensively.
Open and Close are fine.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I currently can think only one usage for BSTR for other code: accessibility like Text and TextRange Control Patterns, maybe there are other COM APIs that use BSTR.
UniqueBSTR is similar toUniqueString, so I think provide index operator is appreciate.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So, reverted the code to using UniqueBSTR = std::unique_ptr<OLECHAR[], BSTRDeleter>;.
unique_ptr<OLECHAR[]> is array of OLECHAR not pointer to array of OLECHAR (OLECHAR[]* ? or OLECHAR**), so it behaviors same as unique_ptr<OLECHAR> (pointer to OLECHAR), except it has an index operator.
Coverity dislikes UniqueBSTR as OLECHAR[] isn't the same as BSTR. This is in the expression :
dict.HanjaToHangul(bstrHanja.get(),bstrHangul)
1469035 COM bad conversion to BSTR
Accessing a wchar_t as if it were a BSTR may read out of bounds memory or cause memory corruption.
In Scintilla::Internal::HanjaDict::GetHangulOfHanja(std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t>> &): A wide character string cannot be converted to a BSTR because it lacks the hidden length field (CWE-119)
eLearning: Learn more about CWE-119
Moving the
OpenMainDicseems reasonable.On my Windows 10 system with Korean installed, there is no
mshjdic.hanjadic(hr=0x800401f3 : Invalid class string) so the Hanja conversion never works but there is aimkrhjd.hanjadicwhich does work. Perhaps Korean machines are commonly set up withmshjdic.hanjadicbut this code would be more widely applicable if both class IDs were tried.A way to test this is:
IHanjaDicis defined by the type library included inimkrhjd.dllwhich is found inC:\Windows\System32\IME\IMEKR\DICTS. This can be imported into C++ withThe resulting
imkrhjd.tlhheader found in the objects directory usesunsigned shortinstead ofwchar_t. It differs slightly in other ways which may be due to using a different type library extraction tool or version. OleView is no longer provided by Microsoft and it apparently doesn't work on recent versions of Windows.The definition of IHanjaDic provided by the type library should be respected. Modifying it to reduce casting seems wrong. Ideally, Scintilla could #import the type library but it won't be present on most developers machines and this won't work with gcc.
Full source without touch interface definition.
Added the sample as a test at end of HanjaDic.cxx.
(just guess) maybe "mshjdic.hanjadic" is come with Office, or old MS IME.
Changed bstrHangul to nullptr as it's
/*[out,retval]*/, I'm not sure whether it's need be freed for each conversion.Sine the conversion is dictionary based, an improvement might pass consecutive Hanja characters in one conversion. Also conversion result might longer than input?
For safety, there should be no reuse of
bstrHangul(as its[out], it shouldn't be freed by HanjaToHangul) and it should be explicitly freed on each iteration.The original author may have had some reason to perform this character by character - perhaps there is something like unwanted Hangul normalisation or there were bugs in some release. We have no documentation on HanjaToHangul and, without knowing Korean, its difficult to know what is important.
This sort of test code is a maintenance burden and tends to decay.
Updated the code, moved constructor to a separator function.
As confirmed at https://github.com/zufuliu/notepad2/issues/393, mshjdic.hanjadic does not work, so changed the code to try imkrhjd.hanjadic first.
Mayeb it's better to surround the
OpenMainDic()block andCloseMainDic()method with catch all (and markHanjaDic()andOpenHanjaDic()asnoexcept).attachment simplified
case Message::GetSelText:in Editor.cxx.Since
SelectionToHangulallocates so may throw, it doesn't help much to noexceptHanjaDic. I could see an argument to throw fromHanjaDicon, for example,E_NOMEMORYfromIHanjaDic::OpenMainDicbut that would only be worthwhile if we were being much more thorough with out-of-memory and had a strategy for handling it.Over time, my opinion on adding noexcept has changed. Constructors often allocate memory so they shouldn't be noexcept even though linters complain. Trivial value constructors that will never allocate like PRectangle should be noexcept. Destructors should be noexcept and try to handle any possible exceptions as it should be possible to finalize safely and completely.
Rule-of-zero means that the default operations should be concentrated in resource management classes and other classes should then use these classes and define no default operations themselves. Thus
HanjaDicshouldn't define a destructor (or assignment, ...). Instead,HJInterfaceshould be a specialization of std::unique_ptr likeLexerInstanceinDocument.h. Something likeThere's some more complexity as
CloseMainDicshould be called ifOpenMainDicsucceeded but that could be moved out of construction/destruction.Using
UnknownReleasersimplified the code a lot. I addedHanjaDicCloser, which seems fixed theCloseMainDicproblem.HanjaDic2.cxx contains a word based conversion and examples from https://www.koreanwikiproject.com/wiki/Hanja#Does_every_Hanja_character_have_only_one_sound_representation.3F
But both function gives same result on my Win10, don't know whether this is the expected behavior or not, possible it's not worth the complex.
Last edit: Zufu Liu 2021-11-16
Updated
HanjaDicCloserto following to not expose the raw pointer.Added BLUEnLIVE's comment (need wording updates, basically there is no need to make word based conversion) from
https://github.com/zufuliu/notepad2/issues/393#issuecomment-970404057
Destructors are default noexcept, so functions they call should be noexcept.
UnknownReleasercan be reused in ScintillaWin.cxx and PlatWin.cxx so there will be a new header containing it along withReleaseUnknownandDLLFunctionwhich are generically useful and don't refer to other Scintilla identifiers. Don't have a great name but will call itWinTypes.hif there is nothing better.Added catch all for the destructor. It seems
HanjaDicCloserwithIHanjaDic *HJinterfaceproduces small function code than withconst HanjaDic *dict;.Also added a thin wrapper
ScopedBSTRforSysAllocString()andSysFreeString();_bstr_tseems not usable on some mingw gcc,e.g. https://stackoverflow.com/questions/51363689/qt5-mingw-undefined-reference-to-convertstringtobstrFor the header, I think it better to prefix it.
WinTypes.hmay conflict with Windows SDK header names. I don't have a better name either,PlatWinBase.h?Added a generic template (similar to
unique_ptr)I think it can be used with
UnknownReleaserto replace someReleaseUnknown()in PlatWin.cxx, like following:Also, according to https://en.cppreference.com/w/cpp/memory/unique_ptr/%7Eunique_ptr
null check in custom deleter can be omitted.
Why define
ScopedPointerfor these cases when the standard library definesunique_ptr? Linters and other developers already know whatunique_ptris.unique_ptrdoesn't have&but that's because the semantics are difficult - should the original value be cleared first or be passed? For this code, passing a temporary, null BSTR, thenreseting it into the smart pointer will be easily understood. This could be moved up intoHanjaDic::HanjaToHangul, possibly returning aunique_ptrinstead of using an out argument.OK, changed to
using UniqueBSTR = std::unique_ptr<OLECHAR[], BSTRDeleter>;with a newHanjaToHangul(BSTR bstrHanja, UniqueBSTR &bstrHangul)method (as there is no? grantee it returns null on fails):The
HanjaToHangul(BSTR bstrHanja, BSTR* pbstrHangul)method is kept, so that in C++23 usingout_ptr(see https://en.cppreference.com/w/cpp/memory/out_ptr_t) the code can be simplified toif (dict.HanjaToHangul(bstrHanja.get(), std::out_ptr(bstrHangul))).Unsure if
std::unique_ptr<OLECHAR[], BSTRDeleter>is correct sinceBSTRisOLECHAR*, notOLECHAR[]*.std::unique_ptr<OLECHAR, BSTRDeleter>would seem more correct. That means[]isn't available when retrieving characters but*bstrHangulcan be used.Unused code shouldn't be included as it makes the file more difficult to understand and produces warnings.
Its more difficult to trace code when methods have the same name so the private
OpenHanjaDicshould be renamed.CloseMainDicisSTDMETHODwhich isvirtual COM_DECLSPEC_NOTHROW HRESULT STDMETHODCALLTYPEso its fine to makeCloseHanjaDicnoexcept and avoid the catch in~HanjaDicCloser. The difference withIUnknownis thatReleaseis justvirtual ULONG STDMETHODCALLTYPE.If we treat
STDMETHODasnoexcept, thenUniqueBSTRandHanjaDicCloseris not even needed at all, all method includingGetHangulOfHanjacan be marked asnoexcept.I would prefer to rename public
OpenHanjaDicandCloseHanjaDicjust asOpenandCloseas the class name already says it'sHanjaDic.Last edit: Zufu Liu 2021-11-20
OK, I thought you wanted allocations to be encapsulated, not just for exception safety. To me, BSTR is allocation so there is some value with encapsulating to ensure release but
Open|CloseMainDicare life cycle with less benefit as there's no handle returned byOpenMainDic.I wouldn't define
UniqueBSTRfor just this code but might if BSTR was used more extensively.OpenandCloseare fine.I currently can think only one usage for BSTR for other code: accessibility like Text and TextRange Control Patterns, maybe there are other COM APIs that use BSTR.
UniqueBSTRis similar toUniqueString, so I think provide index operator is appreciate.So, reverted the code to
using UniqueBSTR = std::unique_ptr<OLECHAR[], BSTRDeleter>;.unique_ptr<OLECHAR[]>is array ofOLECHARnot pointer to array ofOLECHAR(OLECHAR[]*? orOLECHAR**), so it behaviors same asunique_ptr<OLECHAR>(pointer toOLECHAR), except it has an index operator.Simplified code with
SysAllocStringLen:Committed as [bab27e]. Subsequent change adds WinTypes.h.
Related
Commit: [bab27e]
Coverity dislikes
UniqueBSTRasOLECHAR[]isn't the same asBSTR. This is in the expression :CWE-119