|
From: William S F. <ws...@fu...> - 2011-03-25 20:08:09
|
On 23/03/11 16:04, Soren Soe wrote: > William S Fulton wrote: >> On 18/03/11 19:37, Soren Soe wrote: >>> Hi, >>> >>> I am curious about the default typemap for for example std::string >>> conversion to and from Java String. Java string is unicode, and on the >>> C++ side std::string is constructed from 8 bit char. The C++ strings >>> are not necessarily UTF8 encoded in the current language locale, so when >>> the swig typemap (std_string.i) uses GetStringUTFChars on the jstring, >>> the resulting 8bit string may be garbage depending on the locale. I am >>> working on a application that runs under both *nix and windows. >>> >>> The UTF8 conversion works fine on Linux, but not so on windows, where I >>> am trying to get my application running on a Japanese OS with locale set >>> to Japanese_Japan.932. The string encoding on the C++ side uses a >>> multi-byte representation for the native characters, but the encoding is >>> not UTF8. >>> >>> My question is why the default string typemaps are coded to use >>> GetStringUTFChars and NewStringUTF? Shouldn't they be written to use >>> the std::codecvt facet from the standard C++ library? The codecvt >>> facet will convert between wchar_t and char according to the current >>> locale. >>> >>> I have written my own typemaps for std::string and char* to use the >>> std::codecvt facet and my application is now behaving as expected on the >>> Japenese OS. However, I am worried that I am missing something >>> fundamental here; I find it hard to imagine that the default swig >>> typemaps are not I18N compatible. >>> >>> Any help/comments would be greatly appreciated. >>> >> >> I think it is simply that they were simply written with ASCII in mind >> and no-one has used them for anything outside of that. I don't recall >> this issue being brought up before. >> >> I suggest you put a patch to the current typemaps on the SourceForge >> patch tracker. A simple test using UTF would be much appreciated for >> the US locale for regression testing. >> >> How does this work for char * in C only mode? >> >> William >> >> > I would be happy to patch the typemaps as you suggest. However, I am not > an expert on writing typemaps so I am sure the ones I wrote are *not* up > to the required standard. > Modifications to the current typemaps would surely be okay. > What's worse is that writing code to use the std::codecvt is not exactly > straight forward. I had to write some support code to interface with the > codecvt facet. There was no way I would litter the typemaps with the raw > code, plus the support code is used in other places too for string > conversions, so currently the typemaps make use of the support code. If > I knew more about the proper way of organizing typemaps and sharing code > between typemaps, I could attempt a patch as suggested that wouldn't > rely on my support code. I will look into doing this, but if you have > any pointers or examples to get me started in the right direction please > let me know. > For support code, this can be put into a function which will only be generated if you use fragments, as described here: http://www.swig.org/Doc2.0/Typemaps.html#Typemaps_fragments > As for the char* and C. These typemaps must be written to use the C > native wide character to/from multi-byte conversion routines. In fact, > maybe the C native conversions should be used in the std::string > typemaps? Since my project deals with C++/Java only I was able add > specific typemaps for char* that use the codecvt facet, e.g. same > conversions as for std::string. > Yes we need to accommodate lowest common denominator and that often means C code even though C++ might be used. William |