From: SourceForge.net <no...@so...> - 2012-03-13 13:17:43
|
Bugs item #3425505, was opened at 2011-10-18 16:23 Message generated for change (Comment added) made by mikegc You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=101645&aid=3425505&group_id=1645 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: csharp Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: mybluedog (mybluedog) Assigned to: William Fulton (wsfulton) Summary: C# Unicode Solution Initial Comment: C# is not handling Unicode strings properly on Windows. It's converting them to ANSI. By default, .net marshals strings as tchars. This is effectively ANSI chars on my Windows 7 machine (which is suprising). To correct this problem, there need to be two changes. Only one of them can be done with an .i file. The other requires a change to SWIG. Firstl, SWIGWStringHelper needs 'MarshalAs' directives as shown (an .i file can do this)... protected class SWIGWStringHelper { [return: MarshalAs(UnmanagedType.BStr)] public delegate string SWIGWStringDelegate(IntPtr message); static SWIGWStringDelegate wstringDelegate = new SWIGWStringDelegate(CreateWString); [DllImport("$dllimport", EntryPoint="SWIGRegisterWStringCallback_$module")] public static extern void SWIGRegisterWStringCallback_$module(SWIGWStringDelegate wstringDelegate); [return: MarshalAs(UnmanagedType.BStr)] static string CreateWString([MarshalAs(UnmanagedType.LPWStr)]IntPtr cString) { return System.Runtime.InteropServices.Marshal.PtrToStringUni(cString); } static SWIGWStringHelper() { SWIGRegisterWStringCallback_$module(wstringDelegate); } } static protected SWIGWStringHelper swigWStringHelper = new SWIGWStringHelper(); Secondly, in PINVOKE.cs, all string related DllImport directives need to have 'CharSet = CharSet.Unicode' appended to them as shown. [DllImport("xxx", EntryPoint="xxx", CharSet = CharSet.Unicode)] (This requires a change in SWIG itself as the DllImport cannot be manipulated through an .i file). ---------------------------------------------------------------------- Comment By: Mike Christel (mikegc) Date: 2012-03-13 06:17 Message: A bit more detail on an attempt to try earlier advice to use a "unicode2utf" function that still fails to get runQuery or runAnnotatedQuery to work properly with a UTF-8 query string. /// <summary> /// Return utf version of unicode string. /// </summary> /// <param name="str">given string</param> /// <returns>converted string</returns> public String unicode2utf(string str) { // note: same failed behavior with runQuery regardless of BOM use and true or false in next line as parameter to UTF8Encoding UTF8Encoding myUTF8Encoder = new UTF8Encoding(true); byte[] buff = myUTF8Encoder.GetBytes(str); return myUTF8Encoder.GetString(buff); } // With a valid QueryEnvironment set up as indriQE with ASCII queries all returning // expected counts, I try the following with queryInIndriLanguage simply the one // word café which should return 1 result but always returns 0. Indri.ScoredExtentResultVector checkThese = indriQE.runQuery(unicode2utf(queryInIndriQueryLanguage), maxIndriResultsToConsider); str_seg_IDs = indriQE.documentMetadata(checkThese, "docno"); // always (zero results) for queryInIndriQueryLanguage == café // Same with runAnnotatedQuery: // Do the query through Indri, asking for at most nMaxSegmentsToFetch results (which get other data posted/loaded as well // for subsequent match details lookup). Indri.QueryAnnotation qa = indriQE.runAnnotatedQuery(unicode2utf(queryInIndriQueryLanguage), maxIndriResultsToConsider); try { //obtain the segment ID (as "docno") from the metadata of the documents res = qa.getResults(); // always res.Count == 0 for queryInIndriQueryLanguage == café if (res != null && res.Count > 0) { str_seg_IDs = indriQE.documentMetadata(res, "docno"); // etc. } } catch (Exception ex) { // etc. } ---------------------------------------------------------------------- Comment By: Mike Christel (mikegc) Date: 2012-03-13 06:10 Message: The bug still exists with current Indri wrappers for C#. My concern is being able to search UTF-8 such as “café” from C#. I cannot get the UTF-8 query across via either runQuery or runAnnotatedQuery such that I am ever given non-zero results. The query always returns 0 results, even though the term “café” is present in the index and will return results when used via IndriRunQuery (or, I am told, when using the Java interface; hence, this bug is restricted to the C# wrapper). I need to produce results via a C# service, though, and so IndriRunQuery is not an option for me. This issue was seemingly addressed a long time ago: http://www.lemurproject.org/phorum/read.php?11,4029 However, that old commentary does not apply to the current state of things. ---------------------------------------------------------------------- Comment By: mybluedog (mybluedog) Date: 2011-10-24 18:26 Message: I should state the problem with a bit more accuracy and detail. I am returning std::wstrings in from C++. I added the two lines that say "[return: MarshalAs(UnmanagedType.BStr)]" to Swig's .net code. Without these lines, .net converts the return value to an ANSI string right after the line that says "return System.Runtime.InteropServices.Marshal.PtrToStringUni(cString);". ---------------------------------------------------------------------- Comment By: David Piepgrass (qwertie) Date: 2011-10-20 09:33 Message: SWIGWStringHelper.CreateWString returns a BSTR, which is a UTF-16 string type. What's the problem? It looks to me like CreateWString is only used for returning std::wstring, not wchar_t*. Did you verify with a debugger that SWIG does what you think it does? ---------------------------------------------------------------------- Comment By: mybluedog (mybluedog) Date: 2011-10-19 15:41 Message: Thanks for that - I am a noob in both C# and SWIG. All my C++ is using wchar. Stepping through with a debugger it's clear that all the problems are happening in C#. In particular, the CreateWString function in class SWIGWStringHelper is definitiely converting incoming wide strings down to ANSI. The only ways I can think to prevent that are by using the 'MarshalAs' directives, or by somehow telling C# on a project level that it should be marshalling in Unicode by default. I looked for C# project settings as a first idea but failed to find anything. It still seems amazing to me that C# seems to favour Unicode everywhere except for marshaling! Perhaps I have missed something in your comment, but I can't see how it solves the problem in SWIGWStringHelper::CreateWString. ---------------------------------------------------------------------- Comment By: David Piepgrass (qwertie) Date: 2011-10-19 11:40 Message: .NET has always used ANSI marshaling by default; AFAIK it is not dependent on which version of Windows you are running (regardless of wheat the documentation says). In order for TCHAR to be defined as wchar_t in C++, build your project with UNICODE defined (in Visual Studio this is accomplished by setting General | Character Set = Use Unicode Character Set.) SWIG should also be told about TCHAR somehow, e.g. with a typedef: typedef wchar_t TCHAR; I customized my SWIG code to support unicode properly a couple of years back and I didn't need the "CharSet = CharSet.Unicode" things. Instead I used the "inattributes" attribute of imtype: %typemap(imtype, out="IntPtr", inattributes="[MarshalAs(UnmanagedType.LPWStr)]") wchar_t *, wchar_t[ANY], wchar_t[] "string" I handled return values using IntPtr and Marshal.PtrToStringUni in the csout typemap. However, if I remember correctly, nowadays SWIG supports unicode properly. I'm not sure why it's not working for you. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=101645&aid=3425505&group_id=1645 |