#1199 C# Unicode Solution

open
csharp (36)
5
2013-11-19
2011-10-18
mybluedog
No

C# is not handling Unicode strings properly on Windows. It's converting them to ANSI.

By default, .net marshals strings as tchars. This is effectively ANSI chars on my Windows 7 machine (which is suprising).

To correct this problem, there need to be two changes. Only one of them can be done with an .i file. The other requires a change to SWIG.

Firstl, SWIGWStringHelper needs 'MarshalAs' directives as shown (an .i file can do this)...

protected class SWIGWStringHelper {

[return: MarshalAs(UnmanagedType.BStr)]
public delegate string SWIGWStringDelegate(IntPtr message);

static SWIGWStringDelegate wstringDelegate = new SWIGWStringDelegate\(CreateWString\);

[DllImport("$dllimport", EntryPoint="SWIGRegisterWStringCallback_$module")]
public static extern void SWIGRegisterWStringCallback_$module(SWIGWStringDelegate wstringDelegate);

[return: MarshalAs(UnmanagedType.BStr)]
static string CreateWString([MarshalAs(UnmanagedType.LPWStr)]IntPtr cString) {
return System.Runtime.InteropServices.Marshal.PtrToStringUni(cString);
}

static SWIGWStringHelper() {
SWIGRegisterWStringCallback_$module(wstringDelegate);
}
}

static protected SWIGWStringHelper swigWStringHelper = new SWIGWStringHelper();

Secondly, in PINVOKE.cs, all string related DllImport directives need to have 'CharSet = CharSet.Unicode' appended to them as shown.
[DllImport("xxx", EntryPoint="xxx", CharSet = CharSet.Unicode)]

(This requires a change in SWIG itself as the DllImport cannot be manipulated through an .i file).

Discussion

  • .NET has always used ANSI marshaling by default; AFAIK it is not dependent on which version of Windows you are running (regardless of wheat the documentation says).

    In order for TCHAR to be defined as wchar_t in C++, build your project with UNICODE defined (in Visual Studio this is accomplished by setting General | Character Set = Use Unicode Character Set.) SWIG should also be told about TCHAR somehow, e.g. with a typedef:

    typedef wchar_t TCHAR;

    I customized my SWIG code to support unicode properly a couple of years back and I didn't need the "CharSet = CharSet.Unicode" things. Instead I used the "inattributes" attribute of imtype:

    %typemap(imtype, out="IntPtr", inattributes="[MarshalAs(UnmanagedType.LPWStr)]") wchar_t *, wchar_t[ANY], wchar_t[] "string"

    I handled return values using IntPtr and Marshal.PtrToStringUni in the csout typemap.

    However, if I remember correctly, nowadays SWIG supports unicode properly. I'm not sure why it's not working for you.

     
  • mybluedog
    mybluedog
    2011-10-19

    Thanks for that - I am a noob in both C# and SWIG.

    All my C++ is using wchar. Stepping through with a debugger it's clear that all the problems are happening in C#. In particular, the CreateWString function in class SWIGWStringHelper is definitiely converting incoming wide strings down to ANSI. The only ways I can think to prevent that are by using the 'MarshalAs' directives, or by somehow telling C# on a project level that it should be marshalling in Unicode by default. I looked for C# project settings as a first idea but failed to find anything. It still seems amazing to me that C# seems to favour Unicode everywhere except for marshaling!

    Perhaps I have missed something in your comment, but I can't see how it solves the problem in SWIGWStringHelper::CreateWString.

     
  • SWIGWStringHelper.CreateWString returns a BSTR, which is a UTF-16 string type. What's the problem? It looks to me like CreateWString is only used for returning std::wstring, not wchar_t*. Did you verify with a debugger that SWIG does what you think it does?

     
  • mybluedog
    mybluedog
    2011-10-25

    I should state the problem with a bit more accuracy and detail.

    I am returning std::wstrings in from C++.

    I added the two lines that say "[return: MarshalAs(UnmanagedType.BStr)]" to Swig's .net code.

    Without these lines, .net converts the return value to an ANSI string right after the line that says "return System.Runtime.InteropServices.Marshal.PtrToStringUni(cString);".

     
  • Mike Christel
    Mike Christel
    2012-03-13

    The bug still exists with current Indri wrappers for C#. My concern is being able to search UTF-8 such as “café” from C#. I cannot get the UTF-8 query across via either runQuery or runAnnotatedQuery such that I am ever given non-zero results. The query always returns 0 results, even though the term “café” is present in the index and will return results when used via IndriRunQuery (or, I am told, when using the Java interface; hence, this bug is restricted to the C# wrapper). I need to produce results via a C# service, though, and so IndriRunQuery is not an option for me.

    This issue was seemingly addressed a long time ago: http://www.lemurproject.org/phorum/read.php?11,4029
    However, that old commentary does not apply to the current state of things.

     
  • Mike Christel
    Mike Christel
    2012-03-13

    A bit more detail on an attempt to try earlier advice to use a "unicode2utf" function that still fails to get runQuery or runAnnotatedQuery to work properly with a UTF-8 query string.

    /// <summary>
    /// Return utf version of unicode string.
    /// </summary>
    /// <param name="str">given string</param>
    /// <returns>converted string</returns>
    public String unicode2utf(string str)
    { // note: same failed behavior with runQuery regardless of BOM use and true or false in next line as parameter to UTF8Encoding
    UTF8Encoding myUTF8Encoder = new UTF8Encoding(true);
    byte[] buff = myUTF8Encoder.GetBytes(str);

    return myUTF8Encoder.GetString(buff);
    }

    // With a valid QueryEnvironment set up as indriQE with ASCII queries all returning // expected counts, I try the following with queryInIndriLanguage simply the one
    // word café which should return 1 result but always returns 0.
    Indri.ScoredExtentResultVector checkThese = indriQE.runQuery(unicode2utf(queryInIndriQueryLanguage), maxIndriResultsToConsider);
    str_seg_IDs = indriQE.documentMetadata(checkThese, "docno");
    // always (zero results) for queryInIndriQueryLanguage == café

    // Same with runAnnotatedQuery:
    // Do the query through Indri, asking for at most nMaxSegmentsToFetch results (which get other data posted/loaded as well
    // for subsequent match details lookup).
    Indri.QueryAnnotation qa = indriQE.runAnnotatedQuery(unicode2utf(queryInIndriQueryLanguage), maxIndriResultsToConsider);

    try
    {
    //obtain the segment ID (as "docno") from the metadata of the documents
    res = qa.getResults();
    // always res.Count == 0 for queryInIndriQueryLanguage == café
    if (res != null && res.Count > 0)
    {
    str_seg_IDs = indriQE.documentMetadata(res, "docno");
    // etc.
    }
    }
    catch (Exception ex)
    {
    // etc.
    }