Menu

How to send non-ASCII text to R?

2016-02-06
2016-03-08
  • Adam Ryczkowski

    Adam Ryczkowski - 2016-02-06

    Excelsi-R doesn't handle Unicode as of now. I can receive proper Polish text from a stringVar variable in R, when I use the following walkaround in Excel: =REVAL("iconv(stringVar, from='UTF-8',to='CP1250')"). ('CP1250' is the Polish code page used by Windows).

    I see no walkaround the other way around, i.e. to send a Polish text (a variable label) into R.

    Can anyone tell me how to do it?

    I am programmer, I know C++ and VB6, and I might make a fix myself, if a project administrator contacts me and explains details.

    Adam Ryczkowski

     
  • Ben Escoto

    Ben Escoto - 2016-02-09

    Hi Adam,

    Thanks for your interest in Excelsi-R. I would like Excelsi-R to handle
    non-ASCII strings, but I'm afraid I don't understand the specifics of
    how R and Windows handle the encodings of strings at the byte level.
    I'm not surprised Excelsi-R doesn't handle Unicode correctly, but not
    sure how to fix it.

    Since you're a C++/VBA programmer, let me describe what Excelsi-R does
    now and perhaps you can suggest a fix. The main functions are here:

    http://sourceforge.net/p/excelsir/code/ci/default/tree/Excelsi-R/excelsir_string.cpp

    Basically, Excel stores stuff inside cells in VARIANT format, so it
    seems strings inside variants are in BSTR format. The functions that
    convert BSTR to/from C++ std::string objects are pasted into the
    message below for convenience.

    Hmm, I'm mostly a linux user, but I did a bit of googling and apparently
    under Windows I should generally be using std::wstring instead
    std::string. I don't know if this is the problem however.

    If you just need an ugly hack without any programming, presumably it's
    possible to quote unicode inside Excel and get R to do the unquoting.
    For instance along the lines of eval(parse(text='"\u03b2"')).

    Thanks,
    Ben

    std::string ConvertBSTRToMBS(BSTR bstr)
    {
        int wslen = ::SysStringLen(bstr);
        return ConvertWCSToMBS((wchar_t*)bstr, wslen);
    }
    
    std::string ConvertWCSToMBS(const wchar_t* pstr, long wslen)
    {
        int len = ::WideCharToMultiByte(CP_ACP, 0, pstr, wslen, NULL, 0,
    NULL, NULL);
    
        std::string dblstr(len, '\0');
        if (wslen == 0)
            return dblstr; // wslen cannot be 0 below
        len = ::WideCharToMultiByte(CP_ACP, 0 /* no flags */,
                                    pstr, wslen /* not necessary
        NULL-terminated */, &dblstr[0], len,
                                    NULL, NULL /* no default char */);
    
        return dblstr;
    }
    
    BSTR ConvertMBSToBSTR(const std::string& str)
    {
        int wslen = ::MultiByteToWideChar(CP_ACP, 0 /* no flags */,
                                          str.data(), str.length(),
                                          NULL, 0);
    
        BSTR wsdata = ::SysAllocStringLen(NULL, wslen);
        ::MultiByteToWideChar(CP_ACP, 0 /* no flags */,
                              str.data(), str.length(),
                              wsdata, wslen);
        return wsdata;
    }
    
     

    Last edit: Ben Escoto 2016-02-09
  • Adam Ryczkowski

    Adam Ryczkowski - 2016-02-15

    Quite obviously, std::string is incapable of storing Unicode (http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring). I know, that technically speaking strings are just arrays of bytes and encoding shouldn't matter, but C++ has function overloadings, that can invoke proper string recoding when it sees that you try to construct to wchar_t* from _bstr_t (defined in comutil.h)

    On Windows BSTR already contains string encoded in UTF-16, so you (we?) need to do is to use proper conversion from UTF-16 to UTF-32 (std::w_string).

    So I guess, you need to do 2 things for a proper fix:
    1. Employ conversion as in accepted answer to http://stackoverflow.com/questions/6284524/bstr-to-stdstring-stdwstring-and-vice-versa (isn't it your question, by the way?) or better yet: http://stackoverflow.com/questions/3810730/convert-bstr-to-wstring. Don't use MultiByteToWideChar and WideCharToMultiByte which cannot convert to UTF-32.
    2. Drop std::string in favor of std::wstring in every place that handles R command (which might not be so simple as it sounds)

    I haven't tested it yet - I haven't been programming in C++ on Windows for about 10 years (I am also based on Linux) so testing it will not be straightforward for me. Can you do it instead?

     
  • Adam Ryczkowski

    Adam Ryczkowski - 2016-02-15

    If you don't I'll try to test it myself around the next week.

     
  • Ben Escoto

    Ben Escoto - 2016-02-16

    Thanks for the links. I can look into this, but I'm a bit busy at the moment so won't be able to start until late next week.

    Your step #1 looks straightforward. Step #2 hopefully isn't too bad either, but it would involve figuring out what format R stores its strings in, and converting them accordingly. That might just be a bit of googling, or it might degenerate into reading through the source of R and Rserve, worrying about platform-specific issues, etc.

     
  • Ben Escoto

    Ben Escoto - 2016-02-29

    Hi, just looked a bit into this. It seems straightforward, the one wrinkle is that R may use different default encodings on different platforms. By default, Rserve strings are coded using the "unknown"/default encoding. (Rserve does seem to check the string encoding type and recode the strings if necessary.) However, there is an option (either in configuration file or command-line) to tell Rserve to use UTF-8. See: http://www.rforge.net/Rserve/doc.html.

    Would it be acceptable to you if Excelsi-R required Rserve to use UTF-8? According to the link above, the Java clients already require UTF-8, and it may become the default in the future anyway. It would be more work to handle the "default" encoding, because that could differ from client to server. Conceptually, the "default" encoding on the server might even be unknown/unavailable on the client.

     
  • Ben Escoto

    Ben Escoto - 2016-03-08

    I went ahead and made the change. I believe it's as simple as changing CP_ACP to CP_UTF8 in the code above, and switching the encoding of the Rserve server to UTF-8. There should be no issues holding Unicode inside std::string.

    I want to make one other minor change, then I'll release a new version of Excelsi-R.

     

Log in to post a comment.