Excelsi-R doesn't handle Unicode as of now. I can receive proper Polish text from a stringVar variable in R, when I use the following walkaround in Excel: =REVAL("iconv(stringVar, from='UTF-8',to='CP1250')"). ('CP1250' is the Polish code page used by Windows).
I see no walkaround the other way around, i.e. to send a Polish text (a variable label) into R.
Can anyone tell me how to do it?
I am programmer, I know C++ and VB6, and I might make a fix myself, if a project administrator contacts me and explains details.
Adam Ryczkowski
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for your interest in Excelsi-R. I would like Excelsi-R to handle
non-ASCII strings, but I'm afraid I don't understand the specifics of
how R and Windows handle the encodings of strings at the byte level.
I'm not surprised Excelsi-R doesn't handle Unicode correctly, but not
sure how to fix it.
Since you're a C++/VBA programmer, let me describe what Excelsi-R does
now and perhaps you can suggest a fix. The main functions are here:
Basically, Excel stores stuff inside cells in VARIANT format, so it
seems strings inside variants are in BSTR format. The functions that
convert BSTR to/from C++ std::string objects are pasted into the
message below for convenience.
Hmm, I'm mostly a linux user, but I did a bit of googling and apparently
under Windows I should generally be using std::wstring instead
std::string. I don't know if this is the problem however.
If you just need an ugly hack without any programming, presumably it's
possible to quote unicode inside Excel and get R to do the unquoting.
For instance along the lines of eval(parse(text='"\u03b2"')).
Thanks,
Ben
std::stringConvertBSTRToMBS(BSTRbstr){intwslen=::SysStringLen(bstr);returnConvertWCSToMBS((wchar_t*)bstr,wslen);}std::stringConvertWCSToMBS(constwchar_t*pstr,longwslen){intlen=::WideCharToMultiByte(CP_ACP,0,pstr,wslen,NULL,0,NULL,NULL);std::stringdblstr(len,'\0');if(wslen==0)returndblstr;//wslencannotbe0belowlen=::WideCharToMultiByte(CP_ACP,0/* no flags */,pstr,wslen/* not necessary NULL-terminated */,&dblstr[0],len,NULL,NULL/* no default char */);returndblstr;}BSTRConvertMBSToBSTR(conststd::string&str){intwslen=::MultiByteToWideChar(CP_ACP,0/* no flags */,str.data(),str.length(),NULL,0);BSTRwsdata=::SysAllocStringLen(NULL,wslen);::MultiByteToWideChar(CP_ACP,0/* no flags */,str.data(),str.length(),wsdata,wslen);returnwsdata;}
Last edit: Ben Escoto 2016-02-09
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Quite obviously, std::string is incapable of storing Unicode (http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring). I know, that technically speaking strings are just arrays of bytes and encoding shouldn't matter, but C++ has function overloadings, that can invoke proper string recoding when it sees that you try to construct to wchar_t* from _bstr_t (defined in comutil.h)
On Windows BSTR already contains string encoded in UTF-16, so you (we?) need to do is to use proper conversion from UTF-16 to UTF-32 (std::w_string).
I haven't tested it yet - I haven't been programming in C++ on Windows for about 10 years (I am also based on Linux) so testing it will not be straightforward for me. Can you do it instead?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the links. I can look into this, but I'm a bit busy at the moment so won't be able to start until late next week.
Your step #1 looks straightforward. Step #2 hopefully isn't too bad either, but it would involve figuring out what format R stores its strings in, and converting them accordingly. That might just be a bit of googling, or it might degenerate into reading through the source of R and Rserve, worrying about platform-specific issues, etc.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi, just looked a bit into this. It seems straightforward, the one wrinkle is that R may use different default encodings on different platforms. By default, Rserve strings are coded using the "unknown"/default encoding. (Rserve does seem to check the string encoding type and recode the strings if necessary.) However, there is an option (either in configuration file or command-line) to tell Rserve to use UTF-8. See: http://www.rforge.net/Rserve/doc.html.
Would it be acceptable to you if Excelsi-R required Rserve to use UTF-8? According to the link above, the Java clients already require UTF-8, and it may become the default in the future anyway. It would be more work to handle the "default" encoding, because that could differ from client to server. Conceptually, the "default" encoding on the server might even be unknown/unavailable on the client.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I went ahead and made the change. I believe it's as simple as changing CP_ACP to CP_UTF8 in the code above, and switching the encoding of the Rserve server to UTF-8. There should be no issues holding Unicode inside std::string.
I want to make one other minor change, then I'll release a new version of Excelsi-R.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Excelsi-R doesn't handle Unicode as of now. I can receive proper Polish text from a stringVar variable in R, when I use the following walkaround in Excel: =REVAL("iconv(stringVar, from='UTF-8',to='CP1250')"). ('CP1250' is the Polish code page used by Windows).
I see no walkaround the other way around, i.e. to send a Polish text (a variable label) into R.
Can anyone tell me how to do it?
I am programmer, I know C++ and VB6, and I might make a fix myself, if a project administrator contacts me and explains details.
Adam Ryczkowski
Hi Adam,
Thanks for your interest in Excelsi-R. I would like Excelsi-R to handle
non-ASCII strings, but I'm afraid I don't understand the specifics of
how R and Windows handle the encodings of strings at the byte level.
I'm not surprised Excelsi-R doesn't handle Unicode correctly, but not
sure how to fix it.
Since you're a C++/VBA programmer, let me describe what Excelsi-R does
now and perhaps you can suggest a fix. The main functions are here:
http://sourceforge.net/p/excelsir/code/ci/default/tree/Excelsi-R/excelsir_string.cpp
Basically, Excel stores stuff inside cells in VARIANT format, so it
seems strings inside variants are in BSTR format. The functions that
convert BSTR to/from C++ std::string objects are pasted into the
message below for convenience.
Hmm, I'm mostly a linux user, but I did a bit of googling and apparently
under Windows I should generally be using std::wstring instead
std::string. I don't know if this is the problem however.
If you just need an ugly hack without any programming, presumably it's
possible to quote unicode inside Excel and get R to do the unquoting.
For instance along the lines of eval(parse(text='"\u03b2"')).
Thanks,
Ben
Last edit: Ben Escoto 2016-02-09
Quite obviously, std::string is incapable of storing Unicode (http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring). I know, that technically speaking strings are just arrays of bytes and encoding shouldn't matter, but C++ has function overloadings, that can invoke proper string recoding when it sees that you try to construct to wchar_t* from _bstr_t (defined in comutil.h)
On Windows BSTR already contains string encoded in UTF-16, so you (we?) need to do is to use proper conversion from UTF-16 to UTF-32 (std::w_string).
So I guess, you need to do 2 things for a proper fix:
1. Employ conversion as in accepted answer to http://stackoverflow.com/questions/6284524/bstr-to-stdstring-stdwstring-and-vice-versa (isn't it your question, by the way?) or better yet: http://stackoverflow.com/questions/3810730/convert-bstr-to-wstring. Don't use MultiByteToWideChar and WideCharToMultiByte which cannot convert to UTF-32.
2. Drop std::string in favor of std::wstring in every place that handles R command (which might not be so simple as it sounds)
I haven't tested it yet - I haven't been programming in C++ on Windows for about 10 years (I am also based on Linux) so testing it will not be straightforward for me. Can you do it instead?
If you don't I'll try to test it myself around the next week.
Thanks for the links. I can look into this, but I'm a bit busy at the moment so won't be able to start until late next week.
Your step #1 looks straightforward. Step #2 hopefully isn't too bad either, but it would involve figuring out what format R stores its strings in, and converting them accordingly. That might just be a bit of googling, or it might degenerate into reading through the source of R and Rserve, worrying about platform-specific issues, etc.
Hi, just looked a bit into this. It seems straightforward, the one wrinkle is that R may use different default encodings on different platforms. By default, Rserve strings are coded using the "unknown"/default encoding. (Rserve does seem to check the string encoding type and recode the strings if necessary.) However, there is an option (either in configuration file or command-line) to tell Rserve to use UTF-8. See: http://www.rforge.net/Rserve/doc.html.
Would it be acceptable to you if Excelsi-R required Rserve to use UTF-8? According to the link above, the Java clients already require UTF-8, and it may become the default in the future anyway. It would be more work to handle the "default" encoding, because that could differ from client to server. Conceptually, the "default" encoding on the server might even be unknown/unavailable on the client.
I went ahead and made the change. I believe it's as simple as changing CP_ACP to CP_UTF8 in the code above, and switching the encoding of the Rserve server to UTF-8. There should be no issues holding Unicode inside std::string.
I want to make one other minor change, then I'll release a new version of Excelsi-R.