I'm trying to understand the way TinyXML interfaces with the external "world". Simply, its API.
TinyXML returns all strings as char (one byte) strings, am I right?
So, if the read & parsed XML document is encoded in UTF-8, TinyXML returns a set of chars (bytes string) containing Unicode content, am I right?
I use TinyXML on Windows CE 4.2 (Pocket PC 2003) and I have to operate Microsoft Unicode strings (wide-char strings).
So, I have following function to convert one-byte-char encoded Unicode strings (seems strange, right? ;-) to wide-char strings Unicode.
Here I use wstring (compile with _UNICODE defined).
std::wstring UnicodeStringToWString(const std::string& s)
{
// Unicode is encoded into one-byte-char strings, so
// we have to use CP_UTF8 code page in conversion.
// Get input string length in bytes (chars)
int len = ::MultiByteToWideChar(CP_UTF8, 0, s.c_str(), -1, NULL, 0);
// Allocate wide-char buffer
//wchar_t * tmp = new wchar_t[s.length() + 1] ;
wchar_t* buffer = new wchar_t[len];
// Translate set of chars to set of wide-chars
::MultiByteToWideChar(CP_UTF8, 0, s.c_str(), -1, buffer, len);
std::wstring result(buffer) ;
delete [] buffer;
return result ;
}
So, is may seems strange, that I use CP_UTF8, because usually MultiByteToWideChar call is used with CP_ACP code page.
But CP_UTF8 have to be used to tell that API call it should treat input chars string as a UTF-8 input and only convert it to wide-chars string.
If CP_ACP is used, then characters other than English will be lost.
I tested it with Simple Chinese, Russian and Polish characters.
I would like to assure myself if I understand TinyXML correctly.
So, any comments are welcome.
Mateusz Łoskot
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I'm trying to understand the way TinyXML interfaces with the external "world". Simply, its API.
TinyXML returns all strings as char (one byte) strings, am I right?
So, if the read & parsed XML document is encoded in UTF-8, TinyXML returns a set of chars (bytes string) containing Unicode content, am I right?
I use TinyXML on Windows CE 4.2 (Pocket PC 2003) and I have to operate Microsoft Unicode strings (wide-char strings).
So, I have following function to convert one-byte-char encoded Unicode strings (seems strange, right? ;-) to wide-char strings Unicode.
Here I use wstring (compile with _UNICODE defined).
std::wstring UnicodeStringToWString(const std::string& s)
{
// Unicode is encoded into one-byte-char strings, so
// we have to use CP_UTF8 code page in conversion.
// Get input string length in bytes (chars)
int len = ::MultiByteToWideChar(CP_UTF8, 0, s.c_str(), -1, NULL, 0);
// Allocate wide-char buffer
//wchar_t * tmp = new wchar_t[s.length() + 1] ;
wchar_t* buffer = new wchar_t[len];
// Translate set of chars to set of wide-chars
::MultiByteToWideChar(CP_UTF8, 0, s.c_str(), -1, buffer, len);
std::wstring result(buffer) ;
delete [] buffer;
return result ;
}
So, is may seems strange, that I use CP_UTF8, because usually MultiByteToWideChar call is used with CP_ACP code page.
But CP_UTF8 have to be used to tell that API call it should treat input chars string as a UTF-8 input and only convert it to wide-chars string.
If CP_ACP is used, then characters other than English will be lost.
I tested it with Simple Chinese, Russian and Polish characters.
I would like to assure myself if I understand TinyXML correctly.
So, any comments are welcome.
Mateusz Łoskot