std::string Transcode(const wchar_t *pwszString, unsigned int nCodePage = CP_UTF8)
{
std::string strRet;
int nRawLen = wcslen(pwszString);
int nReqLen = nRawLen<<2; // max possible length of converted 8bit string
// convert to UTF-8
char szDst[XRB_DOM_TRANS_BUF_LEN], *pszDst=szDst; //initially, point to array
// check if szDst is long enough
if (nReqLen >= XRB_DOM_TRANS_BUF_LEN){ // szDst not big enough
pszDst = new char[nReqLen];
}
// get string
int nLen = WideCharToMultiByte(nCodePage,0,pwszString,nRawLen,pszDst,nReqLen,0,0);
if (nLen){
pszDst[nLen] = 0; // null terminator
strRet = pszDst; // copy to STL string
}
if (pszDst!=szDst) delete[] pszDst; // delete, if allocated
yeah!
After I use your "Transcode()", all goes well of my simplified Chinese characters.
But if i want read some characters, i need to call "MultiByteToWideChar()" to transcode them too. Can you made the transcode inside the tinyxml project? Then i don't need to call them when i use "UTF-8" characters.
thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Encoding conversion: UTF-8, UTF-16, Big-5 (probably your chinese system), Latin-1, ISO-xyz is a huge problem. A far bigger problem than XML parsing. The OS (Windows, Mac, Linux, etc) has orders of magnitude more code and data for encoding conversion than all of TinyXml put together.
Encoding conversion is much too big a problem to put in TinyXml.
You've got a nice little bit of code to solve your problem, and I'm hoping it will solve lots of people's troubles. But it's one little part of the encoding issue, and to put encoding conversion into the TinyXml code would be the first bucket of an ocean of code I would need to solve this across operating systems.
TinyXml tries to do simple thing well:
- Process UTF-8 correctly. UTF-8 in, UTF-8 out.
- It will also process in legacy mode: default system encoding in, sometime default system encoding out. (Legacy is broken, but tries to be broken in exactly the same way it was prior to 2.3).
Any patches or utilities people would like to post and share are always appreciated by me and others. Especially code like this that solves an immediate problem. I will not put them into the mainline however, because the issue is just too big.
lee
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you want get or set unicode value using tinyxml in win32 system, you need to use WideCharToMultiByte convert wchar_t* to char*.
if you use char* to process unicode value, you may be get a error result.
The code like this:
#include "tinyxml.h"
#ifdef TIXML_USE_STL
#include <iostream>
#include <sstream>
using namespace std;
#else
#include <stdio.h>
#endif
#include <string>
#define XRB_DOM_TRANS_BUF_LEN 200
std::string Transcode(const wchar_t *pwszString, unsigned int nCodePage = CP_UTF8)
{
std::string strRet;
int nRawLen = wcslen(pwszString);
int nReqLen = nRawLen<<2; // max possible length of converted 8bit string
// convert to UTF-8
char szDst[XRB_DOM_TRANS_BUF_LEN], *pszDst=szDst; //initially, point to array
// check if szDst is long enough
if (nReqLen >= XRB_DOM_TRANS_BUF_LEN){ // szDst not big enough
pszDst = new char[nReqLen];
}
// get string
int nLen = WideCharToMultiByte(nCodePage,0,pwszString,nRawLen,pszDst,nReqLen,0,0);
if (nLen){
pszDst[nLen] = 0; // null terminator
strRet = pszDst; // copy to STL string
}
if (pszDst!=szDst) delete[] pszDst; // delete, if allocated
return strRet;
}
void SetChildNodeTextValue(TiXmlNode* parent, const wchar_t* name, const wchar_t* value)
{
TiXmlNode* tNode = parent->FirstChild(Transcode(name).c_str());
if (tNode)
{
TiXmlText* tText = tNode->FirstChild()->ToText();
if (tText == 0)
{
TiXmlText tNewText( Transcode(value).c_str() );
tNode->InsertEndChild(tNewText);
}
else
{
tText->SetValue( Transcode(value).c_str() );
}
}
}
int main()
{
TiXmlDocument doc( "utf8test.xml" );
//TiXmlBase::SetCondenseWhiteSpace( false );
bool loadOkay = doc.LoadFile();
if ( !loadOkay )
{
printf( "Could not load test file 'demotest.xml'. Error='%s'. Exiting.\n", doc.ErrorDesc() );
exit( 1 );
}
TiXmlNode* tNode = 0;
TiXmlNode* tNodeItem = 0;
TiXmlElement* tConfigElement = 0;
TiXmlElement* itemElement = 0;
// AppSetting
tNode = doc.FirstChildElement( "document" );
SetChildNodeTextValue(tNode, L"汉语", L"中文测试");
doc.SaveFile("demotest.xml");
return 0;
}
yeah!
After I use your "Transcode()", all goes well of my simplified Chinese characters.
But if i want read some characters, i need to call "MultiByteToWideChar()" to transcode them too. Can you made the transcode inside the tinyxml project? Then i don't need to call them when i use "UTF-8" characters.
thanks.
Encoding conversion: UTF-8, UTF-16, Big-5 (probably your chinese system), Latin-1, ISO-xyz is a huge problem. A far bigger problem than XML parsing. The OS (Windows, Mac, Linux, etc) has orders of magnitude more code and data for encoding conversion than all of TinyXml put together.
Encoding conversion is much too big a problem to put in TinyXml.
You've got a nice little bit of code to solve your problem, and I'm hoping it will solve lots of people's troubles. But it's one little part of the encoding issue, and to put encoding conversion into the TinyXml code would be the first bucket of an ocean of code I would need to solve this across operating systems.
TinyXml tries to do simple thing well:
- Process UTF-8 correctly. UTF-8 in, UTF-8 out.
- It will also process in legacy mode: default system encoding in, sometime default system encoding out. (Legacy is broken, but tries to be broken in exactly the same way it was prior to 2.3).
Any patches or utilities people would like to post and share are always appreciated by me and others. Especially code like this that solves an immediate problem. I will not put them into the mainline however, because the issue is just too big.
lee
Further to what Lee said...
if you want a DOM XML parsing library in C++ that has extensive support for various encodings then go to xerces.
But so long as the Chinese characters are entered in valid utf-8 TinyXml as-is should work fine...