On 18 Feb 2009, at 19:32, William Newbery wrote:



> On 10 Feb 2009, at 17:15, William Newbery wrote:
> 
> > I want to start using uncicode strings.
> >
> > Looking at Py::string I see the method as_unicodestring, however 
> > this returns a std::basic_string<Py_UNICODE> and provides no option 
> > of encodeing...
> >
> > Another method is the encode method, this lets me provide the 
> > encoding but just returns another Py::String...
> >
> >
> > What exactly do I need to do to go between a python unicode string 
> > and a std::wstring (where sizeof(wchar_t)==2)in UTF-16 encoding?
> 
> I take it that Py_UNICODE is 4 on your platform.
> 
> You could try encode('utf-16') to get a Py::String that is in utf-16.
> Then use as_std_string() to get a std:string, use c_str() to get a 
> pointer to the contents and cast it to wchar_t.
> 
> Adding a as_std_wstring would be a reasonable thing to add to PyCXX 
> to make this convenient.
> as_std_wstring could look inside the Py_Object and avoid a number of 
> conversion steps.
> 
> Barry

The problem is thats basicly a hack and results in several bugs since your stuffing a double byte string into a std::string.
-Any utf-16 charecter that has 00 for the first byte will break it. I dont know if there are any such charecters in little endian encoding, but for big endian quite alot will


-std::string only terminates with a single \0 but utf-16 needs \0\0. This means casting the c_str() to a wchar_t wont work because the charecter after the first \0 it is outside the string, and thus could be anything. So you end up having to make yet another copy by allocating a block of which is size()+2 and making sure both of the last two bytes are 0...

std::string does not use NUL to terminate strings. Use c_str() to get to the data and use size() to find out the length().



"Adding a as_std_wstring would be a reasonable thing to add to PyCXX to make this convenient." wstring could be say ucs-2 or someother wide format as easily as utf_16, and then people may also want ucs-4, etc.

Something that can support all the diffrent formats would be good.

eg mayby:
int Py::String::c_encode(const char *format, char *buffer, int buffersize);
where if *buffer is null it just returns the number of bytes needed to encode in the given format. The user can then allocate the needed buffer and get the string encoded correctly in whatever format, ending with something that is safe to cast to wchar_t or unsigned int or whatever is correct for that format. buffersize should again be in bytes to avoid confusion.

Could you create a patch for this?

Barry