Re: Unicode to std::wstring

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 18 Feb 2009, at 19:32, William Newbery wrote:

>
>
> > On 10 Feb 2009, at 17:15, William Newbery wrote:
> >
> > > I want to start using uncicode strings.
> > >
> > > Looking at Py::string I see the method as_unicodestring, however
> > > this returns a std::basic_string<Py_UNICODE> and provides no  
> option
> > > of encodeing...
> > >
> > > Another method is the encode method, this lets me provide the
> > > encoding but just returns another Py::String...
> > >
> > >
> > > What exactly do I need to do to go between a python unicode string
> > > and a std::wstring (where sizeof(wchar_t)==2)in UTF-16 encoding?
> >
> > I take it that Py_UNICODE is 4 on your platform.
> >
> > You could try encode('utf-16') to get a Py::String that is in  
> utf-16.
> > Then use as_std_string() to get a std:string, use c_str() to get a
> > pointer to the contents and cast it to wchar_t.
> >
> > Adding a as_std_wstring would be a reasonable thing to add to PyCXX
> > to make this convenient.
> > as_std_wstring could look inside the Py_Object and avoid a number of
> > conversion steps.
> >
> > Barry
>
> The problem is thats basicly a hack and results in several bugs  
> since your stuffing a double byte string into a std::string.
> -Any utf-16 charecter that has 00 for the first byte will break it.  
> I dont know if there are any such charecters in little endian  
> encoding, but for big endian quite alot will

>
> -std::string only terminates with a single \0 but utf-16 needs \0\0.  
> This means casting the c_str() to a wchar_t wont work because the  
> charecter after the first \0 it is outside the string, and thus  
> could be anything. So you end up having to make yet another copy by  
> allocating a block of which is size()+2 and making sure both of the  
> last two bytes are 0...

std::string does not use NUL to terminate strings. Use c_str() to get  
to the data and use size() to find out the length().

>
>
> "Adding a as_std_wstring would be a reasonable thing to add to PyCXX  
> to make this convenient." wstring could be say ucs-2 or someother  
> wide format as easily as utf_16, and then people may also want  
> ucs-4, etc.
>
> Something that can support all the diffrent formats would be good.
>
> eg mayby:
> int Py::String::c_encode(const char *format, char *buffer, int  
> buffersize);
> where if *buffer is null it just returns the number of bytes needed  
> to encode in the given format. The user can then allocate the needed  
> buffer and get the string encoded correctly in whatever format,  
> ending with something that is safe to cast to wchar_t or unsigned  
> int or whatever is correct for that format. buffersize should again  
> be in bytes to avoid confusion.

Could you create a patch for this?

Barry