|
From: Barry S. <ba...@ba...> - 2009-02-21 12:20:52
|
On 18 Feb 2009, at 19:32, William Newbery wrote:
>
>
> > On 10 Feb 2009, at 17:15, William Newbery wrote:
> >
> > > I want to start using uncicode strings.
> > >
> > > Looking at Py::string I see the method as_unicodestring, however
> > > this returns a std::basic_string<Py_UNICODE> and provides no
> option
> > > of encodeing...
> > >
> > > Another method is the encode method, this lets me provide the
> > > encoding but just returns another Py::String...
> > >
> > >
> > > What exactly do I need to do to go between a python unicode string
> > > and a std::wstring (where sizeof(wchar_t)==2)in UTF-16 encoding?
> >
> > I take it that Py_UNICODE is 4 on your platform.
> >
> > You could try encode('utf-16') to get a Py::String that is in
> utf-16.
> > Then use as_std_string() to get a std:string, use c_str() to get a
> > pointer to the contents and cast it to wchar_t.
> >
> > Adding a as_std_wstring would be a reasonable thing to add to PyCXX
> > to make this convenient.
> > as_std_wstring could look inside the Py_Object and avoid a number of
> > conversion steps.
> >
> > Barry
>
> The problem is thats basicly a hack and results in several bugs
> since your stuffing a double byte string into a std::string.
> -Any utf-16 charecter that has 00 for the first byte will break it.
> I dont know if there are any such charecters in little endian
> encoding, but for big endian quite alot will
>
> -std::string only terminates with a single \0 but utf-16 needs \0\0.
> This means casting the c_str() to a wchar_t wont work because the
> charecter after the first \0 it is outside the string, and thus
> could be anything. So you end up having to make yet another copy by
> allocating a block of which is size()+2 and making sure both of the
> last two bytes are 0...
std::string does not use NUL to terminate strings. Use c_str() to get
to the data and use size() to find out the length().
>
>
> "Adding a as_std_wstring would be a reasonable thing to add to PyCXX
> to make this convenient." wstring could be say ucs-2 or someother
> wide format as easily as utf_16, and then people may also want
> ucs-4, etc.
>
> Something that can support all the diffrent formats would be good.
>
> eg mayby:
> int Py::String::c_encode(const char *format, char *buffer, int
> buffersize);
> where if *buffer is null it just returns the number of bytes needed
> to encode in the given format. The user can then allocate the needed
> buffer and get the string encoded correctly in whatever format,
> ending with something that is safe to cast to wchar_t or unsigned
> int or whatever is correct for that format. buffersize should again
> be in bytes to avoid confusion.
Could you create a patch for this?
Barry
|