From: Barry S. <ba...@ba...> - 2009-02-21 12:20:52
|
On 18 Feb 2009, at 19:32, William Newbery wrote: > > > > On 10 Feb 2009, at 17:15, William Newbery wrote: > > > > > I want to start using uncicode strings. > > > > > > Looking at Py::string I see the method as_unicodestring, however > > > this returns a std::basic_string<Py_UNICODE> and provides no > option > > > of encodeing... > > > > > > Another method is the encode method, this lets me provide the > > > encoding but just returns another Py::String... > > > > > > > > > What exactly do I need to do to go between a python unicode string > > > and a std::wstring (where sizeof(wchar_t)==2)in UTF-16 encoding? > > > > I take it that Py_UNICODE is 4 on your platform. > > > > You could try encode('utf-16') to get a Py::String that is in > utf-16. > > Then use as_std_string() to get a std:string, use c_str() to get a > > pointer to the contents and cast it to wchar_t. > > > > Adding a as_std_wstring would be a reasonable thing to add to PyCXX > > to make this convenient. > > as_std_wstring could look inside the Py_Object and avoid a number of > > conversion steps. > > > > Barry > > The problem is thats basicly a hack and results in several bugs > since your stuffing a double byte string into a std::string. > -Any utf-16 charecter that has 00 for the first byte will break it. > I dont know if there are any such charecters in little endian > encoding, but for big endian quite alot will > > -std::string only terminates with a single \0 but utf-16 needs \0\0. > This means casting the c_str() to a wchar_t wont work because the > charecter after the first \0 it is outside the string, and thus > could be anything. So you end up having to make yet another copy by > allocating a block of which is size()+2 and making sure both of the > last two bytes are 0... std::string does not use NUL to terminate strings. Use c_str() to get to the data and use size() to find out the length(). > > > "Adding a as_std_wstring would be a reasonable thing to add to PyCXX > to make this convenient." wstring could be say ucs-2 or someother > wide format as easily as utf_16, and then people may also want > ucs-4, etc. > > Something that can support all the diffrent formats would be good. > > eg mayby: > int Py::String::c_encode(const char *format, char *buffer, int > buffersize); > where if *buffer is null it just returns the number of bytes needed > to encode in the given format. The user can then allocate the needed > buffer and get the string encoded correctly in whatever format, > ending with something that is safe to cast to wchar_t or unsigned > int or whatever is correct for that format. buffersize should again > be in bytes to avoid confusion. Could you create a patch for this? Barry |