Barry Scott <barry@ba...> - 2009-02-21 12:20 > On 18 Feb 2009, at 19:32, William Newbery wrote: > > > > > > > > On 10 Feb 2009, at 17:15, William Newbery wrote: > > > > > > > I want to start using uncicode strings. > > > > > > > > Looking at Py::string I see the method as_unicodestring, however > > > > this returns a std::basic_string<Py_UNICODE> and provides no > > option > > > > of encodeing... > > > > > > > > Another method is the encode method, this lets me provide the > > > > encoding but just returns another Py::String... > > > > > > > > > > > > What exactly do I need to do to go between a python unicode string > > > > and a std::wstring (where sizeof(wchar_t)==2)in UTF-16 encoding? > > > > > > I take it that Py_UNICODE is 4 on your platform. > > > > > > You could try encode('utf-16') to get a Py::String that is in > > utf-16. > > > Then use as_std_string() to get a std:string, use c_str() to get a > > > pointer to the contents and cast it to wchar_t. > > > > > > Adding a as_std_wstring would be a reasonable thing to add to PyCXX > > > to make this convenient. > > > as_std_wstring could look inside the Py_Object and avoid a number of > > > conversion steps. > > > > > > Barry > > > > The problem is thats basicly a hack and results in several bugs > > since your stuffing a double byte string into a std::string. > > -Any utf-16 charecter that has 00 for the first byte will break it. > > I dont know if there are any such charecters in little endian > > encoding, but for big endian quite alot will > > > > > -std::string only terminates with a single \0 but utf-16 needs \0\0. > > This means casting the c_str() to a wchar_t wont work because the > > charecter after the first \0 it is outside the string, and thus > > could be anything. So you end up having to make yet another copy by > > allocating a block of which is size()+2 and making sure both of the > > last two bytes are 0... > > std::string does not use NUL to terminate strings. Use c_str() to get > to the data and use size() to find out the length(). > > > > > > > "Adding a as_std_wstring would be a reasonable thing to add to PyCXX > > to make this convenient." wstring could be say ucs-2 or someother > > wide format as easily as utf_16, and then people may also want > > ucs-4, etc. > > > > Something that can support all the diffrent formats would be good. > > > > eg mayby: > > int Py::String::c_encode(const char *format, char *buffer, int > > buffersize); > > where if *buffer is null it just returns the number of bytes needed > > to encode in the given format. The user can then allocate the needed > > buffer and get the string encoded correctly in whatever format, > > ending with something that is safe to cast to wchar_t or unsigned > > int or whatever is correct for that format. buffersize should again > > be in bytes to avoid confusion. > > Could you create a patch for this? > > Barry
From my limited knowleged of the c-api I was able to put this together. Theres a few things I would like to do better but am not aware howto, namly: -A way to calc the required buffer without actauly encodeing a bytes object -A way to encode directly into a buffer rather than a python created bytes object which then must be copied
Also I'm not sure how your checking for and throwing exceptions that origenate from python code so Ive left that out.