From: Rasmus K. <kaj@e.kth.se> - 2003-02-25 14:02:57
|
This discussion seems to have gone a bit out of hand, but anyway, here's my view of it. These are the main features I want regarding character conversion: 1. The library should efficiently cope with any legal XML. 2. I want to be able to get a std::string (with a specified encoding) of a value. Most of the XML I will handle will be ISO8859-15 (thats just me, but replace the encoding with other encodings and a lot will be covered). 3. When necesarry, I want to be able to get a std::wstring (a std::string of wide characters). I want to be able to handle individual characters in the string, and I want to use standard C++. Then this is the "correct" way of handling unicode strings. 4. Sometimes, it would probably be nice to get utf8 data as well. The ordering of the points reflects their relative importance to me personally, but I believe the following conclusions to hold regardless of how the points are reordered: Point one suggests that the library should use utf8 internally. I don't care much how this is done, but some kind of refptr<char*> seems relatively sane while putting utf8 data in a std::string feels bad. Points 2 - 4 suggests that the get_ / set_ string methods should be templated on string class and conversion method. The alternative is that I would have to call a converter at every call to a get_ / set_ string method, meaning in practice that I would have to write a wrapper around libxml++ (which would feel ridiculous, since libxml++ itself is a wrapper around libxml). Note that this _doesnt_ imply that the entire libxml++ would go from a "dynamic library" to a "template library". Just that some of the methods would be templates. Also, "common instatiations" of a template method can - at least threoreticlly - be included in a dynamic library. -- Rasmus Kaj ----------------------------------------------- ra...@ka... \ What is the word where la is the middle, is the beginning, and the end? \------------------------------------- http://www.stacken.kth.se/~kaj/ |
From: Stefan S. <se...@sy...> - 2003-02-25 15:43:28
|
Hi Rasmus, you are listing important requirements, which, however, apply only partly the libxml++ (the other half applies to the unicode library). Rasmus Kaj wrote: > 1. The library should efficiently cope with any legal XML. agreed. > 2. I want to be able to get a std::string (with a specified encoding) > of a value. Most of the XML I will handle will be ISO8859-15 > (thats just me, but replace the encoding with other encodings and > a lot will be covered). That is a requirement for the unicode lib: as libxml2 uses one particular encoding internally (utf8), which you want to be able to 'transcode' into another. > 3. When necesarry, I want to be able to get a std::wstring (a > std::string of wide characters). I want to be able to handle > individual characters in the string, and I want to use standard > C++. Then this is the "correct" way of handling unicode strings. that, too, is an issue the unicode library has to deal with: I take it that with 'standard C++' you mean you want to be able to access characters with the '[]' operator. That is a requirement for the specific encoding you use. With utf8 characters don't have a fixed size, so you don't have random access. Instead you have to iterate over the string to find the nth character. So, depending on what you want to do with the string, one encoding may be better than another. Please note that there is no way for unicode to fit into std::wstring, as that has >16 bit, while unicode needs 21 bits per character. Some 'planes' fit into these 16 bit, but for lots of characters you need more, so the encoding becomes variably sized (meaning, as explained above, there is no random access). > 4. Sometimes, it would probably be nice to get utf8 data as well. yep. > The ordering of the points reflects their relative importance to me > personally, but I believe the following conclusions to hold regardless > of how the points are reordered: > > Point one suggests that the library should use utf8 internally. I > don't care much how this is done, but some kind of refptr<char*> seems > relatively sane while putting utf8 data in a std::string feels bad. agreed. > Points 2 - 4 suggests that the get_ / set_ string methods should be > templated on string class and conversion method. The alternative is > that I would have to call a converter at every call to a get_ / set_ > string method, meaning in practice that I would have to write a > wrapper around libxml++ (which would feel ridiculous, since libxml++ > itself is a wrapper around libxml). agreed. > Note that this _doesnt_ imply that the entire libxml++ would go from a > "dynamic library" to a "template library". Just that some of the > methods would be templates. Also, "common instatiations" of a > template method can - at least threoreticlly - be included in a > dynamic library. well, most methods deal with strings. I originally tried to factor out the string-type agnostic part into a base class, but that didn't lead anywhere. I agree that it would be possible to compile specific 'unicode bindings' to deal with Murray's points about interface/implementation separation. Whether that's actually worth the efford is another story. Regards, Stefan |