[PyCS-devel] New feature in CVS: default encoding for strings
Status: Alpha
Brought to you by:
myelin
|
From: Georg B. <gb...@mu...> - 2002-12-10 18:15:05
|
Hi! I fiddled around with some umlaut problems in pycs and found (and implemented) a solution. Now I feel dirty. The story: A user has an umlaut (char with high bit set) in his name. He can't register nor ping nor do anything where his name or a title for his weblog is transferred when it includes Umlauts. Reason: Radio doesn't give the right encoding. So the first stage was to hack something into pycs to replace the xml header with one with the encoding paramter given. Now it should work, as the request gives a nice and cozy encoding, right? Wrong. The second stage was that pycs bars on all parts when unicode strings are passed in, because parts of it don't work well with unicode. And strings with umlauts are converted to unicode strings, where normal strings are converted to normal strings. That's done by the parser transparently. As soon as you have umlauts, you will get unicode strings. And for example metakit doesn't like them. So I had to add another hack. Even worse hack. Want to know what hacks? Ok, here they come: - added a defaultencoding option to pycs.conf - only if this setting is uncommented and set, the hacks are activated. So everything should work as before without setting this. Nothing should break. - added a shim for continue_request in pycs_xmlhandler.py to fetch the XML request and replace the standard XML header without encoding (and _only_ the one without encoding - if there is one with encoding given, nothing will happen here! - with one with the encoding="defaultencoding" - added code to this shim to do everything manually. The original continue_request has exception handling, I copied it. The original used xmlrpclib.loads, I replaced that with the innards of this function from xmlrpclib - added a Class Unmarshaller in pycs_xmlhandler.py to override the end_string method with one that changes unicode strings to iso-8859-1 encoded normal strings - patched the xmlrpclib.Unmarshaller, call getparser, unpatch xmlrpclib.Unmarshaller, work with the new patched unmarshaller. Yes, this is dirty, but it works because of the non-threadedness of Medusa - there is no parallel thread that might stumble over the patched version of Unmarshaller The rest is normal hacking, copying and pasting. Weird, bad, ugly. But it has one good value: it works. Now pycs and Radio work happily together and accept umlauts in the iso-8859-1 encoding. This is a problem because there are others out there using other encodings? Right. But there are already several problems in pycs and the python environment, so we don't make it worse: - metakit doesn't support unicode, but only standard 8bit encodings - the xml parsers available only support a limited range of encodings, most notably unicode, utf-8 and iso-8859-1 - so if you use different chars, you are already lost, as your XML in that encoding won't be parsed - radio doesn't give a flying fart on encodings and just delivers a standard xml header, that is meant to be "take what I send, just store it" but is in reality by the XML standard defined as UTF-8. You can actually be happy when Radio delivers iso-8859-1 chars in it's XML and not actual Macintosh codes :-/ So if somebody is willing to fully support all encodings available in unicode, he would have to do the following: - get the client applications to send XML in UTF-8 - rip out metakit and plug in something that understands UTF-8 To repeat: the hack is bad, but it is the only (at least to me) currently known way to get 8bit chars working for pycs. This is not only a problem with regard to Radio, but with regard to Metakit, too. Comments? bye, Georg |