Thread: [Modeling-users] working in unicode?
Status: Abandoned
Brought to you by:
sbigaret
From: Mario R. <ma...@ru...> - 2003-04-20 12:25:55
|
Hello, i would like to be able to write and read unicode (as transparently as possible) to a text attribute. Postgres/PyPgSQL supports this (see the 'Databases' section in http://dalchemy.com/opensource/unicodedoc/ -- an article I found very useful). However I have no idea about mysql. It seems that modeling operates in latin-1? Is this true? Can I work around this in any way, or configure the framework to assume utf-8 as encoding? The problem with latin-1 is that it is not a unicode encoding, so even if most of the special characters I would need to handle now are latin-1, sooner than later there will be problems. Regards, mario |
From: Sebastien B. <sbi...@us...> - 2003-04-20 12:58:22
|
Hi, Mario Ruggier <ma...@ru...> wrote: > i would like to be able to write and read unicode (as transparently as > possible) to a text attribute. Postgres/PyPgSQL supports this (see the > 'Databases' section in http://dalchemy.com/opensource/unicodedoc/ > -- an article I found very useful). However I have no idea about mysql. >=20 > It seems that modeling operates in latin-1? Is this true? > Can I work around this in any way, or configure the framework to > assume utf-8 as encoding? The problem with latin-1 is that it > is not a unicode encoding, so even if most of the special characters > I would need to handle now are latin-1, sooner than later there will > be problems. latin-1? No, I never assumed a particular encoding for strings. It can be that the framework does not behave correctly because of the unicode type, but if this happens this is definitely a bug, it was not intended. Two kinds of problems may arise: 1. the underlying adaptor itself (musqldb,psycopg, etc.) do not handle uncide strings very well --I've no idea how adapters handle this, never tried it, 2. mdl assumes that the strings' type is string, not unicode, and some operations fail. I never tried unicode strings, hence I do not have more to say on that. Maybe you could go ahead and try it, and then report? -- S=E9bastien. PS: Oh, maybe you thought latin-1 because of the 'latin-1' in the model? That one in the model only specifies how the xml file is produced, and it has nothing to do with the framework runtime. |
From: Yannick G. <ygi...@yg...> - 2003-04-20 13:53:11
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sunday 20 April 2003 08:58, Sebastien Bigaret wrote: > latin-1? No, I never assumed a particular encoding for strings. It > can be that the framework does not behave correctly because of the > unicode type, but if this happens this is definitely a bug, it was > not intended. We use unicode all the time. The 1st time you create and object, if you don't pass ASCII only data to an attibute of an object, you have to explicitly encode it with the encoding of your choice. We use MySQL which does not support unicode at all. We have do our encoding explicitly but you may have more luck with a unicode DB like postgres. Encoding explicitly has a few draw backs, we can't do a case insensitive search and we can't be sure of the length of the encoded string before we encode it. You have to plan larger fields in your model. Three times larger than the intended length is safe in most situations. Latin-1 to UTF8 is about 1.5 time larger but cyrillic clip near 2.5. Do you own test with a reliable sample of data, UTF8 may take up to 8 bytes per characters: print len(u'\u58f7\u58d9\u58bc\u585e\u5859'.encode("utf-8")) Typing the hex values is boring I made a PyQt app in which I can paste characters I copied from kcharselect : #!/usr/bin/python from autogen.UniToPyGui import UniToPyGui from qt import * import sys class UniToPyDialog(UniToPyGui): def __init__(self, *args): UniToPyGui.__init__(self, *args) self.connect(self.okBtn, SIGNAL("clicked()"), self.convertString) def convertString(self): unistr = unicode(self.uniTxt.text()) self.uniSizeTxt.setText(str(len(unistr))) pystr = repr(unistr) self.pyTxt.setText(pystr) self.pySizeTxt.setText(str(len(pystr))) utfstr = unistr.encode("utf-8") self.utfTxt.setText(utfstr) self.utfSizeTxt.setText(str(len(utfstr))) if __name__ == "__main__": qapp = QApplication(sys.argv) dialog = UniToPyDialog() qapp.setMainWidget(dialog) dialog.show() qapp.exec_loop() # The END ! All you need is a UniToPyGui form with a OK button, an a few text edits. - -- Yannick Gingras Coder for OBB : Onymous Barrelled Baneberry http://OpenBeatBox.org -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (GNU/Linux) iD8DBQE+oqZCrhy5Fqn/MRARAqeBAJ0ZaDCwUEKpNSITQZI8gRhj3F38/QCcCO5l N2l/G7zy4NUhhXk6CzFreyc= =u0lD -----END PGP SIGNATURE----- |
From: Mario R. <ma...@ru...> - 2003-04-21 11:14:11
|
On dimanche, avr 20, 2003, at 14:58 Europe/Amsterdam, Sebastien Bigaret wrote: > Mario Ruggier <ma...@ru...> wrote: >> i would like to be able to write and read unicode (as transparently as >> possible) to a text attribute. Postgres/PyPgSQL supports this (see the >> 'Databases' section in http://dalchemy.com/opensource/unicodedoc/ >> -- an article I found very useful). However I have no idea about >> mysql. >> >> It seems that modeling operates in latin-1? Is this true? >> Can I work around this in any way, or configure the framework to >> assume utf-8 as encoding? The problem with latin-1 is that it >> is not a unicode encoding, so even if most of the special characters >> I would need to handle now are latin-1, sooner than later there will >> be problems. > > latin-1? No, I never assumed a particular encoding for strings. It > can be that the framework does not behave correctly because of the > unicode type, but if this happens this is definitely a bug, it was > not > intended. It's me who was assuming a particular encoding, when setting the values ;) > Two kinds of problems may arise: > > 1. the underlying adaptor itself (musqldb,psycopg, etc.) do not handle > uncide strings very well --I've no idea how adapters handle this, > never tried it, PyPgSQL handles it transparently (but internally he encodes anyway to UTF-8), and thus you can directly execute a query unicode string, and he will return results encoded in client_encoding (see article ref above). What I was asking, is can I tell the Framework to set extra parameters such as the client_encoding, when connecting and crating cursors? > 2. mdl assumes that the strings' type is string, not unicode, and some > operations fail. > > I never tried unicode strings, hence I do not have more to say on > that. > > Maybe you could go ahead and try it, and then report? OK: If i try to send a unicode string value where a string is expected, a ValidationException is generated for that attribute. However, as expected, if I take care to encode all string values to utf-8 before sending them to the db via the framework, it all works fine. Filtering of objects also works fine, e.g. if i specify a qual such as ' someAtt=="someUtf8String"'. I guess case-insensitive matches will not work, as Yannick pointed out (thanks!). But, case does not really mean much when applied generically to unicode (the concept of case is not for all languages?). The only issue is that is is a little bit of a pity to have to take care that all string values must be sent, and received as, utf-8 encoded strings. But in fact this is a really small price to pay -- especially if the input means is a utf-8 encoded web form, which thus automatically sends data as utf-8, and wants to receive it in utf-8. mario |
From: Yannick G. <ygi...@yg...> - 2003-04-21 15:15:31
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Monday 21 April 2003 07:14, Mario Ruggier wrote: > I guess case-insensitive matches will not work, as Yannick pointed out > (thanks!). > But, case does not really mean much when applied generically to unicode > (the concept of case is not for all languages?). I don't know much about Asian languages but there is the concept of case in the Cyrillic alphabet. Have a look kcharselect table no.4. Same for Greek, same for French. If "S=E9bastien" is encoded in utf-8, you'll have a hard time to case-insensitively match it... This is supposed to be managed by the RDMS, it should know how to do a case insensitive select on utf-8 but life is cruel...=20 ; ) - --=20 Yannick Gingras Coder for OBB : Occasional Barricaded Buttressing http://OpenBeatBox.org -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (GNU/Linux) iD8DBQE+pAsOrhy5Fqn/MRARAho7AJ9AvcMOF0o8UFIBGYHi/v75zzpDyQCeO/es fi8cKmHkOMgTq4h1wJ+3Lsk=3D =3D3rc7 -----END PGP SIGNATURE----- |