From: SourceForge.net <no...@so...> - 2007-02-15 22:36:35
|
Bugs item #1659819, was opened at 2007-02-14 17:16 Message generated for change (Comment added) made by laukpe You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Core Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Pekka Laukkanen (laukpe) Assigned to: Nobody/Anonymous (nobody) Summary: Joining unicode items with string doesn't create unicode Initial Comment: When you join a list of unicode strings with a normal string (e.g. "x = ' '.join([u'a',u'b'])") the returned item is a normal string instead of a unicode string. If the string used in join is a unicode string then also the outcome is unicode. If you use the string returned in the former case as a unicode string later in your code (e.g. "unicode(x)") you get an UnicodeError if original unicode strings contained non-ascii characters. In CPython you get a unicode string in both cases as you would expect. See examples using Jython 2.2b1 and Python 2.4.3 below. In Jython 2.2a1 things work differently due to http://jython.org/bugs/1538001 that's fixed in beta. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'str'> >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xE4 Good' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'unicode'> >>> unicode(x) u'Hyv\xe4 Good' >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xe4 Good' >>> ---------------------------------------------------------------------- >Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-16 00:36 Message: Logged In: YES user_id=1379331 Originator: YES I was trying to find the cause for some of these problems (partly to see how easily I can understand Jython internals) and found the following from /trunk/jython/src/templates/str.expose (revision 3058). 41 expose_meth: join o 42 String result = self.str_join(arg0); 43 //XXX: do we really need to check self? 44 if (self instanceof PyUnicode || arg0 instanceof PyUnicode) { 45 return new PyUnicode(result); 46 } else { 47 return new PyString(result); 48 } I know next to nothing about these expose files (I've understood the actual Java code is generated from them) but if I got it correctly "arg0 instanceof PyUnicode" checks is the argument given to the join method a unicode object or not. If that's the case then the bug in join is exactly here -- the argument given to it is a sequence and thus never unicode itself. Instead the code should check is any of the items in the sequence of unicode type. The XXX comment can also probably be removed because u''.join([]) should create a unicode object too. If str.expose is the right place to fix these issues then also replace method ought to get similar checks as join has. Fixing join and replace still leaves "'%s' % u'x'" issue open. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-16 00:16 Message: Logged In: YES user_id=1379331 Originator: YES Also replace seems to be affected. Testing with a post alpha snapshot in this time because I don't have beta installed into this machine. I couldn't find other methods implemented by Python string that could have this problem but probably it's better that someone else also goes through them. Jython 2.2a3005 on java1.5.0_09 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'str'> >>> x 'a b' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'unicode'> >>> x u'a b' >>> ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 12:36 Message: Logged In: YES user_id=1379331 Originator: YES Catenating string and unicode seem to work correctly. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good ' + u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xE4' >>> unicode(x) u'Good Hyv\xE4' ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 12:35 Message: Logged In: YES user_id=1379331 Originator: YES Same problem appears also if you create a string using pattern like 'Something %s' and the substituted string is unicode. Examples below demonstrate. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'str'> >>> x 'Good Hyv\xE4' >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xe4' >>> unicode(x) u'Good Hyv\xe4' >>> ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 |