From: SourceForge.net <no...@so...> - 2007-02-14 15:16:31
|
Bugs item #1659819, was opened at 2007-02-14 17:16 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Core Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Pekka Laukkanen (laukpe) Assigned to: Nobody/Anonymous (nobody) Summary: Joining unicode items with string doesn't create unicode Initial Comment: When you join a list of unicode strings with a normal string (e.g. "x = ' '.join([u'a',u'b'])") the returned item is a normal string instead of a unicode string. If the string used in join is a unicode string then also the outcome is unicode. If you use the string returned in the former case as a unicode string later in your code (e.g. "unicode(x)") you get an UnicodeError if original unicode strings contained non-ascii characters. In CPython you get a unicode string in both cases as you would expect. See examples using Jython 2.2b1 and Python 2.4.3 below. In Jython 2.2a1 things work differently due to http://jython.org/bugs/1538001 that's fixed in beta. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'str'> >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xE4 Good' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'unicode'> >>> unicode(x) u'Hyv\xe4 Good' >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xe4 Good' >>> ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 |
From: SourceForge.net <no...@so...> - 2007-02-15 10:35:24
|
Bugs item #1659819, was opened at 2007-02-14 17:16 Message generated for change (Comment added) made by laukpe You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Core Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Pekka Laukkanen (laukpe) Assigned to: Nobody/Anonymous (nobody) Summary: Joining unicode items with string doesn't create unicode Initial Comment: When you join a list of unicode strings with a normal string (e.g. "x = ' '.join([u'a',u'b'])") the returned item is a normal string instead of a unicode string. If the string used in join is a unicode string then also the outcome is unicode. If you use the string returned in the former case as a unicode string later in your code (e.g. "unicode(x)") you get an UnicodeError if original unicode strings contained non-ascii characters. In CPython you get a unicode string in both cases as you would expect. See examples using Jython 2.2b1 and Python 2.4.3 below. In Jython 2.2a1 things work differently due to http://jython.org/bugs/1538001 that's fixed in beta. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'str'> >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xE4 Good' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'unicode'> >>> unicode(x) u'Hyv\xe4 Good' >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xe4 Good' >>> ---------------------------------------------------------------------- >Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 12:35 Message: Logged In: YES user_id=1379331 Originator: YES Same problem appears also if you create a string using pattern like 'Something %s' and the substituted string is unicode. Examples below demonstrate. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'str'> >>> x 'Good Hyv\xE4' >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xe4' >>> unicode(x) u'Good Hyv\xe4' >>> ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 |
From: SourceForge.net <no...@so...> - 2007-02-15 10:36:56
|
Bugs item #1659819, was opened at 2007-02-14 17:16 Message generated for change (Comment added) made by laukpe You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Core Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Pekka Laukkanen (laukpe) Assigned to: Nobody/Anonymous (nobody) Summary: Joining unicode items with string doesn't create unicode Initial Comment: When you join a list of unicode strings with a normal string (e.g. "x = ' '.join([u'a',u'b'])") the returned item is a normal string instead of a unicode string. If the string used in join is a unicode string then also the outcome is unicode. If you use the string returned in the former case as a unicode string later in your code (e.g. "unicode(x)") you get an UnicodeError if original unicode strings contained non-ascii characters. In CPython you get a unicode string in both cases as you would expect. See examples using Jython 2.2b1 and Python 2.4.3 below. In Jython 2.2a1 things work differently due to http://jython.org/bugs/1538001 that's fixed in beta. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'str'> >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xE4 Good' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'unicode'> >>> unicode(x) u'Hyv\xe4 Good' >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xe4 Good' >>> ---------------------------------------------------------------------- >Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 12:36 Message: Logged In: YES user_id=1379331 Originator: YES Catenating string and unicode seem to work correctly. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good ' + u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xE4' >>> unicode(x) u'Good Hyv\xE4' ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 12:35 Message: Logged In: YES user_id=1379331 Originator: YES Same problem appears also if you create a string using pattern like 'Something %s' and the substituted string is unicode. Examples below demonstrate. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'str'> >>> x 'Good Hyv\xE4' >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xe4' >>> unicode(x) u'Good Hyv\xe4' >>> ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 |
From: SourceForge.net <no...@so...> - 2007-02-15 22:16:48
|
Bugs item #1659819, was opened at 2007-02-14 17:16 Message generated for change (Comment added) made by laukpe You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Core Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Pekka Laukkanen (laukpe) Assigned to: Nobody/Anonymous (nobody) Summary: Joining unicode items with string doesn't create unicode Initial Comment: When you join a list of unicode strings with a normal string (e.g. "x = ' '.join([u'a',u'b'])") the returned item is a normal string instead of a unicode string. If the string used in join is a unicode string then also the outcome is unicode. If you use the string returned in the former case as a unicode string later in your code (e.g. "unicode(x)") you get an UnicodeError if original unicode strings contained non-ascii characters. In CPython you get a unicode string in both cases as you would expect. See examples using Jython 2.2b1 and Python 2.4.3 below. In Jython 2.2a1 things work differently due to http://jython.org/bugs/1538001 that's fixed in beta. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'str'> >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xE4 Good' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'unicode'> >>> unicode(x) u'Hyv\xe4 Good' >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xe4 Good' >>> ---------------------------------------------------------------------- >Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-16 00:16 Message: Logged In: YES user_id=1379331 Originator: YES Also replace seems to be affected. Testing with a post alpha snapshot in this time because I don't have beta installed into this machine. I couldn't find other methods implemented by Python string that could have this problem but probably it's better that someone else also goes through them. Jython 2.2a3005 on java1.5.0_09 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'str'> >>> x 'a b' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'unicode'> >>> x u'a b' >>> ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 12:36 Message: Logged In: YES user_id=1379331 Originator: YES Catenating string and unicode seem to work correctly. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good ' + u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xE4' >>> unicode(x) u'Good Hyv\xE4' ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 12:35 Message: Logged In: YES user_id=1379331 Originator: YES Same problem appears also if you create a string using pattern like 'Something %s' and the substituted string is unicode. Examples below demonstrate. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'str'> >>> x 'Good Hyv\xE4' >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xe4' >>> unicode(x) u'Good Hyv\xe4' >>> ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 |
From: SourceForge.net <no...@so...> - 2007-02-15 22:36:35
|
Bugs item #1659819, was opened at 2007-02-14 17:16 Message generated for change (Comment added) made by laukpe You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Core Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Pekka Laukkanen (laukpe) Assigned to: Nobody/Anonymous (nobody) Summary: Joining unicode items with string doesn't create unicode Initial Comment: When you join a list of unicode strings with a normal string (e.g. "x = ' '.join([u'a',u'b'])") the returned item is a normal string instead of a unicode string. If the string used in join is a unicode string then also the outcome is unicode. If you use the string returned in the former case as a unicode string later in your code (e.g. "unicode(x)") you get an UnicodeError if original unicode strings contained non-ascii characters. In CPython you get a unicode string in both cases as you would expect. See examples using Jython 2.2b1 and Python 2.4.3 below. In Jython 2.2a1 things work differently due to http://jython.org/bugs/1538001 that's fixed in beta. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'str'> >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xE4 Good' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'unicode'> >>> unicode(x) u'Hyv\xe4 Good' >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xe4 Good' >>> ---------------------------------------------------------------------- >Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-16 00:36 Message: Logged In: YES user_id=1379331 Originator: YES I was trying to find the cause for some of these problems (partly to see how easily I can understand Jython internals) and found the following from /trunk/jython/src/templates/str.expose (revision 3058). 41 expose_meth: join o 42 String result = self.str_join(arg0); 43 //XXX: do we really need to check self? 44 if (self instanceof PyUnicode || arg0 instanceof PyUnicode) { 45 return new PyUnicode(result); 46 } else { 47 return new PyString(result); 48 } I know next to nothing about these expose files (I've understood the actual Java code is generated from them) but if I got it correctly "arg0 instanceof PyUnicode" checks is the argument given to the join method a unicode object or not. If that's the case then the bug in join is exactly here -- the argument given to it is a sequence and thus never unicode itself. Instead the code should check is any of the items in the sequence of unicode type. The XXX comment can also probably be removed because u''.join([]) should create a unicode object too. If str.expose is the right place to fix these issues then also replace method ought to get similar checks as join has. Fixing join and replace still leaves "'%s' % u'x'" issue open. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-16 00:16 Message: Logged In: YES user_id=1379331 Originator: YES Also replace seems to be affected. Testing with a post alpha snapshot in this time because I don't have beta installed into this machine. I couldn't find other methods implemented by Python string that could have this problem but probably it's better that someone else also goes through them. Jython 2.2a3005 on java1.5.0_09 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'str'> >>> x 'a b' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'unicode'> >>> x u'a b' >>> ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 12:36 Message: Logged In: YES user_id=1379331 Originator: YES Catenating string and unicode seem to work correctly. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good ' + u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xE4' >>> unicode(x) u'Good Hyv\xE4' ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 12:35 Message: Logged In: YES user_id=1379331 Originator: YES Same problem appears also if you create a string using pattern like 'Something %s' and the substituted string is unicode. Examples below demonstrate. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'str'> >>> x 'Good Hyv\xE4' >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xe4' >>> unicode(x) u'Good Hyv\xe4' >>> ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 |
From: SourceForge.net <no...@so...> - 2007-02-16 06:25:08
|
Bugs item #1659819, was opened at 2007-02-14 10:16 Message generated for change (Comment added) made by cgroves You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Core >Group: targeted for 2.2beta2 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Pekka Laukkanen (laukpe) Assigned to: Nobody/Anonymous (nobody) Summary: Joining unicode items with string doesn't create unicode Initial Comment: When you join a list of unicode strings with a normal string (e.g. "x = ' '.join([u'a',u'b'])") the returned item is a normal string instead of a unicode string. If the string used in join is a unicode string then also the outcome is unicode. If you use the string returned in the former case as a unicode string later in your code (e.g. "unicode(x)") you get an UnicodeError if original unicode strings contained non-ascii characters. In CPython you get a unicode string in both cases as you would expect. See examples using Jython 2.2b1 and Python 2.4.3 below. In Jython 2.2a1 things work differently due to http://jython.org/bugs/1538001 that's fixed in beta. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'str'> >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xE4 Good' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'unicode'> >>> unicode(x) u'Hyv\xe4 Good' >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xe4 Good' >>> ---------------------------------------------------------------------- >Comment By: Charles Groves (cgroves) Date: 2007-02-16 01:25 Message: Logged In: YES user_id=1174327 Originator: NO It's nice to see you digging into the code Pekka. It does sound like you've pinpointed the problem in join. To test it out, you can modify the indented section past expose_meth: join o to be the fixed java code you've suggested. Then run 'python gexpose.py str.expose ../org/python/core/PyString.java' from the templates directory. That'll update PyString.java with your changes. After that it's just a matter of running a development copy of jython as described in http://wiki.python.org/jython/JythonDeveloperGuide I'll probably get to this for beta2, but patches always speed things along. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 17:36 Message: Logged In: YES user_id=1379331 Originator: YES I was trying to find the cause for some of these problems (partly to see how easily I can understand Jython internals) and found the following from /trunk/jython/src/templates/str.expose (revision 3058). 41 expose_meth: join o 42 String result = self.str_join(arg0); 43 //XXX: do we really need to check self? 44 if (self instanceof PyUnicode || arg0 instanceof PyUnicode) { 45 return new PyUnicode(result); 46 } else { 47 return new PyString(result); 48 } I know next to nothing about these expose files (I've understood the actual Java code is generated from them) but if I got it correctly "arg0 instanceof PyUnicode" checks is the argument given to the join method a unicode object or not. If that's the case then the bug in join is exactly here -- the argument given to it is a sequence and thus never unicode itself. Instead the code should check is any of the items in the sequence of unicode type. The XXX comment can also probably be removed because u''.join([]) should create a unicode object too. If str.expose is the right place to fix these issues then also replace method ought to get similar checks as join has. Fixing join and replace still leaves "'%s' % u'x'" issue open. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 17:16 Message: Logged In: YES user_id=1379331 Originator: YES Also replace seems to be affected. Testing with a post alpha snapshot in this time because I don't have beta installed into this machine. I couldn't find other methods implemented by Python string that could have this problem but probably it's better that someone else also goes through them. Jython 2.2a3005 on java1.5.0_09 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'str'> >>> x 'a b' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'unicode'> >>> x u'a b' >>> ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 05:36 Message: Logged In: YES user_id=1379331 Originator: YES Catenating string and unicode seem to work correctly. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good ' + u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xE4' >>> unicode(x) u'Good Hyv\xE4' ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 05:35 Message: Logged In: YES user_id=1379331 Originator: YES Same problem appears also if you create a string using pattern like 'Something %s' and the substituted string is unicode. Examples below demonstrate. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'str'> >>> x 'Good Hyv\xE4' >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xe4' >>> unicode(x) u'Good Hyv\xe4' >>> ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 |
From: SourceForge.net <no...@so...> - 2007-02-16 10:25:54
|
Bugs item #1659819, was opened at 2007-02-14 17:16 Message generated for change (Comment added) made by laukpe You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Core Group: targeted for 2.2beta2 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Pekka Laukkanen (laukpe) Assigned to: Nobody/Anonymous (nobody) Summary: Joining unicode items with string doesn't create unicode Initial Comment: When you join a list of unicode strings with a normal string (e.g. "x = ' '.join([u'a',u'b'])") the returned item is a normal string instead of a unicode string. If the string used in join is a unicode string then also the outcome is unicode. If you use the string returned in the former case as a unicode string later in your code (e.g. "unicode(x)") you get an UnicodeError if original unicode strings contained non-ascii characters. In CPython you get a unicode string in both cases as you would expect. See examples using Jython 2.2b1 and Python 2.4.3 below. In Jython 2.2a1 things work differently due to http://jython.org/bugs/1538001 that's fixed in beta. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'str'> >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xE4 Good' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'unicode'> >>> unicode(x) u'Hyv\xe4 Good' >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xe4 Good' >>> ---------------------------------------------------------------------- >Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-16 12:25 Message: Logged In: YES user_id=1379331 Originator: YES I need to set up Jython devenv before starting to play with this more but I assume that's not too hard nowadays. I think these problems really need some automated tests so I probably need to investigate how Jython test system works too. If I got any problems I'll send queries to jython-dev. ---------------------------------------------------------------------- Comment By: Charles Groves (cgroves) Date: 2007-02-16 08:25 Message: Logged In: YES user_id=1174327 Originator: NO It's nice to see you digging into the code Pekka. It does sound like you've pinpointed the problem in join. To test it out, you can modify the indented section past expose_meth: join o to be the fixed java code you've suggested. Then run 'python gexpose.py str.expose ../org/python/core/PyString.java' from the templates directory. That'll update PyString.java with your changes. After that it's just a matter of running a development copy of jython as described in http://wiki.python.org/jython/JythonDeveloperGuide I'll probably get to this for beta2, but patches always speed things along. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-16 00:36 Message: Logged In: YES user_id=1379331 Originator: YES I was trying to find the cause for some of these problems (partly to see how easily I can understand Jython internals) and found the following from /trunk/jython/src/templates/str.expose (revision 3058). 41 expose_meth: join o 42 String result = self.str_join(arg0); 43 //XXX: do we really need to check self? 44 if (self instanceof PyUnicode || arg0 instanceof PyUnicode) { 45 return new PyUnicode(result); 46 } else { 47 return new PyString(result); 48 } I know next to nothing about these expose files (I've understood the actual Java code is generated from them) but if I got it correctly "arg0 instanceof PyUnicode" checks is the argument given to the join method a unicode object or not. If that's the case then the bug in join is exactly here -- the argument given to it is a sequence and thus never unicode itself. Instead the code should check is any of the items in the sequence of unicode type. The XXX comment can also probably be removed because u''.join([]) should create a unicode object too. If str.expose is the right place to fix these issues then also replace method ought to get similar checks as join has. Fixing join and replace still leaves "'%s' % u'x'" issue open. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-16 00:16 Message: Logged In: YES user_id=1379331 Originator: YES Also replace seems to be affected. Testing with a post alpha snapshot in this time because I don't have beta installed into this machine. I couldn't find other methods implemented by Python string that could have this problem but probably it's better that someone else also goes through them. Jython 2.2a3005 on java1.5.0_09 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'str'> >>> x 'a b' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'unicode'> >>> x u'a b' >>> ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 12:36 Message: Logged In: YES user_id=1379331 Originator: YES Catenating string and unicode seem to work correctly. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good ' + u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xE4' >>> unicode(x) u'Good Hyv\xE4' ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 12:35 Message: Logged In: YES user_id=1379331 Originator: YES Same problem appears also if you create a string using pattern like 'Something %s' and the substituted string is unicode. Examples below demonstrate. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'str'> >>> x 'Good Hyv\xE4' >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xe4' >>> unicode(x) u'Good Hyv\xe4' >>> ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 |
From: SourceForge.net <no...@so...> - 2007-02-17 13:49:13
|
Bugs item #1659819, was opened at 2007-02-14 17:16 Message generated for change (Comment added) made by laukpe You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Core Group: targeted for 2.2beta2 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Pekka Laukkanen (laukpe) Assigned to: Nobody/Anonymous (nobody) Summary: Joining unicode items with string doesn't create unicode Initial Comment: When you join a list of unicode strings with a normal string (e.g. "x = ' '.join([u'a',u'b'])") the returned item is a normal string instead of a unicode string. If the string used in join is a unicode string then also the outcome is unicode. If you use the string returned in the former case as a unicode string later in your code (e.g. "unicode(x)") you get an UnicodeError if original unicode strings contained non-ascii characters. In CPython you get a unicode string in both cases as you would expect. See examples using Jython 2.2b1 and Python 2.4.3 below. In Jython 2.2a1 things work differently due to http://jython.org/bugs/1538001 that's fixed in beta. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'str'> >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xE4 Good' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'unicode'> >>> unicode(x) u'Hyv\xe4 Good' >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xe4 Good' >>> ---------------------------------------------------------------------- >Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-17 15:48 Message: Logged In: YES user_id=1379331 Originator: YES I was able to fix join and the patch is below. Before really submitting it I want to create tests for this first, try to fix also other affected methods and get some comments about the patch. The patch itself is not too complicated but I'm a bit worried about the overhead in iterating over the given sequence twice -- first in PyString.str_join and then again in this code. I made the check for possible unicode items so that it short-circuits but the worst case of iterating over the whole sequence when there is nothing unicode is unfortunately also the common case. I'd say a better approach would be determining the return type already in PyString.str_join but that requires changes into so many places that it's better done by someone who understands also the big picture behind this expose/derive system. Already changing PyString.str_join to return PyString instead of String requires few changes elsewhere in PyString. Index: src/templates/str.expose =================================================================== --- src/templates/str.expose (revision 3110) +++ src/templates/str.expose (working copy) @@ -40,12 +40,17 @@ expose_meth: :b isupper expose_meth: join o String result = self.str_join(arg0); - //XXX: do we really need to check self? - if (self instanceof PyUnicode || arg0 instanceof PyUnicode) { + if (self instanceof PyUnicode) { return new PyUnicode(result); - } else { - return new PyString(result); } + PyObject iter = arg0.__iter__(); + PyObject obj = null; + for (int i = 0; (obj = iter.__iternext__()) != null; i++) { + if (obj instanceof PyUnicode) { + return new PyUnicode(result); + } + } + return new PyString(result); expose_meth: :s ljust i expose_meth: :s lower expose_meth: :s lstrip S? ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-16 12:25 Message: Logged In: YES user_id=1379331 Originator: YES I need to set up Jython devenv before starting to play with this more but I assume that's not too hard nowadays. I think these problems really need some automated tests so I probably need to investigate how Jython test system works too. If I got any problems I'll send queries to jython-dev. ---------------------------------------------------------------------- Comment By: Charles Groves (cgroves) Date: 2007-02-16 08:25 Message: Logged In: YES user_id=1174327 Originator: NO It's nice to see you digging into the code Pekka. It does sound like you've pinpointed the problem in join. To test it out, you can modify the indented section past expose_meth: join o to be the fixed java code you've suggested. Then run 'python gexpose.py str.expose ../org/python/core/PyString.java' from the templates directory. That'll update PyString.java with your changes. After that it's just a matter of running a development copy of jython as described in http://wiki.python.org/jython/JythonDeveloperGuide I'll probably get to this for beta2, but patches always speed things along. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-16 00:36 Message: Logged In: YES user_id=1379331 Originator: YES I was trying to find the cause for some of these problems (partly to see how easily I can understand Jython internals) and found the following from /trunk/jython/src/templates/str.expose (revision 3058). 41 expose_meth: join o 42 String result = self.str_join(arg0); 43 //XXX: do we really need to check self? 44 if (self instanceof PyUnicode || arg0 instanceof PyUnicode) { 45 return new PyUnicode(result); 46 } else { 47 return new PyString(result); 48 } I know next to nothing about these expose files (I've understood the actual Java code is generated from them) but if I got it correctly "arg0 instanceof PyUnicode" checks is the argument given to the join method a unicode object or not. If that's the case then the bug in join is exactly here -- the argument given to it is a sequence and thus never unicode itself. Instead the code should check is any of the items in the sequence of unicode type. The XXX comment can also probably be removed because u''.join([]) should create a unicode object too. If str.expose is the right place to fix these issues then also replace method ought to get similar checks as join has. Fixing join and replace still leaves "'%s' % u'x'" issue open. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-16 00:16 Message: Logged In: YES user_id=1379331 Originator: YES Also replace seems to be affected. Testing with a post alpha snapshot in this time because I don't have beta installed into this machine. I couldn't find other methods implemented by Python string that could have this problem but probably it's better that someone else also goes through them. Jython 2.2a3005 on java1.5.0_09 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'str'> >>> x 'a b' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'unicode'> >>> x u'a b' >>> ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 12:36 Message: Logged In: YES user_id=1379331 Originator: YES Catenating string and unicode seem to work correctly. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good ' + u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xE4' >>> unicode(x) u'Good Hyv\xE4' ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 12:35 Message: Logged In: YES user_id=1379331 Originator: YES Same problem appears also if you create a string using pattern like 'Something %s' and the substituted string is unicode. Examples below demonstrate. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'str'> >>> x 'Good Hyv\xE4' >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xe4' >>> unicode(x) u'Good Hyv\xe4' >>> ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 |
From: SourceForge.net <no...@so...> - 2007-02-18 06:36:05
|
Bugs item #1659819, was opened at 2007-02-14 10:16 Message generated for change (Comment added) made by cgroves You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Core Group: targeted for 2.2beta2 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Pekka Laukkanen (laukpe) Assigned to: Nobody/Anonymous (nobody) Summary: Joining unicode items with string doesn't create unicode Initial Comment: When you join a list of unicode strings with a normal string (e.g. "x = ' '.join([u'a',u'b'])") the returned item is a normal string instead of a unicode string. If the string used in join is a unicode string then also the outcome is unicode. If you use the string returned in the former case as a unicode string later in your code (e.g. "unicode(x)") you get an UnicodeError if original unicode strings contained non-ascii characters. In CPython you get a unicode string in both cases as you would expect. See examples using Jython 2.2b1 and Python 2.4.3 below. In Jython 2.2a1 things work differently due to http://jython.org/bugs/1538001 that's fixed in beta. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'str'> >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xE4 Good' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'unicode'> >>> unicode(x) u'Hyv\xe4 Good' >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xe4 Good' >>> ---------------------------------------------------------------------- >Comment By: Charles Groves (cgroves) Date: 2007-02-18 01:36 Message: Logged In: YES user_id=1174327 Originator: NO I think we can actually remove the self instanceof PyUnicode as well while we're working on this; PyUnicode exposes a join method of its own that always returns PyUnicode. I only see a couple things you'd have to change to have str_join return PyString: PyString.join would have to return str_join(seq).string and PyUnicode.unicode_join would have to replace the result with an instance of PyUnicode if the str_join didn't require it. That's probably better because I think it means we could remove the custom method implementation code for join in str.expose in addition to getting rid of a second iteration. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-17 08:48 Message: Logged In: YES user_id=1379331 Originator: YES I was able to fix join and the patch is below. Before really submitting it I want to create tests for this first, try to fix also other affected methods and get some comments about the patch. The patch itself is not too complicated but I'm a bit worried about the overhead in iterating over the given sequence twice -- first in PyString.str_join and then again in this code. I made the check for possible unicode items so that it short-circuits but the worst case of iterating over the whole sequence when there is nothing unicode is unfortunately also the common case. I'd say a better approach would be determining the return type already in PyString.str_join but that requires changes into so many places that it's better done by someone who understands also the big picture behind this expose/derive system. Already changing PyString.str_join to return PyString instead of String requires few changes elsewhere in PyString. Index: src/templates/str.expose =================================================================== --- src/templates/str.expose (revision 3110) +++ src/templates/str.expose (working copy) @@ -40,12 +40,17 @@ expose_meth: :b isupper expose_meth: join o String result = self.str_join(arg0); - //XXX: do we really need to check self? - if (self instanceof PyUnicode || arg0 instanceof PyUnicode) { + if (self instanceof PyUnicode) { return new PyUnicode(result); - } else { - return new PyString(result); } + PyObject iter = arg0.__iter__(); + PyObject obj = null; + for (int i = 0; (obj = iter.__iternext__()) != null; i++) { + if (obj instanceof PyUnicode) { + return new PyUnicode(result); + } + } + return new PyString(result); expose_meth: :s ljust i expose_meth: :s lower expose_meth: :s lstrip S? ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-16 05:25 Message: Logged In: YES user_id=1379331 Originator: YES I need to set up Jython devenv before starting to play with this more but I assume that's not too hard nowadays. I think these problems really need some automated tests so I probably need to investigate how Jython test system works too. If I got any problems I'll send queries to jython-dev. ---------------------------------------------------------------------- Comment By: Charles Groves (cgroves) Date: 2007-02-16 01:25 Message: Logged In: YES user_id=1174327 Originator: NO It's nice to see you digging into the code Pekka. It does sound like you've pinpointed the problem in join. To test it out, you can modify the indented section past expose_meth: join o to be the fixed java code you've suggested. Then run 'python gexpose.py str.expose ../org/python/core/PyString.java' from the templates directory. That'll update PyString.java with your changes. After that it's just a matter of running a development copy of jython as described in http://wiki.python.org/jython/JythonDeveloperGuide I'll probably get to this for beta2, but patches always speed things along. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 17:36 Message: Logged In: YES user_id=1379331 Originator: YES I was trying to find the cause for some of these problems (partly to see how easily I can understand Jython internals) and found the following from /trunk/jython/src/templates/str.expose (revision 3058). 41 expose_meth: join o 42 String result = self.str_join(arg0); 43 //XXX: do we really need to check self? 44 if (self instanceof PyUnicode || arg0 instanceof PyUnicode) { 45 return new PyUnicode(result); 46 } else { 47 return new PyString(result); 48 } I know next to nothing about these expose files (I've understood the actual Java code is generated from them) but if I got it correctly "arg0 instanceof PyUnicode" checks is the argument given to the join method a unicode object or not. If that's the case then the bug in join is exactly here -- the argument given to it is a sequence and thus never unicode itself. Instead the code should check is any of the items in the sequence of unicode type. The XXX comment can also probably be removed because u''.join([]) should create a unicode object too. If str.expose is the right place to fix these issues then also replace method ought to get similar checks as join has. Fixing join and replace still leaves "'%s' % u'x'" issue open. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 17:16 Message: Logged In: YES user_id=1379331 Originator: YES Also replace seems to be affected. Testing with a post alpha snapshot in this time because I don't have beta installed into this machine. I couldn't find other methods implemented by Python string that could have this problem but probably it's better that someone else also goes through them. Jython 2.2a3005 on java1.5.0_09 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'str'> >>> x 'a b' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'unicode'> >>> x u'a b' >>> ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 05:36 Message: Logged In: YES user_id=1379331 Originator: YES Catenating string and unicode seem to work correctly. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good ' + u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xE4' >>> unicode(x) u'Good Hyv\xE4' ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 05:35 Message: Logged In: YES user_id=1379331 Originator: YES Same problem appears also if you create a string using pattern like 'Something %s' and the substituted string is unicode. Examples below demonstrate. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'str'> >>> x 'Good Hyv\xE4' >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xe4' >>> unicode(x) u'Good Hyv\xe4' >>> ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 |
From: SourceForge.net <no...@so...> - 2007-02-19 01:15:09
|
Bugs item #1659819, was opened at 2007-02-14 17:16 Message generated for change (Comment added) made by laukpe You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Core Group: targeted for 2.2beta2 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Pekka Laukkanen (laukpe) Assigned to: Nobody/Anonymous (nobody) Summary: Joining unicode items with string doesn't create unicode Initial Comment: When you join a list of unicode strings with a normal string (e.g. "x = ' '.join([u'a',u'b'])") the returned item is a normal string instead of a unicode string. If the string used in join is a unicode string then also the outcome is unicode. If you use the string returned in the former case as a unicode string later in your code (e.g. "unicode(x)") you get an UnicodeError if original unicode strings contained non-ascii characters. In CPython you get a unicode string in both cases as you would expect. See examples using Jython 2.2b1 and Python 2.4.3 below. In Jython 2.2a1 things work differently due to http://jython.org/bugs/1538001 that's fixed in beta. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'str'> >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xE4 Good' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'unicode'> >>> unicode(x) u'Hyv\xe4 Good' >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xe4 Good' >>> ---------------------------------------------------------------------- >Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-19 03:15 Message: Logged In: YES user_id=1379331 Originator: YES I was able to get join and replace working correctly so that argument checking is done only in the hand written part of the PyString. I'll attach a patch so that you can check is it even close to be correct. Fixes are still missing tests and also "'%s' % u'x'" issue is open. File Added: patch.1 ---------------------------------------------------------------------- Comment By: Charles Groves (cgroves) Date: 2007-02-18 08:36 Message: Logged In: YES user_id=1174327 Originator: NO I think we can actually remove the self instanceof PyUnicode as well while we're working on this; PyUnicode exposes a join method of its own that always returns PyUnicode. I only see a couple things you'd have to change to have str_join return PyString: PyString.join would have to return str_join(seq).string and PyUnicode.unicode_join would have to replace the result with an instance of PyUnicode if the str_join didn't require it. That's probably better because I think it means we could remove the custom method implementation code for join in str.expose in addition to getting rid of a second iteration. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-17 15:48 Message: Logged In: YES user_id=1379331 Originator: YES I was able to fix join and the patch is below. Before really submitting it I want to create tests for this first, try to fix also other affected methods and get some comments about the patch. The patch itself is not too complicated but I'm a bit worried about the overhead in iterating over the given sequence twice -- first in PyString.str_join and then again in this code. I made the check for possible unicode items so that it short-circuits but the worst case of iterating over the whole sequence when there is nothing unicode is unfortunately also the common case. I'd say a better approach would be determining the return type already in PyString.str_join but that requires changes into so many places that it's better done by someone who understands also the big picture behind this expose/derive system. Already changing PyString.str_join to return PyString instead of String requires few changes elsewhere in PyString. Index: src/templates/str.expose =================================================================== --- src/templates/str.expose (revision 3110) +++ src/templates/str.expose (working copy) @@ -40,12 +40,17 @@ expose_meth: :b isupper expose_meth: join o String result = self.str_join(arg0); - //XXX: do we really need to check self? - if (self instanceof PyUnicode || arg0 instanceof PyUnicode) { + if (self instanceof PyUnicode) { return new PyUnicode(result); - } else { - return new PyString(result); } + PyObject iter = arg0.__iter__(); + PyObject obj = null; + for (int i = 0; (obj = iter.__iternext__()) != null; i++) { + if (obj instanceof PyUnicode) { + return new PyUnicode(result); + } + } + return new PyString(result); expose_meth: :s ljust i expose_meth: :s lower expose_meth: :s lstrip S? ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-16 12:25 Message: Logged In: YES user_id=1379331 Originator: YES I need to set up Jython devenv before starting to play with this more but I assume that's not too hard nowadays. I think these problems really need some automated tests so I probably need to investigate how Jython test system works too. If I got any problems I'll send queries to jython-dev. ---------------------------------------------------------------------- Comment By: Charles Groves (cgroves) Date: 2007-02-16 08:25 Message: Logged In: YES user_id=1174327 Originator: NO It's nice to see you digging into the code Pekka. It does sound like you've pinpointed the problem in join. To test it out, you can modify the indented section past expose_meth: join o to be the fixed java code you've suggested. Then run 'python gexpose.py str.expose ../org/python/core/PyString.java' from the templates directory. That'll update PyString.java with your changes. After that it's just a matter of running a development copy of jython as described in http://wiki.python.org/jython/JythonDeveloperGuide I'll probably get to this for beta2, but patches always speed things along. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-16 00:36 Message: Logged In: YES user_id=1379331 Originator: YES I was trying to find the cause for some of these problems (partly to see how easily I can understand Jython internals) and found the following from /trunk/jython/src/templates/str.expose (revision 3058). 41 expose_meth: join o 42 String result = self.str_join(arg0); 43 //XXX: do we really need to check self? 44 if (self instanceof PyUnicode || arg0 instanceof PyUnicode) { 45 return new PyUnicode(result); 46 } else { 47 return new PyString(result); 48 } I know next to nothing about these expose files (I've understood the actual Java code is generated from them) but if I got it correctly "arg0 instanceof PyUnicode" checks is the argument given to the join method a unicode object or not. If that's the case then the bug in join is exactly here -- the argument given to it is a sequence and thus never unicode itself. Instead the code should check is any of the items in the sequence of unicode type. The XXX comment can also probably be removed because u''.join([]) should create a unicode object too. If str.expose is the right place to fix these issues then also replace method ought to get similar checks as join has. Fixing join and replace still leaves "'%s' % u'x'" issue open. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-16 00:16 Message: Logged In: YES user_id=1379331 Originator: YES Also replace seems to be affected. Testing with a post alpha snapshot in this time because I don't have beta installed into this machine. I couldn't find other methods implemented by Python string that could have this problem but probably it's better that someone else also goes through them. Jython 2.2a3005 on java1.5.0_09 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'str'> >>> x 'a b' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'unicode'> >>> x u'a b' >>> ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 12:36 Message: Logged In: YES user_id=1379331 Originator: YES Catenating string and unicode seem to work correctly. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good ' + u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xE4' >>> unicode(x) u'Good Hyv\xE4' ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 12:35 Message: Logged In: YES user_id=1379331 Originator: YES Same problem appears also if you create a string using pattern like 'Something %s' and the substituted string is unicode. Examples below demonstrate. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'str'> >>> x 'Good Hyv\xE4' >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xe4' >>> unicode(x) u'Good Hyv\xe4' >>> ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 |
From: SourceForge.net <no...@so...> - 2007-02-28 18:58:48
|
Bugs item #1659819, was opened at 2007-02-14 10:16 Message generated for change (Comment added) made by cgroves You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Core Group: targeted for 2.2beta2 >Status: Closed >Resolution: Fixed Priority: 5 Private: No Submitted By: Pekka Laukkanen (laukpe) Assigned to: Nobody/Anonymous (nobody) Summary: Joining unicode items with string doesn't create unicode Initial Comment: When you join a list of unicode strings with a normal string (e.g. "x = ' '.join([u'a',u'b'])") the returned item is a normal string instead of a unicode string. If the string used in join is a unicode string then also the outcome is unicode. If you use the string returned in the former case as a unicode string later in your code (e.g. "unicode(x)") you get an UnicodeError if original unicode strings contained non-ascii characters. In CPython you get a unicode string in both cases as you would expect. See examples using Jython 2.2b1 and Python 2.4.3 below. In Jython 2.2a1 things work differently due to http://jython.org/bugs/1538001 that's fixed in beta. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'str'> >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xE4 Good' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> ul = [u'Hyv\u00E4',u'Good'] >>> >>> x = ' '.join(ul) >>> type(x) <type 'unicode'> >>> unicode(x) u'Hyv\xe4 Good' >>> >>> y = u' '.join(ul) >>> type(y) <type 'unicode'> >>> unicode(y) u'Hyv\xe4 Good' >>> ---------------------------------------------------------------------- >Comment By: Charles Groves (cgroves) Date: 2007-02-28 13:58 Message: Logged In: YES user_id=1174327 Originator: NO join and replace fix applied in r3127. I opened a new bug for the format issue in http://jython.org/bugs/1671134 ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-18 20:15 Message: Logged In: YES user_id=1379331 Originator: YES I was able to get join and replace working correctly so that argument checking is done only in the hand written part of the PyString. I'll attach a patch so that you can check is it even close to be correct. Fixes are still missing tests and also "'%s' % u'x'" issue is open. File Added: patch.1 ---------------------------------------------------------------------- Comment By: Charles Groves (cgroves) Date: 2007-02-18 01:36 Message: Logged In: YES user_id=1174327 Originator: NO I think we can actually remove the self instanceof PyUnicode as well while we're working on this; PyUnicode exposes a join method of its own that always returns PyUnicode. I only see a couple things you'd have to change to have str_join return PyString: PyString.join would have to return str_join(seq).string and PyUnicode.unicode_join would have to replace the result with an instance of PyUnicode if the str_join didn't require it. That's probably better because I think it means we could remove the custom method implementation code for join in str.expose in addition to getting rid of a second iteration. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-17 08:48 Message: Logged In: YES user_id=1379331 Originator: YES I was able to fix join and the patch is below. Before really submitting it I want to create tests for this first, try to fix also other affected methods and get some comments about the patch. The patch itself is not too complicated but I'm a bit worried about the overhead in iterating over the given sequence twice -- first in PyString.str_join and then again in this code. I made the check for possible unicode items so that it short-circuits but the worst case of iterating over the whole sequence when there is nothing unicode is unfortunately also the common case. I'd say a better approach would be determining the return type already in PyString.str_join but that requires changes into so many places that it's better done by someone who understands also the big picture behind this expose/derive system. Already changing PyString.str_join to return PyString instead of String requires few changes elsewhere in PyString. Index: src/templates/str.expose =================================================================== --- src/templates/str.expose (revision 3110) +++ src/templates/str.expose (working copy) @@ -40,12 +40,17 @@ expose_meth: :b isupper expose_meth: join o String result = self.str_join(arg0); - //XXX: do we really need to check self? - if (self instanceof PyUnicode || arg0 instanceof PyUnicode) { + if (self instanceof PyUnicode) { return new PyUnicode(result); - } else { - return new PyString(result); } + PyObject iter = arg0.__iter__(); + PyObject obj = null; + for (int i = 0; (obj = iter.__iternext__()) != null; i++) { + if (obj instanceof PyUnicode) { + return new PyUnicode(result); + } + } + return new PyString(result); expose_meth: :s ljust i expose_meth: :s lower expose_meth: :s lstrip S? ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-16 05:25 Message: Logged In: YES user_id=1379331 Originator: YES I need to set up Jython devenv before starting to play with this more but I assume that's not too hard nowadays. I think these problems really need some automated tests so I probably need to investigate how Jython test system works too. If I got any problems I'll send queries to jython-dev. ---------------------------------------------------------------------- Comment By: Charles Groves (cgroves) Date: 2007-02-16 01:25 Message: Logged In: YES user_id=1174327 Originator: NO It's nice to see you digging into the code Pekka. It does sound like you've pinpointed the problem in join. To test it out, you can modify the indented section past expose_meth: join o to be the fixed java code you've suggested. Then run 'python gexpose.py str.expose ../org/python/core/PyString.java' from the templates directory. That'll update PyString.java with your changes. After that it's just a matter of running a development copy of jython as described in http://wiki.python.org/jython/JythonDeveloperGuide I'll probably get to this for beta2, but patches always speed things along. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 17:36 Message: Logged In: YES user_id=1379331 Originator: YES I was trying to find the cause for some of these problems (partly to see how easily I can understand Jython internals) and found the following from /trunk/jython/src/templates/str.expose (revision 3058). 41 expose_meth: join o 42 String result = self.str_join(arg0); 43 //XXX: do we really need to check self? 44 if (self instanceof PyUnicode || arg0 instanceof PyUnicode) { 45 return new PyUnicode(result); 46 } else { 47 return new PyString(result); 48 } I know next to nothing about these expose files (I've understood the actual Java code is generated from them) but if I got it correctly "arg0 instanceof PyUnicode" checks is the argument given to the join method a unicode object or not. If that's the case then the bug in join is exactly here -- the argument given to it is a sequence and thus never unicode itself. Instead the code should check is any of the items in the sequence of unicode type. The XXX comment can also probably be removed because u''.join([]) should create a unicode object too. If str.expose is the right place to fix these issues then also replace method ought to get similar checks as join has. Fixing join and replace still leaves "'%s' % u'x'" issue open. ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 17:16 Message: Logged In: YES user_id=1379331 Originator: YES Also replace seems to be affected. Testing with a post alpha snapshot in this time because I don't have beta installed into this machine. I couldn't find other methods implemented by Python string that could have this problem but probably it's better that someone else also goes through them. Jython 2.2a3005 on java1.5.0_09 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'str'> >>> x 'a b' >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'a b'.replace('b',u'b') >>> type(x) <type 'unicode'> >>> x u'a b' >>> ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 05:36 Message: Logged In: YES user_id=1379331 Originator: YES Catenating string and unicode seem to work correctly. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good ' + u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xE4' >>> unicode(x) u'Good Hyv\xE4' ---------------------------------------------------------------------- Comment By: Pekka Laukkanen (laukpe) Date: 2007-02-15 05:35 Message: Logged In: YES user_id=1379331 Originator: YES Same problem appears also if you create a string using pattern like 'Something %s' and the substituted string is unicode. Examples below demonstrate. Jython 2.2b1 on java1.5.0_10 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'str'> >>> x 'Good Hyv\xE4' >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 'Good %s' % u'Hyv\u00E4' >>> type(x) <type 'unicode'> >>> x u'Good Hyv\xe4' >>> unicode(x) u'Good Hyv\xe4' >>> ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1659819&group_id=12867 |