From: Paul P. <pa...@pr...> - 2002-01-10 17:47:57
|
Brian Quinlan reported this strange result to me: Jython 2.1 on java1.3.1_02 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> x=u'\x81' >>> unicode(x) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) If Jython is going to unify 8-bit strings and Unicode strings (as Java does) then it should probably treat them all as Unicode strings, not as 8-bit. Paul Prescod |
From: <bc...@wo...> - 2002-01-10 19:26:37
|
[Paul Prescod] >Brian Quinlan reported this strange result to me: > >Jython 2.1 on java1.3.1_02 (JIT: null) >Type "copyright", "credits" or "license" for more information. >>>> x=u'\x81' >>>> unicode(x) >Traceback (innermost last): > File "<console>", line 1, in ? >UnicodeError: ascii decoding error: ordinal not in range(128) In CPython the unicode() builtin function will either. - decodes a byte string - returns a unicode argument unmodified. In jython there is no difference between the strings u"\x81" and "\x81". So we can only do one of these two things. >If Jython is going to unify 8-bit strings and Unicode strings (as Java >does) IMO, there is very little unification between java byte arrays and java strings and the methods that existed initially to convert between them have since been deprecated. That is A Good Thing because it clearly separates the obvious use of bytes and characters. >then it should probably treat them all as Unicode strings, not as >8-bit. Then unicode() would be a no-op. It would just return the argument without doing anything. It is unfortunate that there are such differences between CPython and Jython, but it is a natural consequence of our design where we decided to work without a byte string type. regards, finn |
From: Paul P. <pa...@pr...> - 2002-01-10 22:49:55
|
Finn Bock wrote: > >... > > IMO, there is very little unification between java byte arrays and java > strings and the methods that existed initially to convert between them > have since been deprecated. That is A Good Thing because it clearly > separates the obvious use of bytes and characters. You are right. What I meant was that "plain old strings" are Unicode in Java. Byte arrays are not considered strings at all. > >then it should probably treat them all as Unicode strings, not as > >8-bit. > > Then unicode() would be a no-op. It would just return the argument > without doing anything. Is that a problem? If the user specifies an encoding then you could decode. If they don't, I would suggest to just do a no-op. Under what circumstances would the current exception be more helpful? Paul Prescod |
From: Brian Q. <br...@sw...> - 2002-01-11 00:14:38
|
Paul Prescod wrote: > Finn Bock wrote: > > > >... > > > > IMO, there is very little unification between java byte arrays and java > > strings and the methods that existed initially to convert between them > > have since been deprecated. That is A Good Thing because it clearly > > separates the obvious use of bytes and characters. > > You are right. What I meant was that "plain old strings" are Unicode in > Java. Byte arrays are not considered strings at all. And this is the real problem. Python strings are really just (unsigned) byte arrays. In any case, it problematic to map two different types (string and Unicode objects) to the same Java type. > > >then it should probably treat them all as Unicode strings, not as > > >8-bit. > > > > Then unicode() would be a no-op. It would just return the argument > > without doing anything. > > Is that a problem? If the user specifies an encoding then you could > decode. If they don't, I would suggest to just do a no-op. Under what > circumstances would the current exception be more helpful? Because you are specifically looking for the exception to see if the string can be converted to a Unicode object using the default encoding? Cheers, Brian |
From: dman <ds...@ri...> - 2002-01-11 00:41:52
|
On Thu, Jan 10, 2002 at 04:16:14PM -0800, Brian Quinlan wrote: | Paul Prescod wrote: | > Is that a problem? If the user specifies an encoding then you could | > decode. If they don't, I would suggest to just do a no-op. Under what | > circumstances would the current exception be more helpful? | | Because you are specifically looking for the exception to see if the | string can be converted to a Unicode object using the default encoding? Is it supposed to be an error when trying to convert a unicode object to a unicode object? I don't think so. I can convert an int to an int. >>> x = u"\u20ac" >>> x = unicode( u"\u20ac" ) Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> (I used assignment so I won't get the error of printing non-ascii characters on an ascii display) -D -- He who finds a wife finds what is good and receives favor from the Lord. Proverbs 18:22 |
From: Samuele P. <ped...@bl...> - 2002-01-11 00:57:56
|
[dman] > On Thu, Jan 10, 2002 at 04:16:14PM -0800, Brian Quinlan wrote: > | Paul Prescod wrote: > > | > Is that a problem? If the user specifies an encoding then you could > | > decode. If they don't, I would suggest to just do a no-op. Under what > | > circumstances would the current exception be more helpful? > | > | Because you are specifically looking for the exception to see if the > | string can be converted to a Unicode object using the default encoding? > > Is it supposed to be an error when trying to convert a unicode object > to a unicode object? I don't think so. I can convert an int to an > int. > > >>> x = u"\u20ac" > >>> x = unicode( u"\u20ac" ) > Traceback (innermost last): > File "<console>", line 1, in ? > UnicodeError: ascii decoding error: ordinal not in range(128) > >>> > > (I used assignment so I won't get the error of printing non-ascii > characters on an ascii display) > I start to think that Paul Prescod is right here. In CPython Python 2.1 (#15, Apr 16 2001, 18:25:49) [MSC 32 bit (Intel)] on win32 Type "copyright", "credits" or "license" for more information. >>> unicode(u"\xe9") u'\xe9' >>> while Jython 2.1 on java1.3.0 (JIT: null) Type "copyright", "credits" or "license" for more information. >>> unicode(u"\xe9") Traceback (innermost last): File "<console>", line 1, in ? UnicodeError: ascii decoding error: ordinal not in range(128) >>> The question is: it is better to fail when CPython does not fails or not to fail when CPython fails and succeed when CPython succeeds. I'm maybe missing something subtle but I prefer the latter and so unicode without an encoding should be a nop. regards. |
From: <bc...@wo...> - 2002-01-11 22:33:15
|
[samuele] >I start to think that Paul Prescod is right here. > >In CPython > >Python 2.1 (#15, Apr 16 2001, 18:25:49) [MSC 32 bit (Intel)] on win32 >Type "copyright", "credits" or "license" for more information. >>>> unicode(u"\xe9") >u'\xe9' >>>> > >while > >Jython 2.1 on java1.3.0 (JIT: null) >Type "copyright", "credits" or "license" for more information. >>>> unicode(u"\xe9") >Traceback (innermost last): > File "<console>", line 1, in ? >UnicodeError: ascii decoding error: ordinal not in range(128) >>>> > >The question is: it is better to fail when CPython does not fails or not to >fail when >CPython fails and succeed when CPython succeeds. I'm maybe missing something >subtle but I prefer the latter and so unicode without an encoding should be a >nop. The 2.2 docs has (the 2.1 didn't) a huge special case for the single arg unicode() call. http://www.python.org/doc/current/lib/built-in-funcs.html So, yes, we could change this for 2.2 without breaking the docs. regards, finn |
From: <bc...@wo...> - 2002-01-11 13:18:00
|
On Thu, 10 Jan 2002 19:49:45 -0500, you wrote: [Paul Prescod wrote] > Is that a problem? If the user specifies an encoding then you could > decode. If they don't, I would suggest to just do a no-op. Under what > circumstances would the current exception be more helpful? [Brian Quinlan] > Because you are specifically looking for the exception to see if the > string can be converted to a Unicode object using the default encoding? No, it's because you can change the codec used to something more usefull than the default "ascii". [dman] >Is it supposed to be an error when trying to convert a unicode object >to a unicode object? I don't think so. And I agree. Passing a unicode object to unicode() should have been a no-op. Jython only has this 'problem' because we can't have different execution paths based on the type of string objects. regards, finn |
From: <bc...@wo...> - 2002-01-11 10:49:03
|
>> Then unicode() would be a no-op. It would just return the argument >> without doing anything. [Paul Prescod] >Is that a problem? If the user specifies an encoding then you could >decode. If they don't, I would suggest to just do a no-op. Under what >circumstances would the current exception be more helpful? The single argument unicode() call will use the default encoding. That can be changed to something other than "ascii" and then a no-op makes less sense: [d:\]p22 -S Python 2.2 (#28, Dec 21 2001, 12:21:22) [MSC 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.setdefaultencoding("cp1253") >>> unicode("\x80") u'\u20ac' >>> regards, finn |
From: Paul P. <pa...@pr...> - 2002-01-11 17:12:19
|
Finn Bock wrote: > >... > > The single argument unicode() call will use the default encoding. That > can be changed to something other than "ascii" and then a no-op makes > less sense: > > [d:\]p22 -S There is a reason you needed to put in that -S. Changing the default encoding is not recommended and is not really officially supported. I would suggest that a higher priority be put on normal uses than on variant ones. In the interest of full disclosure, I have been a strong critic of the idea that there should be a per-machine changable default encoding. Obviously enough people agree with me that it hasn't been made a standard feature yet. A public sys.setdefaultencoding would be a bad idea because: * it is a global variable, with all of the typical scalability problems that implies. Setting it in one place could break code in some library module you don't know about. * it doesn't work well with a networked world where information coming off the network could be in random encodings based on other people's machine encodings Paul Prescod |
From: <bc...@wo...> - 2002-01-11 22:39:04
|
[Paul Prescod] >Finn Bock wrote: >> >>... >> >> The single argument unicode() call will use the default encoding. That >> can be changed to something other than "ascii" and then a no-op makes >> less sense: >> >> [d:\]p22 -S > >There is a reason you needed to put in that -S. True. Which is why I added my startup command to the snippet. >Changing the default encoding is not recommended and is not really >officially supported. Right, but the current default value of "ascii" is not written in stone. It just happened to be the lowest common denominator that the major unicode players could agree about. You too remember the fighting, it wasn't pretty. >I would suggest that a higher priority be put on normal uses than on >variant ones. Is a no-op a normal use? Maybe when porting CPython applications? >In the interest of full disclosure, I have been a strong critic of the >idea that there should be a per-machine changable default encoding. I agree that the setting should *not* be changable, but I *strongly* disagree with the current choice of ascii. It only make sense to americans. In jython the IMO obvious default would be the file.encoding property. Maybe I should have picked that default when I added unicode support, but after seeing the casualties of the that discussion on python-dev, I didn't dare. regards, finn |
From: Paul P. <pa...@pr...> - 2002-01-11 23:18:09
|
Finn Bock wrote: > >... > > Is a no-op a normal use? Maybe when porting CPython applications? It wasn't a porting project when Brian ran into it. But maybe he was thinking in CPython terms as he was coding. Is it accurate to say that either way, the single-arg unicode() call is useless in a made-for Jython application? If so, the question is how to emulate CPython most directly. And then it is really a probabilistic issue because you can never get 100% if CPython has two types where Jython has only one. If you turn it into a no-op, you will have the effect in some cases of making Latin1 the default encoding (which is what I have proposed for CPython). Consider: If this is a no-op: >>> x=u'\x81' >>> x=unicode(x) Then so is this: >>> x='\x81' >>> x=unicode(x) I have no problem with that, myself, but it is precisely the proposal that caused the heated i18n flamewars. >... > > In jython the IMO obvious default would be the file.encoding property. > Maybe I should have picked that default when I added unicode support, > but after seeing the casualties of the that discussion on python-dev, I > didn't dare. Still, that wouldn't be a cure-all. In CPython, this would always be a no-op: unicode(unicode(unicode(unicode(u"...")))) In Jython, it would decode according to the file.encoding several times, potentially changing the string every time. Perhaps the one-arg version of the unicode function should simply be illegal (deprecated) when applied to strings. If you want to decode according to file.encoding, you could. If you want to decode according to sys.getdefaultencoding, you could. If you want to hard-code ASCII, you could. It is probably an illusion to think that people can work with Unicode without thinking about encodings anyhow. That's another reason I don't think that file.encoding and sys.defaultencoding are particularly useful. Ten years ago it made sense to guess at the file encoding based on the user's machine locale and OS. Today, I don't think it does. When push comes to shove, the end-user must specify the encoding of the data rather than expecting operating systems or interpreters to guess based on the locale. Paul Prescod |