From: dman <ds...@vm...> - 2001-11-09 22:06:45
|
This bounced last time. -D ----- Forwarded message from dman <ds...@ri...> ----- From: dman <ds...@ri...> Date: Fri, 9 Nov 2001 14:07:51 -0500 To: jyt...@li... User-Agent: Mutt/1.2.5i Mail-Followup-To: jyt...@li... On Fri, Nov 09, 2001 at 04:25:15PM +0000, Finn Bock wrote: | [dman] | | >Could someone please explain to me, again, why jython munges streams | >after the fashion of ms windows if binary mode isn't specified? | | First, was the problem the CR-NL munging or the non-ascii munging | <wink>? I assume you are running unix so it must have been be the | non-ascii munging that bit you. The IMAP RFC states that all lines end in CRLF. When I printed out the last 2 characters of sockfile.readline() I got something (the last piece of data on the line), then 0xa. All the data was fine, except that the CR was missing. | The basic issue is how to deal with characters (16-bits) vs. bytes | (8-bit). Java have two ways: Stream and Reader, but python only have one | open() method. I decided to override the 'b' flag for this behavior | because many (windows) programmers would already know about the 'b' flag | on the open() function. By re-using the 'b' flag the default text mode | was obvious because that is what windows uses. Was the logic of input identical to text-files on windows, or is there more to it than that? How does java decide what the encoding of the data is (ie Unicode 16-bit chars or ASCII 8-bit chars)? How does it decide to remove the CR, but not harm any other data in the stream? I don't really understand much of Java's java.io package, other than it takes some work to figure out which class has the method that does what you want. IMO Python's read() and readline() methods are so much simpler and get the job done just as well. | >I just had a real (annoying) waste of time tracking down why imaplib | >would throw an unexpected response exception | | Have you forgotton how JPython-1.1 did this? I haven't forgotten because I never knew. I've only used Jython >= 2.0. (and CPython, but that is irrelevant here) | Data was written as a through a Writer but reading data was through | a InputStream. With no way of changing that behaviour. What we have | now is better by far. Ok, I agree that allowing specifying the 'b' flag to make it work "right" is better than not allowing specification. (Personally, I think that all streams should just be streams with no magic munging under the programmer's feet. That is, I think that there should only be "binary mode" reading of files and sockets.) | >(on all correct | >responses, except UID responses), but worked beautifully with cpython. | >I am now submitting the patch below to cpython on sourceforge (that's | >where the module is maintained, right? I know that the debian package | >uses cpython's modules). | | It seems like this changes was submitted already: | | >http://sourceforge.net/tracker/?group_id=5470&atid=305470&func=detail&aid=469910 Yeah, Martin von Loewis responed that the bug has already been fixed in CVS and will be included in CPython 2.2. | I'll apply the same patch to jython's version of imaplib.py in the next | release. Cool. BTW, the current version of the Debian package includes the patch. -D ----- End forwarded message ----- |
From: <bc...@wo...> - 2001-11-13 19:21:50
|
[dman] >The IMAP RFC states that all lines end in CRLF. When I printed out >the last 2 characters of sockfile.readline() I got something (the last >piece of data on the line), then 0xa. All the data was fine, except >that the CR was missing. Ok, I hope I understand the issue this time around. I agree that it looks strangely asymmetric for non-windows platforms. I have opened a bugreport but I have given it the lowest priority because I suspect that CPython will adopt a somewhat similar behaviour in 2.3. At the moment the python-dev people is discussing it in this patch. > http://sourceforge.net/tracker/?group_id=5470&atid=305470&func=detail&aid=476814 >| The basic issue is how to deal with characters (16-bits) vs. bytes >| (8-bit). Java have two ways: Stream and Reader, but python only have one >| open() method. I decided to override the 'b' flag for this behavior >| because many (windows) programmers would already know about the 'b' flag >| on the open() function. By re-using the 'b' flag the default text mode >| was obvious because that is what windows uses. > >Was the logic of input identical to text-files on windows, or is there >more to it than that? Not sure about the reason for the new-line algorithm; it isn't my design or code. I guess it is partly based on the windows way and partly on the way java handle line seperator when doing line reading. Maybe JimH have used a timemachine to implement a sane scheme for dealing with cross platform text files several years before CPython got around to it. >How does java decide what the encoding of the >data is (ie Unicode 16-bit chars or ASCII 8-bit chars)? Maybe I misunderstand the question, but java's Reader/Writer classes converts the file bytes to unicode characters while the Stream classes reads the file bytes as bytes. >How does it >decide to remove the CR, but not harm any other data in the stream? Java's readline() method(s) will remove all line-separator chars. Other read() calls does not remove CR. >I don't really understand much of Java's java.io package, other than >it takes some work to figure out which class has the method that does >what you want. IMO Python's read() and readline() methods are so much >simpler and get the job done just as well. No it doesn't. At least not when you try to combine unicode strings and file I/O. When you don't need unicode then I agree that CPython-1.5.2 was very simple to use. But in jython it is impossible to ignore the unicode problems because all our strings are always unicode enabled. Image that you have a string with a non-latin-1 character in it, a euro-sign for example. What should happen when we try to write that to a file? f.write(u"\u20AC") I can think of three answers. 1) Throw an ValueError exception. 2) Silently ignore the highorder byte and write \xAC to the file. 3) Convert the chars according to the platform codec and write the result. Cpython-2.0 uses #1 except it throws exceptions for all characters above 127. Jython uses #2 for binary files and #3 for textfiles. CPython have good technical reason for that choice, but I think the result is very bad and make for unnatural use of unicode strings. >I haven't forgotten because I never knew. I've only used Jython >= >2.0. (and CPython, but that is irrelevant here) My bad. regards, finn |
From: dman <ds...@ri...> - 2001-11-25 01:23:26
|
On Tue, Nov 13, 2001 at 07:25:07PM +0000, Finn Bock wrote: | [dman] | | >The IMAP RFC states that all lines end in CRLF. When I printed out | >the last 2 characters of sockfile.readline() I got something (the last | >piece of data on the line), then 0xa. All the data was fine, except | >that the CR was missing. | | Ok, I hope I understand the issue this time around. I agree that it | looks strangely asymmetric for non-windows platforms. | | I have opened a bugreport but I have given it the lowest priority | because I suspect that CPython will adopt a somewhat similar behaviour | in 2.3. At the moment the python-dev people is discussing it in this | patch. | | > http://sourceforge.net/tracker/?group_id=5470&atid=305470&func=detail&aid=476814 Interesting. I also ran into the problem again when I tried to test our program on windows shortly before submission. The pickle module needs to have the file in "text" mode. I had changed it to binary as a result of the previous problems with using text mode. Of course, it worked beautifully both ways on Unix. | >| The basic issue is how to deal with characters (16-bits) vs. bytes | >| (8-bit). Java have two ways: Stream and Reader, but python only have one | >| open() method. I decided to override the 'b' flag for this behavior | >| because many (windows) programmers would already know about the 'b' flag | >| on the open() function. By re-using the 'b' flag the default text mode | >| was obvious because that is what windows uses. | > | >Was the logic of input identical to text-files on windows, or is there | >more to it than that? | | Not sure about the reason for the new-line algorithm; it isn't my design | or code. I guess it is partly based on the windows way and partly on the | way java handle line seperator when doing line reading. Maybe JimH have | used a timemachine to implement a sane scheme for dealing with cross | platform text files several years before CPython got around to it. Hehe. Honestly, though, I really don't understand why text files must be treated specially. Shouldn't there be a mode for fopen() to return a JPEG file or an MP3 file? Obviously not, so why should text be any different? I think that the system-level handling of files should consider a file as nothing more than a series of bytes. Their meaning should be left up to the application programmer, who will use routines to properly serialize in-memory data to and from the file. If one programmer wants the files to have CRLF periodically, then it is his job to wrap the file with a filter. Also, I think that text files should have a uniform format, just like JPEGs and other complex formats. It seems that text files are really the most complex, though they should be the simplest. <end mini rant> | >How does java decide what the encoding of the | >data is (ie Unicode 16-bit chars or ASCII 8-bit chars)? | | Maybe I misunderstand the question, but java's Reader/Writer classes | converts the file bytes to unicode characters while the Stream classes | reads the file bytes as bytes. Oh, that's what those are supposed to mean. Usually I waste too much time trying to find the class that has the method I want (I hate writing loops to read data via byte array "out" arguments), then usually find out that the class is in the wrong hierarchy to use it where I want. (Can you tell that I dislike I/O in Java? System.out.println isn't too bad though.) | >How does it | >decide to remove the CR, but not harm any other data in the stream? | | Java's readline() method(s) will remove all line-separator chars. Other | read() calls does not remove CR. | | >I don't really understand much of Java's java.io package, other than | >it takes some work to figure out which class has the method that does | >what you want. IMO Python's read() and readline() methods are so much | >simpler and get the job done just as well. | | No it doesn't. At least not when you try to combine unicode strings and | file I/O. When you don't need unicode then I agree that CPython-1.5.2 | was very simple to use. | | But in jython it is impossible to ignore the unicode problems because | all our strings are always unicode enabled. Image that you have a string | with a non-latin-1 character in it, a euro-sign for example. What should | happen when we try to write that to a file? | | f.write(u"\u20AC") Hmm, I think I would have to learn more about unicode in order to answer that. I would have thought that it should simply write out the bytes as it sees them since the file object isn't supposed to second-guess what the programmer really meant when he said to write that data. (continuation of views expressed above) Then it would be the programmer's job to know what the data (bytes) mean when they are read back in, not the file's job. This is where Java's proliferation of IO classes starts to make sense -- each class does the proper filtering of the stream according to some built-in rules. Then one should use the class that is appropriate for the data being read (ie unicode vs. ascii). (I don't think the javadoc explains that very well, though, since I never figured that out until you explained it above) | I can think of three answers. | | 1) Throw an ValueError exception. | 2) Silently ignore the highorder byte and write \xAC to the file. | 3) Convert the chars according to the platform codec and write the | result. | | Cpython-2.0 uses #1 except it throws exceptions for all characters above | 127. Jython uses #2 for binary files and #3 for textfiles. | | CPython have good technical reason for that choice, but I think the | result is very bad and make for unnatural use of unicode strings. | | >I haven't forgotten because I never knew. I've only used Jython >= | >2.0. (and CPython, but that is irrelevant here) | | My bad. No problem, you wouldn't have known :-). -D |
From: <bc...@wo...> - 2001-11-26 17:56:56
|
>| Not sure about the reason for the new-line algorithm; it isn't my design >| or code. I guess it is partly based on the windows way and partly on the >| way java handle line seperator when doing line reading. Maybe JimH have >| used a timemachine to implement a sane scheme for dealing with cross >| platform text files several years before CPython got around to it. > >Hehe. Honestly, though, I really don't understand why text files must >be treated specially. Shouldn't there be a mode for fopen() to >return a JPEG file or an MP3 file? Obviously not, so why should text >be any different? If you want textfile handling to be crossplatform then some of the platforms have to surrender to a common minimum. In that fight, unix have won with its line-feed standard. Either programmers have to deal with open-modes for files or they have to deal with strange line endings. >I think that the system-level handling of files >should consider a file as nothing more than a series of bytes. Sure, but it is a little to late to demand that Microsoft remove the open mode from their file support. >Their >meaning should be left up to the application programmer, who will use >routines to properly serialize in-memory data to and from the file. >If one programmer wants the files to have CRLF periodically, then it >is his job to wrap the file with a filter. Also, I think that text >files should have a uniform format, What! and give up on incompatible ways of doing straightforward things? Where is the challenge in that <wink>? You might as well wish for a consistent way of separating path names with forward slashes! sheesh. >| >I don't really understand much of Java's java.io package, other than >| >it takes some work to figure out which class has the method that does >| >what you want. IMO Python's read() and readline() methods are so much >| >simpler and get the job done just as well. >| >| No it doesn't. At least not when you try to combine unicode strings and >| file I/O. When you don't need unicode then I agree that CPython-1.5.2 >| was very simple to use. >| >| But in jython it is impossible to ignore the unicode problems because >| all our strings are always unicode enabled. Image that you have a string >| with a non-latin-1 character in it, a euro-sign for example. What should >| happen when we try to write that to a file? >| >| f.write(u"\u20AC") > >Hmm, I think I would have to learn more about unicode in order to >answer that. I would have thought that it should simply write out the >bytes That is probably a crucial mistake in your thinking (IMO). The string above does not contain bytes at all. It contains just one character element. >as it sees them since the file object isn't supposed to >second-guess what the programmer really meant when he said to write >that data. When you think of it as bytes, it is not unicode anymore but some encoding of the unicode string. Every JVM comes with a huge list of different codecs. > http://java.sun.com/products/jdk/1.2/docs/guide/internat/encoding.doc.html If jython have to pick one these codecs it simple have to be the default codec for the platform. (As set in the "file.encoding" system property). IMHO it does not make any sense at all to pick some other encoding randomly from the list. [I know several CPython developers would disagree] If we did like CPython and refused to pick an encoding, jython would have a situation where it is *impossible* to write values above 127 to a file! regards, finn |
From: dman <ds...@ri...> - 2001-11-26 18:36:03
|
On Mon, Nov 26, 2001 at 06:00:23PM +0000, Finn Bock wrote: | | >| Not sure about the reason for the new-line algorithm; it isn't my design | >| or code. I guess it is partly based on the windows way and partly on the | >| way java handle line seperator when doing line reading. Maybe JimH have | >| used a timemachine to implement a sane scheme for dealing with cross | >| platform text files several years before CPython got around to it. | > | >Hehe. Honestly, though, I really don't understand why text files must | >be treated specially. Shouldn't there be a mode for fopen() to | >return a JPEG file or an MP3 file? Obviously not, so why should text | >be any different? | | If you want textfile handling to be crossplatform then some of the | platforms have to surrender to a common minimum. In that fight, unix | have won with its line-feed standard. | | Either programmers have to deal with open-modes for files or they have | to deal with strange line endings. | | >I think that the system-level handling of files | >should consider a file as nothing more than a series of bytes. | | Sure, but it is a little to late to demand that Microsoft remove the | open mode from their file support. | | >Their | >meaning should be left up to the application programmer, who will use | >routines to properly serialize in-memory data to and from the file. | >If one programmer wants the files to have CRLF periodically, then it | >is his job to wrap the file with a filter. Also, I think that text | >files should have a uniform format, | | What! and give up on incompatible ways of doing straightforward things? | Where is the challenge in that <wink>? You might as well wish for a | consistent way of separating path names with forward slashes! sheesh. :-). | >| >I don't really understand much of Java's java.io package, other than | >| >it takes some work to figure out which class has the method that does | >| >what you want. IMO Python's read() and readline() methods are so much | >| >simpler and get the job done just as well. | >| | >| No it doesn't. At least not when you try to combine unicode strings and | >| file I/O. When you don't need unicode then I agree that CPython-1.5.2 | >| was very simple to use. | >| | >| But in jython it is impossible to ignore the unicode problems because | >| all our strings are always unicode enabled. Image that you have a string | >| with a non-latin-1 character in it, a euro-sign for example. What should | >| happen when we try to write that to a file? | >| | >| f.write(u"\u20AC") | > | >Hmm, I think I would have to learn more about unicode in order to | >answer that. I would have thought that it should simply write out the | >bytes | | That is probably a crucial mistake in your thinking (IMO). The string | above does not contain bytes at all. It contains just one character | element. When viewed as a "unicode string" object, it does contain just one character. However in memory it consists of 2 bytes, right? Wouldn't f.write( chr( 0x20 ) ) f.write( chr( 0xAC ) ) produce the same results, if the above write is done with a utf-8 encoding? I do have a lot to learn wrt to unicode, though. | >as it sees them since the file object isn't supposed to | >second-guess what the programmer really meant when he said to write | >that data. | | When you think of it as bytes, it is not unicode anymore but some | encoding of the unicode string. Every JVM comes with a huge list of | different codecs. | | > http://java.sun.com/products/jdk/1.2/docs/guide/internat/encoding.doc.html | | If jython have to pick one these codecs it simple have to be the default | codec for the platform. (As set in the "file.encoding" system property). | IMHO it does not make any sense at all to pick some other encoding | randomly from the list. [I know several CPython developers would | disagree] | | If we did like CPython and refused to pick an encoding, jython would | have a situation where it is *impossible* to write values above 127 to a | file! Yeah, that doesn't sound like a good choice. Isn't UTF-8 the new standard, once people get around to converting everything? -D (the sig is randomly chosen, rather appropriate for this discussion, don't you think?) -- A)bort, R)etry, D)o it right this time |
From: <bc...@wo...> - 2001-11-26 19:54:16
|
[dman] >| f.write(u"\u20AC") > >When viewed as a "unicode string" object, it does contain just one >character. However in memory it consists of 2 bytes, right? Sure. At least for java strings. Sometimes it fills 4 bytes on some CPython compilations. >Wouldn't > f.write( chr( 0x20 ) ) > f.write( chr( 0xAC ) ) > >produce the same results, if the above write is done with a utf-8 >encoding? No. An utf-8 encoding gives (using cpython-2.1): >>> u"\u20ac".encode("utf-8") '\xe2\x82\xac' A utf-16-be encoding gives: >>>u"\u20ac".encode("utf-16-be") ' \xac' which is closer to what you seem to expect. Naturally both are totally useless ways of storing a eurosign on my harddisc. So is latin-1 btw: >>> u"\u20ac".encode("latin1") Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeError: Latin-1 encoding error: ordinal not in range(256) The only true and sane encoding to use for windows machines in euroland is cp1252: >>> u"\u20ac".encode("cp1252") '\x80' >| If we did like CPython and refused to pick an encoding, jython would >| have a situation where it is *impossible* to write values above 127 to a >| file! > >Yeah, that doesn't sound like a good choice. Isn't UTF-8 the new >standard, once people get around to converting everything? UTF-8 is a fine way of representing unicode if you only ever use ascii characters <wink>. regards, finn |
From: dman <ds...@ri...> - 2001-11-26 20:08:22
|
On Mon, Nov 26, 2001 at 07:57:43PM +0000, Finn Bock wrote: | [dman] | | >| f.write(u"\u20AC") | > | >When viewed as a "unicode string" object, it does contain just one | >character. However in memory it consists of 2 bytes, right? | | Sure. At least for java strings. Sometimes it fills 4 bytes on some | CPython compilations. | | >Wouldn't | > f.write( chr( 0x20 ) ) | > f.write( chr( 0xAC ) ) | > | >produce the same results, if the above write is done with a utf-8 | >encoding? | | No. An utf-8 encoding gives (using cpython-2.1): | | >>> u"\u20ac".encode("utf-8") | '\xe2\x82\xac' Intersting. | A utf-16-be encoding gives: | | >>>u"\u20ac".encode("utf-16-be") | ' \xac' | | which is closer to what you seem to expect. Yeah, that's what I was thinking of. (0x20 is the space character when printed in ASCII) | Naturally both are totally useless ways of storing a eurosign on my | harddisc. Why? If all (relevant) programs read the data as UTF-8 (or UTF-16-be) then they would all see the same character. | So is latin-1 btw: | | >>> u"\u20ac".encode("latin1") | Traceback (most recent call last): | File "<stdin>", line 1, in ? | UnicodeError: Latin-1 encoding error: ordinal not in range(256) Latin1 doesn't have a eurosign, does it? | The only true and sane encoding to use for windows machines in euroland | is cp1252: | | >>> u"\u20ac".encode("cp1252") | '\x80' I don't see how that is better than the above values, except that is is likely all other windows programs use cp1252 as well, and thus they can understand it. | >| If we did like CPython and refused to pick an encoding, jython would | >| have a situation where it is *impossible* to write values above 127 to a | >| file! | > | >Yeah, that doesn't sound like a good choice. Isn't UTF-8 the new | >standard, once people get around to converting everything? | | UTF-8 is a fine way of representing unicode if you only ever use ascii | characters <wink>. Hehe. I definitely need to learn more about it. -D -- "...In the UNIX world, people tend to interpret `non-technical user' as meaning someone who's only ever written one device driver." --Daniel Pead |
From: <bc...@wo...> - 2001-11-26 20:42:15
|
[me] >Naturally [utf-8 and utf-16-be] are totally useless ways of storing >a eurosign on my harddisc. [dman] >Why? If all (relevant) programs read the data as UTF-8 (or UTF-16-be) >then they would all see the same character. True. If that ever happen, we can all reap the benefits. >| So is latin-1 btw: >| >| >>> u"\u20ac".encode("latin1") >| Traceback (most recent call last): >| File "<stdin>", line 1, in ? >| UnicodeError: Latin-1 encoding error: ordinal not in range(256) > >Latin1 doesn't have a eurosign, does it? No. >| The only true and sane encoding to use for windows machines in euroland >| is cp1252: >| >| >>> u"\u20ac".encode("cp1252") >| '\x80' > >I don't see how that is better than the above values, except that is >is likely all other windows programs use cp1252 as well, and thus they >can understand it. Exactly. regards, finn |