|
From: dman <ds...@ri...> - 2001-11-25 01:23:26
|
On Tue, Nov 13, 2001 at 07:25:07PM +0000, Finn Bock wrote: | [dman] | | >The IMAP RFC states that all lines end in CRLF. When I printed out | >the last 2 characters of sockfile.readline() I got something (the last | >piece of data on the line), then 0xa. All the data was fine, except | >that the CR was missing. | | Ok, I hope I understand the issue this time around. I agree that it | looks strangely asymmetric for non-windows platforms. | | I have opened a bugreport but I have given it the lowest priority | because I suspect that CPython will adopt a somewhat similar behaviour | in 2.3. At the moment the python-dev people is discussing it in this | patch. | | > http://sourceforge.net/tracker/?group_id=5470&atid=305470&func=detail&aid=476814 Interesting. I also ran into the problem again when I tried to test our program on windows shortly before submission. The pickle module needs to have the file in "text" mode. I had changed it to binary as a result of the previous problems with using text mode. Of course, it worked beautifully both ways on Unix. | >| The basic issue is how to deal with characters (16-bits) vs. bytes | >| (8-bit). Java have two ways: Stream and Reader, but python only have one | >| open() method. I decided to override the 'b' flag for this behavior | >| because many (windows) programmers would already know about the 'b' flag | >| on the open() function. By re-using the 'b' flag the default text mode | >| was obvious because that is what windows uses. | > | >Was the logic of input identical to text-files on windows, or is there | >more to it than that? | | Not sure about the reason for the new-line algorithm; it isn't my design | or code. I guess it is partly based on the windows way and partly on the | way java handle line seperator when doing line reading. Maybe JimH have | used a timemachine to implement a sane scheme for dealing with cross | platform text files several years before CPython got around to it. Hehe. Honestly, though, I really don't understand why text files must be treated specially. Shouldn't there be a mode for fopen() to return a JPEG file or an MP3 file? Obviously not, so why should text be any different? I think that the system-level handling of files should consider a file as nothing more than a series of bytes. Their meaning should be left up to the application programmer, who will use routines to properly serialize in-memory data to and from the file. If one programmer wants the files to have CRLF periodically, then it is his job to wrap the file with a filter. Also, I think that text files should have a uniform format, just like JPEGs and other complex formats. It seems that text files are really the most complex, though they should be the simplest. <end mini rant> | >How does java decide what the encoding of the | >data is (ie Unicode 16-bit chars or ASCII 8-bit chars)? | | Maybe I misunderstand the question, but java's Reader/Writer classes | converts the file bytes to unicode characters while the Stream classes | reads the file bytes as bytes. Oh, that's what those are supposed to mean. Usually I waste too much time trying to find the class that has the method I want (I hate writing loops to read data via byte array "out" arguments), then usually find out that the class is in the wrong hierarchy to use it where I want. (Can you tell that I dislike I/O in Java? System.out.println isn't too bad though.) | >How does it | >decide to remove the CR, but not harm any other data in the stream? | | Java's readline() method(s) will remove all line-separator chars. Other | read() calls does not remove CR. | | >I don't really understand much of Java's java.io package, other than | >it takes some work to figure out which class has the method that does | >what you want. IMO Python's read() and readline() methods are so much | >simpler and get the job done just as well. | | No it doesn't. At least not when you try to combine unicode strings and | file I/O. When you don't need unicode then I agree that CPython-1.5.2 | was very simple to use. | | But in jython it is impossible to ignore the unicode problems because | all our strings are always unicode enabled. Image that you have a string | with a non-latin-1 character in it, a euro-sign for example. What should | happen when we try to write that to a file? | | f.write(u"\u20AC") Hmm, I think I would have to learn more about unicode in order to answer that. I would have thought that it should simply write out the bytes as it sees them since the file object isn't supposed to second-guess what the programmer really meant when he said to write that data. (continuation of views expressed above) Then it would be the programmer's job to know what the data (bytes) mean when they are read back in, not the file's job. This is where Java's proliferation of IO classes starts to make sense -- each class does the proper filtering of the stream according to some built-in rules. Then one should use the class that is appropriate for the data being read (ie unicode vs. ascii). (I don't think the javadoc explains that very well, though, since I never figured that out until you explained it above) | I can think of three answers. | | 1) Throw an ValueError exception. | 2) Silently ignore the highorder byte and write \xAC to the file. | 3) Convert the chars according to the platform codec and write the | result. | | Cpython-2.0 uses #1 except it throws exceptions for all characters above | 127. Jython uses #2 for binary files and #3 for textfiles. | | CPython have good technical reason for that choice, but I think the | result is very bad and make for unnatural use of unicode strings. | | >I haven't forgotten because I never knew. I've only used Jython >= | >2.0. (and CPython, but that is irrelevant here) | | My bad. No problem, you wouldn't have known :-). -D |