Re: [Jython-users] stream munging : breaks imaplib (patch included)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Tue, Nov 13, 2001 at 07:25:07PM +0000, Finn Bock wrote:
| [dman]
| 
| >The IMAP RFC states that all lines end in CRLF.  When I printed out
| >the last 2 characters of sockfile.readline() I got something (the last
| >piece of data on the line), then 0xa.  All the data was fine, except
| >that the CR was missing.
| 
| Ok, I hope I understand the issue this time around. I agree that it
| looks strangely asymmetric for non-windows platforms.
| 
| I have opened a bugreport but I have given it the lowest priority
| because I suspect that CPython will adopt a somewhat similar behaviour
| in 2.3. At the moment the python-dev people is discussing it in this
| patch.
| 
| > http://sourceforge.net/tracker/?group_id=5470&atid=305470&func=detail&aid=476814

Interesting.

I also ran into the problem again when I tried to test our program on
windows shortly before submission.  The pickle module needs to have
the file in "text" mode.  I had changed it to binary as a result of
the previous problems with using text mode.  Of course, it worked
beautifully both ways on Unix.

| >| The basic issue is how to deal with characters (16-bits) vs. bytes
| >| (8-bit). Java have two ways: Stream and Reader, but python only have one
| >| open() method. I decided to override the 'b' flag for this behavior
| >| because many (windows) programmers would already know about the 'b' flag
| >| on the open() function. By re-using the 'b' flag the default text mode
| >| was obvious because that is what windows uses.
| > 
| >Was the logic of input identical to text-files on windows, or is there
| >more to it than that? 
| 
| Not sure about the reason for the new-line algorithm; it isn't my design
| or code. I guess it is partly based on the windows way and partly on the
| way java handle line seperator when doing line reading. Maybe JimH have
| used a timemachine to implement a sane scheme for dealing with cross
| platform text files several years before CPython got around to it.

Hehe.  Honestly, though, I really don't understand why text files must
be treated specially.  Shouldn't there be a mode for fopen() to
return a JPEG file or an MP3 file?  Obviously not, so why should text
be any different?  I think that the system-level handling of files
should consider a file as nothing more than a series of bytes.  Their
meaning should be left up to the application programmer, who will use
routines to properly serialize in-memory data to and from the file.
If one programmer wants the files to have CRLF periodically, then it
is his job to wrap the file with a filter.  Also, I think that text
files should have a uniform format, just like JPEGs and other complex
formats.  It seems that text files are really the most complex, though
they should be the simplest.

<end mini rant>

| >How does java decide what the encoding of the
| >data is (ie Unicode 16-bit chars or ASCII 8-bit chars)? 
| 
| Maybe I misunderstand the question, but java's Reader/Writer classes
| converts the file bytes to unicode characters while the Stream classes
| reads the file bytes as bytes.

Oh, that's what those are supposed to mean.  Usually I waste too much
time trying to find the class that has the method I want (I hate
writing loops to read data via byte array "out" arguments), then
usually find out that the class is in the wrong hierarchy to use it
where I want.

(Can you tell that I dislike I/O in Java?  System.out.println isn't
too bad though.)

| >How does it
| >decide to remove the CR, but not harm any other data in the stream?
| 
| Java's readline() method(s) will remove all line-separator chars. Other
| read() calls does not remove CR.
| 
| >I don't really understand much of Java's java.io package, other than
| >it takes some work to figure out which class has the method that does
| >what you want.  IMO Python's read() and readline() methods are so much
| >simpler and get the job done just as well.
| 
| No it doesn't. At least not when you try to combine unicode strings and
| file I/O. When you don't need unicode then I agree that CPython-1.5.2
| was very simple to use.
| 
| But in jython it is impossible to ignore the unicode problems because
| all our strings are always unicode enabled. Image that you have a string
| with a non-latin-1 character in it, a euro-sign for example. What should
| happen when we try to write that to a file?
| 
|    f.write(u"\u20AC")

Hmm, I think I would have to learn more about unicode in order to
answer that.  I would have thought that it should simply write out the
bytes as it sees them since the file object isn't supposed to
second-guess what the programmer really meant when he said to write
that data.  (continuation of views expressed above)

Then it would be the programmer's job to know what the data (bytes)
mean when they are read back in, not the file's job.  This is where
Java's proliferation of IO classes starts to make sense -- each class
does the proper filtering of the stream according to some built-in
rules.  Then one should use the class that is appropriate for the data
being read (ie unicode vs. ascii).  (I don't think the javadoc
explains that very well, though, since I never figured that out until
you explained it above)

| I can think of three answers.
| 
| 1) Throw an ValueError exception.
| 2) Silently ignore the highorder byte and write \xAC to the file.
| 3) Convert the chars according to the platform codec and write the
|    result.
| 
| Cpython-2.0 uses #1 except it throws exceptions for all characters above
| 127. Jython uses #2 for binary files and #3 for textfiles. 
| 
| CPython have good technical reason for that choice, but I think the
| result is very bad and make for unnatural use of unicode strings.
| 
| >I haven't forgotten because I never knew.  I've only used Jython >=
| >2.0.  (and CPython, but that is irrelevant here)
| 
| My bad.

No problem, you wouldn't have known :-).

-D