From: Bruno H. <br...@cl...> - 2004-03-17 12:28:39
|
Don Cohen proposed: > > *print-pretty* should be nil > > (as it is in every other lisp I have used) You have enough customizable variables, from *print-pretty* to the symbols in the CUSTOM package, that you can set in your .clisprc.lisp file. The default is set so that clisp becomes most useful to newbies, without customization. If all other Lisps look frightening to newbies, that's not a reason why clisp should look the same. Bruno |
From: <don...@is...> - 2004-03-18 07:06:48
|
> You have enough customizable variables, from *print-pretty* to the > symbols in the CUSTOM package, that you can set in your .clisprc.lisp file. Init files are really not adequate for things that affect programs that are run by different people on different machines. You might imagine that *print-pretty* should not affect programs, but I've found many times that this is just not true. The last case (provoking a recent message to this list) was a program that was trying to manage the space on a screen. I wanted to print something on the top line of the screen. I position the cursor to the top of the screen and do (princ "...") and the text comes out on the second line! I think the only solution is that this program should include in its code (setf *print-pretty* nil). Considering the amount of time I've spent tracking down such problems it seems clear that I really ought to do that in every lisp program I write. The unicode stuff is similar. I've started to put in my batch files export LANG=en_US but I now begin to think that this ought to be replaced by another form in every lisp program I write. I guess this means that my web server only runs in English. At least for clisp. > The default is set so that clisp becomes most useful to newbies, without > customization. If all other Lisps look frightening to newbies, that's > not a reason why clisp should look the same. I think I've heard this argument before. There's certainly room for argument about what's useful or frightening to newbies. (They probably know better what frightens them than what is useful or good for them.) In any case, I gave up long ago on this one. I just have to keep learning to set it. I should perhaps thank you for making me fix my programs to not rely on this implementation dependent value. (Is there a list of other global variables with implementation dependent initial values?) > > To me the issue is what exactly is a newline. > > I don't think it has to be the same as #\return or #\linefeed. > > I think that if the line terminator mode is :unix then it should > > be ok for READ-LINE to return a string with an embedded #\return > > and if line terminator mode is :mac then it should be ok to return > > a string with #\linefeed, and if :dos then it should be ok to > > return a string with both of those (but not #\return followed > > immediately by #\linefeed). > > This is the way it's done in Java. And it's shitty: although the standard > way to designate a string with an embedded newline is "foo\n", when you > write a string to a file stream, you find out that > stream.print("foo\n"); > does not the same thing as > stream.println("foo"); Knowing practically nothing about java, I will not attempt to explain or defend it. However, it seems inevitable that if you try to map multiple file representations to the same memory representation then the reverse mapping will (and should) be one to many. The best you can do is provide some way to control both the input and output mappings. That's what I want to do. If you like the current behavior then let that be the default. Clearly my current solution is exactly what I want. I read 8 bit bytes and then call code-char, using an encoding that gives me 256 different characters for the 256 bytes. It just seems silly that this should be the best (maybe only) way to get the desired result. |
From: Bruno H. <br...@cl...> - 2004-03-17 20:15:00
|
Don Cohen wrote: > it seems inevitable that if you try to map > multiple file representations to the same memory representation then > the reverse mapping will (and should) be one to many. The best you > can do is provide some way to control both the input and output > mappings. Yes. And in clisp this mapping is the ENCODING with its :LINE-TERMINATOR accessor. It allows you to accomodate different external representations for the same in-memory representation of #\Newline = (code-char 10). What you wanted in the last posting, was *different* in-memory representations of #\Newline. Which leads to unportable programs. This way, you would write Lisp programs that are less portable between Unix and Windows than the equivalent C programs! > Clearly my current solution is exactly what I want. > I read 8 bit bytes and then call code-char, using an encoding that > gives me 256 different characters for the 256 bytes. Such encodings are disappearing rapidly from the landscape. The use of code-char and char-code makes your program unable to support even the standard repertoire of characters required for English (because of the curly quotes...). > It just seems silly that this should be the best (maybe only) way > to get the desired result. You missed the functions CONVERT-STRING-TO-BYTES and CONVERT-BYTES-TO-STRING. Bruno |
From: Sam S. <sd...@gn...> - 2004-03-17 20:30:22
|
> * Bruno Haible <oe...@py...t> [2004-03-17 21:09:30 +0100]: > > What you wanted in the last posting, was *different* in-memory > representations of #\Newline. Which leads to unportable programs. why? because (format nil "~%") will depend on the current value of some global encoding variable? -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> Winword 6.0 UNinstall: Not enough disk space to uninstall WinWord |
From: <don...@is...> - 2004-03-17 21:01:54
|
Bruno Haible writes: > Don Cohen wrote: > > it seems inevitable that if you try to map > > multiple file representations to the same memory representation then > > the reverse mapping will (and should) be one to many. The best you > > can do is provide some way to control both the input and output > > mappings. > > Yes. And in clisp this mapping is the ENCODING with its :LINE-TERMINATOR > accessor. It allows you to accomodate different external representations > for the same in-memory representation of #\Newline = (code-char 10). > > What you wanted in the last posting, was *different* in-memory > representations > of #\Newline. Which leads to unportable programs. This way, you would > write Lisp programs that are less portable between Unix and Windows than > the equivalent C programs! I don't know about the equivalent c programs. I also don't know which programs you view as more or less portable, but my view is that you should be able to write the program you want, and preferably in the most straight forward way. I suppose encodings default to different line terminator modes on different OS's. In that case, if you use the default then you get (what I consider to be) different behavior on different OS's, but I guess that's what you want. If you specify the mode then you get the same behavior on different OS's. I argue that if you want the same behavior on different OS's then you should not use #\newline but either #\return or #\linefeed. There are plenty of other things that differ from one platform to another, such as *features* - the differences are viewed by some people as making it easier to write portable programs, even though on a small scale they lead to different behavior on different platforms. > > Clearly my current solution is exactly what I want. > > I read 8 bit bytes and then call code-char, using an encoding that > > gives me 256 different characters for the 256 bytes. > > Such encodings are disappearing rapidly from the landscape. The use of Does this mean that the encoding I now use is likely to disappear? > code-char and char-code makes your program unable to support even the > standard repertoire of characters required for English (because of the > curly quotes...). I don't know what you you consider the standard repertoire. (What are the curly quotes?) And why does char-code not work on them? > > It just seems silly that this should be the best (maybe only) way > > to get the desired result. > > You missed the functions CONVERT-STRING-TO-BYTES and > CONVERT-BYTES-TO-STRING. (You mean STRING-FROM-BYTES) Yes I did miss those. But I don't think they do what I need. They might if they allowed the result to be put into an arbitrary position of an existing vector. Right now I do (vector-push-extend (code-char (aref *byte-io-vector* i)) ...) So these depend on encodings but code-char and char-code do not? I'd appreciate an explanation of that. It must be related to the curly quote question above. |
From: Bruno H. <br...@cl...> - 2004-03-17 21:33:52
|
Sam Steingold wrote: > > What you wanted in the last posting, was *different* in-memory > > representations of #\Newline. Which leads to unportable programs. > why? > because (format nil "~%") will depend on the current value of some > global encoding variable? Because 1) So many operations will give subtly different results, starting from (length string), (map 'list string), the hash code of a string and thus also the order of traversal of a hash table containing strings as keys, up to all kinds of string manipulation functions that work by scanning a string. 2) The developer tests his program in one mode and not in the other. That was Don Cohen's point about *print-pretty*: one more variable that can be switched on or off, means more testing. In short, in application programming (as opposed to low level system programming) system dependencies are best handled at the border between the program and the outside world, and the program logic is kept free of system dependencies. Bruno |
From: Sam S. <sd...@gn...> - 2004-03-17 21:51:00
|
> * Bruno Haible <oe...@py...t> [2004-03-17 22:28:14 +0100]: > > Sam Steingold wrote: >> > What you wanted in the last posting, was *different* in-memory >> > representations of #\Newline. Which leads to unportable programs. >> why? >> because (format nil "~%") will depend on the current value of some >> global encoding variable? actually, no, the value of (format nil "~%") is just (string #\Newline) independent on any global state. > Because > > 1) So many operations will give subtly different results, starting > from (length string), (map 'list string), the hash code of a string > and thus also the order of traversal of a hash table containing > strings as keys, up to all kinds of string manipulation functions > that work by scanning a string. why? the only change is that (char-code #\Newline) is not 10 anymore. when you read from a file with the an encoding with a :LINE-TERMINATOR-STRICT-P NIL (the current behavior), you will have only #\Newline, no #\Linefeed or #\Return. When :LINE-TERMINATOR-STRICT-P is T, you will get #\Newline for the appropriate line terminator and the original control character otherwise. (length (format nil "foo~%bar")) is always 7. > 2) The developer tests his program in one mode and not in the other. > That was Don Cohen's point about *print-pretty*: one more variable > that can be switched on or off, means more testing. again, what is that mode? -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> MS DOS: Keyboard not found. Press F1 to continue. |
From: Bruno H. <br...@cl...> - 2004-03-17 21:48:35
|
Don Cohen wrote: > I don't know what you you consider the standard repertoire. > (What are the curly quotes?) > And why does char-code not work on them? Oh, char-code works fine on them. It's only when you attempt to feed this char-code into a byte stream that expects values between 0 and 255 that you will get a nice error. > (write-string "=E2=80=9C=E2=80=98Ha ha=E2=80=99 said the clown=E2=80=9D f= rom Manfred Mann") =E2=80=9C=E2=80=98Ha ha=E2=80=99 said the clown=E2=80=9D from Manfred Mann "=E2=80=9C=E2=80=98Ha ha=E2=80=99 said the clown=E2=80=9D from Manfred Mann" > (map 'list #'char-code *) (8220 8216 72 97 32 104 97 8217 32 115 97 105 100 32 116 104 101 32 99 108 = 111 119 110 8221 32 102 114 111 109 32 77 97 110 102 114 101 100 32 77 97 110 = 110) > They might if they allowed the result to be put into an arbitrary > position of an existing vector. SETF SUBSEQ will copy a vector into an existing one. > So these depend on encodings but code-char and char-code do not? Yes, CONVERT-STRING-TO/FROM-BYTES work with any character and any encoding. Whereas the assumption of the 1980ies, that each character is a byte, works only for half the encodings of the world and for less than 1/10th of the characters of the world. Bruno |
From: Sam S. <sd...@gn...> - 2004-03-17 22:08:08
|
> * Bruno Haible <oe...@py...t> [2004-03-17 22:43:01 +0100]: > >> They might if they allowed the result to be put into an arbitrary >> position of an existing vector. > > SETF SUBSEQ will copy a vector into an existing one. I think he wants to re-use the vector (a la READ-SEQUENCE). CONVERT-STRING-TO-BYTES will always allocate a fresh vector. Maybe an :OUTPUT + :START-OUTPUT combination is needed: (EXT:CONVERT-STRING-FROM-BYTES byte-vector encoding :START 10 :END 33 :OUTPUT string :START-OUTPUT 17) will write characters into STRING starting at position 17. (and using ADJUST-ARRAY if STRING is shorter than 40) > Whereas the assumption of the 1980ies, that each character is a byte, > works only for half the encodings of the world and for less than > 1/10th of the characters of the world. still, each character is an integer, right? (even 21 bit integer!) -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> Time would have been the best Teacher, if it did not kill all its students. |
From: <don...@is...> - 2004-03-17 23:23:34
|
Bruno Haible writes: > 1) So many operations will give subtly different results, starting= > from (length string), (map 'list string), the hash code of a st= ring > and thus also the order of traversal of a hash table containing= > strings as keys, up to all kinds of string manipulation functio= ns > that work by scanning a string. If I use the encoding that generates separate #\cr and #\lf and maps one-to-one between bytes 0-255 and characters then=20 I get the same result on any system. (Assuming I don't write #\newline, but stick to #\cr and #\lf.) It is when I use the current default which changes from one system to another that I get different results for different systems. I should use #\newline and the current encoding stuff when I want different output on different systems. > Oh, char-code works fine on them. It's only when you attempt to feed= > this char-code into a byte stream that expects values between 0 and = 255 > that you will get a nice error. >=20 > > (write-string "=E2=80=9C=E2=80=98Ha ha=E2=80=99 said the clown=E2=80= =9D from Manfred Mann") > =E2=80=9C=E2=80=98Ha ha=E2=80=99 said the clown=E2=80=9D from Manfre= d Mann > "=E2=80=9C=E2=80=98Ha ha=E2=80=99 said the clown=E2=80=9D from Manfr= ed Mann" > > (map 'list #'char-code *) > (8220 8216 72 97 32 104 97 8217 32 115 97 105 100 32 116 104 101 32 = 99 108 111 > 119 110 8221 32 102 114 111 109 32 77 97 110 102 114 101 100 32 77 = 97 110 110) Interesting, when I copy your string I get (map 'list 'char-code "=E2=80=9C=E2=80=98Ha ha=E2=80=99 said the clown=E2= =80=9D from Manfred Mann") (226 128 156 226 128 152 72 97 32 104 97 226 128 153 32 115 97 105 100 = 32 116 104 101 32 99 108 111 119 110 226 128 157 32 102 114 111 109 32 77 97 = 110 102 114 101 100 32 77 97 110 110) > SETF SUBSEQ will copy a vector into an existing one. Can I get that to also do vector-push-extend ? > Whereas the assumption of the 1980ies, that each character is a byte= , > works only for half the encodings of the world and for less than 1/1= 0th > of the characters of the world. All I need is one encoding that covers bytes 0-255 and allows me to read each byte as one character. I don't care whether the one I use represents 100% or 1e-9% of the total. > > SETF SUBSEQ will copy a vector into an existing one. > I think he wants to re-use the vector (a la READ-SEQUENCE). > CONVERT-STRING-TO-BYTES will always allocate a fresh vector. > Maybe an :OUTPUT + :START-OUTPUT combination is needed: > (EXT:CONVERT-STRING-FROM-BYTES byte-vector encoding > :START 10 :END 33 > :OUTPUT string :START-OUTPUT 17) >=20 > will write characters into STRING starting at position 17. > (and using ADJUST-ARRAY if STRING is shorter than 40) Ah, with the adjust yet!! How code-char/char-code get away without an encoding: I gather the point is that the encoding relates bytes on a file to=20 characters, and code-char/char-code relates characters to integers,=20 which evidently do NOT have to be the same integers that you would=20 get if you read or wrote those bytes to/from a file! I assume that (=3D x (char-code (code-char x))) for x in the appropriat= e range. I've also been assuming that the ascii characters correspond to the "right" codes. That's probably all that matters to me so far. |
From: Pascal J.B. <pj...@in...> - 2004-03-18 00:52:19
|
Don Cohen writes: > If I use the encoding that generates separate #\cr and #\lf > and maps one-to-one between bytes 0-255 and characters then=20 > I get the same result on any system. (Assuming I don't write > #\newline, but stick to #\cr and #\lf.) > It is when I use the current default which changes from one system > to another that I get different results for different systems. > > I should use #\newline and the current encoding stuff when I want > different output on different systems. Imagine you copy your lisp program to three different systems: a Macintosh system, a MS-Windows system and a unix system, and that with you same lisp program on these three systems, you try to read a TEXT file named example.txt transfered thru FTP as a text file between the three systems. In clear, on the Macintosh you'll have in the file: "texttext[CR]texttext[CR]" on the unix system you'll have "texttext[LF]texttext[LF]" and on MS-Windows you'll have "texttext[CR][LF]texttext[CR][LF]". Then with your scheme your three identical copies of your program wont read the same data of the same text file. > How code-char/char-code get away without an encoding: > I gather the point is that the encoding relates bytes on a file to=20 > characters, and code-char/char-code relates characters to integers,=20 > which evidently do NOT have to be the same integers that you would=20 > get if you read or wrote those bytes to/from a file! > I assume that (=3D x (char-code (code-char x))) for x in the appropriat= > e > range. I've also been assuming that the ascii characters correspond > to the "right" codes. That's probably all that matters to me so far. You're assuming too much. The "ASCII" characters have not the "right" codes on an EBCDIC system. And EBCDIC systems are far from dead, they even do web CGI on CICS... -- __Pascal_Bourguignon__ http://www.informatimago.com/ There is no worse tyranny than to force a man to pay for what he doesn't want merely because you think it would be good for him.--Robert Heinlein http://www.theadvocates.org/ |
From: Sam S. <sd...@gn...> - 2004-03-18 13:43:26
|
> * Pascal J.Bourguignon <cwo@vasbezngvzntb.pbz> [2004-03-18 01:53:52 +0100]: > > Imagine you copy your lisp program to three different systems: a > Macintosh system, a MS-Windows system and a unix system, and that with > you same lisp program on these three systems, you try to read a TEXT > file named example.txt transfered thru FTP as a text file between the > three systems. > > In clear, on the Macintosh you'll have in the file: "texttext[CR]texttext[CR]" > on the unix system you'll have "texttext[LF]texttext[LF]" > and on MS-Windows you'll have "texttext[CR][LF]texttext[CR][LF]". > > Then with your scheme your three identical copies of your program wont > read the same data of the same text file. yes it will. (with-open-file (i "foo" :direction :input) (list (read-line i) (read-line i))) will return ("texttext" texttext") on each platform, both now and with the proposed separation of #\Newline from #\Linefeed. even when these 3 different files are read on the same platform, this list ("texttext" texttext") will be read by this form: (with-open-file (i "foo" :direction :input :external-format (make-encoding :line-terminator-strict-p nil)) (list (read-line i) (read-line i))) or by using the right form for each file: (with-open-file (i "foo" :direction :input :external-format :dos) (list (read-line i) (read-line i))) ... -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> I'd give my right arm to be ambidextrous. |
From: <don...@is...> - 2004-03-18 01:35:49
|
Pascal J.Bourguignon writes: > Imagine you copy your lisp program to three different systems: a > Macintosh system, a MS-Windows system and a unix system, and that with > you same lisp program on these three systems, you try to read a TEXT > file named example.txt transfered thru FTP as a text file between the > three systems. Right, because ftp happens to be changing the file contents. If, on the other hand, I use scp, or ftp in binary mode, then I get the same results with the program that I propose, which I consider to be system indedpendent, and different results with the program that I must write in clisp now, which I regard as system dependent. I don't mind that I *can* write a program that is system dependent. I just want to be *able* to write one that is system independent. > > How code-char/char-code get away without an encoding: > > I gather the point is that the encoding relates bytes on a file to > > characters, and code-char/char-code relates characters to integers, > > which evidently do NOT have to be the same integers that you would > > get if you read or wrote those bytes to/from a file! > > I assume that (= x (char-code (code-char x))) for x in the appropriate > > range. I've also been assuming that the ascii characters correspond > > to the "right" codes. That's probably all that matters to me so far. > You're assuming too much. The "ASCII" characters have not the "right" > codes on an EBCDIC system. And EBCDIC systems are far from dead, they > even do web CGI on CICS... Ok, so if I want to use ebcdic then I should use some other encoding. As it turns out, the Internet protocols pretty much stick to ascii, so that's what I mostly want to use. |
From: Pascal J.B. <pj...@in...> - 2004-03-18 01:55:53
|
Don Cohen writes: > Pascal J.Bourguignon writes: > > > Imagine you copy your lisp program to three different systems: a > > Macintosh system, a MS-Windows system and a unix system, and that with > > you same lisp program on these three systems, you try to read a TEXT > > file named example.txt transfered thru FTP as a text file between the > > three systems. > > Right, because ftp happens to be changing the file contents. > If, on the other hand, I use scp, or ftp in binary mode, But once you copy your files in BINARY mode, they're no longer TEXT files. They are now binary file, and you cannot read CHARACTERS from them only BYTES. If you want to have the exact same behavior on the three systems BYTE-FOR-BYTE, you must read and write BYTES, not LINES of CHARACTERS. > then I get the same results with the program that I propose, which > I consider to be system indedpendent, and different results with > the program that I must write in clisp now, which I regard as > system dependent. I don't mind that I *can* write a program that > is system dependent. I just want to be *able* to write one that is > system independent. Then don't use CHARACTERS and CHAR-CODE/CODE-CHAR since these functions ARE IMPLEMENTATION DEPENDENT! Which is even worse than system dependent, since different implementations on the same system can have different idea of what CHAR-CODE or CODE-CHAR should return. Even the SAME implementation can be compiled with options giving different results, such as clisp compiled in 8-bit chars or with unicode support! system independent <=> BYTE, READ-BYTE, WRITE-BYTE implementation dependent <=> CHAR, READ-CHAR, WRITE-CHAR > > > How code-char/char-code get away without an encoding: > > > I gather the point is that the encoding relates bytes on a file to > > > characters, and code-char/char-code relates characters to integers, > > > which evidently do NOT have to be the same integers that you would > > > get if you read or wrote those bytes to/from a file! > > > I assume that (= x (char-code (code-char x))) for x in the appropriate > > > range. I've also been assuming that the ascii characters correspond > > > to the "right" codes. That's probably all that matters to me so far. > > > You're assuming too much. The "ASCII" characters have not the "right" > > codes on an EBCDIC system. And EBCDIC systems are far from dead, they > > even do web CGI on CICS... > > Ok, so if I want to use ebcdic then I should use some other encoding. > As it turns out, the Internet protocols pretty much stick to ascii, > so that's what I mostly want to use. My point, and what you don't understand, is that when you are expecting TEXT data, your program could be running on an EBCDIC system and receive URLs and HTML textual data in EBCDIC! Of course, what runs on the wire is always ASCII, but what code you find in core memory can be anything the system likes to have. Read again the COMMON-LISP standard http://www.lispworks.com/reference/HyperSpec/Body/02_ac.htm and you'll see that the only thing that is prescribed is a minimum set of characters, but that the corresponding codes can be anything as long as some constaints are respected: http://www.lispworks.com/reference/HyperSpec/Body/13_af.htm You should really study the whole chapter 2 and chapter 13 of CLHS. -- __Pascal_Bourguignon__ http://www.informatimago.com/ There is no worse tyranny than to force a man to pay for what he doesn't want merely because you think it would be good for him.--Robert Heinlein http://www.theadvocates.org/ |
From: <don...@is...> - 2004-03-18 02:51:15
|
Pascal J.Bourguignon writes: > But once you copy your files in BINARY mode, they're no longer TEXT > files. They are now binary file, and you cannot read CHARACTERS from > them only BYTES. If you want to have the exact same behavior on the > three systems BYTE-FOR-BYTE, you must read and write BYTES, not > LINES of CHARACTERS. What determines whether they are binary or text files in the first place? If I create them emacs are they text? If I copy them with scp do they remain text? If I name them .txt does that make them text? I have to say that I don't see any well defined boundary. I'm not so sure I really see any boundary at all. Do you think that a file could possibly contain part text and part binary data? > Then don't use CHARACTERS and CHAR-CODE/CODE-CHAR since these > functions ARE IMPLEMENTATION DEPENDENT! So you think that when I look at some RFC that talks about the "US-ASCII coded character set" that I should be reading in binary. You think that when it talks about "GET" this should be viewed as a sequence of bytes. I should not use the character string "GET" in a lisp program to find out what type of http request is arriving. I think these are characters, and a programming language should be able to deliver them to me as characters. If I use an ebcdic machine then perhaps I should use some non-default encoding to do that. There is, after all, an ascii character set that is supported by clisp. I don't know what it does with bytes>127, but I suggest that it would be useful to support a character set that agrees with ascii up to 127 and assigns some (I don't really care which) characters to 128-255. I believe that whatever character set I now use does satisfy these requirements. I should then be able to use that character set to read whatever bytes are in my file, whether or not you happen to view it as a "text" file. |
From: Pascal J.B. <pj...@in...> - 2004-03-18 03:06:26
|
Don Cohen writes: > Pascal J.Bourguignon writes: > > But once you copy your files in BINARY mode, they're no longer TEXT > > files. They are now binary file, and you cannot read CHARACTERS from > > them only BYTES. If you want to have the exact same behavior on the > > three systems BYTE-FOR-BYTE, you must read and write BYTES, not > > LINES of CHARACTERS. > > What determines whether they are binary or text files in the first > place? It depends on the operating system you're using. On unix, it's entirely up to the user and the library and applications you're using. On MS-DOS and Macintosh, it's the applications that tell the OS/libraries how to consider the file at open time. (The "b" in the mode argument of fopen). On other operating system the file system knows exactly what kind of file it stores and how they are structured. > If I create them emacs are they text? You should ask to users of emacs on VMS. But I guess that most files created with emacs are indeed text files. > If I copy them with scp do they remain text? > If I name them .txt does that make them text? That would be irrelevant on the OS I know. > I have to say that I don't see any well defined boundary. Because you seem to know only operating systems where there's no boundary. Common-Lisp is designed to be able to run portably on operating systems that make the distinction between text files, fixed record binary files, variable record binary files, indexed files, sequential access files, etc. > I'm not so sure I really see any boundary at all. > Do you think that a file could possibly contain > part text and part binary data? Not from the point of view of Common-Lisp, and the OS that I know that make the distinction between text and binary. But why do you think clisp developer took the time to implement EXT:CONVERT-STRING-FROM-BYTES and EXT:CONVERT-STRING-TO-BYTES? They are useful to embed text into a binary file. And note that these OS are not only legacy OS. PalmOS for example has only record-structured files, and text can be stored one line per record (no CR/LF problem there!). > > Then don't use CHARACTERS and CHAR-CODE/CODE-CHAR since these > > functions ARE IMPLEMENTATION DEPENDENT! > So you think that when I look at some RFC that talks about the > "US-ASCII coded character set" that I should be reading in binary. Indeed, if you want to manipulate the packets or the streams defined by the Internet RFC, you'd better do it in binary. If you do it in text, on unix or on macintosh you'd have problems with CRLF line terminations that are mandatory in Internet protocols. > You think that when it talks about "GET" this should be viewed as > a sequence of bytes. I should not use the character string "GET" > in a lisp program to find out what type of http request is arriving. What you get for "GET" depends on the encoding of the source of your lisp program! > I think these are characters, and a programming language should be > able to deliver them to me as characters. If I use an ebcdic machine > then perhaps I should use some non-default encoding to do that. There > is, after all, an ascii character set that is supported by clisp. By clisp but NOT by COMMON-LISP! That's where the portability enters the scene. > I don't know what it does with bytes>127, but I suggest that it > would be useful to support a character set that agrees with ascii up > to 127 and assigns some (I don't really care which) characters to > 128-255. I believe that whatever character set I now use does > satisfy these requirements. I should then be able to use that > character set to read whatever bytes are in my file, whether or not > you happen to view it as a "text" file. -- __Pascal_Bourguignon__ http://www.informatimago.com/ There is no worse tyranny than to force a man to pay for what he doesn't want merely because you think it would be good for him.--Robert Heinlein http://www.theadvocates.org/ |
From: zera h. <zer...@ya...> - 2004-03-18 03:19:19
|
> What determines whether they are binary or text > files in the first > place? If I create them emacs are they text? > If I copy them with scp do they remain text? > If I name them .txt does that make them text? > I have to say that I don't see any well defined > boundary. > I'm not so sure I really see any boundary at all. > Do you think that a file could possibly contain > part text and part binary data? You're right there is no boundry, a file is just a bunch of ones and zeros on your hard disk, a byte is really just a byte is a byte. What you you do with the data is what makes the difference, ie allow hex 0x41 to represent what humans call "A". -zh __________________________________ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com |
From: Bruno H. <br...@cl...> - 2004-03-18 12:49:57
|
Sam Steingold wrote: > I think he wants to re-use the vector (a la READ-SEQUENCE). > CONVERT-STRING-TO-BYTES will always allocate a fresh vector. > Maybe an :OUTPUT + :START-OUTPUT combination is needed: > > (EXT:CONVERT-STRING-FROM-BYTES byte-vector encoding > :START 10 :END 33 > :OUTPUT string :START-OUTPUT 17) > > will write characters into STRING starting at position 17. > (and using ADJUST-ARRAY if STRING is shorter than 40) While it's theoretically possible to add options like this, you will note that functions which do complex operations _and_ store the result destructively somewhere are rare in Lisp. The only one that comes to mind are MAP-INTO and the BIT array operations. There is no RPLACBOTH, no NSUBSEQ, no NCONCATENATE, etc. (The reason is, of course, that destructive operations kill the functional programming style.) Therefore I would add such options only after an investigation would show that the garbage collection overhead due to the strings created by EXT:CONVERT-STRING-FROM-BYTES is not bearable. > still, each character is an integer, right? > (even 21 bit integer!) Still, for creating portable programs, you best view a character just as the contents of a string of length 1, and forget about the mapping to integer. Bruno |
From: Sam S. <sd...@gn...> - 2004-03-18 13:48:59
|
> * Bruno Haible <oe...@py...t> [2004-03-18 13:44:10 +0100]: > > Sam Steingold wrote: >> I think he wants to re-use the vector (a la READ-SEQUENCE). >> CONVERT-STRING-TO-BYTES will always allocate a fresh vector. >> Maybe an :OUTPUT + :START-OUTPUT combination is needed: >> >> (EXT:CONVERT-STRING-FROM-BYTES byte-vector encoding >> :START 10 :END 33 >> :OUTPUT string :START-OUTPUT 17) >> >> will write characters into STRING starting at position 17. >> (and using ADJUST-ARRAY if STRING is shorter than 40) > > While it's theoretically possible to add options like this, you will > note that functions which do complex operations _and_ store the result > destructively somewhere are rare in Lisp. The only one that comes to > mind are MAP-INTO and the BIT array operations. There is no RPLACBOTH, > no NSUBSEQ, no NCONCATENATE, etc. (The reason is, of course, that > destructive operations kill the functional programming style.) there are READ-SEQUENCE and REPLACE. I think that CONVERT-STRING-FROM-BYTES _is_ a kind of READ-SEQUENCE, mentally implemented as (READ-SEQUENCE (make-array) (bytes->stream)) > Therefore I would add such options only after an investigation would > show that the garbage collection overhead due to the strings created > by EXT:CONVERT-STRING-FROM-BYTES is not bearable. it depends on the pattern of usage. if Don ends up using CONVERT-STRING-FROM-BYTES instead of READ-SEQUENCE exclusively, then the semantics should be similar. -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> Bus error -- driver executed. |
From: Bruno H. <br...@cl...> - 2004-03-18 13:58:29
|
Sam Steingold wrote: > > Therefore I would add such options only after an investigation would > > show that the garbage collection overhead due to the strings created > > by EXT:CONVERT-STRING-FROM-BYTES is not bearable. > > it depends on the pattern of usage. OK, then show me one usage, together with benchmark results, where the GC does not do its job sufficiently well. Bruno |
From: Sam S. <sd...@gn...> - 2004-03-18 18:52:31
|
> * Bruno Haible <oe...@py...t> [2004-03-18 14:52:11 +0100]: > > Sam Steingold wrote: >> > Therefore I would add such options only after an investigation would >> > show that the garbage collection overhead due to the strings created >> > by EXT:CONVERT-STRING-FROM-BYTES is not bearable. >> >> it depends on the pattern of usage. > > OK, then show me one usage, together with benchmark results, where the GC > does not do its job sufficiently well. (ext:times (with-open-file (in "foo" :element-type 'unsigned-byte) (let ((bytes 0) (chars 0) (buf (make-array 1024 :element-type 'unsigned-byte))) (loop (let ((got (read-sequence buf in))) (incf bytes got) (incf chars (length (ext:convert-string-from-bytes buf charset:utf-8))) (unless (= got (length buf)) (return (values bytes chars)))))))) Permanent Temporary Class instances bytes instances bytes ----- --------- --------- --------- --------- SIMPLE-STRING 0 0 435575 265098784 EXT:SIMPLE-8BIT-VECTOR 1 48 62218 64210044 CONS 0 0 2862903 22903224 SYMBOL 0 0 124446 3484488 SIMPLE-VECTOR 6 272 186748 3241416 BIGNUM 2 24 91662 1099944 FUNCTION 0 0 42 1668 SYSTEM::ANODE 0 0 24 576 SYSTEM::FNODE 0 0 3 336 SIMPLE-BIT-VECTOR 0 0 29 276 SYSTEM::VAR 0 0 4 272 HASH-TABLE 1 56 3 168 STREAM 0 0 3 192 STRING-STREAM 0 0 3 180 VECTOR 0 0 6 168 FILE-STREAM 0 0 1 144 STRING 0 0 6 144 SYSTEM::CONST 0 0 5 120 PATHNAME 0 0 5 100 STANDARD-GENERIC-FUNCTION 0 0 2 40 BLOCK 0 0 1 36 ----- --------- --------- --------- --------- Total 10 400 3763689 360042320 Real time: 51.568 sec. Run time: 41.54 sec. Space: 360900936 Bytes GC: 606, GC time: 15.705 sec. 63707377 ; 63708160 note that over 35% of time is spent on GC! BTW, why do I get more chars than bytes as the return value? this is a pure ASCII file! -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> Incorrect time syncronization. |
From: Pascal J.B. <pj...@in...> - 2004-03-18 20:51:47
|
Sam Steingold writes: > (ext:times > (with-open-file (in "foo" :element-type 'unsigned-byte) > (let ((bytes 0) (chars 0) > (buf (make-array 1024 :element-type 'unsigned-byte))) > (loop (let ((got (read-sequence buf in))) > (incf bytes got) > (incf chars (length (ext:convert-string-from-bytes > buf charset:utf-8))) > (unless (= got (length buf)) > (return (values bytes chars)))))))) > > BTW, why do I get more chars than bytes as the return value? > this is a pure ASCII file! read-sequence does not set the fill pointer of the array, the more so if it has no fill pointer! [71]> (let ((a (make-array '(64) :element-type 'unsigned-byte :initial-element 0))) (with-open-stream (in (make-string-input-stream "BONJOUR")) (values a (read-sequence a in)))) #(#\B #\O #\N #\J #\O #\U #\R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) ; -- __Pascal_Bourguignon__ http://www.informatimago.com/ There is no worse tyranny than to force a man to pay for what he doesn't want merely because you think it would be good for him.--Robert Heinlein http://www.theadvocates.org/ |
From: Bruno H. <br...@cl...> - 2004-03-18 14:18:23
|
Sam wrote: > > 1) So many operations will give subtly different results, starting > > from (length string), (map 'list string), the hash code of a string > > and thus also the order of traversal of a hash table containing > > strings as keys, up to all kinds of string manipulation functions > > that work by scanning a string. > > why? > the only change is that (char-code #\Newline) is not 10 anymore. This alone creates portability problems. Around 16 years ago, the conventions on the Mac were just the converse than on Unix: \n was 0x0D and \r was 0x0A. And _of course_ when you ported programs from a Mac to Unix you had to change \r into \n in a lot of places. > When :LINE-TERMINATOR-STRICT-P is T, you will get #\Newline for the > appropriate line terminator and the original control character > otherwise. What Don proposed, was that on Windows, a line ends with two whitespace characters, not just one #\Newline. Here is a good example for some innocent looking code that is broken by such a change: (let ((s (make-string-input-stream "xy z"))) (progn (read s) (read-char s))) would return #\z on Unix and #\LF in Don Cohen's "Windows faithful" mode. There are lots of examples like this. Bruno |
From: <don...@is...> - 2004-03-18 17:19:42
|
Bruno Haible writes: > > the only change is that (char-code #\Newline) is not 10 anymore. > > This alone creates portability problems. Around 16 years ago, the > conventions > on the Mac were just the converse than on Unix: \n was 0x0D and \r was 0x0A. > And _of course_ when you ported programs from a Mac to Unix you had to > change \r into \n in a lot of places. Clearly the current implementation choice is useful when you want to write a text file that will be compatible with the other tools on whatever system you're using, and also when you want to read it back in on any other system. I don't propose that you get rid of it. > > When :LINE-TERMINATOR-STRICT-P is T, you will get #\Newline for the > > appropriate line terminator and the original control character > > otherwise. > > What Don proposed, was that on Windows, a line ends with two whitespace > characters, not just one #\Newline. Here is a good example for some innocent > looking code that is broken by such a change: > > (let ((s (make-string-input-stream "xy > z"))) (progn (read s) (read-char s))) > > would return #\z on Unix and #\LF in Don Cohen's "Windows faithful" mode. > There are lots of examples like this. All of those examples also apply to binary IO which is what I'm forced to do now. I'm not expecting the new faithful mode to hide/solve these differences. I just want to deal with characters instead of bytes. In fact, I currently do deal with characters. I read bytes and convert them to characters. Let's just consider the question of what's the "right" way to deal with input when you write a web server in Clisp. I see three possible positions so far. - use only bytes (I think this is Pascal's position) - read bytes and convert them to characters, then use characters internally - this is what I do now - read characters in "faithful" mode This is what I think is best. As a separate matter, we might consider what the "right" way would be to do such IO in portable common lisp. I think Pascal's argument is that the spec doesn't even promise that there will 256 characters so it's pretty much hopeless to do anything other than stick to bytes. This issue does not really apply to my server. It became clear long ago that this particular program was not going to be possible to write in portable common lisp, which does not even support network sockets, let alone determining how much output can be written without blocking (which would be replaced by multiple threads in an implementation supporting that). |
From: Bruno H. <br...@cl...> - 2004-03-18 18:15:54
|
Don Cohen wrote: > Let's just consider the question of what's the "right" way to > deal with input when you write a web server in Clisp. > I see three possible positions so far. > - use only bytes (I think this is Pascal's position) > - read bytes and convert them to characters, then use characters > internally - this is what I do now > - read characters in "faithful" mode > This is what I think is best. I think for a web server the second option is best, for two reasons: - The HTTP protocol is defined in terms of bytes. The byte stream of an HTTP connection can contain binary data as well, for example after a POST request some binary data can be sent, IIRC. - For security reasons, you may want to control explicitly all I/O conversions. Which is hard if you use the Lisp implementation's READ-CHAR / READ-LINE black box. Of course you can use READ-CHAR and READ-LINE on portions of the stream, if you use a Gray stream for converting READ-CHAR calls into READ-BYTE calls. > As a separate matter, we might consider what the "right" way would > be to do such IO in portable common lisp. I think Pascal's argument > is that the spec doesn't even promise that there will 256 characters > so it's pretty much hopeless to do anything other than stick to > bytes. Yes, ANSI CL promises 96 characters, not more. And I don't remember why CLISP's interpretation of ISO-8859-1 contains the control characters 0x80..0x9F. The standards are not clear on this issue. ISO-8859-1 doesn't contain them, but ISO-8859-1 is usually viewed as the 8-bit portion of Unicode, and Unicode has control characters at 0x80..0x9F. Probably I included this range only so that J=F6rg could use the character #\U009B on Amiga... So: don't count on it. The details of I/O conversion in READ-CHAR are up to the implementation. Bruno |