From: Matt K. <kau...@cs...> - 2013-05-23 00:29:02
|
Hi -- Certainly I'd never knowingly write a #\Return, i.e. (code-char 13), into a text file. This problem showed up when a user of my application created a file with #\Return characters, probably on a Windows system, that I read back in on a Linux system. Even then, it's not exactly a huge problem, since the #\Return in front of each #\Newline was dropped by READ ("newlines everywhere", as you say). But we support 7 host Lisps, and our application, ACL2, was complaining because a checksum computed was different for CLISP than for the other six Lisps. (I don't want to go into the whole story about how ACL2 "certifies books", computes checksums, etc....) Anyhow, thanks for your time. I think I see your point and I may simply not worry about the dropped #\Return characters. Or perhaps we should indeed disallow non-text characters such as #\Return in text files; I may consider that. As I've stated a couple of times, we would like to read back in what was written; but I don't want to defend that. I can live with CLISP not behaving like those other six Lisps, and I think you've answered my original question: Again, what is in the string is a newline. What clisp will read is a newline, and what clisp will print is a newline. Newlines everywhere. :-) -- Matt From: "Pascal J. Bourguignon" <pj...@in...> Cc: cli...@li... Date: Wed, 22 May 2013 23:46:30 +0200 Matt Kaufmann <kau...@cs...> writes: > Hi -- > > The problem probably only arises when lines break in the middle of > strings. To see what I mean, replace '(defun fact (x) ...) in your > definition of demo with the following. > > (concatenate 'string > "a" (string (code-char 13)) (string #\Newline) "b") > > Here are the results. Notice that this time, the results are > different (but that probably won't surprise you). > Ok, let's consider a string like: (defparameter *str* "Hello World") Obviously, this string contains a new line. Again, why do you care whether there's a CRLF code sequence or just a LF code in the file? CL-USER> (with-open-file (src "/tmp/a.lisp" :external-format (ext:make-encoding :charset charset:iso-8859-1 :line-terminator :dos)) (read src)) (DEFPARAMETER *STR* "Hello World") CL-USER> (load "/tmp/a.lisp" :external-format (ext:make-encoding :charset charset:iso-8859-1 :line-terminator :dos)) ;; Loading file /tmp/a.lisp ... ;; Loaded file /tmp/a.lisp #P"/tmp/a.lisp" CL-USER> (length *str*) 11 CL-USER> On the other hand, if you care whether your sequence contains codes 13 10 or just 10, why do you use strings? (concatenate 'vector #(93) #(13) #(10) #(94)) --> #(93 13 10 94) or just: (vector 93 13 10 94) --> #(93 13 10 94) or just: #(93 13 10 94) Now if you want to insert a lot of ASCII-encoded bytes, you can always write a reader macro: (defun c-escaped-character-map (escaped-character) (case escaped-character ((#\newline) -1) ((#\a) 7) ((#\b) 8) ((#\t) 9) ((#\n) 10) ((#\v) 11) ((#\f) 12) ((#\r) 13) ((#\x) :hexa) ((#\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7) :octal) (otherwise :default))) (defun character-code-reader-macro (stream quotation-mark) (declare (ignore quotation-mark)) (flet ((encode (ch) ;; TOOD: Use babel or something else to get the uncode code- ;; point of the character. (char-code ch))) (let ((ch (read-char stream))) (if (char= #\\ ch) (let ((ch (read-char stream)) (code (c-escaped-character-map ch))) (flet ((read-code (*read-base* base-name) (let ((code (read stream))) (if (and (integerp code) (<= 0 code (1- char-code-limit))) code (error "Invalid hexadecimal character code: ~A" code))))) (case code (:hexa (read-code 16 "hexadecimal")) (:octal (read-code 8 "octal")) (:default ;; In emacs ?\x = ?x (encode ch)) (otherwise code)))) ;; or use #+clisp ext:string-to-bytes : (encode ch))))) (set-macro-character #\? 'character-code-reader-macro t) #(?a ?\a ?\r ?\n ?b ?\b ?\x41 ?\61 ?\\ ?\z ?' ?\') --> #(97 7 13 10 98 8 65 49 92 122 39 39) (See also: http://paste.lisp.org/display/137262 for a C string reader.) > Anyhow, maybe that answers your question: > >>> How is READ related to CRLF vs. CR vs. LF? > > That is: the issue is when CRLF or CR or LF is in the middle of a > string object. > > As I mentioned in my preceding email, I would like READ to invert > PRIN1. This seems a natural thing to want, though I'm not claiming > it's required of CLISP or any other Lisp. Again, what is in the string is a newline. What clisp will read is a newline, and what clisp will print is a newline. Newlines everywhere. :-) If you should care about the codes, then you should use binary streams, and read and write bytes, not text. READ and PRIN1 read and write text. What YOU should not do, is to insert into strings non-character characters such as #\return. For one thing, they make your program non conforming since they are only semi-standard (ie. an implementation may just not have them). (concatenate 'string "\"" "a" (string (code-char 13)) (string #\Newline) "b" "\"") -------------- ^ | The error is here ------------+ > [5]> (demo) > > UNIX > 00000000: 0A 22 61 0D 0A 62 22 ?"a??b" > > "a > b" If you consider that this file is wrongly encoded (I could agree with you on this point, IF I admited #\return (and other such strange "characters") in strings), the I will argue that the following file is also ill-formed: > DOS > 00000000: 0D 0A 22 61 0D 0D 0A 62 22 ??"a???b" > > "a > > b" Because a stray CR in a DOS file is not a good idea either. Again, are we talking about text files? Or about teletype control binary streams? There is not only #\return and its ilk that you should avoid in strings. Let's take for example #\xd800. You should not insert this so called "character" into strings either because it is not a character. It's a unicode code point that doesn't encode any character (or even any character part!) If you were to put such a "character" in a clisp string, and write out a file (eg. using utf-8 or utf-16 encoding), you would create most probably an invalid file. Just like your two files above. (The first is not a valid unix text file, the second is not a valid DOS text file). By the way, some implementations just don't have a character with code #xd800: #+ccl (code-char #xd800) --> NIL The codes between 0 and 31, 127, and between 128 and 159, to talk only of the code in the iso-8859-1 range, are similar: they don't encode characters, and you should just NOT include them in any string, and of course, not write them in a TEXT file (you can write those codes in a binary file, if such a binary file format requires them). -- __Pascal Bourguignon__ http://www.informatimago.com/ A bad day in () is better than a good day in {}. You can take the lisper out of the lisp job, but you can't take the lisp out of the lisper (; -- antifuchs |