Re: [clisp-list] reading of CR/LF for charset:iso-8859-1

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi --

Certainly I'd never knowingly write a #\Return, i.e. (code-char 13),
into a text file.  This problem showed up when a user of my
application created a file with #\Return characters, probably on a
Windows system, that I read back in on a Linux system.  Even then,
it's not exactly a huge problem, since the #\Return in front of each
#\Newline was dropped by READ ("newlines everywhere", as you say).
But we support 7 host Lisps, and our application, ACL2, was
complaining because a checksum computed was different for CLISP than
for the other six Lisps.  (I don't want to go into the whole story
about how ACL2 "certifies books", computes checksums, etc....)

Anyhow, thanks for your time.  I think I see your point and I may
simply not worry about the dropped #\Return characters.  Or perhaps we
should indeed disallow non-text characters such as #\Return in text
files; I may consider that.  As I've stated a couple of times, we
would like to read back in what was written; but I don't want to
defend that.  I can live with CLISP not behaving like those other six
Lisps, and I think you've answered my original question:

   Again, what is in the string is a newline.  What clisp will read is a
   newline, and what clisp will print is a newline.  Newlines
   everywhere. :-)

-- Matt
   From: "Pascal J. Bourguignon" <pj...@in...>
   Cc: cli...@li...
   Date: Wed, 22 May 2013 23:46:30 +0200

   Matt Kaufmann <kau...@cs...> writes:

   > Hi --
   >
   > The problem probably only arises when lines break in the middle of
   > strings.  To see what I mean, replace '(defun fact (x) ...) in your
   > definition of demo with the following.
   >
   >   (concatenate 'string
   > 	       "a" (string (code-char 13)) (string #\Newline) "b")
   >
   > Here are the results.  Notice that this time, the results are
   > different (but that probably won't surprise you).
   >

   Ok, let's consider a string like:

   (defparameter *str* "Hello
   World")

   Obviously, this string contains a new line.

   Again, why do you care whether there's a CRLF code sequence or just a LF
   code in the file?

   CL-USER> (with-open-file (src "/tmp/a.lisp" :external-format (ext:make-encoding :charset charset:iso-8859-1 
										   :line-terminator :dos))
	      (read src))
   (DEFPARAMETER *STR*
    "Hello
   World")
   CL-USER> (load "/tmp/a.lisp" :external-format (ext:make-encoding :charset charset:iso-8859-1 
								    :line-terminator :dos))
   ;; Loading file /tmp/a.lisp ...
   ;; Loaded file /tmp/a.lisp
   #P"/tmp/a.lisp"
   CL-USER> (length *str*)
   11
   CL-USER> 

   On the other hand, if you care whether your sequence contains codes 13
   10 or just 10, why do you use strings?

      (concatenate 'vector #(93) #(13) #(10) #(94))
      --> #(93 13 10 94)

   or just:

      (vector 93 13 10 94)
      --> #(93 13 10 94)

   or just:

      #(93 13 10 94)

   Now if you want to insert a lot of ASCII-encoded bytes, you can always
   write a reader macro:

   (defun c-escaped-character-map (escaped-character)
     (case escaped-character
       ((#\newline) -1)
       ((#\a)        7)
       ((#\b)        8)
       ((#\t)        9)
       ((#\n)       10)
       ((#\v)       11)
       ((#\f)       12)
       ((#\r)       13)
       ((#\x)       :hexa)
       ((#\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7) :octal)
       (otherwise   :default)))

   (defun character-code-reader-macro (stream quotation-mark)
     (declare (ignore quotation-mark))
     (flet ((encode (ch)
	      ;; TOOD: Use babel or something else to get the uncode code-
	      ;;       point of the character.
	      (char-code ch)))
       (let ((ch (read-char stream)))
	 (if (char= #\\ ch)
	     (let ((ch (read-char stream))
		   (code (c-escaped-character-map ch)))
	       (flet ((read-code (*read-base* base-name)
			(let ((code (read stream)))
			  (if (and (integerp code) (<= 0 code (1- char-code-limit)))
			      code
			      (error "Invalid hexadecimal character code: ~A" code)))))
		 (case code
		   (:hexa  (read-code 16 "hexadecimal"))
		   (:octal (read-code  8 "octal"))
		   (:default ;; In emacs ?\x = ?x
		    (encode ch))
		   (otherwise code))))
	     ;; or use #+clisp ext:string-to-bytes :
	     (encode ch)))))

   (set-macro-character #\? 'character-code-reader-macro t)

   #(?a ?\a ?\r ?\n ?b ?\b ?\x41 ?\61 ?\\ ?\z ?' ?\')
   --> #(97 7 13 10 98 8 65 49 92 122 39 39)

   (See also:
   http://paste.lisp.org/display/137262
   for a C string reader.)

   > Anyhow, maybe that answers your question:
   >
   >>> How is READ related to CRLF vs. CR vs. LF?
   >
   > That is: the issue is when CRLF or CR or LF is in the middle of a
   > string object.
   >
   > As I mentioned in my preceding email, I would like READ to invert
   > PRIN1.  This seems a natural thing to want, though I'm not claiming
   > it's required of CLISP or any other Lisp.  

   Again, what is in the string is a newline.  What clisp will read is a
   newline, and what clisp will print is a newline.  Newlines
   everywhere. :-)

   If you should care about the codes, then you should use binary streams,
   and read and write bytes, not text.  READ and PRIN1 read and write text.

   What YOU should not do, is to insert into strings non-character
   characters such as #\return.  For one thing, they make your program non
   conforming since they are only semi-standard (ie. an implementation may
   just not have them).

       (concatenate 'string
	 "\"" "a" (string (code-char 13)) (string #\Newline) "b" "\"")
			  --------------
				 ^
				 |
   The error is here ------------+

   > [5]> (demo)
   >
   > UNIX 
   > 00000000: 0A 22 61 0D 0A 62 22 ?"a??b"
   >
   > "a
   > b" 

   If you consider that this file is wrongly encoded (I could agree with
   you on this point, IF I admited #\return (and other such strange
   "characters") in strings),  the I will argue that the following file is
   also ill-formed:

   > DOS 
   > 00000000: 0D 0A 22 61 0D 0D 0A 62 22 ??"a???b"
   >
   > "a
   >
   > b" 

   Because a stray CR in a DOS file is not a good idea either.

   Again, are we talking about text files?  
   Or about teletype control binary streams?

   There is not only #\return and its ilk that you should avoid in strings.

   Let's take for example #\xd800.  You should not insert this so called
   "character" into strings either because it is not a character.  It's a
   unicode code point that doesn't encode any character (or even any
   character part!)

   If you were to put such a "character" in a clisp string, and write out a
   file (eg. using utf-8 or utf-16 encoding), you would create most
   probably an invalid file.  Just like your two files above.  (The first
   is not a valid unix text file, the second is not a valid DOS text file).

   By the way, some implementations just don't have a character with code
   #xd800:

       #+ccl (code-char #xd800) --> NIL

   The codes between 0 and 31, 127, and between 128 and 159, to talk only
   of the code in the iso-8859-1 range, are similar: they don't encode
   characters, and you should just NOT include them in any string, and of
   course, not write them in a TEXT file (you can write those codes in a
   binary file, if such a binary file format requires them).

   -- 
   __Pascal Bourguignon__                     http://www.informatimago.com/
   A bad day in () is better than a good day in {}.
   You can take the lisper out of the lisp job, but you can't take the lisp out
   of the lisper (; -- antifuchs