From: Matt K. <kau...@cs...> - 2013-05-19 17:56:55
|
Hi -- I maintain an application that is build on top of Common Lisp, which expects iso-8859-1 for the character encoding. I'd like to set things up so that on a linux system, my application reads characters from a file exactly as they were written. But my attempt to do so failed, dropping a #\Return character, as illustrated by the log below. Is there something simple I can do to accomplish my goal, or else might that be the case in future CLISP releases? Note that I did see the following note at http://www.clisp.org/impnotes/clhs-newline.html: Justification. Unicode Newline Guidelines say: “Even if you know which characters represents NLF on your particular platform, on input and in interpretation, treat CR, LF, CRLF, and NEL the same. Only on output do you need to distinguish between them.” However, I'm hoping that since I'm using iso-8859-1 rather than a utf encoding, maybe that justification doesn't need to apply. Here is the log promised above. It shows that after an attempt to set custom:*default-file-encoding* appropriately, then after writing a string to that file containing four characters including a #\Return character, that character is dropped when reading back in. dunnottar:~% /usr/bin/clisp i i i i i i i ooooo o ooooooo ooooo ooooo I I I I I I I 8 8 8 8 8 o 8 8 I \ `+' / I 8 8 8 8 8 8 \ `-+-' / 8 8 8 ooooo 8oooo `-__|__-' 8 8 8 8 8 | 8 o 8 8 o 8 8 ------+------ ooooo 8oooooo ooo8ooo ooooo 8 Welcome to GNU CLISP 2.49 (2010-07-07) <http://clisp.cons.org/> Copyright (c) Bruno Haible, Michael Stoll 1992, 1993 Copyright (c) Bruno Haible, Marcus Daniels 1994-1997 Copyright (c) Bruno Haible, Pierpaolo Bernardi, Sam Steingold 1998 Copyright (c) Bruno Haible, Sam Steingold 1999-2000 Copyright (c) Sam Steingold, Bruno Haible 2001-2010 Type :h and hit Enter for context help. [1]> (setq custom:*default-file-encoding* (ext:make-encoding :charset 'charset:iso-8859-1 :line-terminator :unix)) #<ENCODING CHARSET:ISO-8859-1 :UNIX> [2]> (with-open-file (str "test.lisp" :direction :output) (princ (concatenate 'string "\"" "a" (string #\Return) (string #\Newline) "b" "\"") str)) "\"a b\"" [3]> (with-open-file (str "test.lisp" :direction :input) (let ((s (read str))) (list (length s) (char s 0) (char s 1) (char s 2)))) You are in the top-level Read-Eval-Print loop. Help (abbreviated :h) = this list Use the usual editing capabilities. (quit) or (exit) leaves CLISP. (3 #\a #\Newline #\b) [4]> Thanks -- -- Matt Kaufmann |
From: Pascal J. B. <pj...@in...> - 2013-05-19 21:16:37
|
Matt Kaufmann <kau...@cs...> writes: > Hi -- > > I maintain an application that is build on top of Common Lisp, which > expects iso-8859-1 for the character encoding. I'd like to set things > up so that on a linux system, my application reads characters from a > file exactly as they were written. But my attempt to do so failed, > dropping a #\Return character, as illustrated by the log below. Is > there something simple I can do to accomplish my goal, or else might > that be the case in future CLISP releases? Note that I did see the > following note at http://www.clisp.org/impnotes/clhs-newline.html: > > Justification. Unicode Newline Guidelines say: “Even if you know > which characters represents NLF on your particular platform, on > input and in interpretation, treat CR, LF, CRLF, and NEL the > same. Only on output do you need to distinguish between them.” > > However, I'm hoping that since I'm using iso-8859-1 rather than a utf > encoding, maybe that justification doesn't need to apply. No, it still applies. Since you want to read codes such as 13 and 10, you should specify an element type of (unsigned-byte 8): [pjb@kuiper :0.0 ~]$ clisp -ansi -norc -q [1]> (deftype octet () '(unsigned-byte 8)) OCTET [2]> (with-open-file (in #P"~/tmp/misc/wang.dos" :element-type 'octet) (let ((buffer (make-array 256 :element-type 'octet))) (read-sequence buffer in) (search #(13 10) buffer))) 29 [3]> (quit) [pjb@kuiper :0.0 ~]$ -- __Pascal Bourguignon__ http://www.informatimago.com/ A bad day in () is better than a good day in {}. You can take the lisper out of the lisp job, but you can't take the lisp out of the lisper (; -- antifuchs |
From: Matt K. <kau...@cs...> - 2013-05-19 22:20:57
|
Thank you very much for getting back to me so quickly. That helps, but I'd like to be able to read in code 10 using the function READ-CHAR, and I don't see how to do that in CLISP, even though I can do it in Allegro CL, CCL, CMUCL, LispWorks, SBCL, and GCL. My sample file contains six characters as follows, where the line break consists of #\Return followed by #\Newline: "x y" Below is a log showing how I get #\Return (code 10) using read-char in those other lisps, but not CLISP. Any suggestions? But first, I should mention that I tried the following in CLISP (though the error probably won't surprise you) -- maybe you can suggest an alternative? [1]> (deftype octet () '(unsigned-byte 8)) OCTET [2]> (with-open-file (in #P"test0" :element-type 'octet) (read-char in)) *** - READ-CHAR on #<INPUT BUFFERED FILE-STREAM (UNSIGNED-BYTE 8) #P"test0"> is illegal The following restarts are available: ABORT :R1 Abort main loop Break 1 [3]> Anyhow, here is the log promised above. dunnottar:~/temp% acl9 International Allegro CL Enterprise Edition 9.0 [64-bit Linux (x86-64)] (Jul 11, 2012 14:33) Copyright (C) 1985-2012, Franz Inc., Oakland, CA, USA. All Rights Reserved. This development copy of Allegro CL is licensed to: [TC20122] University of Texas ;; Optimization settings: safety 1, space 1, speed 1, debug 2. ;; For a complete description of all compiler switches given the ;; current optimization settings evaluate (EXPLAIN-COMPILER-SETTINGS). CL-USER(1): (setq *locale* (find-locale "C")) #<locale "C" [:LATIN1-BASE] @ #x100004067b2> CL-USER(2): (let (ch) (with-open-file (in #P"test0") (loop while (setq ch (read-char in nil)) collect (char-code ch)))) (34 120 13 10 121 34 13 10) CL-USER(3): (exit) ; Exiting dunnottar:~/temp% ccl Starting 64-bit CCL Welcome to Clozure Common Lisp Version 1.9-dev-r15542M-trunk (LinuxX8664)! ? (setq ccl:*default-file-character-encoding* :iso-8859-1) :ISO-8859-1 ? (let (ch) (with-open-file (in #P"test0") (loop while (setq ch (read-char in nil)) collect (char-code ch)))) (34 120 13 10 121 34 13 10) ? (quit) dunnottar:~/temp% cmucl CMU Common Lisp snapshot-2013-05 (20D Unicode), running on dunnottar With core: /v/filer4b/v11q001/acl2/lisps/cmucl-snapshot-2013-05-20D-Unicode/lib/cmucl/lib/lisp-sse2.core Dumped on: Sat, 2013-05-11 11:18:42-05:00 on lorien2 See <http://www.cmucl.org/> for support information. Loaded subsystems: Unicode 1.29 with Unicode version 6.2.0 Python 1.1, target Intel x86/sse2 CLOS based on Gerd's PCL 2010/03/19 15:19:03 * (setq *default-external-format* :iso-8859-1) :ISO-8859-1 * (let (ch) (with-open-file (in #P"test0") (loop while (setq ch (read-char in nil)) collect (char-code ch)))) (34 120 13 10 121 34 13 10) * (quit) ; dunnottar:~/temp% lispworks Starting 64-bit Lispworks LispWorks(R): The Common Lisp Programming Environment Copyright (C) 1987-2012 LispWorks Ltd. All rights reserved. Version 6.1.1 Saved by kaufmann as lw-terminal-only, at 26 Nov 2012 15:23 User kaufmann on dunnottar CL-USER 1 > (setq stream::*default-external-format* '(:LATIN-1 :EOL-STYLE :LF)) (:LATIN-1 :EOL-STYLE :LF) CL-USER 2 > (defun our-file-encoding (pathname ef-spec buffer length) (system:merge-ef-specs ef-spec '(:LATIN-1 :EOL-STYLE :LF))) OUR-FILE-ENCODING CL-USER 3 > (setq system::*file-encoding-detection-algorithm* '(our-file-encoding)) (OUR-FILE-ENCODING) CL-USER 4 > (let (ch) (with-open-file (in #P"test0") (loop while (setq ch (read-char in nil)) collect (char-code ch)))) (34 120 13 10 121 34 13 10) CL-USER 5 > (quit) dunnottar:~/temp% sbcl Starting 64-bit SBCL This is SBCL 1.1.4, an implementation of ANSI Common Lisp. More information about SBCL is available at <http://www.sbcl.org/>. SBCL is free software, provided as is, with absolutely no warranty. It is mostly in the public domain; some portions are provided under BSD-style licenses. See the CREDITS and COPYING files in the distribution for more information. * (setq sb-impl::*default-external-format* :iso-8859-1) :ISO-8859-1 * (let (ch) (with-open-file (in #P"test0") (loop while (setq ch (read-char in nil)) collect (char-code ch)))) (34 120 13 10 121 34 13 10) * (quit) dunnottar:~/temp% gcl GCL (GNU Common Lisp) 2.6.8 CLtL1 May 11 2013 16:43:51 Source License: LGPL(gcl,gmp), GPL(unexec,bfd,xgcl) Binary License: GPL due to GPL'ed components: (XGCL READLINE UNEXEC) Modifications of this banner must retain notice of a compatible license Dedicated to the memory of W. Schelter Use (help) to get some basic information on how to use GCL. Temporary directory for compiler files set to /tmp/ >(let (ch) (with-open-file (in #P"test0") (loop while (setq ch (read-char in nil)) collect (char-code ch)))) (34 120 13 10 121 34 13 10) >(quit) dunnottar:~/temp% clisp i i i i i i i ooooo o ooooooo ooooo ooooo I I I I I I I 8 8 8 8 8 o 8 8 I \ `+' / I 8 8 8 8 8 8 \ `-+-' / 8 8 8 ooooo 8oooo `-__|__-' 8 8 8 8 8 | 8 o 8 8 o 8 8 ------+------ ooooo 8oooooo ooo8ooo ooooo 8 Welcome to GNU CLISP 2.49 (2010-07-07) <http://clisp.cons.org/> Copyright (c) Bruno Haible, Michael Stoll 1992, 1993 Copyright (c) Bruno Haible, Marcus Daniels 1994-1997 Copyright (c) Bruno Haible, Pierpaolo Bernardi, Sam Steingold 1998 Copyright (c) Bruno Haible, Sam Steingold 1999-2000 Copyright (c) Sam Steingold, Bruno Haible 2001-2010 Type :h and hit Enter for context help. [1]> (setq custom:*default-file-encoding* (ext:make-encoding :charset 'charset:iso-8859-1 :line-terminator :unix)) #<ENCODING CHARSET:ISO-8859-1 :UNIX> [2]> (let (ch) (with-open-file (in #P"test0") (loop while (setq ch (read-char in nil)) collect (char-code ch)))) (34 120 10 121 34 10) [3]> (quit) Bye. dunnottar:~/temp% Thanks -- -- Matt From: "Pascal J. Bourguignon" <pj...@in...> Date: Sun, 19 May 2013 23:16:14 +0200 Organization: Informatimago Matt Kaufmann <kau...@cs...> writes: > Hi -- > > I maintain an application that is build on top of Common Lisp, which > expects iso-8859-1 for the character encoding. I'd like to set things > up so that on a linux system, my application reads characters from a > file exactly as they were written. But my attempt to do so failed, > dropping a #\Return character, as illustrated by the log below. Is > there something simple I can do to accomplish my goal, or else might > that be the case in future CLISP releases? Note that I did see the > following note at http://www.clisp.org/impnotes/clhs-newline.html: > > Justification. Unicode Newline Guidelines say: “Even if you know > which characters represents NLF on your particular platform, on > input and in interpretation, treat CR, LF, CRLF, and NEL the > same. Only on output do you need to distinguish between them.” > > However, I'm hoping that since I'm using iso-8859-1 rather than a utf > encoding, maybe that justification doesn't need to apply. No, it still applies. Since you want to read codes such as 13 and 10, you should specify an element type of (unsigned-byte 8): [pjb@kuiper :0.0 ~]$ clisp -ansi -norc -q [1]> (deftype octet () '(unsigned-byte 8)) OCTET [2]> (with-open-file (in #P"~/tmp/misc/wang.dos" :element-type 'octet) (let ((buffer (make-array 256 :element-type 'octet))) (read-sequence buffer in) (search #(13 10) buffer))) 29 [3]> (quit) [pjb@kuiper :0.0 ~]$ -- __Pascal Bourguignon__ http://www.informatimago.com/ A bad day in () is better than a good day in {}. You can take the lisper out of the lisp job, but you can't take the lisp out of the lisper (; -- antifuchs ------------------------------------------------------------------------------ AlienVault Unified Security Management (USM) platform delivers complete security visibility with the essential security capabilities. Easily and efficiently configure, manage, and operate all of your security controls from a single console and one unified framework. Download a free trial. http://p.sf.net/sfu/alienvault_d2d _______________________________________________ clisp-list mailing list cli...@li... https://lists.sourceforge.net/lists/listinfo/clisp-list |
From: Pascal J. B. <pj...@in...> - 2013-05-19 23:31:56
|
Matt Kaufmann <kau...@cs...> writes: > Thank you very much for getting back to me so quickly. That helps, > but I'd like to be able to read in code 10 using the function > READ-CHAR, and I don't see how to do that in CLISP, even though I can > do it in Allegro CL, CCL, CMUCL, LispWorks, SBCL, and GCL. My sample > file contains six characters as follows, where the line break consists > of #\Return followed by #\Newline: I told you how to do it. Read bytes, not characters. You can always convert bytes to characters later with ext:convert-string-from-bytes (deftype octet () '(unsigned-byte 8)) (with-open-file (in #P"~/tmp/misc/wang.dos" :element-type 'octet) (loop :for byte = (read-byte in nil in) :until (eq byte in) :do (case byte ((13) (princ " CR")) ((10) (princ " LF") (princ #\Newline)) (otherwise (if (or (<= 32 byte 126) (<= 160 byte 255)) (princ (ext:convert-string-from-bytes (vector byte) charset:iso-8859-1)) (format "<CODE ~D>" byte))))) (values)) Hao Wang, logicien americain. CR LF CR LF L'algorithme en question a ete publie en 1960 dans l'IBM Journal, CR LF article intitule "Toward Mechanical Mathematics", avec des variantes et CR LF une extension au calcul des predicats. Il s'agit ici du "premier CR LF programme" de Wang, systeme "P". CR LF CR LF L'article a ete ecrit en 1958, et les experiences effectuees sur IBM 704 CR LF - machine a lampes, 32 k mots de 36 bits, celle-la meme qui vit naitre CR LF LISP a la meme epoque. Le programme a ete ecrit en assembleur (Fortran CR LF existait, mais il ne s'etait pas encore impose) et l'auteur estime que CR LF "there is very little in the program that is not straightforward". CR LF CR LF Il observe que les preuves engendrees sont "essentiellement des arbres", CR LF et annonce que la machine a demontre 220 theoremes du calcul des CR LF propositions (tautologies) en 3 minutes. Il en tire argument pour la CR LF superiorite d'une approche algorithmique par rapport a une approche CR LF heuristique comme celle du "Logic Theorist" de Newell, Shaw et Simon (a CR LF partir de 1956 sur la machine JOHNNIAC de la Rand Corporation): un debat CR LF qui dure encore... CR LF CR LF Cet algorithme a ete popularise par J. McCarthy, comme exemple-fanion CR LF d'application de LISP. Il figure dans le manuel de la premiere version CR LF de LISP (LISP 1, sur IBM 704 justement, le manuel est date de Mars CR LF 1960), et il a ete repris dans le celebre "LISP 1.5 Programmer's Manual" CR LF publie en 1962 par MIT Press, un des maitres-livres de l'Informatique. CR LF CR LF CR LF CR LF > Below is a log showing how I get #\Return (code 10) using read-char in > those other lisps, but not CLISP. Any suggestions? You're trying to read a binary stream containing control codes. So read it as such, process the control codes, and convert the bytes that encode characters into strings. See above. > CL-USER(1): (setq *locale* (find-locale "C")) > #<locale "C" [:LATIN1-BASE] @ #x100004067b2> > CL-USER(2): (let (ch) > (with-open-file (in #P"test0") > (loop while (setq ch (read-char in nil)) > collect (char-code ch)))) > (34 120 13 10 121 34 13 10) You should do the reverse: read bytes, and convert them to characters when they are bytes encoding characters. Beware also that char-code and code-char use an unspecified code. Prefer functions such as #+clisp ext:convert-string-from-bytes or com.informatimago.common-lisp.cesarum.ascii:ascii-string or from the babel package, which use a definite encoding. -- __Pascal Bourguignon__ http://www.informatimago.com/ A bad day in () is better than a good day in {}. You can take the lisper out of the lisp job, but you can't take the lisp out of the lisper (; -- antifuchs |
From: Matt K. <kau...@cs...> - 2013-05-20 13:12:07
|
Thank you; that makes sense. Returning to my original problem, regarding READ rather than READ-CHAR: You first reply told me how to do the read I desired by reading a string into a buffer. But for my purposes, I'd like to call READ on arbitrary objects. The following example illustrates how I would like READ to invert PRIN1. Using same set-ups as I sent you before, CLISP returns nil in the following case, while the other six Lisps I mentioned all return t. Although x is a particular string in this example, imagine that it could be any sort of object for which equality can be tested by EQUAL. Is there a way to set up CLISP so that this never returns nil? (let ((x (concatenate 'string "a" (string (code-char 13)) (string #\Newline) "b"))) (delete-file "out0") (with-open-file (out "out0" :direction :output) (prin1 x out)) (with-open-file (in "out0" :direction :input) (equal (read in) x))) Interestingly, I see that the following returns the value (4 #\a #\Return #\Newline #\b): (let ((s (read-from-string (concatenate 'string "\"" "a" (string (code-char 13)) (string #\Newline) "b" "\"")))) (list (length s) (char s 0) (char s 1) (char s 2) (char s 3))) So one answer is to read the entire file into a string using READ-BYTE and EXT:CONVERT-STRING-FROM-BYTES, and then call READ-FROM-STRING on that string instead of calling READ on a stream, maintaining the next position from which to read. Perhaps I could even arrange for a string input stream, so that I don't need to maintain the position. But I'd prefer simply to use just READ, and I'd hoped that I could do so after evaluating the following, but that's not the case. (setq custom:*default-file-encoding* (ext:make-encoding :charset 'charset:iso-8859-1 :line-terminator :unix)) Any suggestions for how I can call read so that it inverts prin1 in the sense explained above? Thanks -- -- Matt From: "Pascal J. Bourguignon" <pj...@in...> Cc: cli...@li... Date: Mon, 20 May 2013 01:31:38 +0200 Matt Kaufmann <kau...@cs...> writes: > Thank you very much for getting back to me so quickly. That helps, > but I'd like to be able to read in code 10 using the function > READ-CHAR, and I don't see how to do that in CLISP, even though I can > do it in Allegro CL, CCL, CMUCL, LispWorks, SBCL, and GCL. My sample > file contains six characters as follows, where the line break consists > of #\Return followed by #\Newline: I told you how to do it. Read bytes, not characters. You can always convert bytes to characters later with ext:convert-string-from-bytes (deftype octet () '(unsigned-byte 8)) (with-open-file (in #P"~/tmp/misc/wang.dos" :element-type 'octet) (loop :for byte = (read-byte in nil in) :until (eq byte in) :do (case byte ((13) (princ " CR")) ((10) (princ " LF") (princ #\Newline)) (otherwise (if (or (<= 32 byte 126) (<= 160 byte 255)) (princ (ext:convert-string-from-bytes (vector byte) charset:iso-8859-1)) (format "<CODE ~D>" byte))))) (values)) Hao Wang, logicien americain. CR LF CR LF L'algorithme en question a ete publie en 1960 dans l'IBM Journal, CR LF article intitule "Toward Mechanical Mathematics", avec des variantes et CR LF une extension au calcul des predicats. Il s'agit ici du "premier CR LF programme" de Wang, systeme "P". CR LF CR LF L'article a ete ecrit en 1958, et les experiences effectuees sur IBM 704 CR LF - machine a lampes, 32 k mots de 36 bits, celle-la meme qui vit naitre CR LF LISP a la meme epoque. Le programme a ete ecrit en assembleur (Fortran CR LF existait, mais il ne s'etait pas encore impose) et l'auteur estime que CR LF "there is very little in the program that is not straightforward". CR LF CR LF Il observe que les preuves engendrees sont "essentiellement des arbres", CR LF et annonce que la machine a demontre 220 theoremes du calcul des CR LF propositions (tautologies) en 3 minutes. Il en tire argument pour la CR LF superiorite d'une approche algorithmique par rapport a une approche CR LF heuristique comme celle du "Logic Theorist" de Newell, Shaw et Simon (a CR LF partir de 1956 sur la machine JOHNNIAC de la Rand Corporation): un debat CR LF qui dure encore... CR LF CR LF Cet algorithme a ete popularise par J. McCarthy, comme exemple-fanion CR LF d'application de LISP. Il figure dans le manuel de la premiere version CR LF de LISP (LISP 1, sur IBM 704 justement, le manuel est date de Mars CR LF 1960), et il a ete repris dans le celebre "LISP 1.5 Programmer's Manual" CR LF publie en 1962 par MIT Press, un des maitres-livres de l'Informatique. CR LF CR LF CR LF CR LF > Below is a log showing how I get #\Return (code 10) using read-char in > those other lisps, but not CLISP. Any suggestions? You're trying to read a binary stream containing control codes. So read it as such, process the control codes, and convert the bytes that encode characters into strings. See above. > CL-USER(1): (setq *locale* (find-locale "C")) > #<locale "C" [:LATIN1-BASE] @ #x100004067b2> > CL-USER(2): (let (ch) > (with-open-file (in #P"test0") > (loop while (setq ch (read-char in nil)) > collect (char-code ch)))) > (34 120 13 10 121 34 13 10) You should do the reverse: read bytes, and convert them to characters when they are bytes encoding characters. Beware also that char-code and code-char use an unspecified code. Prefer functions such as #+clisp ext:convert-string-from-bytes or com.informatimago.common-lisp.cesarum.ascii:ascii-string or from the babel package, which use a definite encoding. -- __Pascal Bourguignon__ http://www.informatimago.com/ A bad day in () is better than a good day in {}. You can take the lisper out of the lisp job, but you can't take the lisp out of the lisper (; -- antifuchs |
From: Pascal J. B. <pj...@in...> - 2013-05-20 16:19:27
|
Matt Kaufmann <kau...@cs...> writes: > Thank you; that makes sense. > > Returning to my original problem, regarding READ rather than > READ-CHAR: > > You first reply told me how to do the read I desired by reading a > string into a buffer. But for my purposes, I'd like to call READ on > arbitrary objects. The following example illustrates how I would like > READ to invert PRIN1. Using same set-ups as I sent you before, CLISP > returns nil in the following case, while the other six Lisps I > mentioned all return t. Although x is a particular string in this > example, imagine that it could be any sort of object for which > equality can be tested by EQUAL. Is there a way to set up CLISP so > that this never returns nil? > > (let ((x (concatenate 'string > "a" (string (code-char 13)) (string #\Newline) "b"))) > (delete-file "out0") > (with-open-file > (out "out0" :direction :output) > (prin1 x out)) > (with-open-file > (in "out0" :direction :input) > (equal (read in) x))) > > Interestingly, I see that the following returns the value > (4 #\a #\Return #\Newline #\b): > > (let ((s (read-from-string > (concatenate 'string > "\"" "a" (string (code-char 13)) (string #\Newline) "b" "\"")))) > (list (length s) (char s 0) (char s 1) (char s 2) (char s 3))) > > So one answer is to read the entire file into a string using READ-BYTE > and EXT:CONVERT-STRING-FROM-BYTES, and then call READ-FROM-STRING on > that string instead of calling READ on a stream, maintaining the next > position from which to read. Perhaps I could even arrange for a > string input stream, so that I don't need to maintain the position. > But I'd prefer simply to use just READ, and I'd hoped that I could do > so after evaluating the following, but that's not the case. > > (setq custom:*default-file-encoding* > (ext:make-encoding :charset 'charset:iso-8859-1 > :line-terminator :unix)) > > Any suggestions for how I can call read so that it inverts prin1 in > the sense explained above? How is READ related to CRLF vs. CR vs. LF? I still don't understand why you're concerned with how the lines are terminated. While it's understandable that you may want to generate text files with a definite line termination sequence (eg. on MS-Windows you want CRLF, but on Unix you want LF), while reading text files, why would you care what line terminator is used? (defparameter *unix-external-format* (ext:make-encoding :charset charset:iso-8859-1 :line-terminator :unix :input-error-action :error :output-error-action :error)) (defparameter *dos-external-format* (ext:make-encoding :charset charset:iso-8859-1 :line-terminator :dos :input-error-action :error :output-error-action :error)) (defun dump (pathname &optional (*standard-output* *standard-output*)) (with-open-file (data pathname :element-type '(unsigned-byte 8)) (let ((buffer (make-array 16 :element-type '(unsigned-byte 8)))) (loop :for offset :from 0 :by 16 :for size = (read-sequence buffer data) :while (plusp size) :do (format t "~&~8,'0X:~{ ~2,'0X~} ~{~A~}~%" offset (coerce (subseq buffer 0 size) 'list) (map 'list (lambda (code) (if (or (<= 32 code 126) (<= 160 code 255)) (code-char code) "?")) (subseq buffer 0 size))))))) (defun demo () (loop :for efname :in '(unix dos) :for external-format :in (list *unix-external-format* *dos-external-format*) :do (print efname) :do (with-open-file (src "/tmp/src.lisp" :direction :output :external-format external-format :if-does-not-exist :create :if-exists :supersede) (let ((*print-right-margin* 20)) (pprint '(defun fact (x) (if (zerop x) 1 (* x (fact (1- x))))) src))) :do (dump "/tmp/src.lisp") :do (with-open-file (src "/tmp/src.lisp" :direction :input :external-format charset:iso-8859-1 :if-does-not-exist :create :if-exists :supersede) (print (read src)) (terpri) (terpri))) (values)) (demo) UNIX 00000000: 0A 28 44 45 46 55 4E 20 46 41 43 54 20 28 58 29 ?(DEFUN FACT (X) 00000010: 0A 20 28 49 46 20 28 5A 45 52 4F 50 20 58 29 20 ? (IF (ZEROP X) 00000020: 31 0A 20 20 28 2A 20 58 0A 20 20 20 28 46 41 43 1? (* X? (FAC 00000030: 54 20 28 31 2D 20 58 29 29 29 29 29 T (1- X))))) (DEFUN FACT (X) (IF (ZEROP X) 1 (* X (FACT (1- X))))) DOS 00000000: 0D 0A 28 44 45 46 55 4E 20 46 41 43 54 20 28 58 ??(DEFUN FACT (X 00000010: 29 0D 0A 20 28 49 46 20 28 5A 45 52 4F 50 20 58 )?? (IF (ZEROP X 00000020: 29 20 31 0D 0A 20 20 28 2A 20 58 0D 0A 20 20 20 ) 1?? (* X?? 00000030: 28 46 41 43 54 20 28 31 2D 20 58 29 29 29 29 29 (FACT (1- X))))) (DEFUN FACT (X) (IF (ZEROP X) 1 (* X (FACT (1- X))))) As you can see, you can write files with either the unix or the dos line terminator, and when you read them using an unspecified line terminator, they read as the same text (and therefore the same sexp). This allows you to pass seamlessly files between MacOS, Unix (including MacOSX), and MS-Windows, reading them with the same clisp program. -- __Pascal Bourguignon__ http://www.informatimago.com/ A bad day in () is better than a good day in {}. You can take the lisper out of the lisp job, but you can't take the lisp out of the lisper (; -- antifuchs |
From: Matt K. <kau...@cs...> - 2013-05-21 04:21:22
|
Hi -- The problem probably only arises when lines break in the middle of strings. To see what I mean, replace '(defun fact (x) ...) in your definition of demo with the following. (concatenate 'string "a" (string (code-char 13)) (string #\Newline) "b") Here are the results. Notice that this time, the results are different (but that probably won't surprise you). [5]> (demo) UNIX 00000000: 0A 22 61 0D 0A 62 22 ?"a??b" "a b" DOS 00000000: 0D 0A 22 61 0D 0D 0A 62 22 ??"a???b" "a b" [6]> Anyhow, maybe that answers your question: >> How is READ related to CRLF vs. CR vs. LF? That is: the issue is when CRLF or CR or LF is in the middle of a string object. As I mentioned in my preceding email, I would like READ to invert PRIN1. This seems a natural thing to want, though I'm not claiming it's required of CLISP or any other Lisp. In my preceding email I gave an example where this isn't the case in CLISP (but it is the case in the other six Lisps I tested). As I showed, setting custom:*default-file-encoding* doesn't help. Perhaps nothing helps, but if there is a way for READ to invert PRIN1 (in the sense of the example I sent in my preceding email), that would be great to know. If not -- thanks anyhow for your time. Regards, Matt From: "Pascal J. Bourguignon" <pj...@in...> Date: Mon, 20 May 2013 18:18:47 +0200 Organization: Informatimago Matt Kaufmann <kau...@cs...> writes: > Thank you; that makes sense. > > Returning to my original problem, regarding READ rather than > READ-CHAR: > > You first reply told me how to do the read I desired by reading a > string into a buffer. But for my purposes, I'd like to call READ on > arbitrary objects. The following example illustrates how I would like > READ to invert PRIN1. Using same set-ups as I sent you before, CLISP > returns nil in the following case, while the other six Lisps I > mentioned all return t. Although x is a particular string in this > example, imagine that it could be any sort of object for which > equality can be tested by EQUAL. Is there a way to set up CLISP so > that this never returns nil? > > (let ((x (concatenate 'string > "a" (string (code-char 13)) (string #\Newline) "b"))) > (delete-file "out0") > (with-open-file > (out "out0" :direction :output) > (prin1 x out)) > (with-open-file > (in "out0" :direction :input) > (equal (read in) x))) > > Interestingly, I see that the following returns the value > (4 #\a #\Return #\Newline #\b): > > (let ((s (read-from-string > (concatenate 'string > "\"" "a" (string (code-char 13)) (string #\Newline) "b" "\"")))) > (list (length s) (char s 0) (char s 1) (char s 2) (char s 3))) > > So one answer is to read the entire file into a string using READ-BYTE > and EXT:CONVERT-STRING-FROM-BYTES, and then call READ-FROM-STRING on > that string instead of calling READ on a stream, maintaining the next > position from which to read. Perhaps I could even arrange for a > string input stream, so that I don't need to maintain the position. > But I'd prefer simply to use just READ, and I'd hoped that I could do > so after evaluating the following, but that's not the case. > > (setq custom:*default-file-encoding* > (ext:make-encoding :charset 'charset:iso-8859-1 > :line-terminator :unix)) > > Any suggestions for how I can call read so that it inverts prin1 in > the sense explained above? How is READ related to CRLF vs. CR vs. LF? I still don't understand why you're concerned with how the lines are terminated. While it's understandable that you may want to generate text files with a definite line termination sequence (eg. on MS-Windows you want CRLF, but on Unix you want LF), while reading text files, why would you care what line terminator is used? (defparameter *unix-external-format* (ext:make-encoding :charset charset:iso-8859-1 :line-terminator :unix :input-error-action :error :output-error-action :error)) (defparameter *dos-external-format* (ext:make-encoding :charset charset:iso-8859-1 :line-terminator :dos :input-error-action :error :output-error-action :error)) (defun dump (pathname &optional (*standard-output* *standard-output*)) (with-open-file (data pathname :element-type '(unsigned-byte 8)) (let ((buffer (make-array 16 :element-type '(unsigned-byte 8)))) (loop :for offset :from 0 :by 16 :for size = (read-sequence buffer data) :while (plusp size) :do (format t "~&~8,'0X:~{ ~2,'0X~} ~{~A~}~%" offset (coerce (subseq buffer 0 size) 'list) (map 'list (lambda (code) (if (or (<= 32 code 126) (<= 160 code 255)) (code-char code) "?")) (subseq buffer 0 size))))))) (defun demo () (loop :for efname :in '(unix dos) :for external-format :in (list *unix-external-format* *dos-external-format*) :do (print efname) :do (with-open-file (src "/tmp/src.lisp" :direction :output :external-format external-format :if-does-not-exist :create :if-exists :supersede) (let ((*print-right-margin* 20)) (pprint '(defun fact (x) (if (zerop x) 1 (* x (fact (1- x))))) src))) :do (dump "/tmp/src.lisp") :do (with-open-file (src "/tmp/src.lisp" :direction :input :external-format charset:iso-8859-1 :if-does-not-exist :create :if-exists :supersede) (print (read src)) (terpri) (terpri))) (values)) (demo) UNIX 00000000: 0A 28 44 45 46 55 4E 20 46 41 43 54 20 28 58 29 ?(DEFUN FACT (X) 00000010: 0A 20 28 49 46 20 28 5A 45 52 4F 50 20 58 29 20 ? (IF (ZEROP X) 00000020: 31 0A 20 20 28 2A 20 58 0A 20 20 20 28 46 41 43 1? (* X? (FAC 00000030: 54 20 28 31 2D 20 58 29 29 29 29 29 T (1- X))))) (DEFUN FACT (X) (IF (ZEROP X) 1 (* X (FACT (1- X))))) DOS 00000000: 0D 0A 28 44 45 46 55 4E 20 46 41 43 54 20 28 58 ??(DEFUN FACT (X 00000010: 29 0D 0A 20 28 49 46 20 28 5A 45 52 4F 50 20 58 )?? (IF (ZEROP X 00000020: 29 20 31 0D 0A 20 20 28 2A 20 58 0D 0A 20 20 20 ) 1?? (* X?? 00000030: 28 46 41 43 54 20 28 31 2D 20 58 29 29 29 29 29 (FACT (1- X))))) (DEFUN FACT (X) (IF (ZEROP X) 1 (* X (FACT (1- X))))) As you can see, you can write files with either the unix or the dos line terminator, and when you read them using an unspecified line terminator, they read as the same text (and therefore the same sexp). This allows you to pass seamlessly files between MacOS, Unix (including MacOSX), and MS-Windows, reading them with the same clisp program. -- __Pascal Bourguignon__ http://www.informatimago.com/ A bad day in () is better than a good day in {}. You can take the lisper out of the lisp job, but you can't take the lisp out of the lisper (; -- antifuchs ------------------------------------------------------------------------------ AlienVault Unified Security Management (USM) platform delivers complete security visibility with the essential security capabilities. Easily and efficiently configure, manage, and operate all of your security controls from a single console and one unified framework. Download a free trial. http://p.sf.net/sfu/alienvault_d2d _______________________________________________ clisp-list mailing list cli...@li... https://lists.sourceforge.net/lists/listinfo/clisp-list |
From: Pascal J. B. <pj...@in...> - 2013-05-22 20:30:57
|
-- __Pascal Bourguignon__ http://www.informatimago.com/ A bad day in () is better than a good day in {}. You can take the lisper out of the lisp job, but you can't take the lisp out of the lisper (; -- antifuchs |
From: Pascal J. B. <pj...@in...> - 2013-05-22 21:47:43
|
Matt Kaufmann <kau...@cs...> writes: > Hi -- > > The problem probably only arises when lines break in the middle of > strings. To see what I mean, replace '(defun fact (x) ...) in your > definition of demo with the following. > > (concatenate 'string > "a" (string (code-char 13)) (string #\Newline) "b") > > Here are the results. Notice that this time, the results are > different (but that probably won't surprise you). > Ok, let's consider a string like: (defparameter *str* "Hello World") Obviously, this string contains a new line. Again, why do you care whether there's a CRLF code sequence or just a LF code in the file? CL-USER> (with-open-file (src "/tmp/a.lisp" :external-format (ext:make-encoding :charset charset:iso-8859-1 :line-terminator :dos)) (read src)) (DEFPARAMETER *STR* "Hello World") CL-USER> (load "/tmp/a.lisp" :external-format (ext:make-encoding :charset charset:iso-8859-1 :line-terminator :dos)) ;; Loading file /tmp/a.lisp ... ;; Loaded file /tmp/a.lisp #P"/tmp/a.lisp" CL-USER> (length *str*) 11 CL-USER> On the other hand, if you care whether your sequence contains codes 13 10 or just 10, why do you use strings? (concatenate 'vector #(93) #(13) #(10) #(94)) --> #(93 13 10 94) or just: (vector 93 13 10 94) --> #(93 13 10 94) or just: #(93 13 10 94) Now if you want to insert a lot of ASCII-encoded bytes, you can always write a reader macro: (defun c-escaped-character-map (escaped-character) (case escaped-character ((#\newline) -1) ((#\a) 7) ((#\b) 8) ((#\t) 9) ((#\n) 10) ((#\v) 11) ((#\f) 12) ((#\r) 13) ((#\x) :hexa) ((#\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7) :octal) (otherwise :default))) (defun character-code-reader-macro (stream quotation-mark) (declare (ignore quotation-mark)) (flet ((encode (ch) ;; TOOD: Use babel or something else to get the uncode code- ;; point of the character. (char-code ch))) (let ((ch (read-char stream))) (if (char= #\\ ch) (let ((ch (read-char stream)) (code (c-escaped-character-map ch))) (flet ((read-code (*read-base* base-name) (let ((code (read stream))) (if (and (integerp code) (<= 0 code (1- char-code-limit))) code (error "Invalid hexadecimal character code: ~A" code))))) (case code (:hexa (read-code 16 "hexadecimal")) (:octal (read-code 8 "octal")) (:default ;; In emacs ?\x = ?x (encode ch)) (otherwise code)))) ;; or use #+clisp ext:string-to-bytes : (encode ch))))) (set-macro-character #\? 'character-code-reader-macro t) #(?a ?\a ?\r ?\n ?b ?\b ?\x41 ?\61 ?\\ ?\z ?' ?\') --> #(97 7 13 10 98 8 65 49 92 122 39 39) (See also: http://paste.lisp.org/display/137262 for a C string reader.) > Anyhow, maybe that answers your question: > >>> How is READ related to CRLF vs. CR vs. LF? > > That is: the issue is when CRLF or CR or LF is in the middle of a > string object. > > As I mentioned in my preceding email, I would like READ to invert > PRIN1. This seems a natural thing to want, though I'm not claiming > it's required of CLISP or any other Lisp. Again, what is in the string is a newline. What clisp will read is a newline, and what clisp will print is a newline. Newlines everywhere. :-) If you should care about the codes, then you should use binary streams, and read and write bytes, not text. READ and PRIN1 read and write text. What YOU should not do, is to insert into strings non-character characters such as #\return. For one thing, they make your program non conforming since they are only semi-standard (ie. an implementation may just not have them). (concatenate 'string "\"" "a" (string (code-char 13)) (string #\Newline) "b" "\"") -------------- ^ | The error is here ------------+ > [5]> (demo) > > UNIX > 00000000: 0A 22 61 0D 0A 62 22 ?"a??b" > > "a > b" If you consider that this file is wrongly encoded (I could agree with you on this point, IF I admited #\return (and other such strange "characters") in strings), the I will argue that the following file is also ill-formed: > DOS > 00000000: 0D 0A 22 61 0D 0D 0A 62 22 ??"a???b" > > "a > > b" Because a stray CR in a DOS file is not a good idea either. Again, are we talking about text files? Or about teletype control binary streams? There is not only #\return and its ilk that you should avoid in strings. Let's take for example #\xd800. You should not insert this so called "character" into strings either because it is not a character. It's a unicode code point that doesn't encode any character (or even any character part!) If you were to put such a "character" in a clisp string, and write out a file (eg. using utf-8 or utf-16 encoding), you would create most probably an invalid file. Just like your two files above. (The first is not a valid unix text file, the second is not a valid DOS text file). By the way, some implementations just don't have a character with code #xd800: #+ccl (code-char #xd800) --> NIL The codes between 0 and 31, 127, and between 128 and 159, to talk only of the code in the iso-8859-1 range, are similar: they don't encode characters, and you should just NOT include them in any string, and of course, not write them in a TEXT file (you can write those codes in a binary file, if such a binary file format requires them). -- __Pascal Bourguignon__ http://www.informatimago.com/ A bad day in () is better than a good day in {}. You can take the lisper out of the lisp job, but you can't take the lisp out of the lisper (; -- antifuchs |
From: Matt K. <kau...@cs...> - 2013-05-23 00:29:02
|
Hi -- Certainly I'd never knowingly write a #\Return, i.e. (code-char 13), into a text file. This problem showed up when a user of my application created a file with #\Return characters, probably on a Windows system, that I read back in on a Linux system. Even then, it's not exactly a huge problem, since the #\Return in front of each #\Newline was dropped by READ ("newlines everywhere", as you say). But we support 7 host Lisps, and our application, ACL2, was complaining because a checksum computed was different for CLISP than for the other six Lisps. (I don't want to go into the whole story about how ACL2 "certifies books", computes checksums, etc....) Anyhow, thanks for your time. I think I see your point and I may simply not worry about the dropped #\Return characters. Or perhaps we should indeed disallow non-text characters such as #\Return in text files; I may consider that. As I've stated a couple of times, we would like to read back in what was written; but I don't want to defend that. I can live with CLISP not behaving like those other six Lisps, and I think you've answered my original question: Again, what is in the string is a newline. What clisp will read is a newline, and what clisp will print is a newline. Newlines everywhere. :-) -- Matt From: "Pascal J. Bourguignon" <pj...@in...> Cc: cli...@li... Date: Wed, 22 May 2013 23:46:30 +0200 Matt Kaufmann <kau...@cs...> writes: > Hi -- > > The problem probably only arises when lines break in the middle of > strings. To see what I mean, replace '(defun fact (x) ...) in your > definition of demo with the following. > > (concatenate 'string > "a" (string (code-char 13)) (string #\Newline) "b") > > Here are the results. Notice that this time, the results are > different (but that probably won't surprise you). > Ok, let's consider a string like: (defparameter *str* "Hello World") Obviously, this string contains a new line. Again, why do you care whether there's a CRLF code sequence or just a LF code in the file? CL-USER> (with-open-file (src "/tmp/a.lisp" :external-format (ext:make-encoding :charset charset:iso-8859-1 :line-terminator :dos)) (read src)) (DEFPARAMETER *STR* "Hello World") CL-USER> (load "/tmp/a.lisp" :external-format (ext:make-encoding :charset charset:iso-8859-1 :line-terminator :dos)) ;; Loading file /tmp/a.lisp ... ;; Loaded file /tmp/a.lisp #P"/tmp/a.lisp" CL-USER> (length *str*) 11 CL-USER> On the other hand, if you care whether your sequence contains codes 13 10 or just 10, why do you use strings? (concatenate 'vector #(93) #(13) #(10) #(94)) --> #(93 13 10 94) or just: (vector 93 13 10 94) --> #(93 13 10 94) or just: #(93 13 10 94) Now if you want to insert a lot of ASCII-encoded bytes, you can always write a reader macro: (defun c-escaped-character-map (escaped-character) (case escaped-character ((#\newline) -1) ((#\a) 7) ((#\b) 8) ((#\t) 9) ((#\n) 10) ((#\v) 11) ((#\f) 12) ((#\r) 13) ((#\x) :hexa) ((#\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7) :octal) (otherwise :default))) (defun character-code-reader-macro (stream quotation-mark) (declare (ignore quotation-mark)) (flet ((encode (ch) ;; TOOD: Use babel or something else to get the uncode code- ;; point of the character. (char-code ch))) (let ((ch (read-char stream))) (if (char= #\\ ch) (let ((ch (read-char stream)) (code (c-escaped-character-map ch))) (flet ((read-code (*read-base* base-name) (let ((code (read stream))) (if (and (integerp code) (<= 0 code (1- char-code-limit))) code (error "Invalid hexadecimal character code: ~A" code))))) (case code (:hexa (read-code 16 "hexadecimal")) (:octal (read-code 8 "octal")) (:default ;; In emacs ?\x = ?x (encode ch)) (otherwise code)))) ;; or use #+clisp ext:string-to-bytes : (encode ch))))) (set-macro-character #\? 'character-code-reader-macro t) #(?a ?\a ?\r ?\n ?b ?\b ?\x41 ?\61 ?\\ ?\z ?' ?\') --> #(97 7 13 10 98 8 65 49 92 122 39 39) (See also: http://paste.lisp.org/display/137262 for a C string reader.) > Anyhow, maybe that answers your question: > >>> How is READ related to CRLF vs. CR vs. LF? > > That is: the issue is when CRLF or CR or LF is in the middle of a > string object. > > As I mentioned in my preceding email, I would like READ to invert > PRIN1. This seems a natural thing to want, though I'm not claiming > it's required of CLISP or any other Lisp. Again, what is in the string is a newline. What clisp will read is a newline, and what clisp will print is a newline. Newlines everywhere. :-) If you should care about the codes, then you should use binary streams, and read and write bytes, not text. READ and PRIN1 read and write text. What YOU should not do, is to insert into strings non-character characters such as #\return. For one thing, they make your program non conforming since they are only semi-standard (ie. an implementation may just not have them). (concatenate 'string "\"" "a" (string (code-char 13)) (string #\Newline) "b" "\"") -------------- ^ | The error is here ------------+ > [5]> (demo) > > UNIX > 00000000: 0A 22 61 0D 0A 62 22 ?"a??b" > > "a > b" If you consider that this file is wrongly encoded (I could agree with you on this point, IF I admited #\return (and other such strange "characters") in strings), the I will argue that the following file is also ill-formed: > DOS > 00000000: 0D 0A 22 61 0D 0D 0A 62 22 ??"a???b" > > "a > > b" Because a stray CR in a DOS file is not a good idea either. Again, are we talking about text files? Or about teletype control binary streams? There is not only #\return and its ilk that you should avoid in strings. Let's take for example #\xd800. You should not insert this so called "character" into strings either because it is not a character. It's a unicode code point that doesn't encode any character (or even any character part!) If you were to put such a "character" in a clisp string, and write out a file (eg. using utf-8 or utf-16 encoding), you would create most probably an invalid file. Just like your two files above. (The first is not a valid unix text file, the second is not a valid DOS text file). By the way, some implementations just don't have a character with code #xd800: #+ccl (code-char #xd800) --> NIL The codes between 0 and 31, 127, and between 128 and 159, to talk only of the code in the iso-8859-1 range, are similar: they don't encode characters, and you should just NOT include them in any string, and of course, not write them in a TEXT file (you can write those codes in a binary file, if such a binary file format requires them). -- __Pascal Bourguignon__ http://www.informatimago.com/ A bad day in () is better than a good day in {}. You can take the lisper out of the lisp job, but you can't take the lisp out of the lisper (; -- antifuchs |
From: Pascal J. B. <pj...@in...> - 2013-05-23 18:33:59
|
Matt Kaufmann <kau...@cs...> writes: > Hi -- > > Certainly I'd never knowingly write a #\Return, i.e. (code-char 13), > into a text file. This problem showed up when a user of my > application created a file with #\Return characters, probably on a > Windows system, that I read back in on a Linux system. Even then, > it's not exactly a huge problem, since the #\Return in front of each > #\Newline was dropped by READ ("newlines everywhere", as you say). > But we support 7 host Lisps, and our application, ACL2, was > complaining because a checksum computed was different for CLISP than > for the other six Lisps. (I don't want to go into the whole story > about how ACL2 "certifies books", computes checksums, etc....) Too bad because that's obviously where the problem lies! You should do the checksum either at the level of the characters in lisp, or at the level of the bytes in the file. Having checksum differ because clisp ignores a CR while reading a text files shows obviously that the checksum is not done as it should. But I won't elaborate, since you don't want to go into it. ;-) -- __Pascal Bourguignon__ http://www.informatimago.com/ A bad day in () is better than a good day in {}. You can take the lisper out of the lisp job, but you can't take the lisp out of the lisper (; -- antifuchs |