From: <don...@is...> - 2012-01-31 00:12:52
|
Was there ever a conclusion to this discussion? A work around? |
From: Sam S. <sd...@gn...> - 2012-01-31 16:41:10
|
> * Don Cohen <qba...@vf...3-vap.pbz> [2012-01-30 16:12:54 -0800]: > > Was there ever a conclusion to this discussion? something has to be done; it requires certain amount of work. http://www.cygwin.com/acronyms/#PTC > A work around? use strings, syscalls/stdio, and make-stream from the fd. -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.memritv.org http://truepeace.org http://memri.org http://dhimmi.com http://www.PetitionOnline.com/tap12009/ http://pmw.org.il http://jihadwatch.org MS: our tomorrow's software will run on your tomorrow's HW at today's speed. |
From: <don...@is...> - 2012-02-02 21:33:30
|
Sam Steingold writes: > > A work around? > use strings, syscalls/stdio, and make-stream from the fd. On rereading I'm not clear on what you had in mind here. Where do I get an fd for a file that I can't open? And how do strings help? One work around that I find useful is (ext:run-program "mv" :arguments (list wildcardname nonwildcardname)) then read the nonwildcardname file, or write it and then mv it back. P.S. Info says under 13.4 `touch': Change file timestamps Some operating systems and file systems support a fourth time: the birth time, when the file was first created; by definition, this timestamp never changes. Also, sorry about resubmitting the last bug report |
From: Fred C. <fc...@al...> - 2012-01-31 17:12:40
|
A suggestion - how about a variable called **RAW-FILES** or some such thing that ignores wildcards and uses UTF8 or BYTE for all file IO including all filesystem calls. So an open will use the utf8 (or byte) sequence for the pathname and do whatever the OS does on an open call, returning the value the OS returns from the call. Input goes to UTF or BYTE arrays, and output comes from them. Error returns are handled by returning the OS value. Directory should also allow next-entry walk through a directory, returning the name provided, regardless of type, and the type should be requested by the user before use (unless they want to crash). This all being in **RAW-FILES** mode, it will have no negative effect on anything else, and will allow OS-level things to be done at the author's risk (and reward). FC On 1/31/12 8:40 AM, Sam Steingold wrote: >> * Don Cohen <qba...@vf...3-vap.pbz> [2012-01-30 16:12:54 -0800]: >> >> Was there ever a conclusion to this discussion? > something has to be done; it requires certain amount of work. > http://www.cygwin.com/acronyms/#PTC > >> A work around? > use strings, syscalls/stdio, and make-stream from the fd. > -- -This is confidential to the parties I intend it to serve- Fred Cohen & Associates tel/fax: 925-954-5876 / 454-0171 http://all.net/ 572 Leona Drive Livermore, CA 94550 |
From: Sam S. <sd...@gn...> - 2012-01-31 17:41:48
|
> * Fred Cohen <sp...@ny...g> [2012-01-31 08:48:54 -0800]: > > A suggestion - how about a variable called **RAW-FILES** or some such > thing that ignores wildcards and uses UTF8 or BYTE for all file IO > including all filesystem calls. I don't like this: 1. people would set **RAW-FILES** to T and then complain that clisp is non-compliant. 2. one has to set it to NIL before (DIRECTORY "/sfdg/*") and then reset it to T when processing the returned data. 3. no other lisp does that, this make cross-platform coding hard. TRT, IMO, is to quote wild characters in some way, either escaping them with backslashes or using special type of "wild strings" vs "literal strings". -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://camera.org http://palestinefacts.org http://honestreporting.com http://thereligionofpeace.com http://pmw.org.il http://truepeace.org WHO ATE MY BREAKFAST PANTS? |
From: Fred C. <fc...@al...> - 2012-01-31 18:31:44
|
On 1/31/12 9:41 AM, Sam Steingold wrote: >> * Fred Cohen <sp...@ny...g> [2012-01-31 08:48:54 -0800]: >> >> A suggestion - how about a variable called **RAW-FILES** or some such >> thing that ignores wildcards and uses UTF8 or BYTE for all file IO >> including all filesystem calls. > I don't like this: > > 1. people would set **RAW-FILES** to T and then complain that clisp is > non-compliant. > > 2. one has to set it to NIL before (DIRECTORY "/sfdg/*") and then reset > it to T when processing the returned data. > > 3. no other lisp does that, this make cross-platform coding hard. > > TRT, IMO, is to quote wild characters in some way, either escaping them > with backslashes or using special type of "wild strings" vs "literal > strings". Go for it. Just make certain that all UTF 8 byte values are legal for all operations and that we can get everything in terms of those byte/UTF-8 sequences, and I will be happy enough (assuming it works as sold). As an aside, it is still important not to simply error out on a type of things in a directory that isn't a file, link, special file, etc. I really think you should allow for open(directory), get-next-entry, etc. till last entry, and type checking on each entry separately. This is a far better way to allow folks to span directory trees without consuming arbitrary amounts of memory (and crash) for things like directories with 500 million files in them, etc. FC -- -This is confidential to the parties I intend it to serve- Fred Cohen & Associates tel/fax: 925-954-5876 / 454-0171 http://all.net/ 572 Leona Drive Livermore, CA 94550 |
From: <don...@is...> - 2012-01-31 20:28:19
|
Fred Cohen writes: > > TRT, IMO, is to quote wild characters in some way, either escaping them > > with backslashes or using special type of "wild strings" vs "literal > > strings". > Go for it. Just make certain that all UTF 8 byte values are legal for > all operations and that we can get everything in terms of those > byte/UTF-8 sequences, and I will be happy enough (assuming it works as I don't think that UTF is quite the right thing, but this raises another interesting point. It seems odd that you should have to know what character set a file system uses in order to read and process a directory. For instance, you should be able to copy a directory accurately without that information. This suggests that we need an encoding that maps 1-1 between bytes and characters. I notice that CHARSET:ISO-8859-1 is almost right: (with-open-file (f "/tmp/bytes" :direction :output :element-type '(unsigned-byte 8) :if-does-not-exist :create) (loop for i below 256 do (write-byte i f))) (with-open-file (f "/tmp/bytes" :external-format CHARSET:ISO-8859-1) (loop for i from 0 while (setf c (read-char f nil nil)) unless (= i (char-code c)) do (princ (cons i (char-code c))))) (13 . 10) Is there a way to separate CR from LF or create an encoding with that property? We should be able to get back from (directory ...) one pathname containing a CR and another containing a LF. In the past I've always resorted to binary IO in such cases, but that doesn't seem to be an option in the case of (directory ...). If I had such an encoding then perhaps I would not need to read files as bytes and then translate them to characters via code-char. |
From: Pascal J. B. <pj...@in...> - 2012-01-31 20:56:59
|
don...@is... (Don Cohen) writes: > Is there a way to separate CR from LF or create an encoding with that > property? We should be able to get back from (directory ...) one > pathname containing a CR and another containing a LF. > > In the past I've always resorted to binary IO in such cases, but that > doesn't seem to be an option in the case of (directory ...). > If I had such an encoding then perhaps I would not need to read files > as bytes and then translate them to characters via code-char. Unix consider pathnames to be sequences of bytes. Yes, binary. Pathname components cannot contain the bytes 0 or 47, but otherwise all the other values from 1 to 255 are valid. File systems will indeed contain pathnames whose bytes are obtained from encoding strings using various coding systems. And having a pathname component that contains bytes 10, 13, and 13+10 in sequence are perfectly valid. So if you want to design a CL physical pathname that is able to represent all the unix pathnames, you need either to find a way to encode/decode vectors of bytes into strings, or merely to define some data type to represent vectors of bytes as valid pathname components. valid pathname directory n. a string, a list of strings, nil, :wild, :unspecific, or some other object defined by the implementation to be a valid directory component. valid pathname name n. a string, nil, :wild, :unspecific, or some other object defined by the implementation to be a valid pathname name. I wouldn't mind allowing vector of bytes as physical pathname components, and returning vector of bytes as soon as the pathname component doesn't contains only bytes encoding ASCII printable characters. The application may always use babel to convert between vectors of bytes and strings, if it can determine an encoding, and a mapping for control codes. But I guess one may argue for an encoding such as URL encoding, which could be useful to write wildcard pathname components as string. "%e9*%e9" vs. #(233 42 233) But "%e*%e9" would be wrong. -- __Pascal Bourguignon__ http://www.informatimago.com/ A bad day in () is better than a good day in {}. |
From: Sam S. <sd...@gn...> - 2012-01-31 21:39:36
|
> * Fred Cohen <sp...@ny...g> [2012-01-31 10:31:28 -0800]: > > On 1/31/12 9:41 AM, Sam Steingold wrote: >> >> TRT, IMO, is to quote wild characters in some way, either escaping them >> with backslashes or using special type of "wild strings" vs "literal >> strings". > Go for it. Thanks. Are you volunteering? http://www.cygwin.com/acronyms/#PTC > I really think you should allow for open(directory), > get-next-entry, etc. till last entry, and type checking on each entry > separately. This is a far better way to allow folks to span directory > trees without consuming arbitrary amounts of memory (and crash) for > things like directories with 500 million files in them, etc. http://clisp.podval.org/impnotes/syscalls.html#file-tree-walk -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://truepeace.org http://www.PetitionOnline.com/tap12009/ http://memri.org http://pmw.org.il http://openvotingconsortium.org To a Lisp hacker, XML is S-expressions with extra cruft. |
From: <don...@is...> - 2012-02-01 18:53:51
|
Sam Steingold writes: > http://clisp.podval.org/impnotes/syscalls.html#file-tree-walk This also solves my CR vs LF problem (the file names it returns don't replace CRs with LFs or vice versa). I guess that means it doesn't process the file names with any encodings. It also uses strings instead of lisp pathnames to represent those file names. Just the thing I need. Yay! The doc says fd-limit defaults to 5 but not what it means/controls. It would also be useful if the depth could be controlled, i.e., only call the function with last argument less than some argument. (I'd have called that argument depth.) I wonder when one would want to use the current depth argument. It seems possibly equally or more useful to report directories before those things in it. |
From: Sam S. <sd...@gn...> - 2012-02-01 19:12:47
|
> * Don Cohen <qba...@vf...3-vap.pbz> [2012-02-01 10:53:39 -0800]: > > Sam Steingold writes: > > > http://clisp.podval.org/impnotes/syscalls.html#file-tree-walk > The doc says fd-limit defaults to 5 but not what it means/controls. this is a shallow interface to nftw(), i.e., you should ask this question to the libc people, not here. > It would also be useful if the depth could be controlled, i.e., > only call the function with last argument less than some argument. > (I'd have called that argument depth.) > I wonder when one would want to use the current depth argument. I think the second q answers the first one :-) > It seems possibly equally or more useful to report directories before > those things in it. maybe, but you have to modify nftw for that. -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://dhimmi.com http://mideasttruth.com http://jihadwatch.org http://pmw.org.il http://iris.org.il http://palestinefacts.org XML is like violence. If it doesn't solve the problem, use more. |
From: <Joe...@t-...> - 2012-02-03 10:11:09
|
Don Cohen wrote: >I notice that CHARSET:ISO-8859-1 is almost right: (with-open-file (f "/tmp/bytes" :direction :output :element-type '(unsigned-byte 8) :if-does-not-exist :create) (loop for i below 256 do (write-byte i f))) This test may have fooled you. Line-terminator transformation in stream functions is different from usage in the FFI or via ext:convert-string-to/from-bytes. However, for pathnames, these days I advise against using Latin-1 on the sole merit that it happens to be 1:1. Modern UNIX environments use UTF-8 and we've seen enough of those badly programmed apps that output "¶" when they should not. Round-trip is not trivial. For instance, an ssh or sshfs from Linux to MacOS shows a bug *somewhere* among sshfs, bash, readline and one of the two OS when you'll discover that ä reveals itself as ¨ + a! (I noticed this when using backspace in bash within ssh.) Regards, Jörg Höhle |
From: <don...@is...> - 2012-02-03 19:05:28
|
>I notice that CHARSET:ISO-8859-1 is almost right: (with-open-file (f "/tmp/bytes" :direction :output :element-type '(unsigned-byte 8) :if-does-not-exist :create) (loop for i below 256 do (write-byte i f))) This test may have fooled you. Line-terminator transformation in stream functions is different from usage in the FFI or via ext:convert-string-to/from-bytes. I don't understand what you think might be confusing. I hope you agree that the code above simply writes all of the 8 bit bytes to a file. The code that you did not include: (with-open-file (f "/tmp/bytes" :external-format CHARSET:ISO-8859-1) (loop for i from 0 while (setf c (read-char f nil nil)) unless (= i (char-code c)) do (princ (cons i (char-code c))))) (13 . 10) shows that reading with external-format CHARSET:ISO-8859-1 recovers all of those bytes as corresponding characters except for CR => LF. If I could create an encoding that printed nothing on the example above, then I would be happy to use it for reading pathnames and lots of other things that I now read as bytes. However, for pathnames, these days I advise against using Latin-1 on the sole merit that it happens to be 1:1. Modern UNIX environments use UTF-8 and we've seen enough of those badly programmed apps that output "ö" when they should not. I don't know how to interpret this "use UTF-8". It looks to me like unix file names are sequences of bytes, not restricted to things that can be parsed into UTF-8. What we need for reading unix file names as character strings seems to be the encoding that I wish I had - one that maps 1-1 between chars and bytes. Round-trip is not trivial. For instance, an ssh or sshfs from Linux to MacOS shows a bug *somewhere* among sshfs, bash, readline and one of the two OS when you'll discover that ä reveals itself as ¨ + a! (I noticed this when using backspace in bash within ssh.) Again, I don't understand what you're trying to tell me here. Does this have something to do with lisp or reading file names? |
From: <Joe...@t-...> - 2012-02-06 15:48:35
|
Don Cohenwrote: >The code that you did not include: (with-open-file (f "/tmp/bytes" :external-format CHARSET:ISO-8859-1) (loop for i from 0 while (setf c (read-char f nil nil)) unless (= i (char-code c)) do (princ (cons i (char-code c))))) (13 . 10) >shows that reading with external-format CHARSET:ISO-8859-1 recovers >all of those bytes as corresponding characters except for CR => LF. Please repeat the test using ext:convert-string-to/from-bytes rathern than character based stream functions. >>Modern UNIX environments use UTF-8 >Again, I don't understand what you're trying to tell me here. What I mean is that the average UNIX FS these days is configured to use UTF-8. I advise against using ISO-8859-1 to read UNIX file names into Lisp strings on the basis that it's a 1:1 encoding. Only UTF-8 appears like a reasonable default choice nowadays (you may always override curstom:*pathname-encoding*), perhaps with Pascal's added suggestion about polymorphism: return a string if it can be read as UTF-8, otherwise a byte array. Uh oh. Not ideal, but IMHO better in some way than misrepresent all UTF-8 Umlauts using Latin-1. This is not Python 1.x! Regards, Jörg Höhle |
From: <don...@is...> - 2012-02-06 21:13:07
|
Joe...@t-... writes: > Don Cohenwrote: > >The code that you did not include: > (with-open-file (f "/tmp/bytes" :external-format CHARSET:ISO-8859-1) > (loop for i from 0 while (setf c (read-char f nil nil)) > unless (= i (char-code c)) do (princ (cons i (char-code c))))) > (13 . 10) > >shows that reading with external-format CHARSET:ISO-8859-1 recovers > >all of those bytes as corresponding characters except for CR => LF. > > Please repeat the test using ext:convert-string-to/from-bytes > rathern than character based stream functions. I don't understand what test you have in mind here. You mean read the file as bytes and then convert to string? That does seem to preserve the difference between CR and LF, though I'm not exactly sure why - does it depend on the encoding? I gather there's no way to get that result with character IO. And note that the directory function does not offer the choice of characters vs bytes. So I see no way to use only ansi standard functions in clisp that can distinguish between files with names containing CR's and LF's. > >>Modern UNIX environments use UTF-8 > >Again, I don't understand what you're trying to tell me here. > > What I mean is that the average UNIX FS these days is configured to > use UTF-8. What can that mean, given that you can put any sequence of bytes not containing / or null into a file name? I see no character set arguments, e.g., in man mkfs.ext4(8). I suppose it has more to do with how keyboard events are interpreted and how sequences of bytes are displayed in windows than with anything related to the file system. > I advise against using ISO-8859-1 to read UNIX file names into Lisp > strings on the basis that it's a 1:1 encoding. Only UTF-8 appears > like a reasonable default choice nowadays (you may always override > curstom:*pathname-encoding*), perhaps with Pascal's added > suggestion about polymorphism: return a string if it can be read as > UTF-8, otherwise a byte array. Uh oh. Not ideal, but IMHO better > in some way than misrepresent all UTF-8 Umlauts using Latin-1. This > is not Python 1.x! I think your preference must be related to the fact that these characters mean more to you than to me, and you imagine that when you get a file from some other place, the intent of the creator was that the bytes in the name be interpreted in UTF-8. This is not necessarily the case. If you want to search for file names containing Umlauts then some such assumption is necessary, but for many other purposes, such as copying directories, it is not. |
From: <Joe...@t-...> - 2012-02-10 13:35:27
|
Hi, Don Cohen wrote: >So I see no way to use only ansi standard functions in clisp that can >distinguish between files with names containing CR's and LF's. Use custom:*pathname-encoding* as iso-8859-1 to work with strings. >You mean read the file as bytes and then convert to string? >That does seem to preserve the difference between CR and LF As you've verified, that encoding is truly 1:1. It's only with character streams that funny things happen. One extra idea would be to have DIRECTORY, EXT:DIR and Sam's POSIX:FILE-TREE-WALK function http://clisp.podval.org/impnotes/syscalls.html#file-tree-walk accept an extra ENCODING keyword. &keyword (encoding custom:*pathname-encoding*) :ENCODING NIL => deliver a byte array You'd use :ENCODING as iso-8859-1 to work with strings. As you've verified, that encoding is truly 1:1. It's only with character streams that funny things happen. Then the directory traversal code must be made robust (not leak memory) w.r.t. the :INPUT-ERROR-ACTION of MAKE-ENCODING. The key is not to accept custom:*pathname-encoding* being NIL as that would break too many expectations. Well, that's only dir traversal. No solution for rename-file etc. but having custom:*pathname-encoding* be iso-8859-1 in your program. With byte arrays, I'd use (let* ((chars (ext:convert-string-from-bytes bytes (ext:make-encoding charset:utf-8 :input-error-action :ignore))) (bytes2 (ext:convert-string-to-bytes chars (ext:make-encoding charset:utf-8 :output-error-action :ignore)))) (if (equalp byte bytes2) chars #| round trip check passed |# bytes)) This would produce either printable UTF-8 strings or byte arrays on my system. (actually, I'd write this code using custom:*pathname-encoding*, but you might have rebound that to iso-8859-1, which does not match my LANG/LC_MESSAGES). If you have bound custom:*pathname-encoding* to iso-8859-1, you can still apply a triple conversion trick with custom:*terminal-encoding* before printing such names or saving them to a file (using byte arrays if unprintable). Regards, Jörg Höhle |
From: <don...@is...> - 2012-02-10 20:20:36
|
>So I see no way to use only ansi standard functions in clisp that can >distinguish between files with names containing CR's and LF's. Use custom:*pathname-encoding* as iso-8859-1 to work with strings. That seems to work: (with-open-file (f (concatenate 'string "/tmp/foo/" (convert-string-from-bytes #(65 13 66 10 67) CHARSET:ISO-8859-1)) :direction :output)) NIL then (directory "/tmp/foo/*") (#P"/tmp/foo/A^MB C") (I have to doctor the output to show you what appears on my screen.) I didn't expect this to work and still don't understand why it does. Why isn't this screwed up by the line terminator of the encoding? I expected to get the same string as would have been read from a file with the same bytes. And that would have contained two "newline" characters. ... Or so I thought - is this a bug or am I misunderstanding something important? [102]> (with-open-file (f "/tmp/foo/content" :element-type '(unsigned-byte 8))(loop while (print (read-byte f nil nil)) do t)) 97 13 98 10 99 NIL NIL [103]> (with-open-file (f "/tmp/foo/content" :external-format CHARSET:ISO-8859-1)(loop while (print (read-char f nil nil)) do t)) #\a #\Newline #\b #\c NIL NIL [104]> Why didn't read-char show me the LF between the b and the c ? |
From: <Joe...@t-...> - 2012-02-13 15:31:38
|
Don Cohen wrote: >> Use custom:*pathname-encoding* as iso-8859-1 to work with strings. >That seems to work: >Why isn't this screwed up by the line terminator of the encoding? >I expected to get the same string as would have been read from a file with the same bytes. The character streams add another layer on top of encodings, namely the line terminator handling. That's the point that has been confusing you all the time. FFI encodings acts like ext:convert-string-to/from-bytes Only the READ-CHAR family adds #\newline conversions. Regards, Jörg Höhle |