Thread: [clisp-list] using pathnames containing wildcard characters

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Was there ever a conclusion to this discussion?
A work around?

> * Don Cohen <qba...@vf...3-vap.pbz> [2012-01-30 16:12:54 -0800]:
>
> Was there ever a conclusion to this discussion?

something has to be done; it requires certain amount of work.
http://www.cygwin.com/acronyms/#PTC

> A work around?

use strings, syscalls/stdio, and make-stream from the fd.

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.memritv.org http://truepeace.org http://memri.org http://dhimmi.com
http://www.PetitionOnline.com/tap12009/ http://pmw.org.il http://jihadwatch.org
MS: our tomorrow's software will run on your tomorrow's HW at today's speed.

Sam Steingold writes:

 > > A work around?
 > use strings, syscalls/stdio, and make-stream from the fd.

On rereading I'm not clear on what you had in mind here.
Where do I get an fd for a file that I can't open?
And how do strings help?

One work around that I find useful is
 (ext:run-program "mv" :arguments (list wildcardname nonwildcardname))
then read the nonwildcardname file, or write it and then mv it back.

P.S.
Info says under 
13.4 `touch': Change file timestamps
Some operating systems and file systems support a fourth time: the
birth time, when the file was first created;
by definition, this timestamp never changes.

Also, sorry about resubmitting the last bug report

A suggestion - how about a variable called **RAW-FILES** or some such
thing that ignores wildcards and uses UTF8 or BYTE for all file IO
including all filesystem calls. So an open will use the utf8 (or byte)
sequence for the pathname and do whatever the OS does on an open call,
returning the value the OS returns from the call. Input goes to UTF or
BYTE arrays, and output comes from them. Error returns are handled by
returning the OS value. Directory should also allow next-entry walk
through a directory, returning the name provided, regardless of type,
and the type should be requested by the user before use (unless they
want to crash). This all being in **RAW-FILES** mode, it will have no
negative effect on anything else, and will allow OS-level things to be
done at the author's risk (and reward).

FC

On 1/31/12 8:40 AM, Sam Steingold wrote:
>> * Don Cohen <qba...@vf...3-vap.pbz> [2012-01-30 16:12:54 -0800]:
>>
>> Was there ever a conclusion to this discussion?
> something has to be done; it requires certain amount of work.
> http://www.cygwin.com/acronyms/#PTC
>
>> A work around?
> use strings, syscalls/stdio, and make-stream from the fd.
>

-- 
-This is confidential to the parties I intend it to serve-
Fred Cohen & Associates   tel/fax: 925-954-5876 / 454-0171
http://all.net/     572 Leona Drive    Livermore, CA 94550

> * Fred Cohen <sp...@ny...g> [2012-01-31 08:48:54 -0800]:
>
> A suggestion - how about a variable called **RAW-FILES** or some such
> thing that ignores wildcards and uses UTF8 or BYTE for all file IO
> including all filesystem calls.

I don't like this:

1. people would set **RAW-FILES** to T and then complain that clisp is
non-compliant.

2. one has to set it to NIL before (DIRECTORY "/sfdg/*") and then reset
it to T when processing the returned data.

3. no other lisp does that, this make cross-platform coding hard.

TRT, IMO, is to quote wild characters in some way, either escaping them
with backslashes or using special type of "wild strings" vs "literal
strings".

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://camera.org http://palestinefacts.org http://honestreporting.com
http://thereligionofpeace.com http://pmw.org.il http://truepeace.org
WHO ATE MY BREAKFAST PANTS?

On 1/31/12 9:41 AM, Sam Steingold wrote:
>> * Fred Cohen <sp...@ny...g> [2012-01-31 08:48:54 -0800]:
>>
>> A suggestion - how about a variable called **RAW-FILES** or some such
>> thing that ignores wildcards and uses UTF8 or BYTE for all file IO
>> including all filesystem calls.
> I don't like this:
>
> 1. people would set **RAW-FILES** to T and then complain that clisp is
> non-compliant.
>
> 2. one has to set it to NIL before (DIRECTORY "/sfdg/*") and then reset
> it to T when processing the returned data.
>
> 3. no other lisp does that, this make cross-platform coding hard.
>
> TRT, IMO, is to quote wild characters in some way, either escaping them
> with backslashes or using special type of "wild strings" vs "literal
> strings".
Go for it. Just make certain that all UTF 8 byte values are legal for
all operations and that we can get everything in terms of those
byte/UTF-8 sequences, and I will be happy enough (assuming it works as
sold).  As an aside, it is still important not to simply error out on a
type of things in a directory that isn't a file, link, special file,
etc. I really think you should allow for open(directory),
get-next-entry, etc. till last entry, and type checking on each entry
separately. This is a far better way to allow folks to span directory
trees without consuming arbitrary amounts of memory (and crash) for
things like directories with 500 million files in them, etc.

FC

-- 
-This is confidential to the parties I intend it to serve-
Fred Cohen & Associates   tel/fax: 925-954-5876 / 454-0171
http://all.net/     572 Leona Drive    Livermore, CA 94550

Fred Cohen writes:
 > > TRT, IMO, is to quote wild characters in some way, either escaping them
 > > with backslashes or using special type of "wild strings" vs "literal
 > > strings".
 > Go for it. Just make certain that all UTF 8 byte values are legal for
 > all operations and that we can get everything in terms of those
 > byte/UTF-8 sequences, and I will be happy enough (assuming it works as

I don't think that UTF is quite the right thing, but this raises 
another interesting point.
It seems odd that you should have to know what character set a file
system uses in order to read and process a directory.  For instance, 
you should be able to copy a directory accurately without that
information.  This suggests that we need an encoding that maps 1-1
between bytes and characters.  I notice that CHARSET:ISO-8859-1 is
almost right:
 (with-open-file (f "/tmp/bytes" :direction :output :element-type
                    '(unsigned-byte 8) :if-does-not-exist :create)
   (loop for i below 256 do (write-byte i f)))

 (with-open-file (f "/tmp/bytes" :external-format CHARSET:ISO-8859-1)
   (loop for i from 0 while (setf c (read-char f nil nil)) 
         unless  (= i (char-code c)) do (princ (cons i (char-code c)))))
 (13 . 10)

Is there a way to separate CR from LF or create an encoding with that
property?  We should be able to get back from (directory ...) one
pathname containing a CR and another containing a LF.

In the past I've always resorted to binary IO in such cases, but that
doesn't seem to be an option in the case of (directory ...).
If I had such an encoding then perhaps I would not need to read files
as bytes and then translate them to characters via code-char.

don...@is... (Don Cohen) writes:

> Is there a way to separate CR from LF or create an encoding with that
> property?  We should be able to get back from (directory ...) one
> pathname containing a CR and another containing a LF.
>
> In the past I've always resorted to binary IO in such cases, but that
> doesn't seem to be an option in the case of (directory ...).
> If I had such an encoding then perhaps I would not need to read files
> as bytes and then translate them to characters via code-char.

Unix consider pathnames to be sequences of bytes.  Yes, binary.
Pathname components cannot contain the bytes 0 or 47, but otherwise all
the other values from 1 to 255 are valid.

File systems will indeed contain pathnames whose bytes are obtained from
encoding strings using various coding systems.  And having a pathname
component that contains bytes 10, 13, and 13+10 in sequence are
perfectly valid.

So if you want to design a CL physical pathname that is able to
represent all the unix pathnames, you need either to find a way to
encode/decode vectors of bytes into strings, or merely to define some
data type to represent vectors of bytes as valid pathname components.

valid pathname directory n. a string, a list of strings, nil, :wild,
    :unspecific, or some other object defined by the implementation to
    be a valid directory component.

valid pathname name n. a string, nil, :wild, :unspecific, or some other
   object defined by the implementation to be a valid pathname name.

I wouldn't mind allowing vector of bytes as physical pathname
components, and returning vector of bytes as soon as the pathname
component doesn't contains only bytes encoding ASCII printable
characters.   The application may always use babel to convert between
vectors of bytes and strings, if it can determine an encoding, and a
mapping for control codes.

But I guess one may argue for an encoding such as URL encoding, which
could be useful to write wildcard pathname components as string.

   "%e9*%e9"
vs.
   #(233 42 233)

But "%e*%e9" would be wrong.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
A bad day in () is better than a good day in {}.

> * Fred Cohen <sp...@ny...g> [2012-01-31 10:31:28 -0800]:
>
> On 1/31/12 9:41 AM, Sam Steingold wrote:
>>
>> TRT, IMO, is to quote wild characters in some way, either escaping them
>> with backslashes or using special type of "wild strings" vs "literal
>> strings".
> Go for it.

Thanks. Are you volunteering?
http://www.cygwin.com/acronyms/#PTC

> I really think you should allow for open(directory),
> get-next-entry, etc. till last entry, and type checking on each entry
> separately. This is a far better way to allow folks to span directory
> trees without consuming arbitrary amounts of memory (and crash) for
> things like directories with 500 million files in them, etc.

http://clisp.podval.org/impnotes/syscalls.html#file-tree-walk

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://truepeace.org http://www.PetitionOnline.com/tap12009/ http://memri.org
http://pmw.org.il http://openvotingconsortium.org
To a Lisp hacker, XML is S-expressions with extra cruft.

Sam Steingold writes:

 > http://clisp.podval.org/impnotes/syscalls.html#file-tree-walk

This also solves my CR vs LF problem (the file names it returns 
don't replace CRs with LFs or vice versa).
I guess that means it doesn't process the file names with any
encodings.  It also uses strings instead of lisp pathnames to
represent those file names.
Just the thing I need.
Yay!

The doc says fd-limit defaults to 5 but not what it means/controls.
It would also be useful if the depth could be controlled, i.e.,
only call the function with last argument less than some argument.
(I'd have called that argument depth.)
I wonder when one would want to use the current depth argument.
It seems possibly equally or more useful to report directories before
those things in it.

> * Don Cohen <qba...@vf...3-vap.pbz> [2012-02-01 10:53:39 -0800]:
>
> Sam Steingold writes:
>
>  > http://clisp.podval.org/impnotes/syscalls.html#file-tree-walk
> The doc says fd-limit defaults to 5 but not what it means/controls.

this is a shallow interface to nftw(), i.e., you should ask this
question to the libc people, not here.

> It would also be useful if the depth could be controlled, i.e.,
> only call the function with last argument less than some argument.
> (I'd have called that argument depth.)
> I wonder when one would want to use the current depth argument.

I think the second q answers the first one :-)

> It seems possibly equally or more useful to report directories before
> those things in it.

maybe, but you have to modify nftw for that.

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://dhimmi.com http://mideasttruth.com http://jihadwatch.org
http://pmw.org.il http://iris.org.il http://palestinefacts.org
XML is like violence. If it doesn't solve the problem, use more.

Don Cohen wrote:
>I notice that CHARSET:ISO-8859-1 is almost right:
 (with-open-file (f "/tmp/bytes" :direction :output :element-type
                    '(unsigned-byte 8) :if-does-not-exist :create)
   (loop for i below 256 do (write-byte i f)))

This test may have fooled you.  Line-terminator transformation in stream functions
is different from usage in the FFI or via ext:convert-string-to/from-bytes.

However, for pathnames, these days I advise against using Latin-1 on the
sole merit that it happens to be 1:1.  Modern UNIX environments use UTF-8 and
we've seen enough of those badly programmed apps that output "Â¶" when they should not.

Round-trip is not trivial.  For instance, an ssh or sshfs from Linux to MacOS
shows a bug *somewhere* among sshfs, bash, readline and one of the two OS
when you'll discover that ä reveals itself as ¨ + a!
(I noticed this when using backspace in bash within ssh.)

Regards,
	Jörg Höhle

  >I notice that CHARSET:ISO-8859-1 is almost right:
   (with-open-file (f "/tmp/bytes" :direction :output :element-type
                      '(unsigned-byte 8) :if-does-not-exist :create)
     (loop for i below 256 do (write-byte i f)))

  This test may have fooled you.  Line-terminator transformation in
  stream functions is different from usage in the FFI or via
  ext:convert-string-to/from-bytes.

I don't understand what you think might be confusing.
I hope you agree that the code above simply writes all of the 8 bit
bytes to a file.  The code that you did not include:
 (with-open-file (f "/tmp/bytes" :external-format CHARSET:ISO-8859-1)
   (loop for i from 0 while (setf c (read-char f nil nil)) 
         unless  (= i (char-code c)) do (princ (cons i (char-code c)))))
 (13 . 10)
shows that reading with external-format CHARSET:ISO-8859-1 recovers
all of those bytes as corresponding characters except for CR => LF.
If I could create an encoding that printed nothing on the example
above, then I would be happy to use it for reading pathnames and lots
of other things that I now read as bytes.

  However, for pathnames, these days I advise against using Latin-1 on
  the sole merit that it happens to be 1:1.  Modern UNIX environments
  use UTF-8 and we've seen enough of those badly programmed apps that
  output "ÃÂ¶" when they should not.

I don't know how to interpret this "use UTF-8".  It looks to me like
unix file names are sequences of bytes, not restricted to things that
can be parsed into UTF-8.  What we need for reading unix file names 
as character strings seems to be the encoding that I wish I had - one
that maps 1-1 between chars and bytes.

  Round-trip is not trivial.  For instance, an ssh or sshfs from Linux
  to MacOS shows a bug *somewhere* among sshfs, bash, readline and one
  of the two OS when you'll discover that Ã¤ reveals itself as Â¨ + a!
  (I noticed this when using backspace in bash within ssh.)

Again, I don't understand what you're trying to tell me here.
Does this have something to do with lisp or reading file names?

 Don Cohenwrote:
>The code that you did not include:
 (with-open-file (f "/tmp/bytes" :external-format CHARSET:ISO-8859-1)
   (loop for i from 0 while (setf c (read-char f nil nil))
         unless  (= i (char-code c)) do (princ (cons i (char-code c)))))
 (13 . 10)
>shows that reading with external-format CHARSET:ISO-8859-1 recovers
>all of those bytes as corresponding characters except for CR => LF.

Please repeat the test using ext:convert-string-to/from-bytes
rathern than character based stream functions.

>>Modern UNIX environments use UTF-8
>Again, I don't understand what you're trying to tell me here.

What I mean is that the average UNIX FS these days is configured to use UTF-8.
I advise against using ISO-8859-1 to read UNIX file names into Lisp strings
on the basis that it's a 1:1 encoding. Only UTF-8 appears like a reasonable default
choice nowadays (you may always override curstom:*pathname-encoding*),
perhaps with Pascal's added suggestion about polymorphism:
return a string if it can be read as UTF-8, otherwise a byte array. Uh oh.
Not ideal, but IMHO better in some way than misrepresent all UTF-8 Umlauts
using Latin-1. This is not Python 1.x!

Regards,
 Jörg Höhle

Joe...@t-... writes:
 >  Don Cohenwrote:
 > >The code that you did not include:
 >  (with-open-file (f "/tmp/bytes" :external-format CHARSET:ISO-8859-1)
 >    (loop for i from 0 while (setf c (read-char f nil nil))
 >          unless  (= i (char-code c)) do (princ (cons i (char-code c)))))
 >  (13 . 10)
 > >shows that reading with external-format CHARSET:ISO-8859-1 recovers
 > >all of those bytes as corresponding characters except for CR => LF.
 > 
 > Please repeat the test using ext:convert-string-to/from-bytes
 > rathern than character based stream functions.

I don't understand what test you have in mind here.
You mean read the file as bytes and then convert to string?
That does seem to preserve the difference between CR and LF, though
I'm not exactly sure why - does it depend on the encoding?
I gather there's no way to get that result with character IO.
And note that the directory function does not offer the choice of
characters vs bytes.  So I see no way to use only ansi standard
functions in clisp that can distinguish between files with names
containing CR's and LF's.

 > >>Modern UNIX environments use UTF-8
 > >Again, I don't understand what you're trying to tell me here.
 > 
 > What I mean is that the average UNIX FS these days is configured to
 > use UTF-8.

What can that mean, given that you can put any sequence of bytes not
containing / or null into a file name?  I see no character set
arguments, e.g., in man mkfs.ext4(8).  I suppose it has more to do
with how keyboard events are interpreted and how sequences of bytes
are displayed in windows than with anything related to the file
system.

 > I advise against using ISO-8859-1 to read UNIX file names into Lisp
 > strings on the basis that it's a 1:1 encoding. Only UTF-8 appears
 > like a reasonable default choice nowadays (you may always override
 > curstom:*pathname-encoding*), perhaps with Pascal's added
 > suggestion about polymorphism: return a string if it can be read as
 > UTF-8, otherwise a byte array. Uh oh.  Not ideal, but IMHO better
 > in some way than misrepresent all UTF-8 Umlauts using Latin-1. This
 > is not Python 1.x!

I think your preference must be related to the fact that these
characters mean more to you than to me, and you imagine that when you
get a file from some other place, the intent of the creator was that
the bytes in the name be interpreted in UTF-8.  This is not
necessarily the case.  If you want to search for file names containing
Umlauts then some such assumption is necessary, but for many other
purposes, such as copying directories, it is not.

Hi,

Don Cohen wrote:
>So I see no way to use only ansi standard functions in clisp that can
>distinguish between files with names containing CR's and LF's.

Use custom:*pathname-encoding* as iso-8859-1 to work with strings.

>You mean read the file as bytes and then convert to string?
>That does seem to preserve the difference between CR and LF

As you've verified, that encoding is truly 1:1.  It's only
with character streams that funny things happen.

One extra idea would be to have DIRECTORY, EXT:DIR
and Sam's POSIX:FILE-TREE-WALK function
http://clisp.podval.org/impnotes/syscalls.html#file-tree-walk
accept an extra ENCODING keyword.
&keyword (encoding custom:*pathname-encoding*)

:ENCODING NIL => deliver a byte array

You'd use :ENCODING as iso-8859-1 to work with strings.
As you've verified, that encoding is truly 1:1.  It's only
with character streams that funny things happen.

Then the directory traversal code must be made robust (not leak memory)
w.r.t. the :INPUT-ERROR-ACTION of MAKE-ENCODING.

The key is not to accept custom:*pathname-encoding* being NIL as that
would break too many expectations.

Well, that's only dir traversal.  No solution for rename-file etc.
but having custom:*pathname-encoding* be iso-8859-1 in your program.

With byte arrays, I'd use
(let* ((chars (ext:convert-string-from-bytes bytes
               (ext:make-encoding charset:utf-8 :input-error-action :ignore)))
       (bytes2 (ext:convert-string-to-bytes chars
                (ext:make-encoding charset:utf-8 :output-error-action :ignore))))
  (if (equalp byte bytes2) chars #| round trip check passed |# bytes))
This would produce either printable UTF-8 strings or byte arrays on my system.
(actually, I'd write this code using custom:*pathname-encoding*, but you
 might have rebound that to iso-8859-1, which does not match my LANG/LC_MESSAGES).

If you have bound custom:*pathname-encoding* to iso-8859-1, you can still
apply a triple conversion trick with custom:*terminal-encoding* before
printing such names or saving them to a file (using byte arrays if unprintable).

Regards,
	Jörg Höhle

  >So I see no way to use only ansi standard functions in clisp that can
  >distinguish between files with names containing CR's and LF's.
  Use custom:*pathname-encoding* as iso-8859-1 to work with strings.
That seems to work:
(with-open-file (f (concatenate 'string "/tmp/foo/"
     (convert-string-from-bytes #(65 13 66 10 67) CHARSET:ISO-8859-1))
    :direction :output))
NIL
then
(directory "/tmp/foo/*")
(#P"/tmp/foo/A^MB
C")

(I have to doctor the output to show you what appears on my screen.)
I didn't expect this to work and still don't understand why it does.
Why isn't this screwed up by the line terminator of the encoding?
I expected to get the same string as would have been read from a file
with the same bytes.  And that would have contained two "newline"
characters.

... Or so I thought -
 is this a bug or am I misunderstanding something important?

[102]> (with-open-file (f "/tmp/foo/content" :element-type
'(unsigned-byte 8))(loop while (print (read-byte f nil nil)) do t))

97 
13 
98 
10 
99 
NIL 
NIL
[103]> (with-open-file (f "/tmp/foo/content" :external-format
CHARSET:ISO-8859-1)(loop while (print (read-char f nil nil)) do t))

#\a 
#\Newline 
#\b 
#\c 
NIL 
NIL
[104]> 

Why didn't read-char show me the LF between the b and the c ?

Don Cohen wrote:
>> Use custom:*pathname-encoding* as iso-8859-1 to work with strings.
>That seems to work:
>Why isn't this screwed up by the line terminator of the encoding?
>I expected to get the same string as would have been read from a file with the same bytes.

The character streams add another layer on top of encodings, namely the
line terminator handling.  That's the point that has been confusing you
all the time.

FFI encodings acts like ext:convert-string-to/from-bytes
Only the READ-CHAR family adds #\newline conversions.

Regards,
	Jörg Höhle