> Don Cohenwrote:
> >The code that you did not include:
> (with-open-file (f "/tmp/bytes" :external-format CHARSET:ISO-8859-1)
> (loop for i from 0 while (setf c (read-char f nil nil))
> unless (= i (char-code c)) do (princ (cons i (char-code c)))))
> (13 . 10)
> >shows that reading with external-format CHARSET:ISO-8859-1 recovers
> >all of those bytes as corresponding characters except for CR => LF.
> Please repeat the test using ext:convert-string-to/from-bytes
> rathern than character based stream functions.
I don't understand what test you have in mind here.
You mean read the file as bytes and then convert to string?
That does seem to preserve the difference between CR and LF, though
I'm not exactly sure why - does it depend on the encoding?
I gather there's no way to get that result with character IO.
And note that the directory function does not offer the choice of
characters vs bytes. So I see no way to use only ansi standard
functions in clisp that can distinguish between files with names
containing CR's and LF's.
> >>Modern UNIX environments use UTF-8
> >Again, I don't understand what you're trying to tell me here.
> What I mean is that the average UNIX FS these days is configured to
> use UTF-8.
What can that mean, given that you can put any sequence of bytes not
containing / or null into a file name? I see no character set
arguments, e.g., in man mkfs.ext4(8). I suppose it has more to do
with how keyboard events are interpreted and how sequences of bytes
are displayed in windows than with anything related to the file
> I advise against using ISO-8859-1 to read UNIX file names into Lisp
> strings on the basis that it's a 1:1 encoding. Only UTF-8 appears
> like a reasonable default choice nowadays (you may always override
> curstom:*pathname-encoding*), perhaps with Pascal's added
> suggestion about polymorphism: return a string if it can be read as
> UTF-8, otherwise a byte array. Uh oh. Not ideal, but IMHO better
> in some way than misrepresent all UTF-8 Umlauts using Latin-1. This
> is not Python 1.x!
I think your preference must be related to the fact that these
characters mean more to you than to me, and you imagine that when you
get a file from some other place, the intent of the creator was that
the bytes in the name be interpreted in UTF-8. This is not
necessarily the case. If you want to search for file names containing
Umlauts then some such assumption is necessary, but for many other
purposes, such as copying directories, it is not.