[clisp-list] character sets

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,

Pascal replied to Don Cohen:
>I'm realizing  that you're using the  LISP reader to  read random data
>files.

1. That file is not guranteed to contain pure ASCII, e.g. it could =
embded URL-requests as is, which might very well contain Umlauts in DOS =
or Latin-1 or whatever format -- I just don't know, and you probably =
don't know either!

2. Therefore, and for typical purposes, it's reasonable to just be able =
to ask for being able to read such a file without errors, as Don did, =
under the following hypotheses:
 - the file is a collection of text lines.
 - the data is mostly ASCII, which means that [CR]LF can be used to =
detect the end-of-line.
 - the data is mostly ASCII, and the first positions of each line will =
be readable under that assumption (e.g. date, time, etc.)

For such a purpose, I recommend using ISO-8859-1 and :line-terminator =
:unix.
It may not be the encoding used to write the data, but
 + it's an 8-bit superset of ASCII (unlike UTF-8) and
 + it provides means in CLISP to most likely process the extra =
characters and write them to other files, even though the actual data =
may be cyrillic, Hangul, UTF-8 or whatever.

The point where it might break is that I don't know whether Hangul =
(Korea), Japanese and other countries encodings which use multi-byte =
sequences can embed something which could be mistaken as CR or LF when =
read bytewise and which could find its way into the log files.

BTW, I recommend using READ-CHAR-SEQUENCE to process log files. I =
talked about that here when I mentioned my "10 times faster than perl" =
firewall log-processing application a year or so ago. You need =
buffering to achieve performance.

Regards,
	J=F6rg H=F6hle