From: Hoehle, Joerg-C. <Joe...@t-...> - 2004-03-16 10:30:27
|
Hi, Pascal replied to Don Cohen: >I'm realizing that you're using the LISP reader to read random data >files. 1. That file is not guranteed to contain pure ASCII, e.g. it could = embded URL-requests as is, which might very well contain Umlauts in DOS = or Latin-1 or whatever format -- I just don't know, and you probably = don't know either! 2. Therefore, and for typical purposes, it's reasonable to just be able = to ask for being able to read such a file without errors, as Don did, = under the following hypotheses: - the file is a collection of text lines. - the data is mostly ASCII, which means that [CR]LF can be used to = detect the end-of-line. - the data is mostly ASCII, and the first positions of each line will = be readable under that assumption (e.g. date, time, etc.) For such a purpose, I recommend using ISO-8859-1 and :line-terminator = :unix. It may not be the encoding used to write the data, but + it's an 8-bit superset of ASCII (unlike UTF-8) and + it provides means in CLISP to most likely process the extra = characters and write them to other files, even though the actual data = may be cyrillic, Hangul, UTF-8 or whatever. The point where it might break is that I don't know whether Hangul = (Korea), Japanese and other countries encodings which use multi-byte = sequences can embed something which could be mistaken as CR or LF when = read bytewise and which could find its way into the log files. BTW, I recommend using READ-CHAR-SEQUENCE to process log files. I = talked about that here when I mentioned my "10 times faster than perl" = firewall log-processing application a year or so ago. You need = buffering to achieve performance. Regards, J=F6rg H=F6hle |