From: Pascal J. B. <pj...@in...> - 2009-11-26 22:27:31
|
On Nov 26, 2009, at 4:56 PM, Fred Cohen wrote: > This is really quite helpful... Thanks. Quesitons follow. > > On Nov 26, 2009, at 2:04 AM, Pascal J. Bourguignon wrote: >> ... >> By allocating exactly the right number of characters? >> How many bytes are there is that file? >> >> >>> Efficiently would be even better. >> >> >> C/USER[242]> (with-open-file (inp "/tmp/test.out" :element-type >> '(unsigned-byte 8)) >> (let ((buffer (make-array (file-length inp) :element- >> type '(unsigned-byte 8) :initial-element 0))) >> (read-sequence buffer inp) >> (ext:convert-string-from-bytes buffer >> charset:iso-8859-15))) > > Ah! I think I am starting to get it. Some questions though... > > Does make-array actually go and pre-fill the array? Is there a way to > avoid the consumption of time? Creating a array with :element-type and without :initial-element is undefined, IIRC. It would depend on the implementation whether the array is not filled and random values would be read (assuming a (declaim (optimize (safety 0))), or whether the array is filled with some data (may be or may be not of the same type!), or whatever. However notice that implementations may use low level tricks to have the array initialized efficiently. For example, they may allocate it on pages marked specially in the MMU (eg. Copy-on-Write), so that it will fill a page only if you try to read a slot that has not been initialized (or that you initialized to the default value such as 0 or NIL, depending on the element-type), and so on. More over, you can easily avoid the consumption of time by reusing the buffers. It is a classic technique, to avoid allocating and deallocating (garbage collection in lisp), to pre-allocate buffers and reuse them as needed. > I often deal with large files ( I will call them traces) - terabytes > each - so reading the whole trace into memory is not going to happen. > This means that I need to be able to randomly access and read parts of > of the trace, treat them in whatever ways are appropriate (often > creating derived traces from them and storing the results in files). I > assume that the file-position or some such thing will do the seek > properly - and then read-sequence will treat everything as bytes > because the element-type is (unsigned-byte 8). Yes. > If I do an (unsigned- > byte 6) will I then get chunks of 6 bits each, and will the file- > position then be relative to a 6-bit length? Be careful with byte size different from 8. Since clisp works on POSIX machines that have a native byte size of 8 bit, when you specify a different byte size, clisp needs to choose some mapping. On one hand we're lucky that it chooses the same mapping on all the platforms where clisp runs, so our files are transferable and compatible, but on the other hand, you don't get to choose how the 6-bit bytes are mapped to 8-bit bytes. In particular, clisp needs to use a header (or was it a trailer) in the file to know the exact file size. (See the Implementation Notes for the details). So if you have files with 6-bit bytes in them, you should probably read them as 8-bit bytes and extract the 6-bit you need according to these files specifications, and not rely on the CL implementation to do that for you either. > ext:convert-string-from-bytes was not immediately apparent in the > manual pages - for obvious reasons. Is there a way to do a typecast > between the array of bytes and the character version? No, there's no way to cast in lisp. You have to convert them. (At most it is possible to coerce some values to some types, but not in this case). > This way we > don't have to run through conversions back and forth all the time when > we want to treat the character as a number or byte, the number as a > character or byte, or the byte as a number or character. Do not convert to process, process directly the bytes. Notice that most of the CL functions used to process strings are actually vector or sequence functions, that is, they will work equally well on byte vectors. For the few string-* functions that you may need, you can easily re-implement them to work on byte vectors. >> If you are not concerned by the characters, why do you want to >> convert >> to them at all? > > I do it so that I can deal with strings effectively. For example, so I > can use regular expressions, search things, do (equal "thing" string), > and so forth. Ideally I would simply leave them as unsigned bytes and > have the rest of the lisp environment treat them transparently. You can do all that directly on byte vectors. Some libraries may be hard-coded to use characters, but it should not be too difficult to generalize them. >> Notice that you can always write: >> (deftype char () '(unsigned-byte 8)) >> and program like in C. >> (let ((s (make-array 42 :element-type 'char :initial-element 0 >> #| !!! |#))) > > Now this is pretty interesting. Can I define string as an array of > unsigned bytes and be done? You cannot change the CL:STRING type, but you can define your own: (shadow 'string) (deftype string (&optional (size '*) `(vector char ,size)) Of course, you still need to shadow and redefine the CL:STRING-* functions. >> You can also write a trivial reader macro to be able to specify byte >> sequence from strings like in C: >> #_"ABC" --> #(65 66 67) >> so you can write things like: >> (replace s #_"abc" :end1 3) (setf (aref s 3) 0) > > Sounds like a compromise I might be able to live with - but it would > be nicer if I could simply use :abc" without the #_ before it... You can choose any character you want for a reader macro. If you positively have zero use of CL strings, then indeed you can use #\". >> And of course, you may write a function to print byte sequences as >> "strings". >> >> (print-c-string #(65 66 67)) >> prints: >> ABC > > Yep - but I need to keep track of what is and is not a byte sequence > as opposed to a string all of the time and cannot use the same code > for strings as byte sequences. You have to do what you have to do. >>> I need all of the byte values that can every exist, because those >>> are >>> all of the possibilities I can ever face. >> >> Forget about characters. Don't be misled by the fact that C char is >> not a character, but a byte, which is written in lisp as (unsigned- >> byte 8). > > I would be happy to deal entirely with bytes in all things. I will try > to test out regular expression parsing, searches, etc. The problem for > me may be that there is a thing called a string at all. Why isn't > string simply a macro for an array of whatever? If it were (which it > may be for all I know), then if I simply changed the macro definition > to use (unsigned-byte 8), everything would work transparently - yes? > Or am I missing something big? Is this what you did when you did > (deftype char () '(unsigned-byte 8))? What you missed is that the CL implementations are allowed to be optimized. Therefore you cannot easily change them. But you can very easily ignore the CL package, and define your own COMMON-LISP-WITH-C-STRING A.K.A. CL/CS package in which you can implement all the strings as byte vectors like in C. Happily, it might be easy to do so, taking the sources of some CL implemented in Lisp (eg. CCL or SBCL), and loading them in your package, after having redefined the character and string types. (Ok, it might be some work, but you'll get a nice package to make you (and a lot of other programmers) happy. >> See: http://www.cliki.net/CloserLookAtCharacters > > That was somewhat useful, but also a bit confusing. Sorry about that, what was confusing? -- __Pascal Bourguignon__ http://www.informatimago.com |