From: <kr...@mo...> - 2001-04-06 08:11:40
|
> And I'd guess that if you took other steps to reduce the size of the > system, like compiling with DEBUG 0 and removing doc strings, is there a systematic way of stripping doc-strings from an image (or fasl file) ? where are those stored anyway ? > For interned symbols, > (let ((n 0)) (do-all-symbols (s) (incf n)) n) => 20211 > so I'd guess about 1 Mbyte. how did you arrive at 1MB ? from notes i took while glancing at a file: src/18a/x86/lisp/internals.h a long time ago, it seems like a symbol occupies 24 bytes. (* 24 20211) => 485064 so about half of your estimate. did i miss anything ? this of course did not take into consideration the symbol-names. maybe that's the other 0.5MB ? (are you aware that (do-all-symbols ) actually processes some symbols more than once ? if i collect all symbols in a list using (pushnew ) (slow and dumb, hashtable would be better), the number of symbols is slightly lower.) > Of course, there might be significantly more space used by > uninterned symbols protected from GC by references in debug > information. what kind of info is stored as debug info ? why would that have many uninterned symbols ? so debug info generally is protected from GC ? (presumably, no more than any other data structures ?) > I'd actually guess that the biggest non-code contribution to space > use is debug information. interesting. would be good to find out. what is all this debug info, and where exactly is it stored ? is this all thoroughly documented someplace ? now i really have not dug into the internals of sbcl/cmucl (even though i might like to at some point), so i do not know how complicated it would be to make some low-level changes that have far-reaching consequences. so let me just draw your attention to the following, just for inspiration and possible consideration: there is a smalltalk dialect called "squeak", which is open source, and a very interesting, self-contained and very portable, compact system, with graphics and all. (btw, tim may has said he is writing a go playing program entirely in squeak. :-) a paper describes some of squeak's internals and design decisions: http://users.ipa.net/~dwighth/squeak/oopsla_squeak.html in the section "object memory", table 1 and 2 describe a variable length object header. because most objects are quite simple, one 32bit header is used most often, which only needs to be expanded for rare complex objects. i wonder whether the same could be done for sbcl too, though it would be a rather comprehensive change. but it could shrink the space allocated for symbols by almost a factor of 6. for example, how often does a given symbol both have a value and a function ? i think that the common lisp spec is screwed up to even make the distinction, and it should be like with scheme. however, given the screwed up spec, we at least don't have to waste unused space. in a cmucl-18c image of mine, which had cllib (from clocc) loaded too, i counted 32625 unique symbols, of which 6426 were (boundp ), and 17087 were (fboundp ). however, only 14 were both at the same time. they were: + / - COMMON-LISP::FDEFINITION-OBJECT COMMON-LISP::MAYBE-GC COMMON-LISP::%INITIAL-FUNCTION KERNEL::INTERNAL-ERROR DEBUG-INTERNALS::HANDLE-FUNCTION-END-BREAKPOINT DEBUG-INTERNALS::HANDLE-BREAKPOINT UNIX:SIOCSPGRP * HEMLOCK-INTERNALS::MINIMUM-WINDOW-HEIGHT HEMLOCK-INTERNALS::OPEN-LINE PCL::FIND-STRUCTURE-CLASS i am not sure even all of these would really need to have both. so one 32bit word could be saved in every symbol, if the header had some way of pointing out whether the value is a function or not, and what the few exotic special cases are that have an expanded symbol description that carries both. similarly, it seems a little wasteful to allocate a full 32bit pointer for the package info. i have not ever had more than 256 packages. so this could be reduced to a byte, or mabye 16bit to be generous. now maybe there needs to be some escape mechanism as well, for people who need millions of packages, but even they would probably feel restricted by today's 4billion packages that they could maximally differentiate. :-) cutting the package pointer to 16 bit would create more "header bits" for special typing (such as merging the value and function, as above). probably if one would carefully rethink what is important, one could reduce the symbol header to a similarly compact form as in squeak. but i agree, it probably involves substantial work, but it would be fun. an issue along similar lines concerns cons. it would be good to know how much space is used up by cons cells in a (purified) image. whenever i did (room ) on an allegro image many years ago, i routinely saw that about half the memory was allocated to cons cells, which seems like a lot. lisp machines such as the symbolics had what was called "cdr-consing", whereby a type-bit indicated that the next adjacent word is another cons, so that the cdr portion could be saved. and so basically a list could be represented by a sequence of cars, with only a cdr cell at the very end, most often pointing to nil. it ought to be possible to implement something like that even on standard hardware, if one would think about it carefully. probably an escape mechanism is needed that can expand such a slim cons cell when needed, for example when a new cons cell is inserted into the middle of a list. but those are probably relatively rare events. by using cdr-consing, one probably could cut the storage space for cons cells about in half, and search speed through lists would also double. this is also a low-level issue, that would require substantial modifications, probably also in the compiler, before one can take advantage of it. i unfortunately don't currently even know where one would have to start looking, to find all the components that would need to be modified. clearly, the garbage collector would need to be modified, and i am horrified that this is all just c code. it is not very conducive to experimenting with alternate GC algorithms. :-( -- greetings markus krummenacker |