Thanks very much for such a informative reply.
On Mon, Feb 14, 2011 at 5:57 PM, Nikodemus Siivola
> Youngest generations cannot expand without bounds: objects instead get
> migrated older generations when necessary. From the heap-map printed
> we can see all data has been migrated to the final untenured
> generation by the time the heap is exhausted. (The final generation
> gets collected too, so that's no a problem.)
I see. This indicates a full GC should be a quick fix.
> Without knowing more about you allocation patterns it's hard to hazard
> a guess, but:
> (define-alien-routine print-generation-stats void)
> and you can call (print-generation-stats) to print the heap map at any
> time to stderr of the process, so you can see what is happening -- are
> things slowly accumulating in older generations over the lifetime of
> your application, or is this is a sudden collapse, etc.
This is nice, but could you make the output target from stderr(fd=2)
to sb-sys:*stderr*? Or something like sb-vm:memory-usage, so I can
rebind the output target.
> I'll start from the assumption that you're not accidentally
> accumulating things in your application. :)
Yes, it shouldn't be my case, but who knows, it might be a bug in my
program. I'll check, thanks!
Also low possibilities in the libraries, the one that is heavily used
is CLSQL and I think it's reasonably stable and robust.
> 2. Doing *something* which causes SBCL to keep accumulating something
> in its internals due to a known or unknown bug. For example:
> (defun foo ()
> (let ((name (gensym)))
> (setf (fdefinition name) #'foo)
> (fmakunbound name)))
> (loop (foo))
> will eventually exhause the heap as even though FMAKUNBOUND removes
> the function binding, it leaves the *name* in SBCL's globaldb.
> EQL-specializers and EQL-specialized methods in CLOS are another known
> leak. I can't from the top my head think of other known issues, but
> maybe there is an unknown one that is biting you?
It could be this tricky case. I've never thought about that, thanks!
There's no dynamic code generation in my program and the only part
which is heavily used is the reader, and the messages are plain
structs of built-in types.
> 4. Getting bitten by the generational trap.
Yes, I think that's the most possible case.
> Let's say you have "cyclic" application -- a workload comes in, you
> process it, then repeat from start with another workload. Let's say
> that processing a single workload involves on average 1 minor GC in
> which large amounts of the data can be live.
> These minor GCs initially only collect the nursery, promoting live
> objects to generation 1 -- where uncollected garbage from earlier
> cycles keeps accumulating till a collection is triggered for it as
> When this happens, *first* the nursery is collected into gen 1. Then
> gen 1 is collected into gen 2. So now live objects from this cycle
> have ended up in gen 2 -- where slowly in this way uncollected garbage
> accumulates till a collection is triggered for gen 2.
> When this happens, first nursery is collected into gen 1. Then gen 1
> into gen 2. Then gen 2 into gen 3... so now live objects from this
> cycle ended up in gen 3, where they accumulate till an even deeper
> collection is triggered.
> This keeps going on till final generation is reached. When that
> happens, it is collected but not promoted, breaking the chain of
> Now, given a "bad" allocation pattern, it may be that you exhaust the
> heap due to uncollected garbage in older generations before a
> collection deep enough to collect that garbage is triggered.
> Based on the description of your application, I suspect this may be
> happening to you. In this case forcing a full collection every cycle
> (or every few cycles) should help -- watching the
> PRINT-GENERATION-STATS should tell you if this is the case, and how
> often you should force a full GC.
Yes, that makes a lot sense. I'll do it after I get some feedback from
> As for bad stuff your application (or a library you depend on) could be doing:
> A. a cache or memoization that keeps growing without bounds?
I just remembered (my bad) that I had some primitive memory usage
message printed every 30 mins in the log... Here it is,
2011-02-10T09:43:57 Memory usage:
Dynamic space usage is: 31,448,288 bytes.
Read-only space usage is: 6,352 bytes.
Static space usage is: 5,472 bytes.
Control stack usage is: 7,312 bytes.
Binding stack usage is: 480 bytes.
Control and binding stack usage is for the current thread only.
Garbage collection is currently enabled.
Breakdown for dynamic space:
3,565,552,112 bytes for 19,595,087 simple-character-string objects.
486,929,728 bytes for 5,092,116 instance objects.
298,242,320 bytes for 9,637,350 other objects.
4,350,724,160 bytes for 34,324,553 dynamic objects (space total.)
Okay, it seems the total memory used is about 8GB... and here's
another truncated long unsigned int bug :).
It seems the number of dynamic objects is about double the number of
simple-character-string. And both numbers are huge.
It's very likely that my lru-cache blew it up or made the allocation
pattern very bad (I really see the problem now).
> B. a HASH-TABLE that should be weak, but has misspecified it's
> weakness -- :VALUE when it should be :KEY, etc.
I'll check this.
> C. a HASH-TABLE created using a large :REHASH-SIZE -- this is
> virtually almost a bad idea, especially if the number is a float...
I don't understand why float number :rehash-size is a problem. Could
you elaborate a little bit more?
>> 3) What's the best practice of memory management for long-run programs
>> in sbcl? Do full GC periodically?
> See above.
Great! I'll take the log first.
Cheers and thanks for the help again!
黄 澗石 (Jianshi Huang)