From: Jianshi H. <jia...@gm...> - 2011-02-14 05:04:09
|
Hi, My program crashed with the following messages: Heap exhausted during garbage collection: 448 bytes available, 512 requested. Gen StaPg UbSta LaSta LUbSt Boxed Unboxed LB LUB !move Alloc Waste Trig WP GCs Mem-age 0: 0 0 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000 1: 0 0 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000 2: 0 0 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000 3: 0 0 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000 4: 0 0 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000 5: 47086 198400 0 0 130093 907945 417 757 0 4113377392 143234960 3988343744 0 19 0.9691 6: 0 0 0 0 18880 3778 0 0 0 92807168 0 2000000 18630 0 0.0000 Total bytes allocated = 8305371856 Dynamic-space-size bytes = 4294901760 GC control variables: *GC-INHIBIT* = true *GC-PENDING* = in progress *STOP-FOR-GC-PENDING* = false fatal error encountered in SBCL pid 29148(tid 140737209779968): Heap exhausted, game over. It seems SBCL try to allocated more than 8GB?! of memory and died since max heap size is 8GB. My program usually consumes 700MB~1GB for data, so others memory used should be consumed by garbages. My questions are: 1) What does " Dynamic-space-size bytes" mean here? It's about half the size of maximum heap 2) My program processes data from streams so objects get created and collected constantly, is it the case that the youngest generations of heap kept expanding and finally exceeded the limit? If so, is there a way to compact/resize the generations?(a full GC?) 3) What's the best practice of memory management for long-run programs in sbcl? Do full GC periodically? P.S. I already made my program to restart everyday, to avoid any memory problems. Surely I can fix it by enlarge the heap size to 16GB... but using that much amount of memory is a little bit scary. :) Any advices will be helpful. Cheers, -- 黄 澗石 (Jianshi Huang) |
From: Nikodemus S. <nik...@ra...> - 2011-02-14 08:57:41
|
On 14 February 2011 07:04, Jianshi Huang <jia...@gm...> wrote: > Heap exhausted during garbage collection: 448 bytes available, 512 requested. > Gen StaPg UbSta LaSta LUbSt Boxed Unboxed LB LUB !move Alloc > Waste Trig WP GCs Mem-age > 0: 0 0 0 0 0 0 0 0 0 0 > 0 2000000 0 0 0.0000 > 1: 0 0 0 0 0 0 0 0 0 0 > 0 2000000 0 0 0.0000 > 2: 0 0 0 0 0 0 0 0 0 0 > 0 2000000 0 0 0.0000 > 3: 0 0 0 0 0 0 0 0 0 0 > 0 2000000 0 0 0.0000 > 4: 0 0 0 0 0 0 0 0 0 0 > 0 2000000 0 0 0.0000 > 5: 47086 198400 0 0 130093 907945 417 757 0 > 4113377392 143234960 3988343744 0 19 0.9691 > 6: 0 0 0 0 18880 3778 0 0 0 92807168 > 0 2000000 18630 0 0.0000 > Total bytes allocated = 8305371856 > Dynamic-space-size bytes = 4294901760 > GC control variables: > *GC-INHIBIT* = true > *GC-PENDING* = in progress > *STOP-FOR-GC-PENDING* = false > fatal error encountered in SBCL pid 29148(tid 140737209779968): > Heap exhausted, game over. > > > It seems SBCL try to allocated more than 8GB?! of memory and died > since max heap size is 8GB. My program usually consumes 700MB~1GB for > data, so others memory used should be consumed by garbages. > > My questions are: > > 1) What does " Dynamic-space-size bytes" mean here? It's about half > the size of maximum heap It is actually the size of the heap, but the fprintf statement in the runtime appears to have %u where it should have %lu ... so that it has gotten truncated by accident. Argh. But it does stand for the heap-size, so you can disregard that. > 2) My program processes data from streams so objects get created and > collected constantly, is it the case that the youngest generations of > heap kept expanding and finally exceeded the limit? If so, is there a > way to compact/resize the generations?(a full GC?) Youngest generations cannot expand without bounds: objects instead get migrated older generations when necessary. From the heap-map printed we can see all data has been migrated to the final untenured generation by the time the heap is exhausted. (The final generation gets collected too, so that's no a problem.) Without knowing more about you allocation patterns it's hard to hazard a guess, but: (define-alien-routine print-generation-stats void) and you can call (print-generation-stats) to print the heap map at any time to stderr of the process, so you can see what is happening -- are things slowly accumulating in older generations over the lifetime of your application, or is this is a sudden collapse, etc. I'll start from the assumption that you're not accidentally accumulating things in your application. :) There are two major reasons for running out of memory through no fault of your own: 1. Getting bitten by SBCL's conservativism (stack and registers.) However, in a long-running application the effect of conservativism typically isn't the issue: if it was the problem, it should not cause slow accumulation of uncollected garbage, but it should rather keep a static amount of garbage uncollected. So while the following should not apply to you, I'll put it here for lurkers to read: If your application produces large multipart structures (long lists, trees of objects -- especially if those tree contain backpointers), it can help to break links in those structures when they are no longer needed. The problem with "large multipart structures" is that there is a large number of objects in the heap where conservatively retaining any of them could cause many others to be retained as well. Of course large flat structures like big SIMPLE-VECTORs aren't immune to this either, but since they aren't preserved by interior pointers they are no more likely to be retained by accident than a single CONS cell ... and if you have million elements in a list, you have a million potential false positives on average responsible for retaining 0.5 million other things. So in practise it is large structures composed of great many smaller parts are only ever a problem. (Which is not to say they are automatically a problem -- again, it really depends on the application.) 2. Doing *something* which causes SBCL to keep accumulating something in its internals due to a known or unknown bug. For example: (defun foo () (let ((name (gensym))) (setf (fdefinition name) #'foo) (fmakunbound name))) (loop (foo)) will eventually exhause the heap as even though FMAKUNBOUND removes the function binding, it leaves the *name* in SBCL's globaldb. EQL-specializers and EQL-specialized methods in CLOS are another known leak. I can't from the top my head think of other known issues, but maybe there is an unknown one that is biting you? 4. Getting bitten by the generational trap. Let's say you have "cyclic" application -- a workload comes in, you process it, then repeat from start with another workload. Let's say that processing a single workload involves on average 1 minor GC in which large amounts of the data can be live. These minor GCs initially only collect the nursery, promoting live objects to generation 1 -- where uncollected garbage from earlier cycles keeps accumulating till a collection is triggered for it as well. When this happens, *first* the nursery is collected into gen 1. Then gen 1 is collected into gen 2. So now live objects from this cycle have ended up in gen 2 -- where slowly in this way uncollected garbage accumulates till a collection is triggered for gen 2. When this happens, first nursery is collected into gen 1. Then gen 1 into gen 2. Then gen 2 into gen 3... so now live objects from this cycle ended up in gen 3, where they accumulate till an even deeper collection is triggered. This keeps going on till final generation is reached. When that happens, it is collected but not promoted, breaking the chain of promotion. Now, given a "bad" allocation pattern, it may be that you exhaust the heap due to uncollected garbage in older generations before a collection deep enough to collect that garbage is triggered. Based on the description of your application, I suspect this may be happening to you. In this case forcing a full collection every cycle (or every few cycles) should help -- watching the PRINT-GENERATION-STATS should tell you if this is the case, and how often you should force a full GC. ... As for bad stuff your application (or a library you depend on) could be doing: A. a cache or memoization that keeps growing without bounds? B. a HASH-TABLE that should be weak, but has misspecified it's weakness -- :VALUE when it should be :KEY, etc. C. a HASH-TABLE created using a large :REHASH-SIZE -- this is virtually almost a bad idea, especially if the number is a float... > 3) What's the best practice of memory management for long-run programs > in sbcl? Do full GC periodically? See above. Cheers, -- Nikodemus |
From: Jianshi H. <jia...@gm...> - 2011-02-14 10:59:55
|
Hi Nikodemus, Thanks very much for such a informative reply. On Mon, Feb 14, 2011 at 5:57 PM, Nikodemus Siivola <nik...@ra...> wrote: > Youngest generations cannot expand without bounds: objects instead get > migrated older generations when necessary. From the heap-map printed > we can see all data has been migrated to the final untenured > generation by the time the heap is exhausted. (The final generation > gets collected too, so that's no a problem.) > I see. This indicates a full GC should be a quick fix. > Without knowing more about you allocation patterns it's hard to hazard > a guess, but: > > (define-alien-routine print-generation-stats void) > > and you can call (print-generation-stats) to print the heap map at any > time to stderr of the process, so you can see what is happening -- are > things slowly accumulating in older generations over the lifetime of > your application, or is this is a sudden collapse, etc. > This is nice, but could you make the output target from stderr(fd=2) to sb-sys:*stderr*? Or something like sb-vm:memory-usage, so I can rebind the output target. > I'll start from the assumption that you're not accidentally > accumulating things in your application. :) > Yes, it shouldn't be my case, but who knows, it might be a bug in my program. I'll check, thanks! Also low possibilities in the libraries, the one that is heavily used is CLSQL and I think it's reasonably stable and robust. > 2. Doing *something* which causes SBCL to keep accumulating something > in its internals due to a known or unknown bug. For example: > > (defun foo () > (let ((name (gensym))) > (setf (fdefinition name) #'foo) > (fmakunbound name))) > > (loop (foo)) > > will eventually exhause the heap as even though FMAKUNBOUND removes > the function binding, it leaves the *name* in SBCL's globaldb. > EQL-specializers and EQL-specialized methods in CLOS are another known > leak. I can't from the top my head think of other known issues, but > maybe there is an unknown one that is biting you? > It could be this tricky case. I've never thought about that, thanks! There's no dynamic code generation in my program and the only part which is heavily used is the reader, and the messages are plain structs of built-in types. > 4. Getting bitten by the generational trap. > Yes, I think that's the most possible case. > Let's say you have "cyclic" application -- a workload comes in, you > process it, then repeat from start with another workload. Let's say > that processing a single workload involves on average 1 minor GC in > which large amounts of the data can be live. > > These minor GCs initially only collect the nursery, promoting live > objects to generation 1 -- where uncollected garbage from earlier > cycles keeps accumulating till a collection is triggered for it as > well. > > When this happens, *first* the nursery is collected into gen 1. Then > gen 1 is collected into gen 2. So now live objects from this cycle > have ended up in gen 2 -- where slowly in this way uncollected garbage > accumulates till a collection is triggered for gen 2. > > When this happens, first nursery is collected into gen 1. Then gen 1 > into gen 2. Then gen 2 into gen 3... so now live objects from this > cycle ended up in gen 3, where they accumulate till an even deeper > collection is triggered. > > This keeps going on till final generation is reached. When that > happens, it is collected but not promoted, breaking the chain of > promotion. > > Now, given a "bad" allocation pattern, it may be that you exhaust the > heap due to uncollected garbage in older generations before a > collection deep enough to collect that garbage is triggered. > > Based on the description of your application, I suspect this may be > happening to you. In this case forcing a full collection every cycle > (or every few cycles) should help -- watching the > PRINT-GENERATION-STATS should tell you if this is the case, and how > often you should force a full GC. > Yes, that makes a lot sense. I'll do it after I get some feedback from PRINT-GENERATION-STATS. > ... > > As for bad stuff your application (or a library you depend on) could be doing: > > A. a cache or memoization that keeps growing without bounds? > I just remembered (my bad) that I had some primitive memory usage message printed every 30 mins in the log... Here it is, 2011-02-10T09:43:57 Memory usage: Dynamic space usage is: 31,448,288 bytes. Read-only space usage is: 6,352 bytes. Static space usage is: 5,472 bytes. Control stack usage is: 7,312 bytes. Binding stack usage is: 480 bytes. Control and binding stack usage is for the current thread only. Garbage collection is currently enabled. Breakdown for dynamic space: 3,565,552,112 bytes for 19,595,087 simple-character-string objects. 486,929,728 bytes for 5,092,116 instance objects. 298,242,320 bytes for 9,637,350 other objects. 4,350,724,160 bytes for 34,324,553 dynamic objects (space total.) Okay, it seems the total memory used is about 8GB... and here's another truncated long unsigned int bug :). It seems the number of dynamic objects is about double the number of simple-character-string. And both numbers are huge. It's very likely that my lru-cache blew it up or made the allocation pattern very bad (I really see the problem now). > B. a HASH-TABLE that should be weak, but has misspecified it's > weakness -- :VALUE when it should be :KEY, etc. > I'll check this. > C. a HASH-TABLE created using a large :REHASH-SIZE -- this is > virtually almost a bad idea, especially if the number is a float... > I don't understand why float number :rehash-size is a problem. Could you elaborate a little bit more? >> 3) What's the best practice of memory management for long-run programs >> in sbcl? Do full GC periodically? > > See above. > > Cheers, > Great! I'll take the log first. Cheers and thanks for the help again! -- 黄 澗石 (Jianshi Huang) |
From: Jianshi H. <jia...@gm...> - 2011-02-14 11:12:44
|
Hi, Just another question, in what situation will GC be triggered for last generation? -- 黄 澗石 (Jianshi Huang) |
From: Nikodemus S. <nik...@ra...> - 2011-02-14 12:08:06
|
On 14 February 2011 13:12, Jianshi Huang <jia...@gm...> wrote: > Just another question, in what situation will GC be triggered for last > generation? It's a surprisingly hard question to answer accurately. GCs in general are triggered when (BYTES-CONSED-BETWEEN-GCS) threshold is exceeded. This triggers a nursery -> gen 1 collection. Older generations are collected when criteria for that generation to be collected is reached. The criteria include average age of objects in generation, number of bytes added to the generation since it was last collected, etc. These can be introspected and tuned, though doing that is something of a black art. See "Introspection and Tuning" in: http://www.sbcl.org/manual/#Garbage-Collection Cheers, -- Nikodemus |
From: Juho S. <js...@ik...> - 2011-02-14 18:13:02
|
On Mon, Feb 14, 2011 at 11:59 AM, Jianshi Huang <jia...@gm...>wrote: > Breakdown for dynamic space: > 3,565,552,112 bytes for 19,595,087 simple-character-string objects. > 486,929,728 bytes for 5,092,116 instance objects. > 298,242,320 bytes for 9,637,350 other objects. > 4,350,724,160 bytes for 34,324,553 dynamic objects (space total.) > > > Okay, it seems the total memory used is about 8GB... and here's > another truncated long unsigned int bug :). > > It seems the number of dynamic objects is about double the number of > simple-character-string. And both numbers are huge. > Not quite. The "dynamic objects" line is just a sum of the preceeding lines. So your total memory use is a bit over 4GB, of which 3.5GB is strings. However, during garbage collection the real memory use is going to be higher (since the GC might need to keep both the original data and new copies around simultaneously), in the worst case by a factor of two. -- Juho Snellman |
From: Jianshi H. <jia...@gm...> - 2011-02-14 21:53:04
|
Hi Juho, On Tue, Feb 15, 2011 at 3:12 AM, Juho Snellman <js...@ik...> wrote: >> Breakdown for dynamic space: >> 3,565,552,112 bytes for 19,595,087 simple-character-string objects. >> 486,929,728 bytes for 5,092,116 instance objects. >> 298,242,320 bytes for 9,637,350 other objects. >> 4,350,724,160 bytes for 34,324,553 dynamic objects (space total.) >> >> >> Okay, it seems the total memory used is about 8GB... and here's >> another truncated long unsigned int bug :). >> >> It seems the number of dynamic objects is about double the number of >> simple-character-string. And both numbers are huge. > > Not quite. The "dynamic objects" line is just a sum of the preceeding lines. > So your total memory use is a bit over 4GB, of which 3.5GB is strings. > However, during garbage collection the real memory use is going to be higher > (since the GC might need to keep both the original data and new copies > around simultaneously), in the worst case by a factor of two. Ah, yes, you're right, thanks for pointing it out. -- 黄 澗石 (Jianshi Huang) |
From: Jianshi H. <jia...@gm...> - 2011-02-15 01:39:18
|
On Mon, Feb 14, 2011 at 7:59 PM, Jianshi Huang <jia...@gm...> wrote: > > I see. This indicates a full GC should be a quick fix. > I added periodic full GC (every 30 mins) to my program. Unfortunately, my program crashed 6 times since last night due to GC fatal error. My program has about 10 threads. Gen StaPg UbSta LaSta LUbSt Boxed Unboxed LB LUB !move Alloc Waste Trig WP GCs Mem-age 0: 26436 0 0 0 646 24 0 0 27 2671424 72896 2219568 0 1 0.0000 1: 33065 33090 0 0 216 554 65 99 124 3660096 165568 4012544 155 1 2.8168 2: 0 0 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000 3: 0 0 0 0 1284 8296 160 292 62 39741408 1349664 41741408 1380 1 0.0000 4: 0 0 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000 5: 0 0 0 0 4278 18120 1545 2386 127 104847184 2996400 106847184 5696 2 0.0000 6: 0 0 0 0 20644 3997 0 0 0 100929536 0 2000000 20424 0 0.0000 Total bytes allocated = 251849648 Dynamic-space-size bytes = 4294901760 Gen StaPg UbSta LaSta LUbSt Boxed Unboxed LB LUB !move Alloc Waste Trig WP GCs Mem-age fatal error encountered in SBCL pid 10469(tid 140737261508944): GC invariant lost, file "gencgc.c", line 490 If anybody can show me some instructions, I'd be happy to debug this problem. Cheers, -- 黄 澗石 (Jianshi Huang) |
From: Nikodemus S. <nik...@ra...> - 2011-02-15 08:59:03
|
On 15 February 2011 03:39, Jianshi Huang <jia...@gm...> wrote: > fatal error encountered in SBCL pid 10469(tid 140737261508944): > GC invariant lost, file "gencgc.c", line 490 Aha, this is my bad. I thought I had fixed this. The issue is that print_generation_stats() has an assert in it, which assumes that no other threads are running when it is called. > If anybody can show me some instructions, I'd be happy to debug this > problem. You can remove the gc_assert on line 490 in gencgc.c without ill effects, and rebuild SBCL. Cheers, -- Nikodemus |
From: Jianshi H. <jia...@gm...> - 2011-02-15 09:26:46
|
On Tue, Feb 15, 2011 at 5:58 PM, Nikodemus Siivola <nik...@ra...> wrote: > You can remove the gc_assert on line 490 in gencgc.c without ill > effects, and rebuild SBCL. > I see. Will try again tonight :) Cheers, -- 黄 澗石 (Jianshi Huang) |