From: Malcolm R. <mal...@gm...> - 2009-07-30 12:27:42
|
Hi guys I'm processing a huge (~4.5 Gigs) csv file which contains a bunch of logged web requests in csv format. I'm using the csv-parser library (http://cl-user.net/asp/libs/csv-parser) to read each line, and then running a function on it which ends up with a double-float array. I'm then writing each of these numeric arrays to an output file. For what it's worth, this is technically feature extraction for the eventual goal of doing some Machine Learning with the data, but that's not really relevant. The code to do this is found at this paste: http://paste.lisp.org/display/84428#1 - the main paste includes a macro for a very simple DSL for writing which features you want to extract out of a line of data (once the csv-parser library processes the file each line becomes an ordinary list). The annotation includes a specific instance of features. This is my first attempt at any macros stuff like this, but I'm fairly sure it works well on smaller files etc, and the code seems pretty clear to understand. Crucially, because I reuse *scansafe-feature-vector* it doesn't appear to me that I am using vast amounts of memory. I guess csv-parser may well contain some memory leaks, but anyway... It's worth pointing out that this is for my MSc project, in collaboration with Scansafe (web security firm). The code I'm posting is my own, but I'm under NDA so can't provide the data files I'm processing to anyone, as they contain Scansafe customer information. After successfully converting lots of smaller data files I tried it on the big file (8 million lines or so). After running for 15 minutes, in which time abot 137000 of the lines were processed, I got the following.. CNGH> (process-scansafe-file "/External/scansafe/ucl_all_may_10all_edited") [12:29:58]:CNGH:Converting csv file /External/scansafe/ucl_all_may_10all_edited.csv into sexp file /External/scansafe/ucl_all_may_10all_edited.lisp [12:30:12]:CNGH:1000 processed, entries are 107 long... [12:30:32]:CNGH:2000 processed, entries are 107 long... [ ... snip ... ] [12:45:36]:CNGH:136000 processed, entries are 107 long... [12:45:39]:CNGH:137000 processed, entries are 107 long... [12:45:43]:mmap: Cannot allocate memory mmap: Cannot allocate memory mmap: Cannot allocate memory [ ... snip, prints this line a couple thousand times ... ] mmap: Cannot allocate memory mmap: Cannot allocate memory INFO: Control stack guard page unprotected mmap: Cannot allocate memory mmap: Cannot allocate memory fatal error encountered in SBCL pid 37763: Unhandled memory fault. Exiting. Error opening /dev/tty: Device not configured Welcome to LDB, a low-level debugger for the Lisp runtime environment. ldb> sbcl(37763,0xb0081000) malloc: *** mmap(size=16384) failed (error code=12) sbcl(37763,0xb0081000) malloc: *** error: can't allocate region sbcl(37763,0xb0081000) malloc: *** set a breakpoint in malloc_error_break to debug This is 1.0.30.4, built from git yesterday (maybe the day before yesterday, not sure) on OS X 10.5.6. First generation intel macbook, Core Duo, 2 Gigs ram.. I've no experience with the low level debugger so I don't know if there is any diagnostic information I can acquire from this (Emacs is still running but I need to head out with my laptop soon and the battery often dies on the road, so if anyone wants me to try typing stuff at ldb then speak soon). Clearly some problem with memory allocation, but as far as I can see my code isn't using that much memory at all.. Is it likely that csv-parser or sbcl is to blame? Thanks in advance for any help, Malcolm |
From: Nikodemus S. <nik...@ra...> - 2009-07-30 13:25:40
|
2009/7/30 Malcolm Reynolds <mal...@gm...>: > [12:45:36]:CNGH:136000 processed, entries are 107 long... > [12:45:39]:CNGH:137000 processed, entries are 107 long... > [12:45:43]:mmap: Cannot allocate memory > mmap: Cannot allocate memory > mmap: Cannot allocate memory This is unlikely to be related to normal Lisp memory allocation. Most likely this is from perror("mmap"); os_validate in src/runtime/bsd-os.c -- but the callers are the thing that is of interest. In pretty much everywhere in SBCL where this is called at runtime, an error should be signalled if the call fails. Do you have a toplevel handler that might be silently handling these (or eg. IGNORE-ERRORS somewhere)? (SETF *BREAK-ON-SIGNALS* 'ERROR) should help with in locating the problem. If you see these messages and are not getting any conditions that would help narrow it down, then the next task is to make sure that is is indeed from os_validate... > [ ... snip, prints this line a couple thousand times ... ] > > mmap: Cannot allocate memory > mmap: Cannot allocate memory > INFO: Control stack guard page unprotected At this point the stack has overflown. Again, you should be getting a STORAGE-CONDITION for this. > mmap: Cannot allocate memory > mmap: Cannot allocate memory > fatal error encountered in SBCL pid 37763: > Unhandled memory fault. Exiting. I would guess that this is from overflowing the stack some more after the guard page has been triggered. >From the description of your program my first guesses as to proximate causes would be that somehow a huge number of stream buffers are allocated -- and they either remain in use, or are somehow not released. Getting a better idea of where the mmap failures are coming from would go a long way towards verifying this. As for the stack overflow: maybe you are recursing in an error handler? Eg. (defun log-condition (c) (handler-bind ((error (lambda (c) (log-condition c)))) (format *standard-output* "ERROR: ~A~%" c))) if the error occurs due to not being able to allocate an IO buffer, then trying to log the error almost certainly causes the same error again -- ad infinitum, till you run out of stack space, try to report that, and finally die. So, first task: find out where the mmap is first failing. If it is in the stream system, as I am guessing, then we need to figure out why SBCL is using so many buffers and why are they not getting released. Cheers, -- Nikodemus |
From: Malcolm R. <mal...@gm...> - 2009-07-30 16:44:06
|
A quick note to anyone wondering where the paste I referenced has gone - I had to request its deletion as the variable names listed for each feature apparently constituted me breaking my NDA with Scansafe. I will repost the code shortly in a hopefully non-infringing form, and reply to Nikodemus' post in full after that. Malcolm On Thu, Jul 30, 2009 at 2:17 PM, Nikodemus Siivola<nik...@ra...> wrote: > 2009/7/30 Malcolm Reynolds <mal...@gm...>: > >> [12:45:36]:CNGH:136000 processed, entries are 107 long... >> [12:45:39]:CNGH:137000 processed, entries are 107 long... >> [12:45:43]:mmap: Cannot allocate memory >> mmap: Cannot allocate memory >> mmap: Cannot allocate memory > > This is unlikely to be related to normal Lisp memory allocation. Most > likely this is from > > perror("mmap"); > > os_validate in src/runtime/bsd-os.c -- but the callers are the thing > that is of interest. > > In pretty much everywhere in SBCL where this is called at runtime, an > error should be signalled if the call fails. Do you have a toplevel > handler that might be silently handling these (or eg. IGNORE-ERRORS > somewhere)? > > (SETF *BREAK-ON-SIGNALS* 'ERROR) > > should help with in locating the problem. If you see these messages > and are not getting any conditions that would help narrow it down, > then the next task is to make sure that is is indeed from > os_validate... > >> [ ... snip, prints this line a couple thousand times ... ] >> >> mmap: Cannot allocate memory >> mmap: Cannot allocate memory >> INFO: Control stack guard page unprotected > > At this point the stack has overflown. Again, you should be getting a > STORAGE-CONDITION for this. > >> mmap: Cannot allocate memory >> mmap: Cannot allocate memory >> fatal error encountered in SBCL pid 37763: >> Unhandled memory fault. Exiting. > > I would guess that this is from overflowing the stack some more after > the guard page has been triggered. > > From the description of your program my first guesses as to proximate > causes would be that somehow a huge number of stream buffers are > allocated -- and they either remain in use, or are somehow not > released. Getting a better idea of where the mmap failures are coming > from would go a long way towards verifying this. > > As for the stack overflow: maybe you are recursing in an error handler? Eg. > > (defun log-condition (c) > (handler-bind ((error (lambda (c) (log-condition c)))) > (format *standard-output* "ERROR: ~A~%" c))) > > if the error occurs due to not being able to allocate an IO buffer, > then trying to log the error almost certainly causes the same error > again -- ad infinitum, till you run out of stack space, try to report > that, and finally die. > > So, first task: find out where the mmap is first failing. If it is in > the stream system, as I am guessing, then we need to figure out why > SBCL is using so many buffers and why are they not getting released. > > Cheers, > > -- Nikodemus > |
From: Martin C. <cra...@co...> - 2009-07-30 17:01:02
|
Malcolm Reynolds wrote on Thu, Jul 30, 2009 at 05:43:30PM +0100: > >> [12:45:36]:CNGH:136000 processed, entries are 107 long... > >> [12:45:39]:CNGH:137000 processed, entries are 107 long... > >> [12:45:43]:mmap: Cannot allocate memory > >> mmap: Cannot allocate memory > >> mmap: Cannot allocate memory > > > > This is unlikely to be related to normal Lisp memory allocation. Most > > likely this is from > > > > perror("mmap"); That reminds me of a request: we should change all these perror("something") to perror("something(sbcl)") The reason is that this would allows us to tell failing system calls from the SBCL runtime versus those from random user or library code. Martin -- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Martin Cracauer <cra...@co...> http://www.cons.org/cracauer/ FreeBSD - where you want to go, today. http://www.freebsd.org/ |
From: Brian M. <br...@ma...> - 2009-07-30 17:23:04
|
On Jul 30, 2009, at 12:00 PM, Martin Cracauer wrote: > Malcolm Reynolds wrote on Thu, Jul 30, 2009 at 05:43:30PM +0100: >>>> [12:45:36]:CNGH:136000 processed, entries are 107 long... >>>> [12:45:39]:CNGH:137000 processed, entries are 107 long... >>>> [12:45:43]:mmap: Cannot allocate memory >>>> mmap: Cannot allocate memory >>>> mmap: Cannot allocate memory >>> >>> This is unlikely to be related to normal Lisp memory allocation. >>> Most >>> likely this is from >>> >>> perror("mmap"); > > That reminds me of a request: > > we should change all these perror("something") to > perror("something(sbcl)") > > The reason is that this would allows us to tell failing system calls > from the SBCL runtime versus those from random user or library code. Why not also use __FILE__ and __LINE__ in these messages? That'd help track down the call even more specifically. -- Brian Mastenbrook br...@ma... http://brian.mastenbrook.net/ |
From: Malcolm R. <mal...@gm...> - 2009-07-30 17:23:29
|
Okay the paste is back at http://paste.lisp.org/display/84446#2 , hopefully in a form which won't result in any legal tonnes of bricks on my head. Responding to Nikodemus: > In pretty much everywhere in SBCL where this is called at runtime, an > error should be signalled if the call fails. Do you have a toplevel > handler that might be silently handling these (or eg. IGNORE-ERRORS > somewhere)? I do have IGNORE-ERRORS, basically because the data I have to work with is at times wildly inconsistent. My first attempt at the code worked fine until it found some entry where numbers that were supposed to be floats were "NaN" or "Infinity", which caused parse-integer to error out. I decided rather than add ad hoc fixes for every bit of random broken data I find, I would wrap the number parsing in ignore-errors and if anything bad happened just use 0.0d0 instead. Hence: (:float `(progn (dbg :feat-extract "i=~A float feature #~A:~A~%" i ,(first spec) ,(third spec)) (setf (aref ,feature-vector i) (or (ignore-errors (coerce-double (parse-number:parse-real-number (pop ,input-vector)))) 0.0d0)) (incf i))) and similar code for integers. I appreciate this is in no way the Right Thing to do but I only just looked at the chapter of PCL which covers conditions. I'll try and be less lazy about this and see if the source of the problem becomes more apparent. I think I can be pretty sure that I'm not recursing in error handling though - the only thing I have in my code releated to errors is a bunch of ignore-errors calls and they are all inside a (or (ignore-errors ...) 0.0d0) form. You say that it could be due to allocating too many stream buffers - surely my use of WITH-OPEN-FILE is fairly textbook.. I will investigate the csv-parser library as it is only one .lisp file I think. I'm not entirely sure what I should look out for though.. I'll report back if I can reproduce after I've removed the ignore-errors from my code. Malcolm > > (SETF *BREAK-ON-SIGNALS* 'ERROR) > > should help with in locating the problem. If you see these messages > and are not getting any conditions that would help narrow it down, > then the next task is to make sure that is is indeed from > os_validate... > >> [ ... snip, prints this line a couple thousand times ... ] >> >> mmap: Cannot allocate memory >> mmap: Cannot allocate memory >> INFO: Control stack guard page unprotected > > At this point the stack has overflown. Again, you should be getting a > STORAGE-CONDITION for this. > >> mmap: Cannot allocate memory >> mmap: Cannot allocate memory >> fatal error encountered in SBCL pid 37763: >> Unhandled memory fault. Exiting. > > I would guess that this is from overflowing the stack some more after > the guard page has been triggered. > > From the description of your program my first guesses as to proximate > causes would be that somehow a huge number of stream buffers are > allocated -- and they either remain in use, or are somehow not > released. Getting a better idea of where the mmap failures are coming > from would go a long way towards verifying this. > > As for the stack overflow: maybe you are recursing in an error handler? Eg. > > (defun log-condition (c) > (handler-bind ((error (lambda (c) (log-condition c)))) > (format *standard-output* "ERROR: ~A~%" c))) > > if the error occurs due to not being able to allocate an IO buffer, > then trying to log the error almost certainly causes the same error > again -- ad infinitum, till you run out of stack space, try to report > that, and finally die. > > So, first task: find out where the mmap is first failing. If it is in > the stream system, as I am guessing, then we need to figure out why > SBCL is using so many buffers and why are they not getting released. > > Cheers, > > -- Nikodemus > |
From: Attila L. <att...@gm...> - 2009-07-30 23:27:21
|
> I do have IGNORE-ERRORS, basically because the data I have to work > with is at times wildly inconsistent. My first attempt at the code ignore-errors is evil. the time you'll waste on debugging misbehavior hidden by it, easily exceeds the time you would need to make the code so that it's not needed. at least that was my experience and i soon stopped using such constructs except at very short lived dynamic extents like an (ignore-errors (parse-integer ...)) just some 0.02 -- attila |
From: Malcolm R. <mal...@gm...> - 2009-07-30 23:49:58
|
> ignore-errors is evil. > > the time you'll waste on debugging misbehavior hidden by it, easily > exceeds the time you would need to make the code so that it's not > needed. at least that was my experience and i soon stopped using such > constructs except at very short lived dynamic extents like an > (ignore-errors (parse-integer ...)) Yeah I'm starting to realise this.. Once I read how handler-case worked it was simple to change the code to only catch the parse errors.. That's running now and it's been going several hours and has nearly processed all my 8 million records. No crashes or glitches of any kind. I was expecting the memory error to occur again but just to be a bit more readable, but there's been no problems. I don't know whether this is possible, but could it be the case that some low level code was signalling that it was low on memory, the high level code was responding by making it call the garbage collector, but then these signals weren't getting through before due to the ignore-errors statement? I'm glad things appear to be working fine now, just curious as to what may have been the underlying cause. I'm on a bit of a tight schedule with my dissertation right now, but when I have time I'll try and come back and get the crash reproducable, as I'm keen to help diagnose this for the SBCL developers if possible. Malcolm |
From: Nikodemus S. <nik...@ra...> - 2009-07-31 08:29:18
|
2009/7/31 Malcolm Reynolds <mal...@gm...>: > Yeah I'm starting to realise this.. Once I read how handler-case > worked it was simple to change the code to only catch the parse > errors.. That's running now and it's been going several hours and has > nearly processed all my 8 million records. No crashes or glitches of > any kind. I was expecting the memory error to occur again but just to > be a bit more readable, but there's been no problems. Curious! Paul Khuong speculated that your problem might be related to a leak in Darwin signals handlers, but it is not obvious to me how getting rid of IGNORE-ERRORS and replacing it with a more selective construct would change this. > whether this is possible, but could it be the case that some low level > code was signalling that it was low on memory, the high level code was > responding by making it call the garbage collector, but then these > signals weren't getting through before due to the ignore-errors > statement? Nope: Lisp-level conditions are not used to drive the garbage collector. Cheers, -- Nikodemus |