Thread: [Sbcl-devel] SBCL 1.0.30.4 crash, "cannot allocate memory", ldb entered

Common Lisp compiler and runtime

Brought to you by: crhodes, demoss, jsnell, pkhuong, and 4 others

sbcl-devel

[Sbcl-devel] SBCL 1.0.30.4 crash, "cannot allocate memory", ldb entered

From: Malcolm R. <mal...@gm...> - 2009-07-30 12:27:42

Hi guys

I'm processing a huge (~4.5 Gigs) csv file which contains a bunch of
logged web requests in csv format. I'm using the csv-parser library
(http://cl-user.net/asp/libs/csv-parser) to read each line, and then
running a function on it which ends up with a double-float array. I'm
then writing each of these numeric arrays to an output file. For what
it's worth, this is technically feature extraction for the eventual
goal of doing some Machine Learning with the data, but that's not
really relevant.

The code to do this is found at this paste:
http://paste.lisp.org/display/84428#1 - the main paste includes a
macro for a very simple DSL for writing which features you want to
extract out of a line of data (once the csv-parser library processes
the file each line becomes an ordinary list). The annotation includes
a specific instance of features. This is my first attempt at any
macros stuff like this, but I'm fairly sure it works well on smaller
files etc, and the code seems pretty clear to understand. Crucially,
because I reuse *scansafe-feature-vector* it doesn't appear to me that
I am using vast amounts of memory. I guess csv-parser may well contain
some memory leaks, but anyway...

It's worth pointing out that this is for my MSc project, in
collaboration with Scansafe (web security firm). The code I'm posting
is my own, but I'm under NDA so can't provide the data files I'm
processing to anyone, as they contain Scansafe customer information.

After successfully converting lots of smaller data files I tried it on
the big file (8 million lines or so). After running for 15 minutes, in
which time abot 137000 of the lines were processed, I got the
following..

CNGH> (process-scansafe-file "/External/scansafe/ucl_all_may_10all_edited")
					
[12:29:58]:CNGH:Converting csv file
/External/scansafe/ucl_all_may_10all_edited.csv into sexp file
/External/scansafe/ucl_all_may_10all_edited.lisp
[12:30:12]:CNGH:1000 processed, entries are 107 long...
[12:30:32]:CNGH:2000 processed, entries are 107 long...

[ ... snip ... ]

[12:45:36]:CNGH:136000 processed, entries are 107 long...
[12:45:39]:CNGH:137000 processed, entries are 107 long...
[12:45:43]:mmap: Cannot allocate memory
mmap: Cannot allocate memory
mmap: Cannot allocate memory

[ ... snip, prints this line a couple thousand times ... ]

mmap: Cannot allocate memory
mmap: Cannot allocate memory
INFO: Control stack guard page unprotected
mmap: Cannot allocate memory
mmap: Cannot allocate memory
fatal error encountered in SBCL pid 37763:
Unhandled memory fault. Exiting.

Error opening /dev/tty: Device not configured
Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb> sbcl(37763,0xb0081000) malloc: *** mmap(size=16384) failed (error code=12)
sbcl(37763,0xb0081000) malloc: *** error: can't allocate region
sbcl(37763,0xb0081000) malloc: *** set a breakpoint in
malloc_error_break to debug


This is 1.0.30.4, built from git yesterday (maybe the day before
yesterday, not sure) on OS X 10.5.6. First generation intel macbook,
Core Duo, 2 Gigs ram..

I've no experience with the low level debugger so I don't know if
there is any diagnostic information I can acquire from this (Emacs is
still running but I need to head out with my laptop soon and the
battery often dies on the road, so if anyone wants me to try typing
stuff at ldb then speak soon). Clearly some problem with memory
allocation, but as far as I can see my code isn't using that much
memory at all.. Is it likely that csv-parser or sbcl is to blame?

Thanks in advance for any help,

Malcolm

Re: [Sbcl-devel] SBCL 1.0.30.4 crash, "cannot allocate memory", ldb entered

From: Nikodemus S. <nik...@ra...> - 2009-07-30 13:25:40

2009/7/30 Malcolm Reynolds <mal...@gm...>:

> [12:45:36]:CNGH:136000 processed, entries are 107 long...
> [12:45:39]:CNGH:137000 processed, entries are 107 long...
> [12:45:43]:mmap: Cannot allocate memory
> mmap: Cannot allocate memory
> mmap: Cannot allocate memory

This is unlikely to be related to normal Lisp memory allocation. Most
likely this is from

  perror("mmap");

os_validate in src/runtime/bsd-os.c -- but the callers are the thing
that is of interest.

In pretty much everywhere in SBCL where this is called at runtime, an
error should be signalled if the call fails. Do you have a toplevel
handler that might be silently handling these (or eg. IGNORE-ERRORS
somewhere)?

  (SETF *BREAK-ON-SIGNALS* 'ERROR)

should help with in locating the problem. If you see these messages
and are not getting any conditions that would help narrow it down,
then the next task is to make sure that is is indeed from
os_validate...

> [ ... snip, prints this line a couple thousand times ... ]
>
> mmap: Cannot allocate memory
> mmap: Cannot allocate memory
> INFO: Control stack guard page unprotected

At this point the stack has overflown. Again, you should be getting a
STORAGE-CONDITION for this.

> mmap: Cannot allocate memory
> mmap: Cannot allocate memory
> fatal error encountered in SBCL pid 37763:
> Unhandled memory fault. Exiting.

I would guess that this is from overflowing the stack some more after
the guard page has been triggered.

>From the description of your program my first guesses as to proximate
causes would be that somehow a huge number of stream buffers are
allocated -- and they either remain in use, or are somehow not
released. Getting a better idea of where the mmap failures are coming
from would go a long way towards verifying this.

As for the stack overflow: maybe you are recursing in an error handler? Eg.

(defun log-condition (c)
  (handler-bind ((error (lambda (c) (log-condition c))))
    (format *standard-output* "ERROR: ~A~%" c)))

if the error occurs due to not being able to allocate an IO buffer,
then trying to log the error almost certainly causes the same error
again -- ad infinitum, till you run out of stack space, try to report
that, and finally die.

So, first task: find out where the mmap is first failing. If it is in
the stream system, as I am guessing, then we need to figure out why
SBCL is using so many buffers and why are they not getting released.

Cheers,

 -- Nikodemus

Re: [Sbcl-devel] SBCL 1.0.30.4 crash, "cannot allocate memory", ldb entered

From: Malcolm R. <mal...@gm...> - 2009-07-30 16:44:06

A quick note to anyone wondering where the paste I referenced has gone
- I had to request its deletion as the variable names listed for each
feature apparently constituted me breaking my NDA with Scansafe. I
will repost the code shortly in a hopefully non-infringing form, and
reply to Nikodemus' post in full after that.

Malcolm

On Thu, Jul 30, 2009 at 2:17 PM, Nikodemus
Siivola<nik...@ra...> wrote:
> 2009/7/30 Malcolm Reynolds <mal...@gm...>:
>
>> [12:45:36]:CNGH:136000 processed, entries are 107 long...
>> [12:45:39]:CNGH:137000 processed, entries are 107 long...
>> [12:45:43]:mmap: Cannot allocate memory
>> mmap: Cannot allocate memory
>> mmap: Cannot allocate memory
>
> This is unlikely to be related to normal Lisp memory allocation. Most
> likely this is from
>
>  perror("mmap");
>
> os_validate in src/runtime/bsd-os.c -- but the callers are the thing
> that is of interest.
>
> In pretty much everywhere in SBCL where this is called at runtime, an
> error should be signalled if the call fails. Do you have a toplevel
> handler that might be silently handling these (or eg. IGNORE-ERRORS
> somewhere)?
>
>  (SETF *BREAK-ON-SIGNALS* 'ERROR)
>
> should help with in locating the problem. If you see these messages
> and are not getting any conditions that would help narrow it down,
> then the next task is to make sure that is is indeed from
> os_validate...
>
>> [ ... snip, prints this line a couple thousand times ... ]
>>
>> mmap: Cannot allocate memory
>> mmap: Cannot allocate memory
>> INFO: Control stack guard page unprotected
>
> At this point the stack has overflown. Again, you should be getting a
> STORAGE-CONDITION for this.
>
>> mmap: Cannot allocate memory
>> mmap: Cannot allocate memory
>> fatal error encountered in SBCL pid 37763:
>> Unhandled memory fault. Exiting.
>
> I would guess that this is from overflowing the stack some more after
> the guard page has been triggered.
>
> From the description of your program my first guesses as to proximate
> causes would be that somehow a huge number of stream buffers are
> allocated -- and they either remain in use, or are somehow not
> released. Getting a better idea of where the mmap failures are coming
> from would go a long way towards verifying this.
>
> As for the stack overflow: maybe you are recursing in an error handler? Eg.
>
> (defun log-condition (c)
>  (handler-bind ((error (lambda (c) (log-condition c))))
>    (format *standard-output* "ERROR: ~A~%" c)))
>
> if the error occurs due to not being able to allocate an IO buffer,
> then trying to log the error almost certainly causes the same error
> again -- ad infinitum, till you run out of stack space, try to report
> that, and finally die.
>
> So, first task: find out where the mmap is first failing. If it is in
> the stream system, as I am guessing, then we need to figure out why
> SBCL is using so many buffers and why are they not getting released.
>
> Cheers,
>
>  -- Nikodemus
>

Re: [Sbcl-devel] SBCL 1.0.30.4 crash, "cannot allocate memory", ldb entered

From: Martin C. <cra...@co...> - 2009-07-30 17:01:02

Malcolm Reynolds wrote on Thu, Jul 30, 2009 at 05:43:30PM +0100: 
> >> [12:45:36]:CNGH:136000 processed, entries are 107 long...
> >> [12:45:39]:CNGH:137000 processed, entries are 107 long...
> >> [12:45:43]:mmap: Cannot allocate memory
> >> mmap: Cannot allocate memory
> >> mmap: Cannot allocate memory
> >
> > This is unlikely to be related to normal Lisp memory allocation. Most
> > likely this is from
> >
> >  perror("mmap");

That reminds me of a request:

we should change all these perror("something") to
perror("something(sbcl)")

The reason is that this would allows us to tell failing system calls
from the SBCL runtime versus those from random user or library code.

Martin
-- 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Martin Cracauer <cra...@co...>   http://www.cons.org/cracauer/
FreeBSD - where you want to go, today.      http://www.freebsd.org/

Re: [Sbcl-devel] SBCL 1.0.30.4 crash, "cannot allocate memory", ldb entered

From: Brian M. <br...@ma...> - 2009-07-30 17:23:04

On Jul 30, 2009, at 12:00 PM, Martin Cracauer wrote:

> Malcolm Reynolds wrote on Thu, Jul 30, 2009 at 05:43:30PM +0100:
>>>> [12:45:36]:CNGH:136000 processed, entries are 107 long...
>>>> [12:45:39]:CNGH:137000 processed, entries are 107 long...
>>>> [12:45:43]:mmap: Cannot allocate memory
>>>> mmap: Cannot allocate memory
>>>> mmap: Cannot allocate memory
>>>
>>> This is unlikely to be related to normal Lisp memory allocation.  
>>> Most
>>> likely this is from
>>>
>>>  perror("mmap");
>
> That reminds me of a request:
>
> we should change all these perror("something") to
> perror("something(sbcl)")
>
> The reason is that this would allows us to tell failing system calls
> from the SBCL runtime versus those from random user or library code.

Why not also use __FILE__ and __LINE__ in these messages? That'd help  
track down the call even more specifically.
--
Brian Mastenbrook
br...@ma...
http://brian.mastenbrook.net/

Re: [Sbcl-devel] SBCL 1.0.30.4 crash, "cannot allocate memory", ldb entered

From: Malcolm R. <mal...@gm...> - 2009-07-30 17:23:29

Okay the paste is back at http://paste.lisp.org/display/84446#2 ,
hopefully in a form which won't result in any legal tonnes of bricks
on my head.

Responding to Nikodemus:

> In pretty much everywhere in SBCL where this is called at runtime, an
> error should be signalled if the call fails. Do you have a toplevel
> handler that might be silently handling these (or eg. IGNORE-ERRORS
> somewhere)?

I do have IGNORE-ERRORS, basically because the data I have to work
with is at times wildly inconsistent. My first attempt at the code
worked fine until it found some entry where numbers that were supposed
to be floats were "NaN" or "Infinity", which caused parse-integer to
error out. I decided rather than add ad hoc fixes for every bit of
random broken data I find, I would wrap the number parsing in
ignore-errors and if anything bad happened just use 0.0d0 instead.
Hence:

	     (:float `(progn
			(dbg :feat-extract "i=~A float feature #~A:~A~%" i ,(first spec)
,(third spec))
			(setf (aref ,feature-vector i)
			      (or (ignore-errors
				    (coerce-double
				     (parse-number:parse-real-number (pop ,input-vector))))
				  0.0d0))
			(incf i)))

and similar code for integers. I appreciate this is in no way the
Right Thing to do but I only just looked at the chapter of PCL which
covers conditions. I'll try and be less lazy about this and see if the
source of the problem becomes more apparent. I think I can be pretty
sure that I'm not recursing in error handling though - the only thing
I have in my code releated to errors is a bunch of ignore-errors calls
and they are all inside a (or (ignore-errors ...) 0.0d0) form.

You say that it could be due to allocating too many stream buffers -
surely my use of WITH-OPEN-FILE is fairly textbook.. I will
investigate the csv-parser library as it is only one .lisp file I
think. I'm not entirely sure what I should look out for though..

I'll report back if I can reproduce after I've removed the
ignore-errors from my code.

Malcolm

>
>  (SETF *BREAK-ON-SIGNALS* 'ERROR)
>
> should help with in locating the problem. If you see these messages
> and are not getting any conditions that would help narrow it down,
> then the next task is to make sure that is is indeed from
> os_validate...
>
>> [ ... snip, prints this line a couple thousand times ... ]
>>
>> mmap: Cannot allocate memory
>> mmap: Cannot allocate memory
>> INFO: Control stack guard page unprotected
>
> At this point the stack has overflown. Again, you should be getting a
> STORAGE-CONDITION for this.
>
>> mmap: Cannot allocate memory
>> mmap: Cannot allocate memory
>> fatal error encountered in SBCL pid 37763:
>> Unhandled memory fault. Exiting.
>
> I would guess that this is from overflowing the stack some more after
> the guard page has been triggered.
>
> From the description of your program my first guesses as to proximate
> causes would be that somehow a huge number of stream buffers are
> allocated -- and they either remain in use, or are somehow not
> released. Getting a better idea of where the mmap failures are coming
> from would go a long way towards verifying this.
>
> As for the stack overflow: maybe you are recursing in an error handler? Eg.
>
> (defun log-condition (c)
>  (handler-bind ((error (lambda (c) (log-condition c))))
>    (format *standard-output* "ERROR: ~A~%" c)))
>
> if the error occurs due to not being able to allocate an IO buffer,
> then trying to log the error almost certainly causes the same error
> again -- ad infinitum, till you run out of stack space, try to report
> that, and finally die.
>
> So, first task: find out where the mmap is first failing. If it is in
> the stream system, as I am guessing, then we need to figure out why
> SBCL is using so many buffers and why are they not getting released.
>
> Cheers,
>
>  -- Nikodemus
>

Re: [Sbcl-devel] SBCL 1.0.30.4 crash, "cannot allocate memory", ldb entered

From: Attila L. <att...@gm...> - 2009-07-30 23:27:21

> I do have IGNORE-ERRORS, basically because the data I have to work
> with is at times wildly inconsistent. My first attempt at the code


ignore-errors is evil.

the time you'll waste on debugging misbehavior hidden by it, easily
exceeds the time you would need to make the code so that it's not
needed. at least that was my experience and i soon stopped using such
constructs except at very short lived dynamic extents like an
(ignore-errors (parse-integer ...))

just some 0.02

-- 
 attila

Re: [Sbcl-devel] SBCL 1.0.30.4 crash, "cannot allocate memory", ldb entered

From: Malcolm R. <mal...@gm...> - 2009-07-30 23:49:58

> ignore-errors is evil.
>
> the time you'll waste on debugging misbehavior hidden by it, easily
> exceeds the time you would need to make the code so that it's not
> needed. at least that was my experience and i soon stopped using such
> constructs except at very short lived dynamic extents like an
> (ignore-errors (parse-integer ...))

Yeah I'm starting to realise this.. Once I read how handler-case
worked it was simple to change the code to only catch the parse
errors.. That's running now and it's been going several hours and has
nearly processed all my 8 million records. No crashes or glitches of
any kind. I was expecting the memory error to occur again but just to
be a bit more readable, but there's been no problems. I don't know
whether this is possible, but could it be the case that some low level
code was signalling that it was low on memory, the high level code was
responding by making it call the garbage collector, but then these
signals weren't getting through before due to the ignore-errors
statement? I'm glad things appear to be working fine now, just curious
as to what may have been the underlying cause. I'm on a bit of a tight
schedule with my dissertation right now, but when I have time I'll try
and come back and get the crash reproducable, as I'm keen to help
diagnose this for the SBCL developers if possible.

Malcolm

Re: [Sbcl-devel] SBCL 1.0.30.4 crash, "cannot allocate memory", ldb entered

From: Nikodemus S. <nik...@ra...> - 2009-07-31 08:29:18

2009/7/31 Malcolm Reynolds <mal...@gm...>:

> Yeah I'm starting to realise this.. Once I read how handler-case
> worked it was simple to change the code to only catch the parse
> errors.. That's running now and it's been going several hours and has
> nearly processed all my 8 million records. No crashes or glitches of
> any kind. I was expecting the memory error to occur again but just to
> be a bit more readable, but there's been no problems.

Curious! Paul Khuong speculated that your problem might be related to
a leak in Darwin signals handlers, but it is not obvious to me how
getting rid of IGNORE-ERRORS and replacing it with a more selective
construct would change this.

> whether this is possible, but could it be the case that some low level
> code was signalling that it was low on memory, the high level code was
> responding by making it call the garbage collector, but then these
> signals weren't getting through before due to the ignore-errors
> statement?

Nope: Lisp-level conditions are not used to drive the garbage collector.

Cheers,

 -- Nikodemus