From: David H. <dav...@gm...> - 2006-10-30 14:26:47
|
Hi, I have a script that crashes, but only if it runs over 9~10 hours, with the following backtrace from gdb. The script uses PyMC, and repeatedly calls (> 1000000) likelihood functions written in fortran and wrapped with f2py. Numpy: 1.0.dev3327 Python: 2.4.3 Does this backtrace give enough info to track the problem or do the gurus need more ? Thanks, David *** glibc detected *** free(): invalid pointer: 0x00002aaaac1257e0 *** Program received signal SIGABRT, Aborted. [Switching to Thread 46912504440528 (LWP 25269)] 0x00002aaaab09011d in raise () from /lib/libc.so.6 (gdb) backtrace #0 0x00002aaaab09011d in raise () from /lib/libc.so.6 #1 0x00002aaaab09184e in abort () from /lib/libc.so.6 #2 0x00002aaaab0c4e41 in __fsetlocking () from /lib/libc.so.6 #3 0x00002aaaab0ca90e in malloc_usable_size () from /lib/libc.so.6 #4 0x00002aaaab0cac56 in free () from /lib/libc.so.6 #5 0x00002aaaabff7770 in PyArray_FromArray (arr=0x1569500, newtype=0x2aaaac1257e0, flags=0) at arrayobject.c:7804 #6 0x00002aaaabfece56 in PyArray_FromAny (op=0x1569500, newtype=0x0, min_depth=0, max_depth=0, flags=0, context=0x0) at arrayobject.c:8257 #7 0x00002aaaabff40b1 in PyArray_MultiIterNew (n=2) at arrayobject.c:10253 #8 0x00002aaaabff44bc in _broadcast_cast (out=0x62b5, in=0x6, castfunc=0x2aaaabfbf5a0 <DOUBLE_to_FLOAT>, iswap=-1, oswap=6) at arrayobject.c:7445 #9 0x00002aaaabffe301 in PyArray_CastToType (mp=0x156dca0, at=<value optimized out>, fortran_=0) at arrayobject.c:7344 #10 0x00002aaaabffe785 in PyArray_FromScalar (scalar=0x1573b30, outcode=0x2aaaac1257e0) at scalartypes.inc.src:219 #11 0x00002aaaabfecff5 in PyArray_FromAny (op=0x1573b30, newtype=0x2aaaac1257e0, min_depth=0, max_depth=<value optimized out>, flags=0, context=0x0) at arrayobject.c:8260 #12 0x00002aaab6038b7b in array_from_pyobj (type_num=11, dims=0x7fffff8f6200, rank=1, intent=<value optimized out>, obj=0x1573b30) at build/src.linux-x86_64-2.4/fortranobject.c:653 #13 0x00002aaab6034aa9 in f2py_rout_flib_beta ( capi_self=<value optimized out>, capi_args=<value optimized out>, capi_keywds=<value optimized out>, f2py_func=0x2aaab603e830 <beta_>) at build/src.linux-x86_64-2.4/PyMC/flibmodule.c:2601 #14 0x0000000000414490 in PyObject_Call () #15 0x0000000000475de5 in PyEval_EvalFrame () #16 0x00000000004bdf69 in PyDescr_NewGetSet () #17 0x00000000004143eb in PyIter_Next () #18 0x000000000046ba53 in _PyUnicodeUCS4_IsNumeric () #19 0x0000000000477ab1 in PyEval_EvalFrame () #20 0x00000000004783ff in PyEval_EvalCodeEx () #21 0x000000000047699b in PyEval_EvalFrame () #22 0x0000000000476ab6 in PyEval_EvalFrame () #23 0x0000000000476ab6 in PyEval_EvalFrame () #24 0x00000000004783ff in PyEval_EvalCodeEx () #25 0x000000000047699b in PyEval_EvalFrame () #26 0x00000000004783ff in PyEval_EvalCodeEx () #27 0x000000000047699b in PyEval_EvalFrame () #28 0x00000000004783ff in PyEval_EvalCodeEx () #29 0x000000000047699b in PyEval_EvalFrame () #30 0x00000000004783ff in PyEval_EvalCodeEx () #31 0x0000000000478512 in PyEval_EvalCode () #32 0x000000000049c222 in PyRun_FileExFlags () #33 0x000000000049c4ae in PyRun_SimpleFileExFlags () #34 0x0000000000410a80 in Py_Main () #35 0x00002aaaab07d49b in __libc_start_main () from /lib/libc.so.6 #36 0x000000000040ffba in _start () |
From: Fernando P. <fpe...@gm...> - 2006-10-30 19:19:01
|
On 10/30/06, David Huard <dav...@gm...> wrote: > Hi, > I have a script that crashes, but only if it runs over 9~10 hours, with the > following backtrace from gdb. The script uses PyMC, and repeatedly calls (> > 1000000) likelihood functions written in fortran and wrapped with f2py. > Numpy: 1.0.dev3327 > Python: 2.4.3 This sounds awfully reminiscent of the bug I recently mentioned: http://aspn.activestate.com/ASPN/Mail/Message/numpy-discussion/3312099 We left a fresh run over the weekend, but my office mate is currently out of the office and his terminal is locked, so I don't know what the result is. I'll report shortly: we followed Travis' instructions and ran with a fresh SVN build which includes the extra warnings he added to the dealloc routines. You may want to try the same advice, perhaps with information from both of us the gurus may zero in on the problem, if indeed it is the same. Note that I'm not positive it's the same problem, and our backtraces aren't quite the same. But the rest of the scenario is similar: low-level memory crash from glibc, very long run is needed to fire the bug, potentially millions of calls to both numpy and to f2py-wrapped in-house libraries. Cheers, f |
From: David H. <dav...@gm...> - 2006-10-30 19:57:58
|
Ok, I'll update numpy and give it another try tonight. Regards, David 2006/10/30, Fernando Perez <fpe...@gm...>: > > On 10/30/06, David Huard <dav...@gm...> wrote: > > Hi, > > I have a script that crashes, but only if it runs over 9~10 hours, with > the > > following backtrace from gdb. The script uses PyMC, and repeatedly calls > (> > > 1000000) likelihood functions written in fortran and wrapped with f2py. > > Numpy: 1.0.dev3327 > > Python: 2.4.3 > > This sounds awfully reminiscent of the bug I recently mentioned: > > http://aspn.activestate.com/ASPN/Mail/Message/numpy-discussion/3312099 > > We left a fresh run over the weekend, but my office mate is currently > out of the office and his terminal is locked, so I don't know what the > result is. I'll report shortly: we followed Travis' instructions and > ran with a fresh SVN build which includes the extra warnings he added > to the dealloc routines. You may want to try the same advice, perhaps > with information from both of us the gurus may zero in on the problem, > if indeed it is the same. > > Note that I'm not positive it's the same problem, and our backtraces > aren't quite the same. But the rest of the scenario is similar: > low-level memory crash from glibc, very long run is needed to fire the > bug, potentially millions of calls to both numpy and to f2py-wrapped > in-house libraries. > > Cheers, > > f > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Numpy-discussion mailing list > Num...@li... > https://lists.sourceforge.net/lists/listinfo/numpy-discussion > |
From: Travis O. <oli...@ee...> - 2006-10-30 22:13:58
|
Fernando Perez wrote: >On 10/30/06, David Huard <dav...@gm...> wrote: > > >>Hi, >>I have a script that crashes, but only if it runs over 9~10 hours, with the >>following backtrace from gdb. The script uses PyMC, and repeatedly calls (> >>1000000) likelihood functions written in fortran and wrapped with f2py. >>Numpy: 1.0.dev3327 >>Python: 2.4.3 >> >> > >This sounds awfully reminiscent of the bug I recently mentioned: > >http://aspn.activestate.com/ASPN/Mail/Message/numpy-discussion/3312099 > > It actually looks very much like it. I think the problem may be in f2py or in one of the C-API calls where-in there is a reference-count problem with the built-in data-type objects. NumPy won't try to free those anymore which will solve the immediate problem, but there is still a reference-count problem somewhere. The reference to the data-type objects is consumed by constructors that take PyArray_Descr * arguments. So, you often need to INCREF before passing to those constructors. It looks like this INCREF is forgotten in some extension module (perhaps in f2py or PyMC). It's possible it's in NumPy itself, though I've re-checked the code lots of times looking for that specific problem. -Travis |
From: Fernando P. <fpe...@gm...> - 2006-10-30 22:31:55
|
On 10/30/06, Travis Oliphant <oli...@ee...> wrote: > Fernando Perez wrote: > >This sounds awfully reminiscent of the bug I recently mentioned: > > > >http://aspn.activestate.com/ASPN/Mail/Message/numpy-discussion/3312099 > > > > > > It actually looks very much like it. I think the problem may be in f2py > or in one of the C-API calls where-in there is a reference-count problem > with the built-in data-type objects. > > NumPy won't try to free those anymore which will solve the immediate > problem, but there is still a reference-count problem somewhere. > > The reference to the data-type objects is consumed by constructors that > take PyArray_Descr * arguments. So, you often need to INCREF before > passing to those constructors. It looks like this INCREF is forgotten > in some extension module (perhaps in f2py or PyMC). It's possible it's > in NumPy itself, though I've re-checked the code lots of times looking > for that specific problem. As a data point, our code has almost no manual memory management in C, but lots and lots of f2py-generated wrappers, as well as a lot of weave.inline-generated code. We do have hand-written C extensions, but most of them operate on externally allocated arrays. The one little snippet where we manually manage memory is a copy of numpy's innerproduct() which I simplified and tuned for our purposes; it just does: ret = (PyArrayObject *)PyArray_SimpleNew(nd,dimensions, ap1->descr->type_num); if (ret == NULL) goto fail; [ do computational loop to fill in ret array, no memory management here ] return (PyObject *)ret; fail: Py_XDECREF(ret); return NULL; That's the full extent of our manual memory management, and I don't see any problem with it, but maybe there is: I copied this from numpy months ago and haven't really looked again. Cheers, f |
From: Travis O. <oli...@ee...> - 2006-10-30 22:36:34
|
Fernando Perez wrote: >On 10/30/06, David Huard <dav...@gm...> wrote: > > >>Hi, >>I have a script that crashes, but only if it runs over 9~10 hours, with the >>following backtrace from gdb. The script uses PyMC, and repeatedly calls (> >>1000000) likelihood functions written in fortran and wrapped with f2py. >>Numpy: 1.0.dev3327 >>Python: 2.4.3 >> >> > >This sounds awfully reminiscent of the bug I recently mentioned: > >http://aspn.activestate.com/ASPN/Mail/Message/numpy-discussion/3312099 > >We left a fresh run over the weekend, but my office mate is currently >out of the office and his terminal is locked, so I don't know what the >result is. I'll report shortly: we followed Travis' instructions and >ran with a fresh SVN build which includes the extra warnings he added >to the dealloc routines. You may want to try the same advice, perhaps >with information from both of us the gurus may zero in on the problem, >if indeed it is the same. > I talked about the reference counting issue. One problem is not incrementing the reference count when it needs to be. The other problem could occur if the reference-count was not decremented when it needed to be and the reference count wrapped from MAX_LONG to 0. This could also create the problem and would be expected for "long-running" processes. -Travis |
From: Fernando P. <fpe...@gm...> - 2006-10-30 22:41:32
|
On 10/30/06, Travis Oliphant <oli...@ee...> wrote: > Fernando Perez wrote: > > >On 10/30/06, David Huard <dav...@gm...> wrote: > > > > > >>Hi, > >>I have a script that crashes, but only if it runs over 9~10 hours, with the > >>following backtrace from gdb. The script uses PyMC, and repeatedly calls (> > >>1000000) likelihood functions written in fortran and wrapped with f2py. > >>Numpy: 1.0.dev3327 > >>Python: 2.4.3 > >> > >> > > > >This sounds awfully reminiscent of the bug I recently mentioned: > > > >http://aspn.activestate.com/ASPN/Mail/Message/numpy-discussion/3312099 > > > >We left a fresh run over the weekend, but my office mate is currently > >out of the office and his terminal is locked, so I don't know what the > >result is. I'll report shortly: we followed Travis' instructions and > >ran with a fresh SVN build which includes the extra warnings he added > >to the dealloc routines. You may want to try the same advice, perhaps > >with information from both of us the gurus may zero in on the problem, > >if indeed it is the same. > > > I talked about the reference counting issue. One problem is not > incrementing the reference count when it needs to be. The other problem > could occur if the reference-count was not decremented when it needed to > be and the reference count wrapped from MAX_LONG to 0. This could also > create the problem and would be expected for "long-running" processes. I just posted the log from that run in the other thread. I'm not sure if that helps you any though. I'm running the code again to see if we see your new warning fire, and will report back. Cheers, f |
From: Travis O. <oli...@ee...> - 2006-10-30 23:02:46
|
David Huard wrote: > Ok, > I'll update numpy and give it another try tonight. > I just fixed some reference-count problems in f2py today. These were of the variety that there was a missing decref that would cause the reference count of certain often-used data-types to increase without bound and eventually wrap (to 0) in long-running processes using f2py. I suspect this is the fundamental problem in both cases. -Travis |
From: Fernando P. <fpe...@gm...> - 2006-10-31 00:23:44
|
On 10/30/06, Travis Oliphant <oli...@ee...> wrote: > David Huard wrote: > > > Ok, > > I'll update numpy and give it another try tonight. > > > > I just fixed some reference-count problems in f2py today. These were of > the variety that there was a missing decref that would cause the > reference count of certain often-used data-types to increase without > bound and eventually wrap (to 0) in long-running processes using f2py. > > I suspect this is the fundamental problem in both cases. Many thanks, Travis. We're rebuilding numpy and all of our f2py-generated wrappers, and will start a new run. I'll report on the results as well. Cheers, f |