From: Nicolas N. <ne...@ma...> - 2016-03-03 09:55:16
|
Dear SBCL users and developers, while parallelizing my PDE solver Femlisp with OS threads, I keep running into an ugly bug which occurs only sporadically. However, it is always of a typical form, namely it drops into ldb in the following way: * fatal error encountered in SBCL pid 4157(tid 140736551384832): no scavenge function for object 0x3e09588968409d6d (widetag 0x6d) Error opening /dev/tty: No such device or address Welcome to LDB, a low-level debugger for the Lisp runtime environment. ldb> backtrace Backtrace: 0: Foreign function (null), fp = 0x7fffc826d180, ra = 0x41268a 1: Foreign function (null), fp = 0x7fffc826d270, ra = 0x41285b 2: Foreign function (null), fp = 0x7fffc826d280, ra = 0x40e693 3: Foreign function scavenge, fp = 0x7fffc826d2d0, ra = 0x40ff1f 4: Foreign function collect_garbage, fp = 0x7fffc826d350, ra = 0x4257ed 5: SB-KERNEL::COLLECT-GARBAGE 6: (COMMON-LISP::FLET WITHOUT-GCING-BODY-52 KEYWORD::IN SB-KERNEL::SUB-GC) 7: (COMMON-LISP::FLET SB-THREAD::EXEC KEYWORD::IN SB-KERNEL::SUB-GC) 8: (COMMON-LISP::FLET WITHOUT-INTERRUPTS-BODY-47 KEYWORD::IN SB-KERNEL::SUB-GC) 9: SB-KERNEL::SUB-GC 10: Foreign function call_into_lisp, fp = 0x7fffc826d690, ra = 0x42868f 11: Foreign function maybe_gc, fp = 0x7fffc826d6c0, ra = 0x412253 12: Foreign function interrupt_handle_pending, fp = 0x7fffc826d830, ra = 0x415bd8 13: Foreign function handle_trap, fp = 0x7fffc826d870, ra = 0x4169c5 14: Foreign function (null), fp = 0x7fffc826d8b0, ra = 0x4130c0 15: Foreign function (null), fp = 0x7fffc826de88, ra = 0x7ffff79c7d10 16: SB-KERNEL::%MAKE-ARRAY 17: FL.MATLISP::ZEROS 18: (SB-PCL::FAST-METHOD FL.UTILITIES::MAKE-ANALOG (FL.MATLISP::STANDARD-MATRIX)) 19: (SB-PCL::FAST-METHOD FL.MATLISP::COPY (COMMON-LISP::T)) 20: FL.MATLISP::GETRF 21: (SB-PCL::FAST-METHOD FL.MATLISP::GESV! (COMMON-LISP::T COMMON-LISP::T)) That is, it looks as if the bug is triggered when make-array is interrupted by a GC step. My guess is that this should usually pose no problems, and that this error arises from previous use of concurrent destructive operations without appropriate mutual exclusion. Nevertheless, I would be interested if anyone of you has seen this kind of bug before, and could possibly help me to narrow the search. Thank you, Nicolas |
From: Stas B. <sta...@gm...> - 2016-03-03 12:01:38
|
On Thu, Mar 3, 2016 at 12:55 PM, Nicolas Neuss <ne...@ma...> wrote: > Dear SBCL users and developers, > > while parallelizing my PDE solver Femlisp with OS threads, I keep > running into an ugly bug which occurs only sporadically. > > However, it is always of a typical form, namely it drops into ldb in the > following way: > > * fatal error encountered in SBCL pid 4157(tid 140736551384832): > no scavenge function for object 0x3e09588968409d6d (widetag 0x6d) > > Error opening /dev/tty: No such device or address > Welcome to LDB, a low-level debugger for the Lisp runtime environment. > ldb> backtrace > Backtrace: > 0: Foreign function (null), fp = 0x7fffc826d180, ra = 0x41268a > 1: Foreign function (null), fp = 0x7fffc826d270, ra = 0x41285b > 2: Foreign function (null), fp = 0x7fffc826d280, ra = 0x40e693 > 3: Foreign function scavenge, fp = 0x7fffc826d2d0, ra = 0x40ff1f > 4: Foreign function collect_garbage, fp = 0x7fffc826d350, ra = 0x4257ed > 5: SB-KERNEL::COLLECT-GARBAGE > 6: (COMMON-LISP::FLET WITHOUT-GCING-BODY-52 KEYWORD::IN SB-KERNEL::SUB-GC) > 7: (COMMON-LISP::FLET SB-THREAD::EXEC KEYWORD::IN SB-KERNEL::SUB-GC) > 8: (COMMON-LISP::FLET WITHOUT-INTERRUPTS-BODY-47 KEYWORD::IN SB-KERNEL::SUB-GC) > 9: SB-KERNEL::SUB-GC > 10: Foreign function call_into_lisp, fp = 0x7fffc826d690, ra = 0x42868f > 11: Foreign function maybe_gc, fp = 0x7fffc826d6c0, ra = 0x412253 > 12: Foreign function interrupt_handle_pending, fp = 0x7fffc826d830, ra = 0x415bd8 > 13: Foreign function handle_trap, fp = 0x7fffc826d870, ra = 0x4169c5 > 14: Foreign function (null), fp = 0x7fffc826d8b0, ra = 0x4130c0 > 15: Foreign function (null), fp = 0x7fffc826de88, ra = 0x7ffff79c7d10 > 16: SB-KERNEL::%MAKE-ARRAY > 17: FL.MATLISP::ZEROS > 18: (SB-PCL::FAST-METHOD FL.UTILITIES::MAKE-ANALOG (FL.MATLISP::STANDARD-MATRIX)) > 19: (SB-PCL::FAST-METHOD FL.MATLISP::COPY (COMMON-LISP::T)) > 20: FL.MATLISP::GETRF > 21: (SB-PCL::FAST-METHOD FL.MATLISP::GESV! (COMMON-LISP::T COMMON-LISP::T)) > > That is, it looks as if the bug is triggered when make-array is > interrupted by a GC step. My guess is that this should usually pose no > problems, and that this error arises from previous use of concurrent > destructive operations without appropriate mutual exclusion. > Nevertheless, I would be interested if anyone of you has seen this kind > of bug before, and could possibly help me to narrow the search. We've seen that before but it's elusive and is still not tracked down. Maybe you can provide the code you are using to trigger it? -- With best regards, Stas. |
From: Nicolas N. <ne...@ma...> - 2016-03-04 16:31:25
|
Stas Boukarev <sta...@gm...> writes: > On Thu, Mar 3, 2016 at 12:55 PM, Nicolas Neuss <ne...@ma...> wrote: >> Dear SBCL users and developers, >> >> while parallelizing my PDE solver Femlisp with OS threads, I keep >> running into an ugly bug which occurs only sporadically. >> >> However, it is always of a typical form, namely it drops into ldb in the >> following way: >> >> * fatal error encountered in SBCL pid 4157(tid 140736551384832): >> no scavenge function for object 0x3e09588968409d6d (widetag 0x6d) >> >> Error opening /dev/tty: No such device or address >> Welcome to LDB, a low-level debugger for the Lisp runtime environment. >> ldb> backtrace >> Backtrace: >> 0: Foreign function (null), fp = 0x7fffc826d180, ra = 0x41268a >> 1: Foreign function (null), fp = 0x7fffc826d270, ra = 0x41285b >> 2: Foreign function (null), fp = 0x7fffc826d280, ra = 0x40e693 >> 3: Foreign function scavenge, fp = 0x7fffc826d2d0, ra = 0x40ff1f >> 4: Foreign function collect_garbage, fp = 0x7fffc826d350, ra = 0x4257ed >> 5: SB-KERNEL::COLLECT-GARBAGE >> 6: (COMMON-LISP::FLET WITHOUT-GCING-BODY-52 KEYWORD::IN SB-KERNEL::SUB-GC) >> 7: (COMMON-LISP::FLET SB-THREAD::EXEC KEYWORD::IN SB-KERNEL::SUB-GC) >> 8: (COMMON-LISP::FLET WITHOUT-INTERRUPTS-BODY-47 KEYWORD::IN SB-KERNEL::SUB-GC) >> 9: SB-KERNEL::SUB-GC >> 10: Foreign function call_into_lisp, fp = 0x7fffc826d690, ra = 0x42868f >> 11: Foreign function maybe_gc, fp = 0x7fffc826d6c0, ra = 0x412253 >> 12: Foreign function interrupt_handle_pending, fp = 0x7fffc826d830, ra = 0x415bd8 >> 13: Foreign function handle_trap, fp = 0x7fffc826d870, ra = 0x4169c5 >> 14: Foreign function (null), fp = 0x7fffc826d8b0, ra = 0x4130c0 >> 15: Foreign function (null), fp = 0x7fffc826de88, ra = 0x7ffff79c7d10 >> 16: SB-KERNEL::%MAKE-ARRAY >> 17: FL.MATLISP::ZEROS >> 18: (SB-PCL::FAST-METHOD FL.UTILITIES::MAKE-ANALOG (FL.MATLISP::STANDARD-MATRIX)) >> 19: (SB-PCL::FAST-METHOD FL.MATLISP::COPY (COMMON-LISP::T)) >> 20: FL.MATLISP::GETRF >> 21: (SB-PCL::FAST-METHOD FL.MATLISP::GESV! (COMMON-LISP::T COMMON-LISP::T)) >> >> That is, it looks as if the bug is triggered when make-array is >> interrupted by a GC step. My guess is that this should usually pose no >> problems, and that this error arises from previous use of concurrent >> destructive operations without appropriate mutual exclusion. >> Nevertheless, I would be interested if anyone of you has seen this kind >> of bug before, and could possibly help me to narrow the search. > We've seen that before but it's elusive and is still not tracked down. > Maybe you can provide the code you are using to trigger it? Hi Stas, thank you for your response. I think I have succeeded in finding and maybe even eliminating my bug, although I am not sure, because I have still open questions. First, I have observed that the problem only occurs when I really use external libraries (I can switch using BLAS/LAPACK libraries on and off in Femlisp) in parallel. Second, I observed that it only occured when calling the LU decomposition routines dgetrf/dgetrs. Third, I saw that I passed one of the arguments to dgetrs in a slightly different way (namely as a standard-matrix, which is a CLOS object, instead of the store of its numbers, which is a vector). Indeed, both ways should give the same result, because I call the BLAS routine using a wrapper which should convert the matrix object to the store of its entries automatically. Nevertheless, the problem vanished when I called the routine directly with the store. Because I do not understand this, I guess that my pinning/calling is wrong, and I would like to ask you to take a look at how I do this: --8<---------------cut here---------------start------------->8--- (defun foreign-call (function &rest args) "Ensures a safe environment for a foreign function call, especially so that no GC changes the arguments." (sb-sys:with-pinned-objects (args) (apply function args)) ) (defgeneric lapack-convert (arg) (:documentation "Convert argument for use in a LAPACK routine.") (:method (x) (error "Don't know to convert arg")) (:method ((x number)) x) (:method ((x vector)) (sb-sys:vector-sap x)) (:method ((x standard-matrix)) (lapack-convert (store x)))) (defun call-lapack (routine &rest args) "Call the LAPACK routine @arg{routine} with the arguments @arg{args}. NIL-arguments are discarded, arrays and standard-matrices are converted to the necessary alien representation." (dbg :lapack "Calling with:~%~{~A~%~}~%" (remove nil args)) (apply #'foreign-call routine (loop for arg in args when arg collect (lapack-convert arg)))) --8<---------------cut here---------------end--------------->8--- Especially: Is it correct to pin the pointer (sb-sys:vector-sap x) of a lisp vector x or do I have to pin x directly? Thank you, Nicolas |
From: Stas B. <sta...@gm...> - 2016-03-04 18:13:59
|
On Fri, Mar 4, 2016 at 7:31 PM, Nicolas Neuss <ne...@ma...> wrote: > Stas Boukarev <sta...@gm...> writes: > >> On Thu, Mar 3, 2016 at 12:55 PM, Nicolas Neuss <ne...@ma...> wrote: >>> Dear SBCL users and developers, >>> >>> while parallelizing my PDE solver Femlisp with OS threads, I keep >>> running into an ugly bug which occurs only sporadically. >>> >>> However, it is always of a typical form, namely it drops into ldb in the >>> following way: >>> >>> * fatal error encountered in SBCL pid 4157(tid 140736551384832): >>> no scavenge function for object 0x3e09588968409d6d (widetag 0x6d) >>> >>> Error opening /dev/tty: No such device or address >>> Welcome to LDB, a low-level debugger for the Lisp runtime environment. >>> ldb> backtrace >>> Backtrace: >>> 0: Foreign function (null), fp = 0x7fffc826d180, ra = 0x41268a >>> 1: Foreign function (null), fp = 0x7fffc826d270, ra = 0x41285b >>> 2: Foreign function (null), fp = 0x7fffc826d280, ra = 0x40e693 >>> 3: Foreign function scavenge, fp = 0x7fffc826d2d0, ra = 0x40ff1f >>> 4: Foreign function collect_garbage, fp = 0x7fffc826d350, ra = 0x4257ed >>> 5: SB-KERNEL::COLLECT-GARBAGE >>> 6: (COMMON-LISP::FLET WITHOUT-GCING-BODY-52 KEYWORD::IN SB-KERNEL::SUB-GC) >>> 7: (COMMON-LISP::FLET SB-THREAD::EXEC KEYWORD::IN SB-KERNEL::SUB-GC) >>> 8: (COMMON-LISP::FLET WITHOUT-INTERRUPTS-BODY-47 KEYWORD::IN SB-KERNEL::SUB-GC) >>> 9: SB-KERNEL::SUB-GC >>> 10: Foreign function call_into_lisp, fp = 0x7fffc826d690, ra = 0x42868f >>> 11: Foreign function maybe_gc, fp = 0x7fffc826d6c0, ra = 0x412253 >>> 12: Foreign function interrupt_handle_pending, fp = 0x7fffc826d830, ra = 0x415bd8 >>> 13: Foreign function handle_trap, fp = 0x7fffc826d870, ra = 0x4169c5 >>> 14: Foreign function (null), fp = 0x7fffc826d8b0, ra = 0x4130c0 >>> 15: Foreign function (null), fp = 0x7fffc826de88, ra = 0x7ffff79c7d10 >>> 16: SB-KERNEL::%MAKE-ARRAY >>> 17: FL.MATLISP::ZEROS >>> 18: (SB-PCL::FAST-METHOD FL.UTILITIES::MAKE-ANALOG (FL.MATLISP::STANDARD-MATRIX)) >>> 19: (SB-PCL::FAST-METHOD FL.MATLISP::COPY (COMMON-LISP::T)) >>> 20: FL.MATLISP::GETRF >>> 21: (SB-PCL::FAST-METHOD FL.MATLISP::GESV! (COMMON-LISP::T COMMON-LISP::T)) >>> >>> That is, it looks as if the bug is triggered when make-array is >>> interrupted by a GC step. My guess is that this should usually pose no >>> problems, and that this error arises from previous use of concurrent >>> destructive operations without appropriate mutual exclusion. >>> Nevertheless, I would be interested if anyone of you has seen this kind >>> of bug before, and could possibly help me to narrow the search. >> We've seen that before but it's elusive and is still not tracked down. >> Maybe you can provide the code you are using to trigger it? > > Hi Stas, > > thank you for your response. > > I think I have succeeded in finding and maybe even eliminating my bug, > although I am not sure, because I have still open questions. > > First, I have observed that the problem only occurs when I really use > external libraries (I can switch using BLAS/LAPACK libraries on and off > in Femlisp) in parallel. Second, I observed that it only occured when > calling the LU decomposition routines dgetrf/dgetrs. Third, I saw that > I passed one of the arguments to dgetrs in a slightly different way > (namely as a standard-matrix, which is a CLOS object, instead of the > store of its numbers, which is a vector). Indeed, both ways should give > the same result, because I call the BLAS routine using a wrapper which > should convert the matrix object to the store of its entries > automatically. Nevertheless, the problem vanished when I called the > routine directly with the store. > > Because I do not understand this, I guess that my pinning/calling is > wrong, and I would like to ask you to take a look at how I do this: > > --8<---------------cut here---------------start------------->8--- > (defun foreign-call (function &rest args) > "Ensures a safe environment for a foreign function call, especially so > that no GC changes the arguments." > (sb-sys:with-pinned-objects (args) (apply function args)) > ) > > (defgeneric lapack-convert (arg) > (:documentation "Convert argument for use in a LAPACK routine.") > (:method (x) (error "Don't know to convert arg")) > (:method ((x number)) x) > (:method ((x vector)) (sb-sys:vector-sap x)) > (:method ((x standard-matrix)) (lapack-convert (store x)))) > > (defun call-lapack (routine &rest args) > "Call the LAPACK routine @arg{routine} with the arguments @arg{args}. > NIL-arguments are discarded, arrays and standard-matrices are converted to > the necessary alien representation." > (dbg :lapack "Calling with:~%~{~A~%~}~%" (remove nil args)) > (apply #'foreign-call routine > (loop for arg in args > when arg collect (lapack-convert arg)))) > --8<---------------cut here---------------end--------------->8--- > You have to use with-pinned-objects on each object directly, pinning a list with objects wouldn't do. -- With best regards, Stas. |
From: Nicolas N. <ne...@ma...> - 2016-03-08 10:39:33
|
Stas Boukarev <sta...@gm...> writes: > On Fri, Mar 4, 2016 at 7:31 PM, Nicolas Neuss <ne...@ma...> wrote: [...] >> --8<---------------cut here---------------start------------->8--- >> (defun foreign-call (function &rest args) >> "Ensures a safe environment for a foreign function call, especially so >> that no GC changes the arguments." >> (sb-sys:with-pinned-objects (args) (apply function args)) >> ) >> >> (defgeneric lapack-convert (arg) >> (:documentation "Convert argument for use in a LAPACK routine.") >> (:method (x) (error "Don't know to convert arg")) >> (:method ((x number)) x) >> (:method ((x vector)) (sb-sys:vector-sap x)) >> (:method ((x standard-matrix)) (lapack-convert (store x)))) >> >> (defun call-lapack (routine &rest args) >> "Call the LAPACK routine @arg{routine} with the arguments @arg{args}. >> NIL-arguments are discarded, arrays and standard-matrices are converted to >> the necessary alien representation." >> (dbg :lapack "Calling with:~%~{~A~%~}~%" (remove nil args)) >> (apply #'foreign-call routine >> (loop for arg in args >> when arg collect (lapack-convert arg)))) >> --8<---------------cut here---------------end--------------->8--- >> > You have to use with-pinned-objects on each object directly, > pinning a list with objects wouldn't do. Hi Stas, thank you. You are right, of course. [Indeed, during my search for the bug, I had even changed something that was (perhaps) correct, into the above, which is clearly wrong (because only the args list is pinned and not its elements).] To be completely sure, I have extracted a simplified call to the DAXPY BLAS routine (computing y:=a*x+y) here, which I hope to be correct: --8<---------------cut here---------------start------------->8--- (defun daxpy_ (n alpha x incx y incy) (sb-alien:with-alien ((daxpy_ (function sb-alien:void (* sb-alien:int) (* sb-alien:double) (* sb-alien:double) (* sb-alien:int) (* sb-alien:double) (* sb-alien:int)) :extern "daxpy_") (n sb-alien:int n) (alpha sb-alien:double alpha) (incx sb-alien:int incx) (incy sb-alien:int incy)) (sb-alien:alien-funcall daxpy_ (sb-alien:addr n) (sb-alien:addr alpha) x (sb-alien:addr incx) y (sb-alien:addr incy)) (values nil))) (let ((x (make-array 4 :initial-element 1.0 :element-type 'double-float)) (y (make-array 4 :initial-element 2.0 :element-type 'double-float)) (alpha 3.0) (n 4)) (sb-sys:with-pinned-objects (n alpha x y) (funcall (lapack "axpy" :double) n alpha (sb-sys:vector-sap x) 1 (sb-sys:vector-sap y) 1)) y) --8<---------------cut here---------------end--------------->8--- Is the latter a call to DAXPY which is expected to work also in the presence of multiple threads? And could it maybe be simplified by omitting the scalar arguments n and/or alpha from the pinned arguments list? Thank you, Nicolas |