Could Matlisp be causing this GC crash?

2000-09-29
2001-02-23
  • I don't know if the GC related crash in CMUCL could be caused by the Matlisp package or not.  I'd appreciate any comments regarding this.  Below is a copy of the posting to CMUCL-HELP...

    tnx

    mike
    -------------------------------------------

    Hello all,

      I need some guidance with the following problem.  I'm currently
    stumped and really don't want to start redoing my analysis code
    in another language.

    The problem seems to be the GC process dumping me to the "LDB>" prompt
    like this:

      Size lossage.  No size function for object at 0x48232e70
      First word of object: 0x3c9d00ea
      GC lossage.  No scavenge function for object 0x3c9d00ea
      LDB monitor
      ldb>

    If I (GC-OFF), run the code, then (GC-ON) all runs well.  Unfortunately
    the simple test code I'm running, w/ (GC-OFF), runs me near 512-Meg and the
    target machine won't be that big.

    The core I'm using is...(note I have added MATLISP/1.0b.  A rather full
    matrix library is required for my applications)

      Loaded subsystems:
          Python 1.0, target Intel x86
          CLOS based on PCL version:  September 16 92 PCL (f)
          CLX X Library MIT R5.02
          Motif toolkit and graphical debugger 1.0
          Hemlock 3.5
          Defsystem Mar 13 1995
          MATLISP/1.0b

    Finally I'm running...

      I'm using CMUCL 2.4.20, Release 2 distributed as a Debian package and
      converted to RPM w/ Alien.  I'm running under RH 6.2 (Linux
      2.2.14-6.1.1smp).

    tnx in adance

    mike

    --
    ---------------------------------------------------------------------------
    Dr Michael A. Koerber                 Good judgment comes from experience.
    MIT/Lincoln Laboratory                Experience comes from bad judgment.
    mak@ll.mit.edu

     
    • Tunc Simsek
      Tunc Simsek
      2000-09-30

      This looks hairy.  I think the problem can be worked out though.

      1. since it works with GC-OFF the problem should not be with the
         foreign calls to LAPACK and BLAS (because with-vector-data-addresses
         calls foreign functions without-gcing). 

      2.  I think it'd be useful to know the size of your matrices and the exact
        function in which you get the awkward memory error.

      Thanks,
      Tunc

       
      • I suspect that the "crash" is somehow related to the GEV (generalized Eigen Value/Vector) routines.  It was when I started making numerous calls to that package that the crashes started.  The basic procedure program was to:

        1.  Compute a 31 x 2 complex matrix, D
        2.  Form R = ctranspose(D) D,  a 2 x 2 matrix
        3.  Compute Eigenvalues
        4.  Repeat 1--3 a few thousand times for different D matricies

        Steps 1--2 have been computed hundred's of thousands of times with no errors of any type reported.  The addition of step 3 seems to cause the problems.  The problems have been not only the GC reported problem, but occasional random crashes in another subroutine which is receiving "impossible" arguments w/ value NULL.

        Due to project requirements at my end, I have started a rewrite of my analysis code from the ground up in Matlab in order to have presentable results in time for a review.  I don't feel that I'm capable of solving the Lisp/Matlisp problem w/in the time constraints.  If you have time or interest, I can send my routines and a "crash demo" to you, or perhaps run some experiments over the weekends that might provide you data to track down the source of the error.

          tnx for your help.

        mike

         
        • Raymond Toy
          Raymond Toy
          2000-10-02

          Can you be more specific as the the exact routine?  I can't find a GEV routine, but there's the GEEV routine.

          This will certainly help a lot in tracking down the problem.

          Can you send the routines?  That will also help a lot in tracking down the errors.

          Ray
          toy@rtp.ericsson.se

           
          • Yes...GEEV (EIG) is the call that I'm making.

            I just sent these files to Tunc this AM before I read your response here.  Is this sufficient, or shall I resend?

            mike

             
    • I retried the original code after the release of CMUCL (debian 2.4.22).  The same problem exists.  However, I have noted that by increasing *BYTES-CONSED-BETWEEN-GCS* to 80M, the error never occurs even in multiple back to back runs.

      Could there be some sort of "race-condition" in memory reclaimation between GC and "whatever"?  I have to admit, I don't have a clue what is really happening.

       
      • Raymond Toy
        Raymond Toy
        2001-02-22

        I've just started looking into this problem finally, after updating my Linux system.  I get random crashes just like you.

        If possible, could you try to simplify the code?  I agree that geev appears to be the problem, but a quick look at the code seems to show that we're dimensioning everything correctly.  Could be a real bug in the x86 GC.  I don't have this problem on Sparc.

        If we could come up with a fixed D matrix and then call geev many times to cause the crash, we'll be closer to finding the problem.

        Thanks,

        Ray

         
        • The offending line in the file TWIST-ML-EST.LISP is:

          (defparameter *max-eig-denom-twisted* (max-eig *twisted-array* *az-rad* *el-rad*))

          If you wrap a (DOTIMES (HMM 10) ... ) around it you are certain of a failure. 

          I also tried an experiment where (DEFUN MAX-EIG ...) was modified as follows:

          ....
              (dotimes (n (* (length az) (length el)))
                (setf D (join (mcol manif-h n) (mcol manif-v n)))    ; The bivector

                ;;; START A NEW BLOCK OF "MAKE-IT-FAIL" CODE
                ;; Set up a fixed value of D and repeatedly call eig
                (format t "N = ~A~%" n)
                (let ((myD (m* (ctranspose D) D))
                  mAns)
              (dotimes (ii 10)
                (setf mAns (eig myD))
                (format t "~A " ii)))
                (format t "~%")
                ;;; END OF THE "MAKE-IT-FAIL" CODE

                (if (null X)
          ....

          This causes a failure on either of the lines of code:

          (defparameter *max-eig-numer-ula* (new-max-eig *h-pol-array* *az-rad* *el-rad* *x-ula*))
          (defparameter *max-eig-numer-twisted* (new-max-eig *twisted-array* *az-rad* *el-rad* *x-twist*))

          I have more observations, but instead of continuing with dribble here, I'll try to distill them down a bit first.

          tnx

          mike

           
          • Raymond Toy
            Raymond Toy
            2001-02-23

            Thanks for the simplified code (the other simplified code that you sent me that doesn't seem to be on this list).  That helped a lot!!!!

            There was a real bug.  I'm trying to check in the change but cvs on sourceforge appears to be down.

            In any case, the look in src/geev.lisp.  In the method geev for complex matrices, the dimension for was wrong.  It should be twice the size because the array is supposed to be a complex array.  The code should look more like this:

            (defmethod geev ((a complex-matrix) &optional (job :NN))
              (let* ((n (nrows a))
                 (a (copy a))
                 (w (make-complex-matrix-dim n 1))
                 (xxx   (make-array 2 :element-type 'complex-matrix-element-type))
                 (lwork (* 2 n))
                 (work  (make-array (* 2 lwork) :element-type 'complex-matrix-element-type))

            With this change, I can run 10000 complex eig's without problems.  Note that I didn't have any problem with the real eig routine.  I checked the dimensions there and they look right.

            Can you check this for me?

            Ray

             
            • Ray,

                 YES!  I made the changes and pounded the heck out of EIG routine w/o failure.  I used the simplified code segment as well as the original code.  I used various setting for the GC, Matrix sizes, repetitions.  In all many 10,000's of calls with real and complex matricies all ran w/o any memory problems.

                Now I can leave Matlab behind again!  (Their last release, R12, increased the execution _time_ by a factor of 2 for my applications...this isn't progress :-))

              thanks for the help,

              mike
              --
              -------------------------------------------------------------------
              Dr Michael A. Koerber         It said "Requires Windows 95 or better",
              MIT/Lincoln Laboratory        so I installed Linux.
              mak@ll.mit.edu

               
      • ;;; RTOY,
        ;;;     This should demonstrate the failure in a cleaner environment.
        ;;;
        ;;; tnx,
        ;;; mak
        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
        ;;; The purpose here is to try to demonstrate that the function M:EIG
        ;;; causes (is part of the cause) of GC loss/segment violations.  In
        ;;; this script *M* and *BYTES-CONSED-BETWEEN-GCS* will effect the
        ;;; onset of the error.  The following notes apply
        ;;;
        ;;; 1. The error will almost certainly occur for the complex case
        ;;; 2. The error will occur for the real case, but not as often
        ;;; 3. The error may be a SEGMENT-VIOLATION
        ;;; 4. The error may also be a GC LOSSAGE error dumping you to an LDB> prompt
        ;;;

        (defparameter *M* 20 "This is the size of a square array that will be used in the test")

        (defparameter *RM* (m:rand *M*) "A Real matrix to play with")

        (defparameter *CM* (m:m+ (m:rand *M*) (m:scal (sqrt -1.0) (m:rand *M*))) "A complex matrix to play with")

        (setf *cm* (m:m* (m:ctranspose *cm*) *cm*))
        (setf *rm* (m:m* (m:transpose *rm*) *rm*))

        (setf *gc-verbose* t)            ; make gc verbose
        (setf *bytes-consed-between-gcs* 2000000)

        (format t
        "~%======================================================================
        STARTING EIG DECOMP OF REAL MATRIX~%
        ")

        (dotimes (ii 10000)
          (princ ii)
          (princ " ")
          (m:eig *rm*))

        (format t
        "~%======================================================================
        STARTING EIG DECOMP OF COMPLEX MATRIX~%
        ")

        (dotimes (ii 10000)
          (princ ii)
          (princ " ")
          (m:eig *cm*))