Menu

#4236 Repeated run_testsuite leads to out of memory error

None
open
nobody
5
2024-01-08
2023-12-29
No

Working with Maxima built from commit 3a26747 with SBCL 2.3.7 on Ubuntu 16.04. Running the test suite twice, once with share_tests = true, leads to an out of memory error in either rtest_ctensor or rtest_itensor (seems to vary).

(%i1) run_testsuite (); run_testsuite (share_tests = true);
Testsuite run for SBCL 2.3.7:
Running tests in rtest_sqdnst: 13/13 tests passed
Running tests in rtest_extensions: 18/18 tests passed
Running tests in rtest_rules: 210/210 tests passed
[... etc etc ...]
Running tests in rtest_ilt: 31/31 tests passed
Running tests in ulp_tests: 63/63 tests passed


No unexpected errors found out of 13,463 tests.
Evaluation took:
  108.034 seconds of real time
  107.751332 seconds of total run time (102.396675 user, 5.354657 system)
  [ Real times consist of 4.411 seconds GC time, and 103.623 seconds non-GC time. ]
  [ Run times consist of 4.406 seconds GC time, and 103.346 seconds non-GC time. ]
  99.74% CPU
  9,620 forms interpreted
  12,149 lambdas converted
  248,913,931,560 processor cycles
  37,916,342,224 bytes consed

(%o0)                                done
(%i1) Testsuite run for SBCL 2.3.7:
Running tests in rtest_sqdnst: 13/13 tests passed
Running tests in rtest_extensions: 18/18 tests passed
Running tests in rtest_rules: 210/210 tests passed
[... etc etc ...]
Running tests in rtest_bernstein: 44/44 tests passed
Running tests in rtest_atensor: 20/20 tests passed
Running tests in rtest_ctensor: Thread local storage exhausted.
fatal error encountered in SBCL pid 15290 tid 15290:
%PRIMITIVE HALT called; the party is over.

Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb>    

I tried some variations of run_testsuite and that combination is what I found that seems to cause the error repeatably.

I haven't tried to figure out what operation in rtest_ctensor or rtest_itensor is the immediate cause of the out of memory error. Possibly simplification rules? Just a wild guess.

Discussion

  • Viktor Toth

    Viktor Toth - 2023-12-30

    I don't think this has anything to do with tensors. I just ran the same thing with SBCL 1.4.0 on CentOS 7 and I got a memory exhaustion error in rtest16.

    As a matter of fact, we don't even need to include the share tests, and using a kill(all) between the two runs makes no difference either (I thought it might fix things):

    using Lisp SBCL 1.4.0-1.el7
    Distributed under the GNU Public License. See the file COPYING.
    Dedicated to the memory of William Schelter.
    The function bug_report() provides bug reporting information.
    (%i1) run_testsuite();kill(all);run_testsuite();
    Testsuite run for SBCL 1.4.0-1.el7:
    Running tests in rtest_sqdnst: 13/13 tests passed
    [...]
    Running tests in ulp_tests: 63/63 tests passed
    
    No unexpected errors found out of 13,396 tests.
    Evaluation took:
      126.995 seconds of real time
        97.781301 seconds of total run time (96.163794 user, 1.617507 system)
      [ Run times consist of 4.195 seconds GC time, and 93.587 seconds non-GC time. ]
      77.00% CPU
      14,973 forms interpreted
      15,083 lambdas converted
      266,684,105,920 processor cycles
      44,429,263,344 bytes consed
    
    (%o0)                                done
    (%i1) (%o0)                                done
    Testsuite run for SBCL 1.4.0-1.el7:
    Running tests in rtest_sqdnst: 13/13 tests passed
    [...]
    391/391 tests passed
    Running tests in rtest16: Thread local storage exhausted.
    fatal error encountered in SBCL pid 15590(tid 0x7ffff7fd8740):
    %PRIMITIVE HALT called; the party is over.
    
    Welcome to LDB, a low-level debugger for the Lisp runtime environment.
    ldb>
    

    For what it's worth, I learned a long time ago that running the testsuite twice in the same session only invites trouble.

     
    • Gunter Königsmann

      My guess is that this out of memory actually is an indicator for a real problem, even if I don't know if it is to be called a bug: Maxima loves special variables, but SBCL reserves only a small memory portion for the thread-local storage they are placed in. In my daily work I run out of thread-local memory every few months and matchdeclare seems to cause such out of memories quickly.

      sbcl 1.5.2 claims to have boosted thesize of said memory to 4096 objects and added a command-line switch (--tls-limit) that allows to further increase it. 4096 variables and lost items looks like being not much => perhaps our build system should boost that number if s cl is new enough to understand that command-line switch.

       
      • Robert Dodier

        Robert Dodier - 2023-12-30

        I think we need to know in more detail what is going on with thread local storage for special variables or something like that. If that is the origin of the problem or at least a contributing factor, it should be possible to measure the storage allocation (I don't know how to do that for SBCL, I assume it is possible) every now and then and show that it increases until it fails with an error.

         
  • Robert Dodier

    Robert Dodier - 2023-12-30

    Yeah, I think I've bumped into similar errors before. I'd like to try to at least identify what is the source of the error, even if it's something that we can't or won't fix.

    I bumped into this error because I've been doing a lot of testing for the Unicode pretty printer. It is a bit of a nuisance to run into unrelated errors while trying to test some new code ...

     
  • Robert Dodier

    Robert Dodier - 2023-12-30
    • summary: Repeated run_testsuite leads to out of memory error in tensor tests --> Repeated run_testsuite leads to out of memory error
     
  • Robert Dodier

    Robert Dodier - 2023-12-30

    I've edited the title to omit mention of the tensor tests.

     
  • Robert Dodier

    Robert Dodier - 2023-12-30
    • labels: run_testsuite, tensor, sbcl --> run_testsuite, sbcl
     
    • Gunter Königsmann

      configure.ac already seems to boost sbcl's number of thread-local symbols to a value that allows to run the test suite once:

      # The default of 4096 is sometimes too little for the test suite.
      if test x"${sbcl}" = xtrue ; then
         AC_MSG_CHECKING(if sbcl complains if we try to enlarge the thread-local storage)
         echo "(quit)" | ${SBCL_NAME} --tls-limit 8192 > /dev/null 2>&1
         if test "$?" = "0" ; then
          SBCL_EXTRA_ARGS="--tls-limit 8192"
          AC_MSG_RESULT(Yes)
         else
          SBCL_EXTRA_ARGS=""
          AC_MSG_RESULT(No)
         fi
      fi
      

      The question now is: Should we increase that value from 8192 to 16384 - or should we try to find out if we somewhere unnecessarily generate such symbols and therefore can catch this problem at its root?

       
      • Gunter Königsmann

        Weird: In my local maxima copy I increased the size of the thread-local memory storage from 8192 to 16384. Now running the testsuite a second time causes a call stack overflow => we might get trapped here in an infinite recursive function call or something.

         
      • Robert Dodier

        Robert Dodier - 2023-12-30

        The question now is: Should we increase that value from 8192 to 16384 - or should we try to find out if we somewhere unnecessarily generate such symbols and therefore can catch this problem at its root?

        It is quite unclear what exactly is going on, therefore it is too early to start adjusting configuration parameters in hopes of avoiding the problem.

        If you are interesting in pursuing the possibility that thread local memory allocation or generated symbols or any other specific cause is at the root of the problem, please investigate with whatever tools are available (I don't know what those might be) and please report what you find, and we'll go from there.

         
        • Stavros Macrakis

          Agree with Dodier. We need to find the memory leak, not ignore it by
          increasing allocations.

           

          Last edit: Robert Dodier 2023-12-30
  • Raymond Toy

    Raymond Toy - 2023-12-31

    FWIW, clisp, cmucl and ecl can run the main testsuite at least twice without errors. But I'm using the current HEAD version for this. Don't know if that matters.

     
    • Gunter Königsmann

      ecl and clisp might not matter too much matter with respect to a memory leak as they don't limit the amount of special variables.

       
  • Barton Willis

    Barton Willis - 2024-01-05

    Using Maxima compiled with Clozure CL 1.12.2, I tried running

    (run_testsuite(), print(1), room(), kill(all), read(),
     run_testsuite(), print(2), room(), kill(all), read(),
     run_testsuite(), print(3), room(), kill(all), read(),
     run_testsuite(), print(4), room(), kill(all), read())
    

    After the first running of the testsuite, room reports

    Approximately 22,020,096 bytes of memory can be allocated
    before the next full GC is triggered.
    
                       Total Size             Free                 Used
    Lisp Heap:       60030976 (58624K)   22020096 (21504K)   38010880 (37120K)
    Stacks:          11034192 (10776K)   11028144 (10770K)       6048 (6K)
    Static:          47218592 (46112K)          0 (0K)       47218592 (46112K)
    376742.750 MB reserved for heap expansion.
    

    After the second

    Approximately 22,937,600 bytes of memory can be allocated
    before the next full GC is triggered.
     Free                 Used
    Lisp Heap:       60424192 (59008K)   22937600 (22400K)   37486592 (36608K)
    Stacks:          11034192 (10776K)   11028144 (10770K)       6048 (6K)
    Static:          47218592 (46112K)          0 (0K)       47218592 (46112K)
    376742.370 MB reserved for heap expansion.
    

    After the third

    Approximately 23,199,744 bytes of memory can be allocated
    before the next full GC is triggered.
    
                       Total Size             Free                 Used
    Lisp Heap:       61472768 (60032K)   23199744 (22656K)   38273024 (37376K)
    Stacks:          11034192 (10776K)   11028144 (10770K)       6048 (6K)
    Static:          47218592 (46112K)          0 (0K)       47218592 (46112K)
    376741.370 MB reserved for heap expansion.
    

    And again

    Approximately 10,485,760 bytes of memory can be allocated
    before the next full GC is triggered.
    
                       Total Size             Free                 Used
    Lisp Heap:       58195968 (56832K)   10485760 (10240K)   47710208 (46592K)
    Stacks:          11034192 (10776K)   11028144 (10770K)       6048 (6K)
    Static:          47218592 (46112K)          0 (0K)       47218592 (46112K)
    376744.500 MB reserved for heap expansion.
    
     
    • Gunter Königsmann

      that would indicate that Maxima doesn't have a huge memory leak, but it can still have a small memory leak in the places SBCL assignes a small fixed-size memory buffer to: it does do so for special variables (by default it can keep only 4096 or so of them), the call stack and the - was it binding stack?

       
  • Tomio Arisaka

    Tomio Arisaka - 2024-01-08

    If you build SBCL with the lisp feature SB-DEVEL, then the function dump-thread is enabled.
    It shows the values of TLS while SBCL is running.

    For example:

    (defun foo (&optional (max 5000))
      (sb-thread::dump-thread)
      (progv (loop for i below max
                   collect (make-symbol (format nil "sym~D" i)))
          (loop for i below max collect i)
        (sb-thread::dump-thread)))
    
     

Log in to post a comment.