#4785 "out of stack space" on AIX

obsolete: 8.5.9
closed-fixed
5
2011-03-07
2011-01-27
Starwalker
No

I've got "out of stack space" error on AIX when running 32bits-tclsh. No such error of 64bits-tclsh.
I've traced the code of tcl, I found following problem:

[1] stopped in TclpGetCStackParams at line 1095 in file "/tellin/hjw/tcl8.5.9/unix/../unix/tclUnixInit.c" ($t1)
1095 tsdPtr->stackBound = (int *) ((char *)tsdPtr->outerVarPtr -
(dbx) n
stopped in TclpGetCStackParams at line 1097 in file "/tellin/hjw/tcl8.5.9/unix/../unix/tclUnixInit.c" ($t1)
1097 } else {
(dbx) p tsdPtr->outerVarPtr
0x2ff22460
(dbx) p stackSize
2147450878
(dbx) print tsdPtr->stackBound
0xaff2a462

According to the message above, the stackSize is a very great value, almost 2G, and tsdPtr->stackBound is a overflowed number. This cause following code returns false:
# define CheckCStack(iPtr, localIntPtr) \ ((localIntPtr) > (iPtr)->stackBound)
(dbx) p &localInt
0x2ff221a0
(dbx) p iPtr->stackBound
0xaff2a462

For 64bits-tclsh, the situation is as follows:
[1] stopped in TclpGetCStackParams at line 1095 in file "/tellin/hjw/tcl8.5.9-64/unix/../unix/tclUnixInit.c" ($t1)
1095 tsdPtr->stackBound = (int *) ((char *)tsdPtr->outerVarPtr -
(dbx) n
stopped in TclpGetCStackParams at line 1097 in file "/tellin/hjw/tcl8.5.9-64/unix/../unix/tclUnixInit.c" ($t1)
1097 } else {
(dbx) print tsdPtr->outerVarPtr
0x0ffffffffffff140
(dbx) print stackSize
4294934528
(dbx) print tsdPtr->stackBound
0x0fffffff00007140

The stackSize is amost 4G, but tsdPtr->stackBound is not overflowed. So the CheckCStack returns true.

Discussion

  • Donal K. Fellows

    • labels: --> 38. Init - Library - Autoload
    • assigned_to: nobody --> dgp
     
  • Jan Nijtmans

    Jan Nijtmans - 2011-01-27

    How about changing the lines 1071-1073:
    if (stackSize || (tsdPtr->stackBound &&
    ((stackGrowsDown && (&result < tsdPtr->stackBound)) ||
    (!stackGrowsDown && (&result > tsdPtr->stackBound))))) {
    to:
    if (stackSize || (tsdPtr->stackBound &&
    ((stackGrowsDown && ((&result - tsdPtr->stackBound) < 0)) ||
    (!stackGrowsDown && ((&result - tsdPtr->stackBound) > 0))))) {

    That should always work, no matter that the stackBound is near the 2G
    bounary. I would fail when the stack grows to more than half the available
    memory, but that seems highly unlikely.

    Does that help?

     
  • Jan Nijtmans

    Jan Nijtmans - 2011-01-27

    And - of course - the same changes in tclBasic.c as well

     
  • Starwalker

    Starwalker - 2011-01-27

    These changes work.
    After change tclBasic.c:360 from:
    ((localIntPtr) > (iPtr)->stackBound)
    to
    (((localIntPtr) - (iPtr)->stackBound) > 0)

    It makes the ((localIntPtr) - (iPtr)->stackBound) becomes a positive value. But I've no idea will it cause other problem on other machines.
    However, I still think the value of (iPtr)->stackBound is incorrect.

    stopped in TclInterpReady at line 3474 in file "/tellin/hjw/tcl8.5.9/unix/../generic/tclBasic.c" ($t1)
    3474 && CheckCStack(iPtr, &localInt)) {
    (dbx) print &localInt
    0x2ff22150
    (dbx) print iPtr->stackBound
    0xaff2a462
    (dbx) print &localInt - iPtr->stackBound
    0x7fff7cee

     
  • Jan Nijtmans

    Jan Nijtmans - 2011-01-28
    • assigned_to: dgp --> nijtmans
     
  • Jan Nijtmans

    Jan Nijtmans - 2011-01-28

    My guess is that on AIX there is a bug in pointer comparison, such that all pointers above the 2G are considered smaller than pointer below 2G. So, whenever two pointers are compared, one below and the other above the 2G border, the result is not correct. I see 2 possible solutions to this:

    - First substracting the two pointers results in a ptr_diff type,
    which is always signed. Then we can compare this to 0, and
    as long as no-one pushed more than 2G on the stack the
    result will be as expected. Well, 2G is an incredable
    amount, I don't think there is any machine with a total
    stack size as big a half the available memory.
    - Another solution would be to cast the pointers to
    (size_t) before the comparison, so:

    ((size_t)(localIntPtr) > (size_t)(iPtr)->stackBound)

    Then we simply correct AIX's comparison 'bug', but it
    looks more ugly ;-)

    I would prefer the first possibility, but someone might
    try to convince me otherwise. Anyone?

     
  • Donal K. Fellows

    Ugly's OK. It's conceptually ugly anyway.

     
  • Donal K. Fellows

    But don't make changes without testing on several platforms (minimally including a normal x86 Unix with gcc, Windows with MSVC, and AIX because it is known to have an issue).

     
  • Starwalker

    Starwalker - 2011-01-30

    I still think it's the problem of calculating iPtr->stackBound.
    Obviously, 0x2ff221a0 minus 2147450878 is a nagtive value for 32-bits integer.

     
  • Jan Nijtmans

    Jan Nijtmans - 2011-01-30

    Yes, something is very strange here: A stack size of
    2147450878 (0x7FFF7FFE), that's very big! So
    maybe the stacksize calculation is simply wrong
    for AIX. Then that should be corrected instead of
    making the code uglier... I'm hesitating

     
  • Joe Mistachkin

    Joe Mistachkin - 2011-01-30

    I'm trying to get access to an AIX box to help with this issue. I have a few theories of my own I would like to test.

     
  • Starwalker

    Starwalker - 2011-01-31

    I think the stack size is correct. The stack size on AIX can be set by the file /etc/security/limits which set stack to -1 and means "unlimited".

    The limitations are as follows:
    # ulimit -a
    time(seconds) unlimited
    file(blocks) unlimited
    data(kbytes) unlimited
    stack(kbytes) 4194304
    memory(kbytes) unlimited
    coredump(blocks) 2097151
    nofiles(descriptors) 2000

    For 32-bits program, the stack size is nearly 2G. For 64-bits program, the stack size is nearly 4G.

    If I change the stack limitation to a smaller number, for example 65536, which makes the stack size to 32M. The 32-bits tclsh won't core dump due to there is no stack bound problem.

     
  • Jan Nijtmans

    Jan Nijtmans - 2011-01-31

    Thanks! so how about the attached patch? No matter that we have a very big stack space, calculating
    the border should never overflow! If it does, it means that we already occupied a part of the
    stack, so the real stack size is lower. Here is a patch trying to accomplish that.

    Does this help?

     
  • Jan Nijtmans

    Jan Nijtmans - 2011-01-31

    proposed fix

     
  • Starwalker

    Starwalker - 2011-02-09

    This patch (3166410.patch) works。

     
  • Jan Nijtmans

    Jan Nijtmans - 2011-03-07
    • status: open --> closed-fixed
     
  • Jan Nijtmans

    Jan Nijtmans - 2011-03-07

    Fixed on core-8-5-branch. Not applicable to trunk and 8.4