From: SourceForge.net <no...@so...> - 2011-01-31 07:34:41
|
Bugs item #3166410, was opened at 2011-01-27 08:39 Message generated for change (Comment added) made by nijtmans You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3166410&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 38. Init - Library - Autoload Group: current: 8.5.9 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Starwalker (starwalker2000) Assigned to: Jan Nijtmans (nijtmans) Summary: "out of stack space" on AIX Initial Comment: I've got "out of stack space" error on AIX when running 32bits-tclsh. No such error of 64bits-tclsh. I've traced the code of tcl, I found following problem: [1] stopped in TclpGetCStackParams at line 1095 in file "/tellin/hjw/tcl8.5.9/unix/../unix/tclUnixInit.c" ($t1) 1095 tsdPtr->stackBound = (int *) ((char *)tsdPtr->outerVarPtr - (dbx) n stopped in TclpGetCStackParams at line 1097 in file "/tellin/hjw/tcl8.5.9/unix/../unix/tclUnixInit.c" ($t1) 1097 } else { (dbx) p tsdPtr->outerVarPtr 0x2ff22460 (dbx) p stackSize 2147450878 (dbx) print tsdPtr->stackBound 0xaff2a462 According to the message above, the stackSize is a very great value, almost 2G, and tsdPtr->stackBound is a overflowed number. This cause following code returns false: # define CheckCStack(iPtr, localIntPtr) \ ((localIntPtr) > (iPtr)->stackBound) (dbx) p &localInt 0x2ff221a0 (dbx) p iPtr->stackBound 0xaff2a462 For 64bits-tclsh, the situation is as follows: [1] stopped in TclpGetCStackParams at line 1095 in file "/tellin/hjw/tcl8.5.9-64/unix/../unix/tclUnixInit.c" ($t1) 1095 tsdPtr->stackBound = (int *) ((char *)tsdPtr->outerVarPtr - (dbx) n stopped in TclpGetCStackParams at line 1097 in file "/tellin/hjw/tcl8.5.9-64/unix/../unix/tclUnixInit.c" ($t1) 1097 } else { (dbx) print tsdPtr->outerVarPtr 0x0ffffffffffff140 (dbx) print stackSize 4294934528 (dbx) print tsdPtr->stackBound 0x0fffffff00007140 The stackSize is amost 4G, but tsdPtr->stackBound is not overflowed. So the CheckCStack returns true. ---------------------------------------------------------------------- >Comment By: Jan Nijtmans (nijtmans) Date: 2011-01-31 08:34 Message: Thanks! so how about the attached patch? No matter that we have a very big stack space, calculating the border should never overflow! If it does, it means that we already occupied a part of the stack, so the real stack size is lower. Here is a patch trying to accomplish that. Does this help? ---------------------------------------------------------------------- Comment By: Starwalker (starwalker2000) Date: 2011-01-31 02:47 Message: I think the stack size is correct. The stack size on AIX can be set by the file /etc/security/limits which set stack to -1 and means "unlimited". The limitations are as follows: # ulimit -a time(seconds) unlimited file(blocks) unlimited data(kbytes) unlimited stack(kbytes) 4194304 memory(kbytes) unlimited coredump(blocks) 2097151 nofiles(descriptors) 2000 For 32-bits program, the stack size is nearly 2G. For 64-bits program, the stack size is nearly 4G. If I change the stack limitation to a smaller number, for example 65536, which makes the stack size to 32M. The 32-bits tclsh won't core dump due to there is no stack bound problem. ---------------------------------------------------------------------- Comment By: Joe Mistachkin (mistachkin) Date: 2011-01-30 21:26 Message: I'm trying to get access to an AIX box to help with this issue. I have a few theories of my own I would like to test. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-01-30 21:08 Message: Yes, something is very strange here: A stack size of 2147450878 (0x7FFF7FFE), that's very big! So maybe the stacksize calculation is simply wrong for AIX. Then that should be corrected instead of making the code uglier... I'm hesitating ---------------------------------------------------------------------- Comment By: Starwalker (starwalker2000) Date: 2011-01-30 15:31 Message: I still think it's the problem of calculating iPtr->stackBound. Obviously, 0x2ff221a0 minus 2147450878 is a nagtive value for 32-bits integer. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-01-28 16:55 Message: But don't make changes without testing on several platforms (minimally including a normal x86 Unix with gcc, Windows with MSVC, and AIX because it is known to have an issue). ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-01-28 16:53 Message: Ugly's OK. It's conceptually ugly anyway. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-01-28 09:06 Message: My guess is that on AIX there is a bug in pointer comparison, such that all pointers above the 2G are considered smaller than pointer below 2G. So, whenever two pointers are compared, one below and the other above the 2G border, the result is not correct. I see 2 possible solutions to this: - First substracting the two pointers results in a ptr_diff type, which is always signed. Then we can compare this to 0, and as long as no-one pushed more than 2G on the stack the result will be as expected. Well, 2G is an incredable amount, I don't think there is any machine with a total stack size as big a half the available memory. - Another solution would be to cast the pointers to (size_t) before the comparison, so: ((size_t)(localIntPtr) > (size_t)(iPtr)->stackBound) Then we simply correct AIX's comparison 'bug', but it looks more ugly ;-) I would prefer the first possibility, but someone might try to convince me otherwise. Anyone? ---------------------------------------------------------------------- Comment By: Starwalker (starwalker2000) Date: 2011-01-27 14:27 Message: These changes work. After change tclBasic.c:360 from: ((localIntPtr) > (iPtr)->stackBound) to (((localIntPtr) - (iPtr)->stackBound) > 0) It makes the ((localIntPtr) - (iPtr)->stackBound) becomes a positive value. But I've no idea will it cause other problem on other machines. However, I still think the value of (iPtr)->stackBound is incorrect. stopped in TclInterpReady at line 3474 in file "/tellin/hjw/tcl8.5.9/unix/../generic/tclBasic.c" ($t1) 3474 && CheckCStack(iPtr, &localInt)) { (dbx) print &localInt 0x2ff22150 (dbx) print iPtr->stackBound 0xaff2a462 (dbx) print &localInt - iPtr->stackBound 0x7fff7cee ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-01-27 12:06 Message: And - of course - the same changes in tclBasic.c as well ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-01-27 12:03 Message: How about changing the lines 1071-1073: if (stackSize || (tsdPtr->stackBound && ((stackGrowsDown && (&result < tsdPtr->stackBound)) || (!stackGrowsDown && (&result > tsdPtr->stackBound))))) { to: if (stackSize || (tsdPtr->stackBound && ((stackGrowsDown && ((&result - tsdPtr->stackBound) < 0)) || (!stackGrowsDown && ((&result - tsdPtr->stackBound) > 0))))) { That should always work, no matter that the stackBound is near the 2G bounary. I would fail when the stack grows to more than half the available memory, but that seems highly unlikely. Does that help? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3166410&group_id=10894 |