You have an uninitialized variable i in the uninit() method in line 170 in test_util.cls
The interpreter crashes while trying to report error 41.001 Nonnumeric value .. used in arithmetic operation
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
With the help of Rony I could make the test cases run also on a 64 bit ooRexx installation on a Mac, and they produce the same behaviour as on 32-bit ooRexx for Windows, or so it seems at least.
I have enclosed a Zip with source, makefile and instructions on how to do the testing also on a Mac. I have corrected the uninitialised value referred to above, I do not see a difference before&after, so I do not think that "i" is the culprit.
Running the program pgm_02.rex as-is I get a Bus error: 10, whereas running it in Valgrind I get a Segmentation fault: 11 error. I have enclosed logs for both cases.
I suggest start reading the Valgrind log at
==2912== Process terminating with default action of signal 11 (SIGSEGV)
==2912== Bad permissions for mapped region at address 0x100E02C28
This points to specific ooRexx libraries. I hope this can shed some further light on the bug(s) reported by Rony. I will continue to run this test with more options to try to capture more detail of where it goes wrong.
PS in the Valgrind log this information:
Syscall param msg->desc.port.name points to uninitialised byte(s)
Emanates from a piece of code in syswrap-darwin.c (internal to Valgrind)
I have continued to run this test case and the crash is consistently showing up. I have tried to extract the most relevant parts from the Valgrind log (attached) Hopefully some of these names of methods in named libraries can ring a bell?
The first reported error is that there is a pointer (in the first thread/interpreter instance) socketcall.sendto(msg) that points to uninitialised byte(s)
The reason seems to be a chain of calls starting with RexxCreateSessionQueue (read from the bottom to the top) in librexxapi.5.0.0.dylib (The DLL in macOS)
--Begin Valgrind--
Syscall param socketcall.sendto(msg) points to uninitialised byte(s)
at sendto in /usr/lib/system/libsystem_kernel.dylib
in /Library/Frameworks/ooRexx.framework/Versions/A/Libraries/librexxapi.5.0.0.dylib
--End Valgrind--
The second reported error (in the first thread/interpreter instance) is
Interpreter::startInterpreter(Interpreter::InterpreterStartupMode)
in librexx.5.0.0.dylib
Caused by an uninitialised value created by a stack allocation by LocalAPIManager::establishServerConnection()
in librexxapi.5.0.0.dylib
--Begin Valgrind--
Interpreter::startInterpreter(Interpreter::InterpreterStartupMode)
in /Library/Frameworks/ooRexx.framework/Versions/A/Libraries/librexx.5.0.0.dylib
Address 0x7fff5fbfd452 is on thread 1's stack
in frame #6, created by LocalAPIManager::establishServerConnection() (???:)
Uninitialised value was created by a stack allocation
in /Library/Frameworks/ooRexx.framework/Versions/A/Libraries/librexxapi.5.0.0.dylib
--End Valgrind--
The actual crash occur the FIRST time a further thread is generated, Thread 1 is the first interpreter instance, Thread 2 is the first new one, the code in test_util.cls is this:
::method create class
use strict arg number=10000, useReply=.false -- number of objects to create
if useReply=.true then -- do it in another thread
reply <------------ The crash occurs HERE
do i=1 to number <-------- This line is never reached in Thread 2 (but the thread gets created)
self~new
if i//100=0 then .output~charout(".")
if i//1000=0 then .output~charout(i)
end
.output~say
The library is libsystem_pthread.dylib and the methods (?) are _pthread_body, _pthread_start and thread_start
--Begin Valgrind--
Thread 2:
Invalid read of size 4
at 0x10087D899: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
by 0x10087D886: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
by 0x10087D08C: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
Address 0x18 is not stack'd, malloc'd or (recently) free'd
--End Valgrind--
I will continue to drill deeper, I assume I can get hold of the source for all the libraries above?
There are also 4 warnings from the compiler when creating the testcase
test_gc.cpp:192:45: warning: format specifies type 'int' but the argument has
type 'logical_t' (aka 'unsigned long') [-Wformat]
instance, rc, rtc, pgmName, args, resu...
It does not look worrying but I will try to silence it and see if it makes a difference.
If there is ANYTHING you want me to test further just let me know.
Unfortunately, this crash still occurs on the tested Windows (32-bit) and on Linux (64-bit)!
Enclosed please find the updated zip-archive, the full Linux thread stack traces and the full Windows stack trace.
[In order to get the thread stack traces with and without local vars, I updated test_gc.cpp to allow it to be run on Linux (and MacOSX) by adding the external Function BsfGetTid() as on Unix SysQueryProcess("TID") is not available. Adapted the Rexx scripts accordingly and created new zip archive 'bug2a_20171021.zip" ]
I'm seeing that, at the time of the crash, when RexxActivation::getArguments() retrieves argCount it receives 1145128260, which is '44414544'x which is "DAED" in ASCII, which is the eye-catcher that dead objects are marked with (if running a DEBUG version).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Something a little more subtle. The exception occurs because an error occurs in an uninit method. While building the stack trace, if is picking up the stack frame of the just completed method that initiated the uninit processing. Because execution of that activation is complete, a lot of the information (in this case, the current instruction) is not valid. I believe this patch should fix the problem (untested in this scenario).
You have an uninitialized variable i in the uninit() method in line 170 in test_util.cls
The interpreter crashes while trying to report error 41.001 Nonnumeric value .. used in arithmetic operation
The ooRexx interpreter should never be able to crash in such a situation!
With the help of Rony I could make the test cases run also on a 64 bit ooRexx installation on a Mac, and they produce the same behaviour as on 32-bit ooRexx for Windows, or so it seems at least.
I have enclosed a Zip with source, makefile and instructions on how to do the testing also on a Mac. I have corrected the uninitialised value referred to above, I do not see a difference before&after, so I do not think that "i" is the culprit.
Running the program pgm_02.rex as-is I get a Bus error: 10, whereas running it in Valgrind I get a Segmentation fault: 11 error. I have enclosed logs for both cases.
I suggest start reading the Valgrind log at
==2912== Process terminating with default action of signal 11 (SIGSEGV)
==2912== Bad permissions for mapped region at address 0x100E02C28
This points to specific ooRexx libraries. I hope this can shed some further light on the bug(s) reported by Rony. I will continue to run this test with more options to try to capture more detail of where it goes wrong.
PS in the Valgrind log this information:
Syscall param msg->desc.port.name points to uninitialised byte(s)
Emanates from a piece of code in syswrap-darwin.c (internal to Valgrind)
static void pre_port_desc_read(ThreadId tid, mach_msg_port_descriptor_t *desc2)
{
pragma pack(4)
struct {
mach_port_t name;
mach_msg_size_t pad1;
uint16_t pad2;
uint8_t disposition;
uint8_t type;
} desc = (void)desc2;
pragma pack()
PRE_FIELD_READ("msg->desc.port.name", desc->name);
PRE_FIELD_READ("msg->desc.port.disposition", desc->disposition);
PRE_FIELD_READ("msg->desc.port.type", desc->type);
}
And I think it can be ignored, the same error message appear also for fully functional ooRexx code.
I have continued to run this test case and the crash is consistently showing up. I have tried to extract the most relevant parts from the Valgrind log (attached) Hopefully some of these names of methods in named libraries can ring a bell?
The first reported error is that there is a pointer (in the first thread/interpreter instance) socketcall.sendto(msg) that points to uninitialised byte(s)
The reason seems to be a chain of calls starting with RexxCreateSessionQueue (read from the bottom to the top) in librexxapi.5.0.0.dylib (The DLL in macOS)
--Begin Valgrind--
Syscall param socketcall.sendto(msg) points to uninitialised byte(s)
at sendto in /usr/lib/system/libsystem_kernel.dylib
by
SysSocketConnection::write(void, unsigned long, unsigned long)
SysSocketConnection::write(void, unsigned long, void, unsigned long, unsigned long*)
ServiceMessage::writeMessage(SysClientStream&)
ClientMessage::send(SysClientStream*)
ClientMessage::send()
LocalAPIManager::establishServerConnection()
LocalAPIManager::initProcess()
LocalAPIManager::getInstance()
LocalAPIContext::getAPIManager()
RexxCreateSessionQueue
in /Library/Frameworks/ooRexx.framework/Versions/A/Libraries/librexxapi.5.0.0.dylib
--End Valgrind--
The second reported error (in the first thread/interpreter instance) is
Interpreter::startInterpreter(Interpreter::InterpreterStartupMode)
in librexx.5.0.0.dylib
Caused by an uninitialised value created by a stack allocation by LocalAPIManager::establishServerConnection()
in librexxapi.5.0.0.dylib
--Begin Valgrind--
Interpreter::startInterpreter(Interpreter::InterpreterStartupMode)
in /Library/Frameworks/ooRexx.framework/Versions/A/Libraries/librexx.5.0.0.dylib
Address 0x7fff5fbfd452 is on thread 1's stack
in frame #6, created by LocalAPIManager::establishServerConnection() (???:)
Uninitialised value was created by a stack allocation
in /Library/Frameworks/ooRexx.framework/Versions/A/Libraries/librexxapi.5.0.0.dylib
--End Valgrind--
The actual crash occur the FIRST time a further thread is generated, Thread 1 is the first interpreter instance, Thread 2 is the first new one, the code in test_util.cls is this:
::method create class
use strict arg number=10000, useReply=.false -- number of objects to create
if useReply=.true then -- do it in another thread
reply <------------ The crash occurs HERE
do i=1 to number <-------- This line is never reached in Thread 2 (but the thread gets created)
self~new
if i//100=0 then .output~charout(".")
if i//1000=0 then .output~charout(i)
end
.output~say
The library is libsystem_pthread.dylib and the methods (?) are _pthread_body, _pthread_start and thread_start
--Begin Valgrind--
Thread 2:
Invalid read of size 4
at 0x10087D899: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
by 0x10087D886: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
by 0x10087D08C: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
Address 0x18 is not stack'd, malloc'd or (recently) free'd
--End Valgrind--
I will continue to drill deeper, I assume I can get hold of the source for all the libraries above?
There are also 4 warnings from the compiler when creating the testcase
test_gc.cpp:192:45: warning: format specifies type 'int' but the argument has
type 'logical_t' (aka 'unsigned long') [-Wformat]
instance, rc, rtc, pgmName, args, resu...
It does not look worrying but I will try to silence it and see if it makes a difference.
If there is ANYTHING you want me to test further just let me know.
I have now silenced the compiler warnings
Warning: format specifies type 'int' but the argument has type 'logical_t'
in
test_gc.cpp:192:45:
test_gc.cpp:192:77:
test_gc.cpp:229:123:
test_gc.cpp:229:146:
Replacing %d with %lu in fprintf in
rc=[%d]
condition=[%d]
bCondition=[%d]
silenced the warnings, but the crash is the same.
Last edit: Per Olov Jonsson 2017-10-11
Here are two more trace file with ooRexx build as debug. I still get "Segmentation fault: 11" when running this test case on macOS
-Checked out revision 11317, complete
-cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=DEBUG ../../oorexxsvn/main/trunk
-make
Unfortunately, this crash still occurs on the tested Windows (32-bit) and on Linux (64-bit)!
Enclosed please find the updated zip-archive, the full Linux thread stack traces and the full Windows stack trace.
[In order to get the thread stack traces with and without local vars, I updated test_gc.cpp to allow it to be run on Linux (and MacOSX) by adding the external Function BsfGetTid() as on Unix SysQueryProcess("TID") is not available. Adapted the Rexx scripts accordingly and created new zip archive 'bug2a_20171021.zip" ]
I'm seeing that, at the time of the crash, when RexxActivation::getArguments() retrieves argCount it receives 1145128260, which is '44414544'x which is "DAED" in ASCII, which is the eye-catcher that dead objects are marked with (if running a DEBUG version).
Something a little more subtle. The exception occurs because an error occurs in an uninit method. While building the stack trace, if is picking up the stack frame of the just completed method that initiated the uninit processing. Because execution of that activation is complete, a lot of the information (in this case, the current instruction) is not valid. I believe this patch should fix the problem (untested in this scenario).
Committed code fix with revision [r11321].
Regression tests run fine.
The provided error case
pgm01.rex
runs fineRick, thanks!!!!
Rony, can you please confirm?
Related
Commit: [r11321]
Built 32-bit Windows and 64-bit Linux ooRexx from trunk and can confirm that the test case runs fine on either system.