For some applications, it can prove difficult to match up the thread on which an interpreter instance has been created so that the instance can be terminated. It should be possible to do this on any thread so that memory resources don't get leaked.
Anonymous
Code committed [r12480], docs committed [r12481]
Related
Commit: [r12480]
Commit: [r12481]
Running the test case with this version hangs when the 63rd Rexx interpreter instance (RII) gets terminated. If that matters there is always an additional ("root") RII in the system which gets created first.
(The test case is shaped such that sequentially 1,000 Rexx interpreter instances get created, run a program and get terminated.)
It should only hang if there's still an active thread running on the
instance. In that case, it will wait for any active threads to terminate. A
stack trace might help, but I'm pretty sure I know where it's waiting.
Rick
On Thu, Jul 21, 2022 at 6:02 AM Rony G. Flatscher orexx@users.sourceforge.net wrote:
Related
Feature Requests: #806
Reran the test case and had it sit for a while and indeed it continues! Sometimes the next following engine termination will block, then a few ones go on before blocking for a short while. But it continues!
Will try to terminate the Rexx engine from a separate Java thread such that the block in Rexx does not block the Java thread that issues the termination.
O.K. eventually the test case finished, all 1,000 engines got created and terminated.
Doing Java and Rexx gc() a couple of times got the engine and proxy references down to the minimum.
The ooRexx .local objects all got garbage collected with this revision!
Further observations: terminating each RII from a separate Java thread still causes those blocks to occur which inhibit the creation of new RIIs!
Here just to illustrate (each dot represents a separate RII which ran successfully a program that issues a dot). In this use case the RexxEngines get garbage collected by Java and terminated from the clean thread where the RexxEngine terminate() will create a separate Java thread to invoke Terminate() via JNI (watch out for interleaved output from different threads):
Each termination call is going to run a garbage collection and call uninit
methods. Since these are activities that are cooperatively multitasked,
they might take some time to complete, particularly if the uninit methods
make calls to methods written in native code. Depending on how much live
data there is, this can take a while.
Rick
On Thu, Jul 21, 2022 at 7:33 AM Rony G. Flatscher orexx@users.sourceforge.net wrote:
Related
Feature Requests: #806
It should not take between 50 times longer than before judging by the number of RII/sec.
Will post three more stack traces from the same process just did a "Break all" at different point in times. There is one stack trace for a RII termination in it.
Please advise if I can try to come up with more information.
Here the three stack traces (process is currently halted):
Think I got a meaningful stack trace:
Testing new r12482: blocks still occur, here the stack traces that I could get (process on hold):
If it doesn't hand indefinitely, this is not a bug. You are running 1000 garbage collection cycles, which are not interruptible , while also doing other work on other threads. There are going to be delays because of the cooperative multitasking.
The 1000 engines are at the end of the test cycle.
Yesterday, it took about 20 seconds altogether, including the more than 1000 garbage collector runs with the uninits; all executed according to the Java side statistics (using the version that replaced the monitors in .local right before termination).
Did take (manual) timings of r12482 each time a noticeable (lasting a few seconds) blocking occurs. Here the results, each elapsed time shows the time between blocks at the very beginning of the test cycle (so the first noticeable block occurs after 3,98 seconds, the next one after 6,56 seconds, the next one after 12,91 seconds), stopped the timings after about 40 seconds:
Still not a bug. The calls yesterday experience no delays because they were not doing anything because they weren't getting called on the original creation thread. Now that that restriction has been lifted, they are now doing all of the shutdown activity, including performing the garbage collection and running the uninit methods.
OK, this is now with a test case (t_bsf.rex) doing principally the same: create an instance, run a program (.output~charout('.'), terminate the instance in a loop of 1000 runs. This testcase also changes the destination of .input, .output, .error, .traceoutput, .debuginput to BSF objects that redirect to the Java streams.
Scenario A: everything runs on the same thread, creation, running the program, terminating. According to the Java statistics the BSF objects all got garbage collected and their uninit run:
Scenario B: the creation and running of the program is executed on the same thread, terminating the instance is done from a separate thread. According to the Java statistics the BSF objects all got garbage collected and their uninit run:
To sum up:
Scenario A needs 13.283 seconds for 1000 engines, scenario B needs 828.357 seconds.
In both scenarios the .local entries get freed by Rexx and garbage collected with running their uninits.
And still NOT A BUG. Because the separate thread is competing with another thread for the interpreter lock, it is going to be held up frequently because of contention with the other thread. The external calls in the uninit methods are going to cause the lock to be released which will cause that thread to go to the back of the line for dispatching. As long as things complete eventually, things are doing what they are supposed to be doing. Your code happens to be non-optimally written.
This is very unfortunate as in Java based web servers where each request (e.g. of a JSP page) gets dispatched on different threads and the termination of the engines gets done from a single, different Java thread (a cleaner thread).
Unfortunately, there is nothing I can do ( unlike with my own stand-alone test programs mimickring what the Java web servers do).
After further extensive testing I think that there is a problem somewhere as even in the case where there are ten instances only, each having five BSF objects in their .local monitors, termination of these ten instances needs six (!) seconds on a quite fast machine.
Changed test logic: creating an instance, running a program (.local monitors with uninit objects) in a loop. The instances get collected and after the loop (for timing purposes) the Rexx gc gets invoked as many times as instances got created (just to see how long that would take at that particular point in time).
Then another loop over the collected instances sends the terminate message (via the engine's BSFManager): this is the spot where there are dramatic differences comparing variant A with variant B (see below).
loop=10
BSF objects: loop * 5 (monitor destinations), hence totalling 50 objects that need their uninit methods run in this test run
Creating instances, durations:
Sending terminate message, durations:
Variant B is almost 60 (!) times faster than Variant A (0.108 vs. 6.346 seconds)!
The only difference being in this case, that Variant A needs to free the .local monitor destination objects and then run their uninits (five BSF objects per instance, test run of 10 instances hence a total of 50 BSF objects that need to have their uninit methods run, yet this lasts 6.2 seconds on a quite fast machine).
These uninits are also run in Variant B, but the BSF objects got removed from .local long before the terminate message gets sent to the respective instance.
The routine "bsf.redirectTo {Java | Rexx}" allows one to redirect the .local monitors to be redirected to Java ("Java") or to Rexx ("Rexx"). The current destination of each of the .input, .output, .error, .debuginput and .traceoutput monitors gets popped, unless it is the very first destination of the monitor, then the respective Java or standard Rexx destinations get added.
Additional updates [r12483]
Related
Commit: [r12483]
if MutableBufferLength must be a blocking API, shouldn't MutableBufferData also be a blocking API?
you're right, the backing buffer can be reallocated, which makes it risky.
However, that could happen in any event immediately after the pointer was
requested. I suspect it's safe to leave it this way, but I'll back that one
off.
Rick
On Thu, Jul 21, 2022 at 5:07 PM Erich erich_st@users.sourceforge.net
wrote:
Related
Feature Requests: #806
The only thing wrong here is how your code is written and your expectations of what will happen. Once Terminate is called, it waits for all the threads to terminate, then obtains the global dispatch lock and runs a garbage collection (a long running. non-interruptible operation). Once the garbage is collected, the uninit methods get run on objects that are now out of scope. Since your uninit objects are in native code, the dispatch lock is given up before the callout is made and it needs to be reobtained for things to continue. It goes back to the end of the dispatch queue. If for example, the thread holding the lock itself need to run a garbage collection, then it could be a long wait. Adding to this, if you native methods are making calls to APIs that have to get the dispatch lock, there can be a bit of waiting involved. I saw from the stack traces you posted that there's a least one of these API calls being made. Depending on how many threads are running and what they are doing, these calls can experience some substantial delays, but the fact that they actually run to completion means means the underlying mechanisms are working correctly.
Things can be improved by trying to optimize your uninit code to minimize callbacks or reducing the number of things you're attempting to do currently, but this is working correctly.
This is the absolute last time I'm going to comment on this issue. If you still feel there's a bug here, then you need to figure it out and fix it.
Rick
Hmm. It is strange that those "few calls with locks" cause such enormous delays when running on a (debug) system that almost does 1,000,000 RexxCPS. So in the time terminating ten instances lasting six seconds there could be alternatively 6,000,000 RexxCPS executed.
Even if the implementation (as no memory leaks occur anymore which is good) is correct, the runtime characteristics definitely do not look ok at all. E.g. it inhibits deploying ooRexx in Java based web servers serving JSPs, which would be a pity as the ooRexx solution is just simple and easy to employ for ooRexx programmers, a real boon! Hence trying hard to come up with a solution that would help solve the problem.
Ad uninit: the call to native code is needed to inform the Java side that a particular Rexx object releases its reference to its peer Java object (the Java side decreases the reference counter and if the counter drops to 0 it will remove the pinned Java object from the Java registry).
The uninit logic is written with as few clauses as possible to get the best performance (in native code just an argument check, an ObjectToString() is carried out and the necessary Java method gets invoked directly via JNI), hence these calls are short and fast. There are no callbacks to ooRexx during its execution.
Looking from the outside in a form of a black-box analysis it surely looks like some sort of a race condition or wrong locking as really, the same uninits run in one case 60 times slower than in the other case, the difference being simply 50 objects with uninits in .local in the termination phase (needing 50 simple callouts without callbacks).
Therefore there seems to be a significant difference in garbage collection running uninits during normal operation of an instance compared to doing it in the termination phase. It is as if in the termination phase some sort of a race condition occurs or some sort of additional/wrong locking with timeouts.
Notabene: the test case is the same for both runs, the difference is the existence of 5 BSF objects in .local when the termination phase starts for each instance which slows the termination down so tremendously. I cannot see how normal locking for five uninit invocations would cause such delays in the termination phase hence speculating some sort of a race condition or waiting on a wrong object or using a wrong lock and the like.
Not sure whether this observation helps: in a different test scenario, where creating an instance, running the program and termination occurs in the loop, the main thread seems to be blocked at instance creation time in the next loop run as long as the termination of the previous instance is being carried out on the separate thread. Only upon termination of that instance on that separate thread would the new instance be created it seems.
As there seems to be nothing that I can do to improve these abysmal termination times (one callout per uninit, no callback to ooRexx on this callout, shortest possible paths in ooRexx code, native code and Java code for uninits) the only thing I can do then would be to warn people about using ooRexx in such use cases which I really, really would hate a lot, needless to say!
An immensely amount of time and effort would have been all in vain.
OK, went out and started to debug interpreter/runtime/InterpreterInstance.cpp, eventually it worked surprisingly, then systematically commenting out the debug statements until a state was arrived that keeps working, but removing any of the debug statements would create those abysmal blocks.
So inserting fprintf(stderr,...);fflush(stderr); at two specific locations makes it work. Both must be present.
The locations in the attached file are line # 484 (before "ResourceSection lock;") and # 497 (after "terminated = allActivities->items() == 1;").