Orphan background processes
Status: Beta
Brought to you by:
worden
There seem to be background jobs in which make and the processor-intensive programs it calls are still running after pe-make has terminated. I think this is not supposed to happen, and may cause us to have background processes running that are no longer visible in the background jobs listings on the wiki, or else it's reporting them as terminated when they're not, or something. This looks like a bug that should be fixed.
Anonymous
Confirmed. I'm looking at one right now, in which make is running and pe-make isn't. WW reports the job as "not running (status unknown)". I think it'll continue to do that after make finishes, because pe-make is expected to record the exit status and it won't.
So two questions:
1. what happened to pe-make? the reaper? or does pe-make have a flaw that should be eliminated? if not, if I knew what was killing it I might be able to catch the condition and at least leave a record and/or kill the make process on the way down.
2. if it's going to happen like this, I should improve the interface. I could check for the make process as well as the pe-make, I guess, so that I could report "is running" while it's running. Then when make finishes, pe-make won't be there to record the exit status, so then I'd have to report "not running (status unknown)" but at least in that case it'd be true.Last edit: Lee Worden 2013-04-11
We're getting something like this on our SGE setup as well - processes that are running, though the background session directory does not exist. I'm presuming this is because WW thinks it killed the job and destroyed the session directory. This may be a separate bug in which we need better handling in case we don't successfully kill the job. But I don't actually know what the cause is.
For reference, the last fixes I made to the SGE interface were at https://sourceforge.net/tracker/index.php?func=detail&aid=3476800&group_id=366300&atid=1527385, which could be relevant in some way
We have these on the cluster nodes now (we're running both foreground and background WW processes distributed across the yushan cluster now).
On another ticket, Andrei A. wrote:
Andrei: why do you say so? It seems more likely to me that they happen at random times. Did you have experience suggesting that they are connected to clear or edit operations, freezes, or background jobs? All these things are supposed to be protected from each other by a locking algorithm.
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Is it possible to just "collect" the problem? Ie., have some sort of background process that scans for jobs that have lost their pe-make and kills them?
Sure, but it wouldn't help us kill the bug
I have a test program that produces a ridiculously large amount of output, and apparently fails to die when pe-make kills its children. Useful for testing. Need to sit down with it and test it until I see how to fix it.