|
From: Karl K. <ka...@ac...> - 2023-09-13 14:43:53
|
Hi, We are using texttest with SGE, and have found that there is a problem when qacct is not fast enough. What happens: (1) the job completes (2) via a message from the slave, the status of the test is updated ( masterprocess.py", line 1025, in handleRequestFromHost ) . This is an independent thread. (3) masterprocess.py, updateJobStatus sees that the job is not in the qstat any longer. But in masterprocess.py:198, updateJobStatus, the jobComplete is not yet true. (4) setSlaveFailed will wait a long time for qacct to finally get the info on the job, at which time step (2) has happened. Result: failures that are not quite real, and incorrect error messages in the test.state.freeText and test.state.briefText. So, there are questions: * Should there be a lock around Test.ChangeState ? * And what do you think of the following work-around/solution for the problem that SGE is too late, regardless of locking ? -bash-4.2$ diff -du ~/texttest_latest/texttest-master/texttestlib/queuesystem/masterprocess.py texttestlib/queuesystem/masterprocess.py --- /home/karlkoehler/texttest_latest/texttest-master/texttestlib/queuesystem/masterprocess.py 2023-08-28 10:04:33.000000000 -0700 +++ texttestlib/queuesystem/masterprocess.py 2023-09-12 17:00:14.795473685 -0700 @@ -646,8 +646,11 @@ return system def changeState(self, test, newState, previouslySubmitted=True): - test.changeState(newState) - self.handleLocalError(test, previouslySubmitted) + # this has to check the test because otherwise slowness in sge qacct will + # set the state to failed and with the wrong message. + if not test.state.isComplete(): + test.changeState(newState) + self.handleLocalError(test, previouslySubmitted) Thanks, Karl Koehler |