From: Karl K. <ka...@ac...> - 2023-09-15 15:58:53
|
Hi everyone, as you know, each testsuite can be configured with it's own queuesystem. This is useful if you have some tests that run fast, and some tests that run slowly - you want the fast jobs to be on the local machine because the SGE queuing time would be likely longer than the test execution time. Thus there are testsuites with: config_module:queuesystem queue_system_module:SGE and testsuites with queue_system_module:local Now here's the bug in texttest: When looking for "all jobs complete", we only look at the queuesystem for test[0]. If that is "local" and we have SGE tests, this has the effect that we try to attribute error-states to all other tests, then quit early. Thus: --- ~/texttest_latest/texttest-master/texttestlib/queuesystem/masterprocess.py 2023-08-28 10:04:33.000000000 -0700 +++ queuesystem/masterprocess.py 2023-09-15 08:30:15.505148514 -0700 @@ -183,21 +183,52 @@ return queueSystem.supportsPolling() def updateJobStatus(self): - queueSystem = self.getQueueSystem(list(self.jobs.keys())[0]) - statusInfo = queueSystem.getStatusForAllJobs() + ## + # set of queueSystems + statusInfo = dict() + qsSet = set() + for test in self.jobs.keys(): + qsSet.add(self.getQueueSystem(test)) + for queueSystem in qsSet : + statusInfo.update(queueSystem.getStatusForAllJobs()) + self.diag.info("Got status for all jobs : " + repr(statusInfo)) if statusInfo is not None: # queue system not available for some reason I'd like to say that probably making the following thing a two-pass loop is better, waiting for all jobs to complete before we call "qacct". There is a gap in time between SGE job completion and the job appearing on qacct, and thus if we wait for the queue to be empty we are less likely to spend effort waiting for qacct to be ready for individual failed jobs, so we can update actually running and passing job status more expediently: --- ~/texttest_latest/texttest-master/texttestlib/queuesystem/masterprocess.py 2023-08-28 10:04:33.000000000 -0700 +++ queuesystem/masterprocess.py 2023-09-15 08:51:15.828557223 -0700 @@ -183,21 +183,38 @@ return queueSystem.supportsPolling() def updateJobStatus(self): - queueSystem = self.getQueueSystem(list(self.jobs.keys())[0]) - statusInfo = queueSystem.getStatusForAllJobs() + ## + # set of queueSystems + statusInfo = dict() + qsSet = set() + for test in self.jobs.keys(): + qsSet.add(self.getQueueSystem(test)) + for queueSystem in qsSet : + statusInfo.update(queueSystem.getStatusForAllJobs()) + self.diag.info("Got status for all jobs : " + repr(statusInfo)) if statusInfo is not None: # queue system not available for some reason + ## + # setSlaveFailed only if there are no more jobs running. + activejobs = 0 for test, jobs in list(self.jobs.items()): if not test.state.isComplete(): for jobId, jobName in jobs: status = statusInfo.get(jobId) if status: + activejobs += 1 # Only do this to test jobs (might make a difference for derived configurations) # Ignore filtering states for now, which have empty 'briefText'. self.updateRunStatus(test, status) - elif not status and not self.jobCompleted(test, jobName): - # Do this to any jobs - self.setSlaveFailed(test, self.jobStarted(test, jobName), True, jobId) + if activejobs == 0: + for test, jobs in list(self.jobs.items()): + if not test.state.isComplete(): + for jobId, jobName in jobs: + status = statusInfo.get(jobId) + if not status and not self.jobCompleted(test, jobName): + print("state of %s : %s" % (str(test), test.state.category)) + # Do this to any jobs + self.setSlaveFailed(test, self.jobStarted(test, jobName), True, jobId) Similar with the cleanup function. @@ -391,8 +408,16 @@ def cleanup(self, final=False): cleanupComplete = True if self.jobs: - queueSystem = self.getQueueSystem(list(self.jobs.keys())[0]) - cleanupComplete &= queueSystem.cleanup(final) + ## multi-queue-system + # + qsSet = set() + for test in self.jobs.keys(): + qsSet.add(self.getQueueSystem(test)) + for queueSystem in qsSet : + cleanupComplete &= queueSystem.cleanup(final) + # + ## Thanks, - Karl Koehler |