|
From: Geoff B. <geo...@gm...> - 2023-09-19 06:39:07
|
Hi again Karl,
Much the same observations as for your previous message. (I have no
possibility to test this myself - please run self-tests and submit a pull
request, and preferably write a new self-test).
With the added observation that I think this setup - with some tests
running locally and others via a queuesystem in the same run - is not a
scenario I have ever run myself and I would not be surprised to encounter
other problems with it than the ones you have found.
A simple fix is to run TextTest separately on test suites / applications
that need to run sequentially.
Regards,
Geoff
On Fri, Sep 15, 2023 at 5:59 PM Karl Koehler <ka...@ac...> wrote:
> Hi everyone,
>
> as you know, each testsuite can be configured with it's own queuesystem.
> This is useful if you have some tests that run fast, and some tests that
> run slowly - you want the fast jobs to be on the local machine because the
> SGE queuing time would be likely longer than the test execution time.
> Thus there are testsuites with:
> config_module:queuesystem
> queue_system_module:SGE
> and testsuites with
> queue_system_module:local
>
> Now here's the bug in texttest: When looking for "all jobs complete", we
> only look at the queuesystem for test[0]. If that is "local" and we have
> SGE tests, this has the effect that we try to attribute error-states to all
> other tests, then quit early.
> Thus:
>
> ---
> ~/texttest_latest/texttest-master/texttestlib/queuesystem/masterprocess.py
> 2023-08-28 10:04:33.000000000 -0700
> +++ queuesystem/masterprocess.py 2023-09-15 08:30:15.505148514 -0700
> @@ -183,21 +183,52 @@
> return queueSystem.supportsPolling()
>
> def updateJobStatus(self):
> - queueSystem = self.getQueueSystem(list(self.jobs.keys())[0])
> - statusInfo = queueSystem.getStatusForAllJobs()
> + ##
> + # set of queueSystems
> + statusInfo = dict()
> + qsSet = set()
> + for test in self.jobs.keys():
> + qsSet.add(self.getQueueSystem(test))
> + for queueSystem in qsSet :
> + statusInfo.update(queueSystem.getStatusForAllJobs())
> +
> self.diag.info("Got status for all jobs : " + repr(statusInfo))
> if statusInfo is not None: # queue system not available for some
> reason
>
>
> I'd like to say that probably making the following thing a two-pass loop
> is better, waiting for all jobs to complete before we call "qacct". There
> is a gap in time between SGE job completion and the job appearing on qacct,
> and thus if we wait for the queue to be empty we are less likely to spend
> effort waiting for qacct to be ready for individual failed jobs, so we can
> update actually running and passing job status more expediently:
>
>
> ---
> ~/texttest_latest/texttest-master/texttestlib/queuesystem/masterprocess.py
> 2023-08-28 10:04:33.000000000 -0700
> +++ queuesystem/masterprocess.py 2023-09-15 08:51:15.828557223 -0700
> @@ -183,21 +183,38 @@
> return queueSystem.supportsPolling()
>
> def updateJobStatus(self):
> - queueSystem = self.getQueueSystem(list(self.jobs.keys())[0])
> - statusInfo = queueSystem.getStatusForAllJobs()
> + ##
> + # set of queueSystems
> + statusInfo = dict()
> + qsSet = set()
> + for test in self.jobs.keys():
> + qsSet.add(self.getQueueSystem(test))
> + for queueSystem in qsSet :
> + statusInfo.update(queueSystem.getStatusForAllJobs())
> +
> self.diag.info("Got status for all jobs : " + repr(statusInfo))
> if statusInfo is not None: # queue system not available for some
> reason
> + ##
> + # setSlaveFailed only if there are no more jobs running.
> + activejobs = 0
> for test, jobs in list(self.jobs.items()):
> if not test.state.isComplete():
> for jobId, jobName in jobs:
> status = statusInfo.get(jobId)
> if status:
> + activejobs += 1
> # Only do this to test jobs (might make a
> difference for derived configurations)
> # Ignore filtering states for now, which have
> empty 'briefText'.
> self.updateRunStatus(test, status)
> - elif not status and not self.jobCompleted(test,
> jobName):
> - # Do this to any jobs
> - self.setSlaveFailed(test,
> self.jobStarted(test, jobName), True, jobId)
> + if activejobs == 0:
> + for test, jobs in list(self.jobs.items()):
> + if not test.state.isComplete():
> + for jobId, jobName in jobs:
> + status = statusInfo.get(jobId)
> + if not status and not self.jobCompleted(test,
> jobName):
> + print("state of %s : %s" % (str(test),
> test.state.category))
> + # Do this to any jobs
> + self.setSlaveFailed(test,
> self.jobStarted(test, jobName), True, jobId)
>
>
>
> Similar with the cleanup function.
>
> @@ -391,8 +408,16 @@
> def cleanup(self, final=False):
> cleanupComplete = True
> if self.jobs:
> - queueSystem = self.getQueueSystem(list(self.jobs.keys())[0])
> - cleanupComplete &= queueSystem.cleanup(final)
> + ## multi-queue-system
> + #
> + qsSet = set()
> + for test in self.jobs.keys():
> + qsSet.add(self.getQueueSystem(test))
> + for queueSystem in qsSet :
> + cleanupComplete &= queueSystem.cleanup(final)
> + #
> + ##
>
> Thanks,
>
> - Karl Koehler
>
>
>
> _______________________________________________
> Texttest-users mailing list
> Tex...@li...
> https://lists.sourceforge.net/lists/listinfo/texttest-users
>
|