From: Pawel S. <Paw...@bc...> - 2009-07-20 07:18:37
|
Hi, Sorry for not replying for a long time, but I was away. I am back to the issue now. So I had some chats with HPC engineer here and I think we are closer to finding the problem now. So it seems to be due to our setup: I am running GridSAM on virtual machine (lets call it VM) which is a mirror copy of submition node (SN) to the cluster. They have many in common, but not the filesystem. So basically PBS script generated by GridSAM is running on one machine while GridSAM on another. I am not sure if such a usecase has been thought of, but apparently the generated PBS script (snippet at the end of email) tries to 'cd' from SN to a directory (PBS_O_WORKDIR) that is on VM, which obviously fails. This brings me to suspicion that file .complete touched at the end of the script is never created in place where GridSAM expects it to be, so it does not know when the job completes. Stdout and stderr files configured in PBS script (#PBS -o and #PBS -e) get to the right place though, so PBS handles this virtual-machine setup. We came up with two solutions to that: either SN and VM have to share some disc space mounted in the same path in both, or we can alter the way PBS script is generated. And here comes the question: is that possible? Is there some config file I can alter in order to get for example "ssh VM:$PBS_O_WORKDIR/ touch .complete" instead of "touch .complete"? Cheers, Paweł ----------------- #! /bin/sh # PBS batch job script built by Globus job manager # #PBS -S /bin/sh #PBS -Wx=NODESET:ONEOF:FEATURE:switch10:switch11:ecs #PBS -o /tmp/blast1.out #PBS -e /tmp/blast1.err cd ${PBS_O_WORKDIR} /export/fimm/local2/blast-2.2.13/bin/blastall \ -i /work/pawels/blast-runs/seq/batch0.fasta -p blastp -d /work2/speil/flatdb/genbank/fasta/nr -m7 touch .complete ------------------ Justin Bradley wrote: > Hi Pawel, > > Sorry for the slow response. > > The UserMappingStage warning is nothing to worry about. GridSAM can be > set up to map user certificates to local users, then submit jobs as that > user. Rather than the default which is to run the jobs as the user > running the Tomcat container. > > The SPMDVariation is also nothing to be concerned by unless you are > using MPI. > > The PBSStageStatus is alas only a poorly formatted piece of logging. > > I must ad it I'm slightly stumped. However it may be beneficial if you > sent me both some working JSDL and some from your problematic job. And > I'll take a look to see if there is anything wrong. Also could you send > me: tomcat/logs/gridsam.log records/omii > webapps/gridsam/WEB-INF/classes/jobmanager.xml and jobmanager-pbs.xml > > There are known edge cases with the default PBS connector where if a job > completes too quickly, it can be removed from the PBS queue such that > qstat no longer reports its status. Then GridSAM can loose track of the > job. However, this doesn't feel like this problem (and there are > several workarounds if it was.) > > Justin > > On 24 May 2009, at 21:55, Pawel Sztromwasser wrote: > >> >> Hi Justin, >> >>> Is this PBS Torque and using the default scheduler? >> yes this is Torque. scheduler is combination of Torque and Maui >>> If the simple jobs work with GridSAM, then the more complex ones >>> shouldn't really be any different. >>> You could check the gridsam.log in tomcat/logs, see if that mentions >>> anything useful. Feel free to email this to me if you like. >> >> This are the only warnings and errors I have found (there are a few >> occurrences of each of these three, but no other errors/warns) >> >> $ grep WARN gridsam.log.2009-05-19 >> 2009-05-19 16:37:49,303 WARN [UserMappingStage] You disabled security >> features: pool accounts and private workding directory. >> 2009-05-19 16:37:49,339 WARN [UserMappingStage] Use the local user >> account (pawels) of OU=CBU, O=UiB, EMAILADDRESS=<my_email>, C=Norway, >> ST=Hordaland, >> >> $ grep ERROR gridsam.log.2009-05-19 >> 2009-05-19 15:48:03,989 ERROR [ResourceRegistry] >> classpath:org/icenigrid/gridsam/resource/config/common.xml, line 135, >> column 81 - No value available for symbol 'gridsam.SPMDVariation'. >> >> >> This was interesting though: >> >> 2009-05-19 17:19:26,941 INFO [PBSStatusStage] monitoring PBS using >> /opt/torque/bin/qstat2116221.hpcmaster.bccs.uib.no >> >> There is no space between qstat and jobId. I don't know if this is >> smth to worry about or just the way log4j prints this message? >> >> Cheers, >> Pawel >> >>> >>> Justin >>> >>> >>> On 22 May 2009, at 07:23, Pawel Sztromwasser wrote: >>> >>>> Justin Bradley wrote: >>>>> Hi Pawel, >>>>> Which version of PBS are you using? >>>> >>>> pbs_server is version 2.3.5 >>>> >>>>> And are the PBS server and scheduler on the same machine as GridSAM? >>>> >>>> No. GridSAM is running on a virtual machine that has access to the >>>> queue (and scheduling tools like qsub, qstat...), but not to >>>> pbs_server for example. I can run jobs on the cluster from this >>>> virtual machine as I was on the cluster itself (this is what >>>> cluster's admin said;). Does this affect GridSAM? >>>> >>>> Pawel >>>> >>>>> Justin >>>>> On 20 May 2009, at 13:51, Pawel Sztromwasser wrote: >>>>>> Hello, >>>>>> >>>>>> I guess similar issue has been mentioned once on the mailing list, >>>>>> but >>>>>> doesn't look like solved: >>>>>> >>>>>> http://sourceforge.net/mailarchive/forum.php?thread_name=002701c64d08%2449efd790%24db421080%40cs.ucl.ac.uk&forum_name=gridsam-discuss >>>>>> >>>>>> >>>>>> This also looks relevant, but I think it has been included into >>>>>> version >>>>>> of GridSAM I am using (2.1.4): >>>>>> >>>>>> http://sourceforge.net/mailarchive/forum.php?thread_name=20080617145418.605536413%40soton.ac.uk&forum_name=gridsam-developer >>>>>> >>>>>> >>>>>> So what is the problem? I configured GridSAM to run jobs using PBS >>>>>> and >>>>>> then I am submitting them using GridSAM's Web Services. Jobs start >>>>>> OK, >>>>>> but their status freezes at state 'active' (response from >>>>>> getJobStatus >>>>>> method below). I checked jobs on the cluster and they had >>>>>> finished, so >>>>>> it seems like GridSAM doesn't check properly job's status on the >>>>>> cluster. It worked properly while running simple POSIX programs >>>>>> like cat >>>>>> and echo, so it is something with PBS. >>>>>> >>>>>> I double checked pbs.PBSJobStatusCommand in jobmanager-pbs.xml >>>>>> file and >>>>>> it is correct. Is there anything additional that GridSAM uses to >>>>>> check >>>>>> if job completed on the cluster? maybe something is wrong with my >>>>>> setup, >>>>>> like directories with wrong permissions it can't access? >>>>>> >>>>>> I am using GridSAM 2.1.4 with OMII toolkit 3.4.4. >>>>>> >>>>>> Thank you in advance for help, >>>>>> Pawel >>>>>> >>>>>> >>>>>> ------------------------ >>>>>> <getJobStatusResponse >>>>>> xmlns="http://www.icenigrid.org/service/gridsam"> >>>>>> <JobStatus> >>>>>> <JobIdentifier> >>>>>> <ID>urn:gridsam:0131f86a215956de01215965031d000b</ID> >>>>>> </JobIdentifier> >>>>>> <Stage> >>>>>> <State>pending</State> >>>>>> <Description>job is being scheduled</Description> >>>>>> <Time>2009-05-19T17:02:20.701+02:00</Time> >>>>>> </Stage> >>>>>> <Stage><State>staging-in</State><Description>staging >>>>>> files...</Description><Time>2009-05-19T17:02:20.803+02:00</Time></Stage><Stage><State>staged-in</State><Description>no >>>>>> >>>>>> file needs to be staged >>>>>> in</Description><Time>2009-05-19T17:02:20.807+02:00</Time></Stage> >>>>>> <Stage> >>>>>> <State>active</State> >>>>>> <Description>job is being launched through >>>>>> PBS</Description> >>>>>> <Time>2009-05-19T17:02:20.933+02:00</Time> >>>>>> </Stage> >>>>>> <Property name="urn:pbs:script"> >>>>>> >>>>>> #! /bin/sh >>>>>> # PBS batch job script built by Globus job manager >>>>>> # >>>>>> #PBS -S /bin/sh >>>>>> #PBS -Wx=NODESET:ONEOF:FEATURE:switch10:switch11:ecs >>>>>> >>>>>> #PBS -o /work/pawels/blast-runs/blast1.out >>>>>> >>>>>> #PBS -e /work/pawels/blast-runs/blast1.err >>>>>> >>>>>> cd ${PBS_O_WORKDIR} >>>>>> >>>>>> /local/blast-2.2.18/bin/blastall \ >>>>>> -i /work/pawels/blast-runs/seq/batch0.fasta -o >>>>>> /work/pawels/blast-runs/seq/batch0.fasta.blast -p blastp -d >>>>>> /net/compute-2-23.local/scratch/andersl/uniref90 -m7 >>>>>> touch .complete >>>>>> </Property> >>>>>> <Property name="urn:pbs:launched">true</Property> >>>>>> <Property name="urn:gridsam:principal">OU=BCCS, O=UiB, >>>>>> EMAILADDRESS=paw...@bc..., C=Norway, ST=Hordaland, >>>>>> CN=gridsam_certificate</Property> >>>>>> <Property >>>>>> name="urn:gridsam:pbs:jobid">2116221.hpcmaster.bccs.uib.no</Property> >>>>>> </JobStatus> >>>>>> </getJobStatusResponse> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>>>>> Crystal Reports - New Free Runtime and 30 Day Trial >>>>>> Check out the new simplified licensing option that enables >>>>>> unlimited royalty-free distribution of the report engine >>>>>> for externally facing server and web deployment. >>>>>> http://p.sf.net/sfu/businessobjects >>>>>> _______________________________________________ >>>>>> GridSAM-Discuss mailing list >>>>>> Gri...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/gridsam-discuss >>>>> -- >>>>> Justin Bradley >>>>> Design & Development Team Leader >>>>> j.b...@om... >>>>> OMII-UK >>>>> Bay 23, 4067, B32 >>>>> University of Southampton >>>> >>> >>> -- >>> Justin Bradley >>> Design & Development Team Leader >>> j.b...@om... >>> OMII-UK >>> Bay 23, 4067, B32 >>> University of Southampton >>> >>> >>> >>> > > -- > Justin Bradley > Design & Development Team Leader > j.b...@om... > OMII-UK > Bay 23, 4067, B32 > University of Southampton > > > > |