|
From: Allen D. <all...@uc...> - 2006-05-17 21:36:32
|
Thanks Jared. It looks like when I moved the jobs it restarted them, so if you were appending to file you may want to kill/cleanup/resubmit. They have now been running for 12 minutes. All but one were restarted, you can see which by the submission time (or the "R" resubmit flag in the job state). -Allen On 5/17/06, Jared Fox <jar...@uc...> wrote: > > Will do. I don't think my jobs were causing trouble, but I'll switch to > the new volume for future jobs. My jobs were reading into memory 1 or 2 > files at the beginning and then not touching the disk for hours while the > data was processed and then writing output to file. I was only running 2 > jobs at a time as well except for last night. > > Thanks for the info and for moving my jobs to the other queue so that > Brian could use the bigmem queue. > > ----- Original Message ----- > *From:* Allen Day <all...@uc...> > *To:* Jared Fox <jar...@uc...> ; Brian O'Connor <boc...@uc...> ; > nel...@li... ; > nel...@li... > *Sent:* Wednesday, May 17, 2006 2:09 PM > *Subject:* cluster tips > > Hi Jared, > > When you submit jobs, please specify the queue you with them to go to. > Now that we have multiple queues, the default behavior of the qsub utility > is to first put jobs to all.q, then to gb4.q, then to celsius.q, faulting > down the chain if a queue is currently full. We're working on making all > jobs go to only all.q unless otherwise requested, but until then if you do > not need the bigmem queue, please explicity request all.q with the "-q > all.q" option to qsub. > > Also, it appears that the I/O your jobs were doing over the last couple of > days were putting a heavy load on the xsan NFS processes. It appeared that > you were touching a large number of small files. Please copy your files > over to nucleus:/clustertmp/jaredfox that are going to be used from the > compute nodes. This volume is NFS exported to all the compute nodes. > > Thanks. > > -Allen > > |