From: Christoph H. <chr...@gm...> - 2012-07-09 09:34:50
|
Hi Arjun, Thanks for your reply! I am in contact with the system administrator to try to avoid taking down nodes in the future. In fact I am using a shell script in a particular format to submit jobs to the cluster, where I have to set a memory limit in advance. I think it is likely that this limit is implemented as something like the ulimit command you mentioned and normally jobs are just killed when the memory limit is exceeded - this runCA job was somehow exceptional and I think the sys admins are at the moment looking into the reasons why the job took down the node instead of just being killed. Thanks again for your suggestion! cheers, Christoph On 07/05/2012 08:43 PM, Arjun Prasad wrote: > > Hi Christoph, > > I know this reply is a bit late, but we have the problem of limited > memory and runaway jobs on our cluster also. I haven't had the CA > problem you describe, but when I can't watch a job closely I use a > wrapper shell script with a ulimit command in it to engage the > operating system based limits. (bash uses ulimit, csh and derivatives > I think use limit) > > I'm also not sure using SGE based memory limits will actually help > that much. We have that functionality optionally enabled on our > cluster. I haven't tried it, but some other users have complained that > it doesn't work well. SGE doesn't seem to do a very good job of > keeping track of actual memory used. > > I haven't tried to use a wrapper script called by runCA, so I'm not > completely sure how that would work. If you're running things with > runCA's SGE tie-ins you can ask the sysadmin to add a ulimit command > to an SGE prolog script for a custom queue. > > Something like 'ulimit -d 60000000' should keep you from taking down > the node. > > Arjun > > On Tue, 3 Jul 2012, Walenz, Brian wrote: > >> Sorry about the trouble with the sysadmins. >> >> Given the mix of reads, I'd just skip the dedupe. Both of those library >> types aren't known to have artificial duplications. >> >> Memory usage depends on a lot of factors (the genome itself, genome >> size, >> depth of coverage, read length, number of reads, number of mated >> reads) and >> I don't have any good general advice anymore. >> >> Is it possible to submit a job such that the scheduler will kill it >> if some >> memory limit is exceeded? That might be generally useful enough that >> the >> sysadmins would help to set it up. (I've been arguing for that here >> for a >> while, only to have other users object to the idea - "but I don't >> know how >> big it is going to get! you can't just kill it!") >> >> >> On 7/3/12 4:43 PM, "Christoph Hahn" <chr...@gm...> wrote: >> >>> Dear Brian, >>> >>> Thanks! >>> My last attempt has apparently caused some serious problems on the node >>> it was running on. So, I have to wait for the cluster admins ok >>> before I >>> try again. >>> Will try to run it manually without the obtStore then and keep you >>> posted on the result. The dataset only contains illumina PE and 454 SE. >>> Is there a way to get an idea about the memory requirements beforehand >>> (I have to specify that on the cluster before I start the job and the >>> admin will not be happy if I kill the node again..)? I guess not? >>> >>> Thanks again for your help!!! >>> >>> cheers, >>> Christoph >>> >>> On 07/03/2012 10:24 PM, Walenz, Brian wrote: >>>> Good to know about the restart not working. >>>> >>>> You should be able to run manually without the obtStore by leaving >>>> out the >>>> -ovs option for it. >>>> >>>> To find duplicate mate pairs, it needs to save up overlaps until >>>> both of the >>>> reads in the mate have been seen. The bug in CVS was to not >>>> process mate >>>> pairs until ALL reads were seen. I've not seen this in CA7 but the >>>> same can >>>> happen if the mated reads are 'far away' in the input, for example, >>>> if all >>>> of the 'left' reads are loaded before the 'right' reads. >>>> >>>> If all else fails, you can skip deduplication. There is little >>>> gain in >>>> deduplicating Illumina PE and MP libraries -- PE duplicates don't >>>> really >>>> affect scaffolding, and MP duplicates aren't detectable from overlaps. >>>> Hopefully there aren't any 454 mates in this. >>>> >>>> b >>>> >>>> >>>> On 7/3/12 4:02 PM, "Christoph Hahn" <chr...@gm...> wrote: >>>> >>>> >>>> >>>>> Hi Brian, >>>>> >>>>> Thanks for your reply! >>>>> >>>>> I am using CA7. I am afraid updating is not really an option at the >>>>> moment - I am running it on a cluster and updating CVS might be >>>>> complicated because the cluster administrators are always very >>>>> busy and >>>>> it would thus for sure take a while.. >>>>> >>>>> Therefore, it would be great if you could give me a tip on how to >>>>> handle >>>>> that in CA7 for now. In my latest attempt I used 64 GB RAM and it >>>>> killed >>>>> the node after some 2 hours. I ran the following: >>>>> >>>>> CA version 7.0 ($Id: deduplicate.C,v 1.15 2011/12/29 09:26:03 >>>>> brianwalenz Exp $). >>>>> >>>>> Error Rates: >>>>> AS_OVL_ERROR_RATE 0.060000 >>>>> AS_CNS_ERROR_RATE 0.100000 >>>>> AS_CGW_ERROR_RATE 0.100000 >>>>> AS_MAX_ERROR_RATE 0.250000 >>>>> >>>>> Current Working Directory: >>>>> /projects/nn9201k/Celera/work2/salaris1/0-overlaptrim >>>>> >>>>> Command: >>>>> /xanadu/home/chrishah/programmes/wgs-7.0/Linux-amd64/bin/deduplicate >>>>> \ >>>>> -gkp /projects/nn9201k/Celera/work2/salaris1/salaris.gkpStore \ >>>>> -ovs >>>>> /projects/nn9201k/Celera/work2/salaris1/0-overlaptrim/salaris.obtStore >>>>> \ >>>>> -ovs >>>>> /projects/nn9201k/Celera/work2/salaris1/0-overlaptrim/salaris.dupStore >>>>> \ >>>>> -report >>>>> >> /projects/nn9201k/Celera/work2/salaris1/0-overlaptrim/salaris.deduplicate.lo>>> >> >> g >>>>> \ >>>>> -summary >>>>> /projects/nn9201k/Celera/work2/salaris1/0-overlaptrim/salaris.deduplicate.su >>>>> >>>>> mm >>>>> ary >>>>> >>>>> Here are the first and last few lines of salaris.deduplicate.log >>>>> (it has >>>>> 384855 lines, *.deduplicate.summary and *.deduplicate.err are empty): >>>>> >>>>> Delete 28 DUPof 3462651 a 0,76 b 0,76 hang 0,0 diff 0,0 error >>>>> 0.000000 >>>>> Delete 76 DUPof 10667558 a 0,76 b 0,76 hang 0,0 diff 0,0 error >>>>> 0.000000 >>>>> Delete 210 DUPof 8142147 a 0,70 b 0,70 hang 0,0 diff 0,0 error >>>>> 0.000000 >>>>> Delete 216 DUPof 9129559 a 0,76 b 0,76 hang 0,0 diff 0,0 error >>>>> 0.000000 >>>>> Delete 228 DUPof 7781271 a 0,76 b 0,76 hang 0,0 diff 0,0 error >>>>> 0.013200 >>>>> Delete 297 DUPof 11757250 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>>> error 0.000000 >>>>> Delete 319 DUPof 11174680 a 0,73 b 0,73 hang 0,0 diff 0,0 >>>>> error 0.000000 >>>>> . >>>>> . >>>>> . >>>>> Delete 132295695 DUPof 211765973 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>>> error 0.000000 >>>>> Delete 132296968 DUPof 181491499 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>>> error 0.000000 >>>>> Delete 132297966 DUPof 159665067 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>>> error 0.000000 >>>>> Delete 132304543 DUPof 155518568 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>>> error 0.000000 >>>>> Delete 132307934 DUPof 134266938 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>>> error 0.000000 >>>>> Delete 132309546 DUPof 179301753 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>>> error 0.000000 >>>>> Delete 132313400 DUPof 153142824 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>>> error 0.000000 >>>>> Delete 132319681 DUPof 132368976 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>>> error 0.000000 >>>>> Delete 132323752 DUPof 165992623 a 0,76 (this is exactly how it >>>>> stopped..) >>>>> >>>>> Can I maybe run the deduplicate command manually and only make use of >>>>> the overlaps in the dupStore? When I tried to start CA again it >>>>> continued with finalTrim, so I removed the *.deduplicate.log, etc. >>>>> files >>>>> before I restarted CA. >>>>> It would be great if you could help me out! Thanks!! >>>>> >>>>> cheers, >>>>> Christoph >>>>> >>>>> >>>>> On 07/03/2012 06:44 PM, Walenz, Brian wrote: >>>>>> Hi, Christoph- >>>>>> >>>>>> Are you using CA7 or CVS? >>>>>> >>>>>> This behavior was introduced to CVS on May 21, and fixed on the >>>>>> 29th. The >>>>>> bug was after an optimization in loading overlaps was made - only >>>>>> overlaps >>>>>> in the 'dupStore' are needed, the 'obtStore' can be ignored. This >>>>>> eliminated a huge amount of I/O and overhead from the dedupe >>>>>> compute. >>>>>> >>>>>> If updating CVS doesn't fix the problem, can you send some of the >>>>>> logging >>>>>> from deduplicate? >>>>>> >>>>>> b >>>>>> >>>>>> >>>>>> On 7/3/12 6:28 AM, "Christoph Hahn" <chr...@gm...> wrote: >>>>>> >>>>>>> Dear developers and users, >>>>>>> >>>>>>> I am encountering some problems in the deduplicate step. >>>>>>> Unfortunately, >>>>>>> the memory usage is steadily increasing until the process dies >>>>>>> because >>>>>>> of exceeding memory limit. So far, I used up to 32 GB. I could >>>>>>> of course >>>>>>> just further increase the available memory, but I was wondering >>>>>>> if there >>>>>>> is a possibility to fix and/or predict the maximum memory usage >>>>>>> for this >>>>>>> step (and maybe also for the next steps) beforehand. >>>>>>> >>>>>>> Thanks for your help! >>>>>>> >>>>>>> much obliged, >>>>>>> Christoph >>>>>>> >>>>>>> Universtiy of Oslo, Norway >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> >>>>>>> -- >>>>>>> -- >>>>>>> Live Security Virtual Conference >>>>>>> Exclusive live event will cover all the ways today's security and >>>>>>> threat landscape has changed and how IT managers can respond. >>>>>>> Discussions >>>>>>> will include endpoint security, mobile security and the latest >>>>>>> in malware >>>>>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>>>>>> _______________________________________________ >>>>>>> wgs-assembler-users mailing list >>>>>>> wgs...@li... >>>>>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >>>>> >>> >>> >> >> >> ------------------------------------------------------------------------------ >> >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. >> Discussions >> will include endpoint security, mobile security and the latest in >> malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >> |