From: Arjun P. <ap...@ma...> - 2012-07-05 18:43:28
|
Hi Christoph, I know this reply is a bit late, but we have the problem of limited memory and runaway jobs on our cluster also. I haven't had the CA problem you describe, but when I can't watch a job closely I use a wrapper shell script with a ulimit command in it to engage the operating system based limits. (bash uses ulimit, csh and derivatives I think use limit) I'm also not sure using SGE based memory limits will actually help that much. We have that functionality optionally enabled on our cluster. I haven't tried it, but some other users have complained that it doesn't work well. SGE doesn't seem to do a very good job of keeping track of actual memory used. I haven't tried to use a wrapper script called by runCA, so I'm not completely sure how that would work. If you're running things with runCA's SGE tie-ins you can ask the sysadmin to add a ulimit command to an SGE prolog script for a custom queue. Something like 'ulimit -d 60000000' should keep you from taking down the node. Arjun On Tue, 3 Jul 2012, Walenz, Brian wrote: > Sorry about the trouble with the sysadmins. > > Given the mix of reads, I'd just skip the dedupe. Both of those library > types aren't known to have artificial duplications. > > Memory usage depends on a lot of factors (the genome itself, genome size, > depth of coverage, read length, number of reads, number of mated reads) and > I don't have any good general advice anymore. > > Is it possible to submit a job such that the scheduler will kill it if some > memory limit is exceeded? That might be generally useful enough that the > sysadmins would help to set it up. (I've been arguing for that here for a > while, only to have other users object to the idea - "but I don't know how > big it is going to get! you can't just kill it!") > > > On 7/3/12 4:43 PM, "Christoph Hahn" <chr...@gm...> wrote: > >> Dear Brian, >> >> Thanks! >> My last attempt has apparently caused some serious problems on the node >> it was running on. So, I have to wait for the cluster admins ok before I >> try again. >> Will try to run it manually without the obtStore then and keep you >> posted on the result. The dataset only contains illumina PE and 454 SE. >> Is there a way to get an idea about the memory requirements beforehand >> (I have to specify that on the cluster before I start the job and the >> admin will not be happy if I kill the node again..)? I guess not? >> >> Thanks again for your help!!! >> >> cheers, >> Christoph >> >> On 07/03/2012 10:24 PM, Walenz, Brian wrote: >>> Good to know about the restart not working. >>> >>> You should be able to run manually without the obtStore by leaving out the >>> -ovs option for it. >>> >>> To find duplicate mate pairs, it needs to save up overlaps until both of the >>> reads in the mate have been seen. The bug in CVS was to not process mate >>> pairs until ALL reads were seen. I've not seen this in CA7 but the same can >>> happen if the mated reads are 'far away' in the input, for example, if all >>> of the 'left' reads are loaded before the 'right' reads. >>> >>> If all else fails, you can skip deduplication. There is little gain in >>> deduplicating Illumina PE and MP libraries -- PE duplicates don't really >>> affect scaffolding, and MP duplicates aren't detectable from overlaps. >>> Hopefully there aren't any 454 mates in this. >>> >>> b >>> >>> >>> On 7/3/12 4:02 PM, "Christoph Hahn" <chr...@gm...> wrote: >>> >>> >>> >>>> Hi Brian, >>>> >>>> Thanks for your reply! >>>> >>>> I am using CA7. I am afraid updating is not really an option at the >>>> moment - I am running it on a cluster and updating CVS might be >>>> complicated because the cluster administrators are always very busy and >>>> it would thus for sure take a while.. >>>> >>>> Therefore, it would be great if you could give me a tip on how to handle >>>> that in CA7 for now. In my latest attempt I used 64 GB RAM and it killed >>>> the node after some 2 hours. I ran the following: >>>> >>>> CA version 7.0 ($Id: deduplicate.C,v 1.15 2011/12/29 09:26:03 >>>> brianwalenz Exp $). >>>> >>>> Error Rates: >>>> AS_OVL_ERROR_RATE 0.060000 >>>> AS_CNS_ERROR_RATE 0.100000 >>>> AS_CGW_ERROR_RATE 0.100000 >>>> AS_MAX_ERROR_RATE 0.250000 >>>> >>>> Current Working Directory: >>>> /projects/nn9201k/Celera/work2/salaris1/0-overlaptrim >>>> >>>> Command: >>>> /xanadu/home/chrishah/programmes/wgs-7.0/Linux-amd64/bin/deduplicate \ >>>> -gkp /projects/nn9201k/Celera/work2/salaris1/salaris.gkpStore \ >>>> -ovs >>>> /projects/nn9201k/Celera/work2/salaris1/0-overlaptrim/salaris.obtStore \ >>>> -ovs >>>> /projects/nn9201k/Celera/work2/salaris1/0-overlaptrim/salaris.dupStore \ >>>> -report >>>> > /projects/nn9201k/Celera/work2/salaris1/0-overlaptrim/salaris.deduplicate.lo>>> > g >>>> \ >>>> -summary >>>> /projects/nn9201k/Celera/work2/salaris1/0-overlaptrim/salaris.deduplicate.su >>>> mm >>>> ary >>>> >>>> Here are the first and last few lines of salaris.deduplicate.log (it has >>>> 384855 lines, *.deduplicate.summary and *.deduplicate.err are empty): >>>> >>>> Delete 28 DUPof 3462651 a 0,76 b 0,76 hang 0,0 diff 0,0 error 0.000000 >>>> Delete 76 DUPof 10667558 a 0,76 b 0,76 hang 0,0 diff 0,0 error 0.000000 >>>> Delete 210 DUPof 8142147 a 0,70 b 0,70 hang 0,0 diff 0,0 error 0.000000 >>>> Delete 216 DUPof 9129559 a 0,76 b 0,76 hang 0,0 diff 0,0 error 0.000000 >>>> Delete 228 DUPof 7781271 a 0,76 b 0,76 hang 0,0 diff 0,0 error 0.013200 >>>> Delete 297 DUPof 11757250 a 0,76 b 0,76 hang 0,0 diff 0,0 error 0.000000 >>>> Delete 319 DUPof 11174680 a 0,73 b 0,73 hang 0,0 diff 0,0 error 0.000000 >>>> . >>>> . >>>> . >>>> Delete 132295695 DUPof 211765973 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>> error 0.000000 >>>> Delete 132296968 DUPof 181491499 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>> error 0.000000 >>>> Delete 132297966 DUPof 159665067 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>> error 0.000000 >>>> Delete 132304543 DUPof 155518568 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>> error 0.000000 >>>> Delete 132307934 DUPof 134266938 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>> error 0.000000 >>>> Delete 132309546 DUPof 179301753 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>> error 0.000000 >>>> Delete 132313400 DUPof 153142824 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>> error 0.000000 >>>> Delete 132319681 DUPof 132368976 a 0,76 b 0,76 hang 0,0 diff 0,0 >>>> error 0.000000 >>>> Delete 132323752 DUPof 165992623 a 0,76 (this is exactly how it stopped..) >>>> >>>> Can I maybe run the deduplicate command manually and only make use of >>>> the overlaps in the dupStore? When I tried to start CA again it >>>> continued with finalTrim, so I removed the *.deduplicate.log, etc. files >>>> before I restarted CA. >>>> It would be great if you could help me out! Thanks!! >>>> >>>> cheers, >>>> Christoph >>>> >>>> >>>> On 07/03/2012 06:44 PM, Walenz, Brian wrote: >>>>> Hi, Christoph- >>>>> >>>>> Are you using CA7 or CVS? >>>>> >>>>> This behavior was introduced to CVS on May 21, and fixed on the 29th. The >>>>> bug was after an optimization in loading overlaps was made - only overlaps >>>>> in the 'dupStore' are needed, the 'obtStore' can be ignored. This >>>>> eliminated a huge amount of I/O and overhead from the dedupe compute. >>>>> >>>>> If updating CVS doesn't fix the problem, can you send some of the logging >>>>> from deduplicate? >>>>> >>>>> b >>>>> >>>>> >>>>> On 7/3/12 6:28 AM, "Christoph Hahn" <chr...@gm...> wrote: >>>>> >>>>>> Dear developers and users, >>>>>> >>>>>> I am encountering some problems in the deduplicate step. Unfortunately, >>>>>> the memory usage is steadily increasing until the process dies because >>>>>> of exceeding memory limit. So far, I used up to 32 GB. I could of course >>>>>> just further increase the available memory, but I was wondering if there >>>>>> is a possibility to fix and/or predict the maximum memory usage for this >>>>>> step (and maybe also for the next steps) beforehand. >>>>>> >>>>>> Thanks for your help! >>>>>> >>>>>> much obliged, >>>>>> Christoph >>>>>> >>>>>> Universtiy of Oslo, Norway >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> -- >>>>>> -- >>>>>> Live Security Virtual Conference >>>>>> Exclusive live event will cover all the ways today's security and >>>>>> threat landscape has changed and how IT managers can respond. Discussions >>>>>> will include endpoint security, mobile security and the latest in malware >>>>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>>>>> _______________________________________________ >>>>>> wgs-assembler-users mailing list >>>>>> wgs...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >>>> >> >> > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > |