You can subscribe to this list here.
2012 |
Jan
(1) |
Feb
(2) |
Mar
|
Apr
(29) |
May
(8) |
Jun
(5) |
Jul
(46) |
Aug
(16) |
Sep
(5) |
Oct
(6) |
Nov
(17) |
Dec
(7) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2013 |
Jan
(5) |
Feb
(2) |
Mar
(10) |
Apr
(13) |
May
(20) |
Jun
(7) |
Jul
(6) |
Aug
(14) |
Sep
(9) |
Oct
(19) |
Nov
(17) |
Dec
(3) |
2014 |
Jan
(3) |
Feb
|
Mar
(7) |
Apr
(1) |
May
(1) |
Jun
(30) |
Jul
(10) |
Aug
(2) |
Sep
(18) |
Oct
(3) |
Nov
(4) |
Dec
(13) |
2015 |
Jan
(27) |
Feb
|
Mar
(19) |
Apr
(12) |
May
(10) |
Jun
(18) |
Jul
(4) |
Aug
(2) |
Sep
(2) |
Oct
|
Nov
(1) |
Dec
(9) |
2016 |
Jan
(6) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Paul C. <pca...@gm...> - 2012-10-19 20:01:40
|
Hi I'm trying for the first time to assemble Illumina fastq reads. After running runCA 7.0 with: runCA -d cabogout -p SRR073769.uniq.bowtie.unmap SRR073769.uniq.bowtie.unmap.frg I got this output: ----------------------------------------START Fri Oct 19 15:54:42 2012 /Users/pgc92/Public/usr/local/wgs-7.0/Darwin-amd64/bin/gatekeeper -o /Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/cabogout/SRR073769.uniq.bowtie.unmap.gkpStore.BUILDING -T -F /Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/SRR073769.uniq.bowtie.unmap.frg > /Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/cabogout/SRR073769.uniq.bowtie.unmap.gkpStore.err 2>&1 ----------------------------------------END Fri Oct 19 15:54:46 2012 (4 seconds) numFrags = 0 ================================================================================ runCA failed. ---------------------------------------- Stack trace: at /Users/pgc92/Public/usr/local/wgs/Darwin-i386/bin/runCA line 1237 main::caFailure('gatekeeper failed to add fragments', '/Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/cabog...') called at /Users/pgc92/Public/usr/local/wgs/Darwin-i386/bin/runCA line 1698 main::preoverlap('/Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/SRR07...') called at /Users/pgc92/Public/usr/local/wgs/Darwin-i386/bin/runCA line 5874 ---------------------------------------- Last few lines of the relevant log file (/Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/cabogout/SRR073769.uniq.bowtie.unmap.gkpStore.err): Starting file '/Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/SRR073769.uniq.bowtie.unmap.frg'. Processing SINGLE-ENDED SANGER QV encoding reads from: '/Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/SRR073769.uniq.bowtie.unmap.fq' GKP finished with no alerts or errors. ---------------------------------------- Failure message: gatekeeper failed to add fragments What am I doing wrong? My fastq file contains ~1 million 45 bp reads with sanger quality values. Here is head output of the fastq file (57989) $ head SRR073769.uniq.bowtie.unmap.fq @SRR073769.109 PATHBIO-SOLEXA2:2:1:3:1029 length=45 CTGCCCAGGCATAGTTCACCATCTTTCGGGTCCTAACACGTGCGC +SRR073769.109 PATHBIO-SOLEXA2:2:1:3:1029 length=45 @@?@>@>@7@?9==@B@;@@@29>@6>3950:467>######### @SRR073769.111 PATHBIO-SOLEXA2:2:1:3:1362 length=45 TGGTTAGTTTCTTCTCCTCCGCTGACTAATATGCTTAAATTCAGA +SRR073769.111 PATHBIO-SOLEXA2:2:1:3:1362 length=45 CCCCCCC@CCCCCBCCBCCA@ABBCBBBCCBB8AB?6@ACB;?97 @SRR073769.113 PATHBIO-SOLEXA2:2:1:3:1458 length=45 GATCCACGGGGGCCGACCCGGTGACCCGGTTACCCGCCAGGTCCT Here is the output of the FRG file: (57990) $ cat *frg {VER ver:2 } {LIB act:A acc:SRR073769.uniq.bowtie.unmap ori:U mea:0.000 std:0.000 src: . nft:16 fea: forceBOGunitigger=1 isNotRandom=0 doNotTrustHomopolymerRuns=0 doTrim_initialNone=0 doTrim_initialMerBased=1 doTrim_initialFlowBased=0 doTrim_initialQualityBased=0 doRemoveDuplicateReads=1 doTrim_finalLargestCovered=1 doTrim_finalEvidenceBased=0 doRemoveSpurReads=1 doRemoveChimericReads=1 doConsensusCorrection=0 fastqQualityValues=sanger fastqOrientation=innie fastqReads=/Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/SRR073769.uniq.bowtie.unmap.fq . } {VER ver:1 } Thank you, Paul Paul Cantalupo University of Pittsburgh |
From: Ole K. T. <o.k...@bi...> - 2012-09-10 14:28:50
|
On 10 September 2012 12:31, Ole Kristian Tørresen <o.k...@bi...> wrote: > On 10 September 2012 09:31, Walenz, Brian <bw...@jc...> wrote: >> Hi, Ole- >> >> The _average_ dropped to 68? The _minimum_ allowed is 64. > > Yes, and this is the cause for some concern from my part. This number > includes reads with no length (because merTrim does not remove the > reads, just record them with 0 length sequence and quality), but I'm > not sure about the average length of not deleted reads. > >> >> In the merTrim stderr output there should be mention of the thresholds it is >> using. There are two thresholds: >> >> 'minVerified' tells what kmers can be used for correcting some other kmer. >> By default, this is 1/4 the guessed coverage in the reads. >> >> 'minCorrect' tells what kmers can be corrected. Any kmer with count at most >> this can be corrected. By default, this is 1/3 the guessed coverage in the >> reads. >> >> After all corrections are done, read ends are trimmed if they are not >> covered by 'trusted' kmers. >> >> Possibly the guessed coverage was artificially high, resulting in >> artificially high thresholds. You can set these thresholds manually with >> -correct (for minCorrect) and -evidence (for minVerified). If the values >> are less than 1, they are interpreted as a fraction of the guessed coverage, >> otherwise, an absolute count threshold. >> >> Does the guessed coverage make sense? Does the kmer count histogram look >> sane? I reran with logging now, and the guessed coverage look insane: Guessed X coverage is 183 Use minCorrect=61 minVerified=45 I think the coverage should be around 16x, so I'll set -correct to 5 and -evidence to 4. Hopefully that should do it. Thank you. Ole > > I forgot to redirect the stderr to a file, but are running it again > now to check the output. > >> >> You can turn on verbose mode, which dumps a picture of the corrections, >> trusted kmer coverage with -V. You probably don't want to do this for all >> reads. Maybe just a sample of 100 reads or so. Super verbose mode (-V -V >> -V) will dump the same picture after each step in the algorithm. > > I think the issue, at least with this library, is that the second read > is really bad. Almost every second read has more than half it length > in quality '#', which is just trash. So this is probably not a cause > where merTrim does something wrong, but where the sequencing has gone > wrong. > > Thank you. > > Ole > >> >> b >> >> >> On 9/6/12 3:08 PM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: >> >>> Hi, >>> I just ran merTrim on a relatively low coverage library, well, we >>> don't really know whether it is low coverage or not since we don't >>> know the genome size accurately yet. The original library was 16 Gbp, >>> but after merTrim and then loading it into an assembly, only 6 Gbp >>> survived. This might give a relatively good assembly, but I'm a bit >>> worried that it removed too much sequence. Can I adjust how much it >>> throws out? >>> >>> My reads are 150 bp, PE. I followed the preprocessing page on the CA >>> site, and created a database of trusted kmers and used that to >>> correct my reads. >>> >>> Of 56,188,107 reads of mate 1, 18,030,094 were deleted and 36,122,767 >>> were clean, and the average length dropped to 68 bp. I expected it to >>> remove about 10 % of my sequences (from what I've seen on other >>> merTrim runs), but 2/3 seems a bit much. >>> >>> Thank you. >>> >>> Ole >>> >>> ------------------------------------------------------------------------------ >>> Live Security Virtual Conference >>> Exclusive live event will cover all the ways today's security and >>> threat landscape has changed and how IT managers can respond. Discussions >>> will include endpoint security, mobile security and the latest in malware >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>> _______________________________________________ >>> wgs-assembler-users mailing list >>> wgs...@li... >>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >> |
From: Ole K. T. <o.k...@bi...> - 2012-09-10 10:31:44
|
On 10 September 2012 09:31, Walenz, Brian <bw...@jc...> wrote: > Hi, Ole- > > The _average_ dropped to 68? The _minimum_ allowed is 64. Yes, and this is the cause for some concern from my part. This number includes reads with no length (because merTrim does not remove the reads, just record them with 0 length sequence and quality), but I'm not sure about the average length of not deleted reads. > > In the merTrim stderr output there should be mention of the thresholds it is > using. There are two thresholds: > > 'minVerified' tells what kmers can be used for correcting some other kmer. > By default, this is 1/4 the guessed coverage in the reads. > > 'minCorrect' tells what kmers can be corrected. Any kmer with count at most > this can be corrected. By default, this is 1/3 the guessed coverage in the > reads. > > After all corrections are done, read ends are trimmed if they are not > covered by 'trusted' kmers. > > Possibly the guessed coverage was artificially high, resulting in > artificially high thresholds. You can set these thresholds manually with > -correct (for minCorrect) and -evidence (for minVerified). If the values > are less than 1, they are interpreted as a fraction of the guessed coverage, > otherwise, an absolute count threshold. > > Does the guessed coverage make sense? Does the kmer count histogram look > sane? I forgot to redirect the stderr to a file, but are running it again now to check the output. > > You can turn on verbose mode, which dumps a picture of the corrections, > trusted kmer coverage with -V. You probably don't want to do this for all > reads. Maybe just a sample of 100 reads or so. Super verbose mode (-V -V > -V) will dump the same picture after each step in the algorithm. I think the issue, at least with this library, is that the second read is really bad. Almost every second read has more than half it length in quality '#', which is just trash. So this is probably not a cause where merTrim does something wrong, but where the sequencing has gone wrong. Thank you. Ole > > b > > > On 9/6/12 3:08 PM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: > >> Hi, >> I just ran merTrim on a relatively low coverage library, well, we >> don't really know whether it is low coverage or not since we don't >> know the genome size accurately yet. The original library was 16 Gbp, >> but after merTrim and then loading it into an assembly, only 6 Gbp >> survived. This might give a relatively good assembly, but I'm a bit >> worried that it removed too much sequence. Can I adjust how much it >> throws out? >> >> My reads are 150 bp, PE. I followed the preprocessing page on the CA >> site, and created a database of trusted kmers and used that to >> correct my reads. >> >> Of 56,188,107 reads of mate 1, 18,030,094 were deleted and 36,122,767 >> were clean, and the average length dropped to 68 bp. I expected it to >> remove about 10 % of my sequences (from what I've seen on other >> merTrim runs), but 2/3 seems a bit much. >> >> Thank you. >> >> Ole >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > |
From: Walenz, B. <bw...@jc...> - 2012-09-10 07:32:05
|
Hi, Ole- The _average_ dropped to 68? The _minimum_ allowed is 64. In the merTrim stderr output there should be mention of the thresholds it is using. There are two thresholds: 'minVerified' tells what kmers can be used for correcting some other kmer. By default, this is 1/4 the guessed coverage in the reads. 'minCorrect' tells what kmers can be corrected. Any kmer with count at most this can be corrected. By default, this is 1/3 the guessed coverage in the reads. After all corrections are done, read ends are trimmed if they are not covered by 'trusted' kmers. Possibly the guessed coverage was artificially high, resulting in artificially high thresholds. You can set these thresholds manually with -correct (for minCorrect) and -evidence (for minVerified). If the values are less than 1, they are interpreted as a fraction of the guessed coverage, otherwise, an absolute count threshold. Does the guessed coverage make sense? Does the kmer count histogram look sane? You can turn on verbose mode, which dumps a picture of the corrections, trusted kmer coverage with -V. You probably don't want to do this for all reads. Maybe just a sample of 100 reads or so. Super verbose mode (-V -V -V) will dump the same picture after each step in the algorithm. b On 9/6/12 3:08 PM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: > Hi, > I just ran merTrim on a relatively low coverage library, well, we > don't really know whether it is low coverage or not since we don't > know the genome size accurately yet. The original library was 16 Gbp, > but after merTrim and then loading it into an assembly, only 6 Gbp > survived. This might give a relatively good assembly, but I'm a bit > worried that it removed too much sequence. Can I adjust how much it > throws out? > > My reads are 150 bp, PE. I followed the preprocessing page on the CA > site, and created a database of trusted kmers and used that to > correct my reads. > > Of 56,188,107 reads of mate 1, 18,030,094 were deleted and 36,122,767 > were clean, and the average length dropped to 68 bp. I expected it to > remove about 10 % of my sequences (from what I've seen on other > merTrim runs), but 2/3 seems a bit much. > > Thank you. > > Ole > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |
From: Ole K. T. <o.k...@bi...> - 2012-09-06 19:09:08
|
Hi, I just ran merTrim on a relatively low coverage library, well, we don't really know whether it is low coverage or not since we don't know the genome size accurately yet. The original library was 16 Gbp, but after merTrim and then loading it into an assembly, only 6 Gbp survived. This might give a relatively good assembly, but I'm a bit worried that it removed too much sequence. Can I adjust how much it throws out? My reads are 150 bp, PE. I followed the preprocessing page on the CA site, and created a database of trusted kmers and used that to correct my reads. Of 56,188,107 reads of mate 1, 18,030,094 were deleted and 36,122,767 were clean, and the average length dropped to 68 bp. I expected it to remove about 10 % of my sequences (from what I've seen on other merTrim runs), but 2/3 seems a bit much. Thank you. Ole |
From: Quan, X. <x....@im...> - 2012-09-03 08:46:15
|
Hi I am running assemblying for a large genome. I forgot to set the ovlRefBlockSize to a larger number (currently it is the default number 2000000). Now the overlapInCore is running for more than three days and not finished yet. There are more than 4000 jobs have been finished without error. Below is the output statistics for one of the job "HASH LOADING STOPPED: strings 5420544 out of 5420544 max. HASH LOADING STOPPED: length 700000034 out of 700000034 max. HASH LOADING STOPPED: entries 242755113 out of 264241152 max (load 68.90). ### realloc Extra_Ref_Space max_extra_ref_ct = 386385931 String_Ct = 5420544 Extra_String_Ct = 13533 Extra_String_Subcount = 21 Read 12224632 kmers to mark to skip Kmer hits without olaps = 6269530 Kmer hits with olaps = 2108710 Multiple overlaps/pair = 0 Total overlaps produced = 2107177 Contained overlaps = 0 Dovetail overlaps = 0 " According to the ovljob file, there are 038952 jobs. Is this number the real job number to run? Is it worth I kill the process and restart the overlapper with larger ovlRefBlockSize number? Thanks! Dr. Xueping Quan Research Associate in BioInformatics Imperial College London Tel: +44(0)207 594 17 80 email:x....@im... Personal:http://www3.imperial.ac.uk/people/x.quan Group: www3.imperial.ac.uk/savolainenlab<https://exchange.imperial.ac.uk/ecp/Customize/www3.imperial.ac.uk/savolainenlab> |
From: Paul C. <pca...@gm...> - 2012-08-27 18:57:43
|
Hi runCA has moved past the gatekeeper step and is onto the consensus! Thank you everyone for your help, Paul Paul Cantalupo University of Pittsburgh On Mon, Aug 27, 2012 at 2:53 PM, Sebastian Jaenicke <sja...@ce...> wrote: > Hi, > > On Mon, Aug 27, 2012 at 11:51:45AM -0400, Paul Cantalupo wrote: > [..] >> I'm trying to assemble a 4.4 Gb fastq file containing ~8.3 million 454 >> sequences. I'm using runCA version 7.0 on a Mac Server Snow Leopard >> 10.6. I got an error that says "to many open files" (see attached >> error.txt file). I've attached the spec file as well. > > try rising the number of open files permitted by your system; I don't > know the defaults for Snow Leopard, but 2048 should be sufficient. > I.e., invoke > > ulimit -n 2048 > > before runCA. You can check defaults with just 'ulimit -n'. > > Regards, > > - Sebastian > > -- > A: Maybe because some people are too annoyed by top-posting. > Q: Why do I not get an answer to my question(s)? > A: Because it messes up the order in which people normally read text. > Q: Why is top-posting such a bad thing? |
From: Sebastian J. <sja...@Ce...> - 2012-08-27 18:54:07
|
Hi, On Mon, Aug 27, 2012 at 11:51:45AM -0400, Paul Cantalupo wrote: [..] > I'm trying to assemble a 4.4 Gb fastq file containing ~8.3 million 454 > sequences. I'm using runCA version 7.0 on a Mac Server Snow Leopard > 10.6. I got an error that says "to many open files" (see attached > error.txt file). I've attached the spec file as well. try rising the number of open files permitted by your system; I don't know the defaults for Snow Leopard, but 2048 should be sufficient. I.e., invoke ulimit -n 2048 before runCA. You can check defaults with just 'ulimit -n'. Regards, - Sebastian -- A: Maybe because some people are too annoyed by top-posting. Q: Why do I not get an answer to my question(s)? A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? |
From: Paul C. <pca...@gm...> - 2012-08-27 15:51:56
|
Hello all, I'm trying to assemble a 4.4 Gb fastq file containing ~8.3 million 454 sequences. I'm using runCA version 7.0 on a Mac Server Snow Leopard 10.6. I got an error that says "to many open files" (see attached error.txt file). I've attached the spec file as well. Thank you for any help you can provide, Paul Paul Cantalupo University of Pittsburgh |
From: Walenz, B. <bw...@jc...> - 2012-08-23 19:14:21
|
Larger unitigs almost always lead to better assemblies. (As one early developers said: "I've never seen better assemblies from smaller unitigs".) Unitigs can be split after they are formed (at 1x coverage areas with only bad mates spanning it) and so the stats out of unitigger aren't exactly what is input to scaffolder. Mate happiness (5-consensus-insert-size) probably won't show much difference here. We've talked about making scaffolder output size statistics periodically, but haven't implemented anything. Even for a running assembly, you can output size statistics for contigs using tigStore: tigStore -g *gkpStore -t *tigStore V -C -d sizes Where V == the last complete version (has ctg, utg and dat files) in the tigStore. A bit heavy weight, but you can (in theory) run terminator using just a checkpoint and a tigStore, even when scaffolder is running. Some of the labeling will be wrong (mate pairs won't be labeled as happy, etc; contigs/unitigs probably won't be labeled either) but you can get sequence files. b On 8/23/12 5:11 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: > Hi, > I have several assemblies running, based on different input and > configurations, and want to have an idea of how well they are doing. > In the 4-unitigger folder, there is a log2 length histogram. Can I use > that to get an idea of how well my assembly is going? For example, > this is from one assembly (bogart): > checkUnitigMembership()-- 13 ( 8192- 16384) 2953 > checkUnitigMembership()-- 14 ( 16384- 32768) 168 > checkUnitigMembership()-- 15 ( 32768- 65536) 4 > checkUnitigMembership()-- 16 ( 65536- 131072) 1 > > and this is from another (bog): > checkUnitigMembership()-- 13 ( 8192- 16384) 2302 > checkUnitigMembership()-- 14 ( 16384- 32768) 74 > checkUnitigMembership()-- 15 ( 32768- 65536) 1 > > and a third (bog): > checkUnitigMembership()-- 13 ( 8192- 16384) 2718 > checkUnitigMembership()-- 14 ( 16384- 32768) 48 > > > Since there are more and longer unitigs in the first assembly, will > that probably turn out to have longer contigs in the end, or is there > no correlation between this? Is there other places where I can get a > feel of my assembly? Parsing the scaffold log files in any particular > way? > > Thank you. > > Ole > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |
From: Ole K. T. <o.k...@bi...> - 2012-08-23 09:11:42
|
Hi, I have several assemblies running, based on different input and configurations, and want to have an idea of how well they are doing. In the 4-unitigger folder, there is a log2 length histogram. Can I use that to get an idea of how well my assembly is going? For example, this is from one assembly (bogart): checkUnitigMembership()-- 13 ( 8192- 16384) 2953 checkUnitigMembership()-- 14 ( 16384- 32768) 168 checkUnitigMembership()-- 15 ( 32768- 65536) 4 checkUnitigMembership()-- 16 ( 65536- 131072) 1 and this is from another (bog): checkUnitigMembership()-- 13 ( 8192- 16384) 2302 checkUnitigMembership()-- 14 ( 16384- 32768) 74 checkUnitigMembership()-- 15 ( 32768- 65536) 1 and a third (bog): checkUnitigMembership()-- 13 ( 8192- 16384) 2718 checkUnitigMembership()-- 14 ( 16384- 32768) 48 Since there are more and longer unitigs in the first assembly, will that probably turn out to have longer contigs in the end, or is there no correlation between this? Is there other places where I can get a feel of my assembly? Parsing the scaffold log files in any particular way? Thank you. Ole |
From: Ole K. T. <o.k...@bi...> - 2012-08-17 10:43:43
|
Hi Brian, I got some comments/questions/answers below. On 13 August 2012 04:34, Walenz, Brian <bw...@jc...> wrote: > Hi, Ole- > > Very sorry to hear. I've been stung by this a few times too. > > There is minimal support for non-innie oriented mates. The assembler was > developed with innie-oriented mates assumed, and there are still lots of > places where we make that assumption. In particular, finding evidence for > merging two scaffolds assumes innie oriented mates; computing gap sizes > based on mate pairs also does. Both explicitly exclude non-innie oriented > mates from contributing. > > The same issue comes up after classifyMates runs. We're left with a pile of > now outtie-oriented PE pairs that we can do nothing with. We thought about > updating the stores (reverse complementing the read), but as every overlap > involving these reads would need to be modified, we decided this was just > too risky. > > So, I'm sad to say, recomputing is the only real option. If it makes you > feel any better, I had to run overlaps on a big assembly three times because > our scratch disk policy is to delete files older than a week, and I kept > getting pulled away from it. I reran the entire run because I was a bit too eager and deleted the store before I was able to look into the issues you mentioned here. I've done it now though, and got some questions about them. > > You might be able to learn something from this run though. Bogart can't use > all 3.3tb of those overlaps, so maybe you can reduce the number of overlaps. The store is 2.5 TB, still too big I guess. I don't think I've seen this big a store before, the largest was about 1 TB (and approx. the same input data, about 51x coverage in Illumina reads and 26x in 454 reads). I have some files that number from 0001 to 0250 where the 0001 file is 5 GB and the 0250 file is 20 GB, so I guess that correct. And a ovs and idx in addition. > > Is the minimum overlap length too low? You could spot check some overlaps > to see what the longest overlap is. You might be able to get away with, > say, a minimum overlap length of 64 bases. How do I do this precisely? I tried running some commands like this: ~/src/wgs-August2/Linux-amd64/bin/overlapStore -p 375000001 51xillumina_26x454_bac-ends_bog.ovlStore 51xillumina_26x454_bac-ends_bog.gkpStore OBTINITIAL Output: 375000001 A: 1 0 ---------------------------------------------------------------------------------------------------> 288037584 A: 0 1975 ( -1) B: 124 2047 ( -1) 0.00% +124> +-2048 283419705 A: 0 1976 ( -1) B: 124 2047 ( -1) 3.45% +124> +-2048 Bus error (core dumped) ~/src/wgs-August2/Linux-amd64/bin/overlapStore -p 250000000 51xillumina_26x454_bac-ends_bog.ovlStore 51xillumina_26x454_bac-ends_bog.gkpStore OBTINITIAL Output: DUMPING PICTURE for ID 250000000 in store 51xillumina_26x454_bac-ends_bog.ovlStore (gkp 51xillumina_26x454_bac-ends_bog.gkpStore clear OBTINITIAL) 250000000 A: 1 0 ---------------------------------------------------------------------------------------------------> 362293433 A: 0 1912 ( -1) B: 80 2047 ( -1) 0.00% +80> +-2048 308391868 A: 0 1913 ( -1) B: 60 2047 ( -1) 0.00% +60> +-2048 117489298 A: 0 1914 ( -1) B: 54 2047 ( -1) 0.00% +54> +-2048 231078346 A: 0 1917 ( -1) B: 51 2047 ( -1) 0.00% +51> +-2048 92028512 A: 0 1922 ( -1) B: 37 2047 ( -1) 0.00% +37> +-2048 Bus error (core dumped) That does not look like what I expected (from http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=OvlStore). I also ran with this: ~/src/wgs-August2/Linux-amd64/bin/overlapStore -d 51xillumina_26x454_bac-ends_bog.ovlStore -b 375000000 -e 375000000 That gave med 5596 overlaps, so I guess it was a bad choice fragment. If I choose another fragment (454 shotgun read): ~/src/wgs-August2/Linux-amd64/bin/overlapStore -d 51xillumina_26x454_bac-ends_bog.ovlStore -b 400000001 -e 400000001 I get 153 overlaps, and some of them: 400000001 3314999 I 59 -104 0.00 0.00 400000001 4090114 I 42 -121 0.00 0.00 400000001 5386505 I 59 -104 0.00 0.00 400000001 9054281 I 171 8 2.17 2.17 400000001 25877453 I 11 -170 0.00 0.00 400000001 34584423 I 195 32 2.94 2.94 400000001 35861704 N 221 26 2.38 2.38 400000001 36059151 N 168 5 2.11 2.11 400000001 36573727 N 10 -184 0.00 0.00 400000001 37990829 N 175 12 1.14 1.14 400000001 39350033 I 59 -104 0.00 0.00 400000001 39934425 N 218 39 4.44 4.44 400000001 41436906 I 133 -46 0.00 0.00 400000001 41776173 N 168 5 2.11 2.11 400000001 42439341 I 211 45 1.92 1.92 400000001 42876704 N 189 26 1.35 1.35 400000001 44051983 I 103 -60 0.00 0.00 400000001 47862199 N 2 -184 0.00 0.00 400000001 48017281 N 202 39 3.28 3.28 400000001 48875374 N 168 5 2.11 2.11 400000001 51196018 I 130 -58 0.00 0.00 400000001 54522446 N 205 39 3.45 3.45 400000001 56546126 I 72 -91 0.00 0.00 400000001 56653818 N 89 -74 0.00 0.00 400000001 66204913 I 5 -158 0.00 0.00 400000001 67154193 I 171 8 2.17 2.17 <snip> 400000001 367185391 I 220 225 4.65 4.65 400000001 367771473 I 0 18 0.76 0.76 400000001 371107629 N -13 55 0.76 0.76 400000001 372107538 N 0 -31 0.00 0.00 400000001 372408998 N 152 347 0.90 0.90 400000001 377636043 N 0 54 0.38 0.38 400000001 377646282 N 0 53 0.38 0.38 400000001 377655226 N 99 72 0.61 0.61 400000001 377896151 N 0 -96 0.00 0.00 400000001 377911031 N 0 57 0.38 0.38 400000001 383651644 N 189 336 1.35 1.35 400000001 383688429 N 192 109 1.41 1.41 400000001 383754638 N 189 338 1.35 1.35 400000001 383766287 N 189 338 1.35 1.35 400000001 383932105 I 99 345 0.61 0.61 400000001 385033058 I 69 124 1.55 1.55 400000001 387893078 I 0 88 0.76 0.76 400000001 387946865 I 0 88 0.76 0.76 400000001 398390290 I 205 489 1.72 1.72 400000001 398406551 I 214 449 4.08 4.08 400000001 398998877 I 99 368 0.61 0.61 400000001 399078177 I 153 368 0.91 0.91 400000001 405995246 N 166 387 1.03 1.03 400000001 409651281 I 0 63 0.38 0.38 400000001 409656569 I 79 63 0.54 0.54 400000001 409813283 N 0 48 0.38 0.38 400000001 410422960 I 0 -95 0.00 0.00 But I can't see the overlap length from that can I? I guess I could get the read length for each read and test though. It seems that the error rate is mostly below 4 %, so that might not help much to set it to 4 % (just a little bit). For another read (Illumina from a MP library), 66 overlaps: ~/src/wgs-August2/Linux-amd64/bin/overlapStore -d 51xillumina_26x454_bac-ends_bog.ovlStore -b 40000000 -e 40000000 40000000 18888403 N 3 7 0.00 0.00 40000000 19154662 I -22 -37 3.39 3.39 40000000 42344478 N 15 7 0.00 0.00 40000000 53346801 I -51 -50 4.35 4.35 40000000 61865542 N 49 53 0.00 0.00 40000000 64381669 N 0 -7 2.25 2.25 40000000 65643742 N 41 26 0.00 0.00 40000000 66612098 I -53 -54 4.76 4.76 40000000 69287018 N -3 0 0.00 0.00 40000000 70886837 I -55 -51 4.44 4.44 40000000 73035047 N -46 -42 3.70 3.70 40000000 77363047 I 48 52 0.00 0.00 40000000 83770223 N 3 7 0.00 0.00 40000000 92992529 I -53 -49 4.26 4.26 40000000 93109879 N -25 -21 0.00 0.00 <snip> 40000000 317454557 N -121 -32 0.00 0.00 40000000 320220222 N -4 6 2.11 2.11 40000000 323660284 I -22 55 0.00 0.00 40000000 330106339 N -128 -47 0.00 0.00 40000000 330437997 I -48 37 2.08 2.08 40000000 333633813 I -4 31 0.00 0.00 40000000 350457623 N 34 139 0.00 0.00 40000000 366189823 I -10 -26 2.86 2.86 40000000 373715828 I -61 -48 0.00 0.00 40000000 380074140 I -51 -15 0.00 0.00 40000000 380125547 I -9 -9 2.30 2.30 40000000 383202147 N -18 -4 0.00 0.00 40000000 383480660 N 34 27 0.00 0.00 40000000 392470256 N -26 -31 3.08 3.08 40000000 395518457 N -2 -5 0.00 0.00 40000000 409319564 I -26 -54 4.76 4.76 There seems to be a lot of 0 % error overlaps (but I don't know the length). Can I see from bogart's output how much of the overlaps it would have loaded? Would it have loaded all 2.5 TB? I have 410,962,052 reads in total, and bogart says 21,884,230,828 overlaps. If I do chose to rerun overlaps, will it run faster with more stringent options? (4 % errors, 50 overlap length or maybe 64). Thank you for your help, once again. Ole > > Is the error rate too high? Again spot checking, are there reads with no > low-error overlaps? Maybe you can get away with only 4% error. > > b > > > > On 8/10/12 3:18 PM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: > >> Hi, >> I ran classify on an Illumina mate pair library, and managed to use >> one of the old versions of gatekeeper to dump the reads, so I guess >> they were dumped as innie reads. I thought the library still was >> outtie, and input that into an assembly. Now, after finishing >> overlapper (using grid and grid version of overlapStoreBuild) I have a >> ovlStore of 3.3 TB, so I'd rather not run that again if I can avoid >> it. >> >> I see from this page: >> http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Gatekeeper >> #Library >> that there are some options in changing orientation of the library, >> but only "innie" is supported it says. Do you have any suggestions of >> what I can do? Would it not work changing the library to "outtie"? >> >> Thank you. >> >> Ole >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > |
From: Walenz, B. <bw...@jc...> - 2012-08-16 05:50:45
|
I think most of the time will be spent in overlaps, and this page should help: http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=RunCA#OV L_Overlapper Or just confuse. I've been told that the page is a bit obtuse for non-hackers, but it's all we have right now. In general 4 cores and 8gb memory works great, for 'normal' genomes. There are a few more options that you can fiddle with to optimize memory loading, but the defaults -- once you work through the page above -- should work reasonably well (I hope). Happy to help more if you want to get more specific. b On 8/14/12 9:26 AM, "Thomas Hackl" <tho...@un...> wrote: > Hi, > > I would like to use the pacBioToCA correction pipeline on data for a > bigger genomes (Gb size), hence computation time quite concerns me. > Judging from the *.spec files, there are quite some parameters to tweak. > We have machines, 40/80 cores, 200/500Gb memory, and I was wondering if > you could give me some advice on how to modify the spec file parameters > to make optimal use of this potential. > > Thanks > Thomas > |
From: Walenz, B. <bw...@jc...> - 2012-08-16 05:37:04
|
Hi- Without details, I can’t really say what expected is. It depends on genome properties (repeats, duplications), configuration (kmer size, threshold), read properties (number of reads, quality), and, of course, the amount of hardware you have. I’d be unhappy with one week of wall clock too. It is (sadly) easy to badly configure overlapper so that it runs forever. Can you share your spec? Hardware? Are you willing to share details of genome and reads? b On 8/14/12 10:01 AM, "Quan, Xueping" <x....@im...> wrote: Hi I am running hybrid assembling using celera assembler version 7.0 for my genome (3.5gb) and wondering what's the normal expected time to finish. I started the assemle job one week ago and it is still on the overlap stage. Thanks! Xueping Dr. Xueping Quan Research Associate in BioInformatics Imperial College London Tel: +44(0)207 594 17 80 email:x....@im... Personal:http://www3.imperial.ac.uk/people/x.quan Group: www3.imperial.ac.uk/savolainenlab <https://exchange.imperial.ac.uk/ecp/Customize/www3.imperial.ac.uk/savolainenlab> |
From: Quan, X. <x....@im...> - 2012-08-14 14:02:28
|
Hi I am running hybrid assembling using celera assembler version 7.0 for my genome (3.5gb) and wondering what's the normal expected time to finish. I started the assemle job one week ago and it is still on the overlap stage. Thanks! Xueping Dr. Xueping Quan Research Associate in BioInformatics Imperial College London Tel: +44(0)207 594 17 80 email:x....@im... Personal:http://www3.imperial.ac.uk/people/x.quan Group: www3.imperial.ac.uk/savolainenlab<https://exchange.imperial.ac.uk/ecp/Customize/www3.imperial.ac.uk/savolainenlab> |
From: Thomas H. <tho...@un...> - 2012-08-14 13:27:09
|
Hi, I would like to use the pacBioToCA correction pipeline on data for a bigger genomes (Gb size), hence computation time quite concerns me. Judging from the *.spec files, there are quite some parameters to tweak. We have machines, 40/80 cores, 200/500Gb memory, and I was wondering if you could give me some advice on how to modify the spec file parameters to make optimal use of this potential. Thanks Thomas -- Thomas Hackl Julius-Maximilians-Universität Department of Bioinformatics 97074 Würzburg, Germany Fon: +49 931 - 31 86883 Mail: tho...@un... |
From: Walenz, B. <bw...@jc...> - 2012-08-13 02:48:18
|
Hi- I had to run one myself, and then check the code. There is a definite uninitialized value problem here. In effect, any gap in the scaffold was not set to zero coverage. I've fixed it in CVS. If you're not using CVS, patch AS_RUN/fragmentDepth.c by adding memset(histogram, 0, sizeof(uint32) * histogramMax); near the start of the computeStuff() function (line 71 - after the block of variables are defined, before the first 'if' test works). The patch in CVS is slightly different, but equivalent. Big scaffolds (with few gaps) didn't seem to be affected too much. Small scaffolds - yours are two contigs joined by a fosmid? - were. I'm stunned this has survived for so long! Thanks for noticing. b On 8/9/12 10:23 AM, "Christoph Hahn" <chr...@gm...> wrote: > Hi Brian, > > thanks for your reply! The fragmentDepth utility does basically what I > was interested in, thanks! I am a little confused with its output, > though. If I run it in -scaffold mode like: > fragmentDepth -scaffold < *.posmap.frgscf.sorted > > In the fragmentDepth output I get the following as an example: > uid start end mode mean median > 7180006953248 0 33010 40294 42.589286 2 > 7180006953249 0 31936 1518 42.845247 1 > 7180006953250 0 26539 62454 41.643727 41 > > A few questions there: > What exactly is the mode (40294,1518,62454) column? According to > *.posmap.scflen scaffold 7180006953248 is 33204 long - why does it > calculate the coverage only until position 33010? Also, I am not sure > how to understand the median value. To reach a value of 1 or 2 as in the > first two scaffolds in the example about half of the positions need to > have a coverage of 0-1 or 0-2, right? can that be correct, or am I > misunderstanding something here? > > Thanks for your help! > > cheers, > Christoph > > > > On 08/09/2012 05:03 AM, Walenz, Brian wrote: >> Hi, Christoph- >> >> [Sorry, wrote this 16 hours ago and forgot to send] >> >> Check out the 'fragmentDepth' utility. It computes coverage, and outputs in >> three different ways: coverage of each scaffold, a histogram of coverage (as >> at the end of *.qc), and a fasta-like output of the actual depth of coverage >> at each base in the scaffold. >> >> I can't think of a reason it would fail on contigs, but I haven't tried it. >> >> The posmap files should be capturing most of the important stuff from the >> (agreed: very difficult to use) asm file. If you can't get what you're >> looking for out of the posmap files, we need to add to them. >> >> b >> >> >> >> On 8/8/12 6:20 AM, "Christoph Hahn" <chr...@gm...> wrote: >> >>> Hello CA developers and experts, >>> >>> I have just finished my first big 454+illumina hybrid assembly using CA7 >>> and I am about to assess the result now in comparison to purely illumina >>> based assemblies. >>> >>> One question there: What is the easiest way to get coverage information >>> for the scaffolds, contigs, unitigs in the *.scf.fasta, *.ctg.fasta, >>> etc. files? I figured, that it is possible to calculate it manually >>> using the information in the *.posmap.frgscf and *.posmap.scflen files >>> (in case of scaffolds). I guess, the information is also in the *.asm >>> file, but I am having problems reading/parsing the file. >>> Is there an easy way you can think about? >>> The reason, why I want to do this is that I want to bin the >>> scaffolds/contigs based on coverage, GC-content and length. >>> >>> Any ideas are highly appreciated, thanks! >>> >>> Much obliged, >>> Christoph >>> >>> University of Oslo, Norway >>> >>> >>> ---------------------------------------------------------------------------- >>> -- >>> Live Security Virtual Conference >>> Exclusive live event will cover all the ways today's security and >>> threat landscape has changed and how IT managers can respond. Discussions >>> will include endpoint security, mobile security and the latest in malware >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>> _______________________________________________ >>> wgs-assembler-users mailing list >>> wgs...@li... >>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > |
From: Walenz, B. <bw...@jc...> - 2012-08-13 02:34:37
|
Hi, Ole- Very sorry to hear. I've been stung by this a few times too. There is minimal support for non-innie oriented mates. The assembler was developed with innie-oriented mates assumed, and there are still lots of places where we make that assumption. In particular, finding evidence for merging two scaffolds assumes innie oriented mates; computing gap sizes based on mate pairs also does. Both explicitly exclude non-innie oriented mates from contributing. The same issue comes up after classifyMates runs. We're left with a pile of now outtie-oriented PE pairs that we can do nothing with. We thought about updating the stores (reverse complementing the read), but as every overlap involving these reads would need to be modified, we decided this was just too risky. So, I'm sad to say, recomputing is the only real option. If it makes you feel any better, I had to run overlaps on a big assembly three times because our scratch disk policy is to delete files older than a week, and I kept getting pulled away from it. You might be able to learn something from this run though. Bogart can't use all 3.3tb of those overlaps, so maybe you can reduce the number of overlaps. Is the minimum overlap length too low? You could spot check some overlaps to see what the longest overlap is. You might be able to get away with, say, a minimum overlap length of 64 bases. Is the error rate too high? Again spot checking, are there reads with no low-error overlaps? Maybe you can get away with only 4% error. b On 8/10/12 3:18 PM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: > Hi, > I ran classify on an Illumina mate pair library, and managed to use > one of the old versions of gatekeeper to dump the reads, so I guess > they were dumped as innie reads. I thought the library still was > outtie, and input that into an assembly. Now, after finishing > overlapper (using grid and grid version of overlapStoreBuild) I have a > ovlStore of 3.3 TB, so I'd rather not run that again if I can avoid > it. > > I see from this page: > http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Gatekeeper > #Library > that there are some options in changing orientation of the library, > but only "innie" is supported it says. Do you have any suggestions of > what I can do? Would it not work changing the library to "outtie"? > > Thank you. > > Ole > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |
From: Ole K. T. <o.k...@bi...> - 2012-08-10 19:18:21
|
Hi, I ran classify on an Illumina mate pair library, and managed to use one of the old versions of gatekeeper to dump the reads, so I guess they were dumped as innie reads. I thought the library still was outtie, and input that into an assembly. Now, after finishing overlapper (using grid and grid version of overlapStoreBuild) I have a ovlStore of 3.3 TB, so I'd rather not run that again if I can avoid it. I see from this page: http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Gatekeeper#Library that there are some options in changing orientation of the library, but only "innie" is supported it says. Do you have any suggestions of what I can do? Would it not work changing the library to "outtie"? Thank you. Ole |
From: Christoph H. <chr...@gm...> - 2012-08-09 14:23:37
|
Hi Brian, thanks for your reply! The fragmentDepth utility does basically what I was interested in, thanks! I am a little confused with its output, though. If I run it in -scaffold mode like: fragmentDepth -scaffold < *.posmap.frgscf.sorted In the fragmentDepth output I get the following as an example: uid start end mode mean median 7180006953248 0 33010 40294 42.589286 2 7180006953249 0 31936 1518 42.845247 1 7180006953250 0 26539 62454 41.643727 41 A few questions there: What exactly is the mode (40294,1518,62454) column? According to *.posmap.scflen scaffold 7180006953248 is 33204 long - why does it calculate the coverage only until position 33010? Also, I am not sure how to understand the median value. To reach a value of 1 or 2 as in the first two scaffolds in the example about half of the positions need to have a coverage of 0-1 or 0-2, right? can that be correct, or am I misunderstanding something here? Thanks for your help! cheers, Christoph On 08/09/2012 05:03 AM, Walenz, Brian wrote: > Hi, Christoph- > > [Sorry, wrote this 16 hours ago and forgot to send] > > Check out the 'fragmentDepth' utility. It computes coverage, and outputs in > three different ways: coverage of each scaffold, a histogram of coverage (as > at the end of *.qc), and a fasta-like output of the actual depth of coverage > at each base in the scaffold. > > I can't think of a reason it would fail on contigs, but I haven't tried it. > > The posmap files should be capturing most of the important stuff from the > (agreed: very difficult to use) asm file. If you can't get what you're > looking for out of the posmap files, we need to add to them. > > b > > > > On 8/8/12 6:20 AM, "Christoph Hahn" <chr...@gm...> wrote: > >> Hello CA developers and experts, >> >> I have just finished my first big 454+illumina hybrid assembly using CA7 >> and I am about to assess the result now in comparison to purely illumina >> based assemblies. >> >> One question there: What is the easiest way to get coverage information >> for the scaffolds, contigs, unitigs in the *.scf.fasta, *.ctg.fasta, >> etc. files? I figured, that it is possible to calculate it manually >> using the information in the *.posmap.frgscf and *.posmap.scflen files >> (in case of scaffolds). I guess, the information is also in the *.asm >> file, but I am having problems reading/parsing the file. >> Is there an easy way you can think about? >> The reason, why I want to do this is that I want to bin the >> scaffolds/contigs based on coverage, GC-content and length. >> >> Any ideas are highly appreciated, thanks! >> >> Much obliged, >> Christoph >> >> University of Oslo, Norway >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |
From: Walenz, B. <bw...@jc...> - 2012-08-09 03:03:21
|
Hi, Christoph- [Sorry, wrote this 16 hours ago and forgot to send] Check out the 'fragmentDepth' utility. It computes coverage, and outputs in three different ways: coverage of each scaffold, a histogram of coverage (as at the end of *.qc), and a fasta-like output of the actual depth of coverage at each base in the scaffold. I can't think of a reason it would fail on contigs, but I haven't tried it. The posmap files should be capturing most of the important stuff from the (agreed: very difficult to use) asm file. If you can't get what you're looking for out of the posmap files, we need to add to them. b On 8/8/12 6:20 AM, "Christoph Hahn" <chr...@gm...> wrote: > Hello CA developers and experts, > > I have just finished my first big 454+illumina hybrid assembly using CA7 > and I am about to assess the result now in comparison to purely illumina > based assemblies. > > One question there: What is the easiest way to get coverage information > for the scaffolds, contigs, unitigs in the *.scf.fasta, *.ctg.fasta, > etc. files? I figured, that it is possible to calculate it manually > using the information in the *.posmap.frgscf and *.posmap.scflen files > (in case of scaffolds). I guess, the information is also in the *.asm > file, but I am having problems reading/parsing the file. > Is there an easy way you can think about? > The reason, why I want to do this is that I want to bin the > scaffolds/contigs based on coverage, GC-content and length. > > Any ideas are highly appreciated, thanks! > > Much obliged, > Christoph > > University of Oslo, Norway > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |
From: Christoph H. <chr...@gm...> - 2012-08-08 10:20:30
|
Hello CA developers and experts, I have just finished my first big 454+illumina hybrid assembly using CA7 and I am about to assess the result now in comparison to purely illumina based assemblies. One question there: What is the easiest way to get coverage information for the scaffolds, contigs, unitigs in the *.scf.fasta, *.ctg.fasta, etc. files? I figured, that it is possible to calculate it manually using the information in the *.posmap.frgscf and *.posmap.scflen files (in case of scaffolds). I guess, the information is also in the *.asm file, but I am having problems reading/parsing the file. Is there an easy way you can think about? The reason, why I want to do this is that I want to bin the scaffolds/contigs based on coverage, GC-content and length. Any ideas are highly appreciated, thanks! Much obliged, Christoph University of Oslo, Norway |
From: Walenz, B. <bw...@jc...> - 2012-07-31 16:13:56
|
hi- It is these two lines: ovlThreads = 2 ovlConcurrency = 24 The first says that each process will use 2 threads, and the second says to run 24 processes at the same time, for a total of 48 cores used. Dropping ovlConcurrency to 16 should work. b -- Brian Walenz Sr. Software Engineer J. Craig Venter Institute On 7/31/12 9:18 AM, "Quan, Xueping" <x....@im...> wrote: Hi I am working on a large plant genome (genome size about 3.5Gb), I got about 131Gb Illumina paired-end and 1.8Gb 454 mate pair reads. The assembling is running on a HPC with shared memory with upper memory limit and number of CPUs I could use being 800Gb and 32 cores. However, the assembling job was killed by the system because "ncpus 33.30 exceeded limit 32" in the overlapInCore stage. Below is my spec file, could you please have a look to see where it is wrong and how to optimize: " # # Expected rate of sequencing error. Allow pairwise alignments up to this rate. # Sanger reads can use values less than one. Titanium reads require 3% in unitig. # utgErrorRate=0.03 utgErrorLimit=2.5 # Allow mismatches over and above the utgErrorRate. This helps with Illumina reads. ovlErrorRate=0.06 # Larger than utg to allow for correction. cnsErrorRate=0.10 # Larger than utg to avoid occasional consensus failures cgwErrorRate=0.10 # Larger than utg to allow contig merges across high-error ends # merSize = 22 # default=22; use lower to combine across heterozygosity, higher to separate near-identical repeat copies overlapper=ovl # the mer overlapper for 454-like data is insensitive to homopolymer problems but requires more RAM and disk # unitigger = bog utgBubblePopping = 1 # utgGenomeSize = 3.5gb # # MERYL calculates K-mer seeds merylMemory = 512000 merylThreads = 24 # # OVERLAPPER calculates overlaps ovlHashBits=25 ovlHashBlockLength=180000000 ovlThreads = 2 ovlConcurrency = 24 ovlRefBlockSize = 32000000 # # OVERLAP STORE build the database #ovlStoreMemory = 8GB # Oops! That doesn't work. See correction below. ovlStoreMemory = 8192 # Mbp # # ERROR CORRECTION applied to overlaps frgCorrThreads = 10 frgCorrConcurrency = 3 ovlCorrBatchSize = 1000000 ovlCorrConcurrency = 25 # # UNITIGGER configuration # # CONSENSUS configuration cnsConcurrency = 16 " Thanks very much! Xueping Dr. Xueping Quan Research Associate in BioInformatics Imperial College London Tel: +44(0)207 594 17 80 email:x....@im... Personal:http://www3.imperial.ac.uk/people/x.quan Group: www3.imperial.ac.uk/savolainenlab <https://exchange.imperial.ac.uk/ecp/Customize/www3.imperial.ac.uk/savolainenlab> |
From: Quan, X. <x....@im...> - 2012-07-31 13:18:46
|
Hi I am working on a large plant genome (genome size about 3.5Gb), I got about 131Gb Illumina paired-end and 1.8Gb 454 mate pair reads. The assembling is running on a HPC with shared memory with upper memory limit and number of CPUs I could use being 800Gb and 32 cores. However, the assembling job was killed by the system because "ncpus 33.30 exceeded limit 32" in the overlapInCore stage. Below is my spec file, could you please have a look to see where it is wrong and how to optimize: " # # Expected rate of sequencing error. Allow pairwise alignments up to this rate. # Sanger reads can use values less than one. Titanium reads require 3% in unitig. # utgErrorRate=0.03 utgErrorLimit=2.5 # Allow mismatches over and above the utgErrorRate. This helps with Illumina reads. ovlErrorRate=0.06 # Larger than utg to allow for correction. cnsErrorRate=0.10 # Larger than utg to avoid occasional consensus failures cgwErrorRate=0.10 # Larger than utg to allow contig merges across high-error ends # merSize = 22 # default=22; use lower to combine across heterozygosity, higher to separate near-identical repeat copies overlapper=ovl # the mer overlapper for 454-like data is insensitive to homopolymer problems but requires more RAM and disk # unitigger = bog utgBubblePopping = 1 # utgGenomeSize = 3.5gb # # MERYL calculates K-mer seeds merylMemory = 512000 merylThreads = 24 # # OVERLAPPER calculates overlaps ovlHashBits=25 ovlHashBlockLength=180000000 ovlThreads = 2 ovlConcurrency = 24 ovlRefBlockSize = 32000000 # # OVERLAP STORE build the database #ovlStoreMemory = 8GB # Oops! That doesn't work. See correction below. ovlStoreMemory = 8192 # Mbp # # ERROR CORRECTION applied to overlaps frgCorrThreads = 10 frgCorrConcurrency = 3 ovlCorrBatchSize = 1000000 ovlCorrConcurrency = 25 # # UNITIGGER configuration # # CONSENSUS configuration cnsConcurrency = 16 " Thanks very much! Xueping Dr. Xueping Quan Research Associate in BioInformatics Imperial College London Tel: +44(0)207 594 17 80 email:x....@im... Personal:http://www3.imperial.ac.uk/people/x.quan Group: www3.imperial.ac.uk/savolainenlab<https://exchange.imperial.ac.uk/ecp/Customize/www3.imperial.ac.uk/savolainenlab> |
From: kuhl <ku...@mo...> - 2012-07-31 08:13:56
|
Hello Brian, thanks for the help. Fortunately, in step 7-4 cgw successfully finished MergeScaffoldsAggressive after iteration 564. Best wishes, Heiner On Mon, 30 Jul 2012 12:44:16 -0400, "Walenz, Brian" <bw...@jc...> wrote: > Hi, Heiner- > > Working backwards through your email: > > We've also noticed the 'large scaffold gets lots of little contigs added' > problem. This seems to be dominating our run time. I'm working on this > problem at the moment. Our previous solution was basically what you did: > let it run until we get impatient, then kill it and restart from the next > checkpoint label. > > The CVS tip has a slight improvement in cgw, committed around the 20th. I > hope to have much more within the next week. > > You can ignore the mates in the library, but not the reads. To ignore the > mates, simply delete the mate link from gkpStore. At the very bottom of > the > 'gatekeeper' page on the wiki is 'allfragsunmated' which will remove the > mate link from all reads in a single library. This is a destructive > operation! Save a backup of gkpStore/fnm and gkpStore/fpk if you want to > revert. (these two files store metadata for long and short fragments > resp.) > > http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Gatekeep > er > > FYI- The 5-consensus-insert-size directory has a plot of the insert size > histogram for each library. These are based on unitigs, and so the 20k > library might not be represented well. tigStore (the command) can also > analyze mate pairs for contigs/unitigs in the store with -d matepair. > > b > > > On 7/26/12 5:39 PM, "kuhl" <ku...@mo...> wrote: > >> Hi Brian et al., >> >> I am currently running a huge assembly with CA7 (2.5Gb 30x Illumina + >> 454, >> cgw takes 150-300Gb RAM). It is now in step 7-2 and I have just stopped >> cgw >> at MergeScaffoldsAggressive iteration 1641 and restarted it at >> ckp08-2SM. I >> did this also in 7-0 at iteration 2xxx. Now I am not sure, if I should >> maybe rerun scaffolding without 20 kb mate pairs, which I think are >> responsible for this mess. So I have two questions: >> >> How can I convince cgw to ignore a certain library without doing steps >> 0-5 >> again? >> >> Is there a rule of thumb, when MergeScaffoldsAggressive should be >> stopped? >> >> >> In my case it looks like cgw is only very slightly progressing with each >> iteration and there is one large scaffold that is growing more and >> more... >> >> ExamineUsableSEdges()- maxWeightEdge from 0 to 32 at idx 3355 out of >> 60498 >> ExamineUsableSEdges()- maxWeightEdge from 0 to 19 at idx 8774 out of >> 60498 >> ExamineUsableSEdges()- maxWeightEdge from 0 to 55 at idx 286 out of 60500 >> ExamineUsableSEdges()- maxWeightEdge from 0 to 32 at idx 3355 out of >> 60500 >> ExamineUsableSEdges()- maxWeightEdge from 0 to 16 at idx 10594 out of >> 60500 >> ExamineUsableSEdges()- maxWeightEdge from 0 to 7 at idx 20348 out of >> 60500 >> ExamineUsableSEdges()- maxWeightEdge from 0 to 55 at idx 286 out of 60489 >> ExamineUsableSEdges()- maxWeightEdge from 0 to 32 at idx 3355 out of >> 60489 >> ExamineUsableSEdges()- maxWeightEdge from 0 to 19 at idx 8773 out of >> 60489 >> ExamineUsableSEdges()- maxWeightEdge from 0 to 9 at idx 16854 out of >> 60489 >> ExamineUsableSEdges()- maxWeightEdge from 0 to 55 at idx 286 out of 60486 >> ExamineUsableSEdges()- maxWeightEdge from 0 to 32 at idx 3355 out of >> 60486 >> ExamineUsableSEdges()- maxWeightEdge from 0 to 16 at idx 10593 out of >> 60486 >> ExamineUsableSEdges()- maxWeightEdge from 0 to 7 at idx 20428 out of >> 60486 >> >> Regards, Heiner >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users -- --------------------------------------------------------------- Dr. Heiner Kuhl MPI Molecular Genetics Tel: + 49 + 30 / 8413 1551 Next Generation Sequencing Ihnestrasse 73 email: ku...@mo... D-14195 Berlin http://www.molgen.mpg.de --------------------------------------------------------------- |