You can subscribe to this list here.
2012 |
Jan
(1) |
Feb
(2) |
Mar
|
Apr
(29) |
May
(8) |
Jun
(5) |
Jul
(46) |
Aug
(16) |
Sep
(5) |
Oct
(6) |
Nov
(17) |
Dec
(7) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2013 |
Jan
(5) |
Feb
(2) |
Mar
(10) |
Apr
(13) |
May
(20) |
Jun
(7) |
Jul
(6) |
Aug
(14) |
Sep
(9) |
Oct
(19) |
Nov
(17) |
Dec
(3) |
2014 |
Jan
(3) |
Feb
|
Mar
(7) |
Apr
(1) |
May
(1) |
Jun
(30) |
Jul
(10) |
Aug
(2) |
Sep
(18) |
Oct
(3) |
Nov
(4) |
Dec
(13) |
2015 |
Jan
(27) |
Feb
|
Mar
(19) |
Apr
(12) |
May
(10) |
Jun
(18) |
Jul
(4) |
Aug
(2) |
Sep
(2) |
Oct
|
Nov
(1) |
Dec
(9) |
2016 |
Jan
(6) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Jason H. <jas...@zo...> - 2014-09-02 00:01:43
|
Here you go, thanks! -Jason On Aug 21, 2014, at 12:56 PM, Serge Koren <se...@um...> wrote: > Hi, > > Sorry for the delayed reply, I missed your post in my email. The high heterozygosity could definitely have an effect on the throughput of the correction. I would suggest increasing the sensitivity further and not specifying -pbCNS on your command line (this consensus module is faster but less robust to higher error data and so could be negatively affected by heterozygosity). > mhap = "-k 14 --num-hashes 768 --num-min-matches 3 --threshold 0.04" > merSize = 14 > > If you could send your asm.layout.err file, I can get more information and confirm whether the low output is due to the consensus or the sensitivity parameters. > > Sergey > > On Aug 18, 2014, at 12:52 PM, Jason Hill <jas...@zo...> wrote: > >> Hello PBcR and WGS community, >> >> I’m working with what should be 100x pacbio coverage and after using PBcR I’m ending up with at best 7x - 8x of corrected reads. My initial read set is about 11million reads, with an average length of 3000bp. After error correction my best run resulted in 1.2million reads with an average length of 2000bp. My genome has a relatively high heterozygosity as a terrestrial insect. I’ve adjusted both max_coverage and increased genome size to try to account for this but see fewer and shorter reads than using the default PBcR parameters. My current run is being done with following the command spec file. I’m using the latest version of all WGS, 8.2b. >> >> ############## pacbio.spec ############# >> assemble = 0 >> localStaging = /wgs_pacbio_assembly/PBcR_self_correction/staging >> >> #faster overlapper with more sensitive settings >> mhap = "-k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04" >> merSize = 16 >> >> #system memory parameters to avoid fraction bug >> ovlMemory = 512 >> ovlStoreMemory = 512000 >> merylMemory = 512000 >> >> #increase coverage depth to counter heterozygosity/error rate >> #usually results in less corrected reads >> maxCoverage = 60 >> >> #increase genome size to counter heterozygosity, actual genome size 350MB >> #usually results in less corrected reads >> genomeSize = 500000000 >> ##################################### >> >> $PBcR -pbCNS\ >> -length 300\ >> -partitions 65\ >> -l corrected_pb_1\ >> -t 64\ >> -s pacbio.spec\ >> -noclean\ >> -fastq pb.fastq 2>&1 | tee self_corrected_pb_1.log >> >> When looking at the corrected read lists in the temporary directory I see what appear to be deleted reads of a length I would assume would make the cut, for example: >> >>> 100003680002,3680002 mate=0,0 lib=corrected_pb_1,1 clr=LATEST,1,2219 deleted=1 >> cgtatgtaaaccaattttatactgatggggcgcgaaataacttttcttaagttccttgtgtccaaaca… continues for a total of 2219 bp. >> >> As it is, none of the overlap layout assemblers can do much with the low coverage I end up with so I’m very eager to hear ideas of how I can move this forward. Would you please take a look and let me know how you would proceed? I would be happy to supply any additional information and files. >> >> -Jason >> >> >> >> >> > |
From: Serge K. <se...@um...> - 2014-08-21 19:56:43
|
Hi, Sorry for the delayed reply, I missed your post in my email. The high heterozygosity could definitely have an effect on the throughput of the correction. I would suggest increasing the sensitivity further and not specifying -pbCNS on your command line (this consensus module is faster but less robust to higher error data and so could be negatively affected by heterozygosity). mhap = "-k 14 --num-hashes 768 --num-min-matches 3 --threshold 0.04" merSize = 14 If you could send your asm.layout.err file, I can get more information and confirm whether the low output is due to the consensus or the sensitivity parameters. Sergey On Aug 18, 2014, at 12:52 PM, Jason Hill <jas...@zo...> wrote: > Hello PBcR and WGS community, > > I’m working with what should be 100x pacbio coverage and after using PBcR I’m ending up with at best 7x - 8x of corrected reads. My initial read set is about 11million reads, with an average length of 3000bp. After error correction my best run resulted in 1.2million reads with an average length of 2000bp. My genome has a relatively high heterozygosity as a terrestrial insect. I’ve adjusted both max_coverage and increased genome size to try to account for this but see fewer and shorter reads than using the default PBcR parameters. My current run is being done with following the command spec file. I’m using the latest version of all WGS, 8.2b. > > ############## pacbio.spec ############# > assemble = 0 > localStaging = /wgs_pacbio_assembly/PBcR_self_correction/staging > > #faster overlapper with more sensitive settings > mhap = "-k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04" > merSize = 16 > > #system memory parameters to avoid fraction bug > ovlMemory = 512 > ovlStoreMemory = 512000 > merylMemory = 512000 > > #increase coverage depth to counter heterozygosity/error rate > #usually results in less corrected reads > maxCoverage = 60 > > #increase genome size to counter heterozygosity, actual genome size 350MB > #usually results in less corrected reads > genomeSize = 500000000 > ##################################### > > $PBcR -pbCNS\ > -length 300\ > -partitions 65\ > -l corrected_pb_1\ > -t 64\ > -s pacbio.spec\ > -noclean\ > -fastq pb.fastq 2>&1 | tee self_corrected_pb_1.log > > When looking at the corrected read lists in the temporary directory I see what appear to be deleted reads of a length I would assume would make the cut, for example: > >> 100003680002,3680002 mate=0,0 lib=corrected_pb_1,1 clr=LATEST,1,2219 deleted=1 > cgtatgtaaaccaattttatactgatggggcgcgaaataacttttcttaagttccttgtgtccaaaca… continues for a total of 2219 bp. > > As it is, none of the overlap layout assemblers can do much with the low coverage I end up with so I’m very eager to hear ideas of how I can move this forward. Would you please take a look and let me know how you would proceed? I would be happy to supply any additional information and files. > > -Jason > > > > > |
From: Jason H. <jas...@zo...> - 2014-08-18 17:11:24
|
Hello PBcR and WGS community, I’m working with what should be 100x pacbio coverage and after using PBcR I’m ending up with at best 7x - 8x of corrected reads. My initial read set is about 11million reads, with an average length of 3000bp. After error correction my best run resulted in 1.2million reads with an average length of 2000bp. My genome has a relatively high heterozygosity as a terrestrial insect. I’ve adjusted both max_coverage and increased genome size to try to account for this but see fewer and shorter reads than using the default PBcR parameters. My current run is being done with following the command spec file. I’m using the latest version of all WGS, 8.2b. ############## pacbio.spec ############# assemble = 0 localStaging = /wgs_pacbio_assembly/PBcR_self_correction/staging #faster overlapper with more sensitive settings mhap = "-k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04" merSize = 16 #system memory parameters to avoid fraction bug ovlMemory = 512 ovlStoreMemory = 512000 merylMemory = 512000 #increase coverage depth to counter heterozygosity/error rate #usually results in less corrected reads maxCoverage = 60 #increase genome size to counter heterozygosity, actual genome size 350MB #usually results in less corrected reads genomeSize = 500000000 ##################################### $PBcR -pbCNS\ -length 300\ -partitions 65\ -l corrected_pb_1\ -t 64\ -s pacbio.spec\ -noclean\ -fastq pb.fastq 2>&1 | tee self_corrected_pb_1.log When looking at the corrected read lists in the temporary directory I see what appear to be deleted reads of a length I would assume would make the cut, for example: >100003680002,3680002 mate=0,0 lib=corrected_pb_1,1 clr=LATEST,1,2219 deleted=1 cgtatgtaaaccaattttatactgatggggcgcgaaataacttttcttaagttccttgtgtccaaaca… continues for a total of 2219 bp. As it is, none of the overlap layout assemblers can do much with the low coverage I end up with so I’m very eager to hear ideas of how I can move this forward. Would you please take a look and let me know how you would proceed? I would be happy to supply any additional information and files. -Jason |
From: Brian W. <th...@gm...> - 2014-07-30 13:31:51
|
My message with the patch didn't seem to make it into the archive completely. The patch is there, but the message text isn't. Here's the patch: https://sourceforge.net/p/wgs-assembler/mailman/message/32480476/ You can read the text in this reply: https://sourceforge.net/p/wgs-assembler/mailman/message/32481695/ The two values you need to change are from gatekeeper -dumpinfo. Search for "pacBio" to find where they are in the code. b On Tue, Jul 29, 2014 at 5:29 PM, Brian Foster <bf...@lb...> wrote: > Hello All, > > I think I am running into the same partitioning problem that was mentioned > in a previous thread. I am getting a single relatively large partition with > many smaller same sized partitions and the overlapStore stage is failing. I > am looking for the patch to overlapStoreBuild.C and can't seem to find it. > Was that sent as an email attachment? Any help would be appreciated. > > Thanks, > Brian > > > > ------------------------------------------------------------------------------ > Infragistics Professional > Build stunning WinForms apps today! > Reboot your WinForms applications with our WinForms controls. > Build a bridge from your legacy apps to the future. > > http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > > |
From: Brian F. <bf...@lb...> - 2014-07-29 21:29:13
|
Hello All, I think I am running into the same partitioning problem that was mentioned in a previous thread. I am getting a single relatively large partition with many smaller same sized partitions and the overlapStore stage is failing. I am looking for the patch to overlapStoreBuild.C and can't seem to find it. Was that sent as an email attachment? Any help would be appreciated. Thanks, Brian |
From: Brian W. <th...@gm...> - 2014-07-25 23:31:39
|
Hi, Heiner- Wow, you've got an old version. ;-) Those two options don't exist in the latest code. 'rebuild repeats' would take all the reads detected by bogart as being repetitive, and do a second unitigging using just those reads. The idea was that maybe we could collapse/separate repeats better if all the unique reads were removed. I never saw any huge gains from doing this. 'mate extension' was a similar idea. Find all the reads that are in repeats. Then, for each unitig, reconstruct it using the reads in the unitig PLUS any mated reads in the repeats. The end result was that the unitig should be extended into repeats, but only using mated reads. Similar result - kind of worked, but nothing spectacular. They were both decent ideas (and fun to remember), but I don't think they'll help here. We all (should) know that repeats bigger than a read can't be resolved (in general). A corollary of this is that if repeats bigger than the smaller reads are resolved, then the smaller reads cannot be uniquely resolved. It just took enormously different sizes (4k pacbio and 0.1k illumina) to make this a problem. I've been pleased with ECtools from the Schatz Lab ( http://schatzlab.cshl.edu/data/ectools/). Assemble the Illumina to unitigs, use that to correct the pacbio, then assemble the pacbio. I wasn't so pleased by the effort it took to run it (this was 1/2 a year ago) and it might not scale past 1/2 Gbp. But the assemblies were quite good. b On Fri, Jul 25, 2014 at 12:00 PM, kuhl <ku...@mo...> wrote: > Dear Brian, > > just a comment, would > > batRebuildRepeats = 1 > batMateExtension = 1 > > help with this issue? I am also running long reads (~4000 bp) with short > reads and found this to be helping with some issues I had with cgw. > Anyway, I never could use the full memory with bogart with these > parameters, because it crashed in step 10. I had to limit bogart to 100Gb > RAM (on 2-3 Gbp vertebrate genomes). And then it worked. The result was > lower N50 unitigs, but this was solved by cgw. Regarding missassemblies in > scaffolds, I also find a lot, which are actually limiting the final N50 and > are forcing me to do a lot of manual final polishing of the assemblies > (splitting / rescaffolding / gap closing again etc). If I set > "doUnitigSplitting = 1" it helps, but is there any way to speed this up, > like doing the unitig splitting on partitions in parallel? Seems there is > still no perfect solution for hybrid data assemblies.... > > Heiner > > |
From: Brian W. <th...@gm...> - 2014-07-25 23:07:49
|
If my suspicion is correct - keep in mind, all this is a total guess on what I imagine is happening - it's likely a mess that can be pushed to completion now. All the obvious scaffolding should be done already. Bump it out of the scaffold merging steps, but let the other cgw steps run. Possibly, you can get away with increasing the min weight (6? 8? no good guess), instead of manually forcing it to stop merging. On Fri, Jul 25, 2014 at 11:23 AM, Waldbieser, Geoff < Geo...@ar...> wrote: > So in this case adding the Illumina PE reads would not have helped? > > Is the graph trying to detangle or is it likely to be a mess that needs to > be axed now? > > > > > > *From:* Brian Walenz [mailto:th...@gm...] > *Sent:* Friday, July 25, 2014 8:11 AM > > *To:* Waldbieser, Geoff > *Subject:* Re: [wgs-assembler-users] Does scaffolding scale with > available RAM? > > > > Sorry, I owe you a few replies. I switched jobs, and now can't read gmail > at work, or work at home. > > It's not that the pacbio assembled through repeats, but that the pacbio > reads themselves get through (larger) repeats. Without the pacbio, bogart > will detect the repeat, notice that no read spans it, and excise it from > the unitig. With the pacbio, bogart again detects the repeat, but now that > a read spans it, the repeat is left in the unitig. > > That would be great, except that the repeat illumina mates are now a total > mess. With just illumina, the repeats are isolated to short unitigs, and > only those mates are a mess, but scaffolder was designed to handle this > case. With the longer repeats included in longer unitigs, and illumina > mates placed incorrectly in those, the scaffold graph is a mess. > > E.g., > > unitig1: unique1-repeatA-unique2 > unitig2: unique3-repeatB-unique4 (where repeatA and repeatB are related) > > It is possible to get a mate between repeatA and unique4, when really it > should be in repeatB. > > Your pacbio-only assembly was from correction of the pacbio with > illumina? I'm surprised it was that bad. > > > > > > On Mon, Jul 21, 2014 at 6:32 PM, Waldbieser, Geoff < > Geo...@ar...> wrote: > > First of all, thanks for saving us $100k on a high Mem server. > > > > When I mapped BAC end sequences to the Illumina-only assembly > (MaSuRCA-2.2.0) the avg insert length of contained mates was 165kb which > was on the dot for that BAC library. When I mapped to the PacBio-only > assembly the insert sizes were in the 30kb range, so I knew something was > wrong. That would support your idea of assembling through repeats and > perhaps through the wrong repeats. So I thought including the Illumina mate > pairs might help the PacBio assembly but apparently the MPs just made it > more convoluted. > > > > Aleksey had suggested not using the PacBio at all for assembly, just for > gap closure. Maybe it’s time to pull the plug on this one, maybe shred the > PacBio reads to overlapping 2kb lengths to use on MaSuRCA. But then again > it could end soon (I tell myself every day). Is there a reasonable way to > estimate how many contigs have been incorporated thus estimating how many > there are to go? > > > > > > > > |
From: kuhl <ku...@mo...> - 2014-07-25 16:18:46
|
Dear Brian, just a comment, would batRebuildRepeats = 1 batMateExtension = 1 help with this issue? I am also running long reads (~4000 bp) with short reads and found this to be helping with some issues I had with cgw. Anyway, I never could use the full memory with bogart with these parameters, because it crashed in step 10. I had to limit bogart to 100Gb RAM (on 2-3 Gbp vertebrate genomes). And then it worked. The result was lower N50 unitigs, but this was solved by cgw. Regarding missassemblies in scaffolds, I also find a lot, which are actually limiting the final N50 and are forcing me to do a lot of manual final polishing of the assemblies (splitting / rescaffolding / gap closing again etc). If I set "doUnitigSplitting = 1" it helps, but is there any way to speed this up, like doing the unitig splitting on partitions in parallel? Seems there is still no perfect solution for hybrid data assemblies.... Heiner On Fri, 25 Jul 2014 15:23:47 +0000, "Waldbieser, Geoff" <Geo...@AR...> wrote: > So in this case adding the Illumina PE reads would not have helped? > Is the graph trying to detangle or is it likely to be a mess that needs to > be axed now? > > > From: Brian Walenz [mailto:th...@gm...] > Sent: Friday, July 25, 2014 8:11 AM > To: Waldbieser, Geoff > Subject: Re: [wgs-assembler-users] Does scaffolding scale with available > RAM? > > Sorry, I owe you a few replies. I switched jobs, and now can't read gmail > at work, or work at home. > It's not that the pacbio assembled through repeats, but that the pacbio > reads themselves get through (larger) repeats. Without the pacbio, bogart > will detect the repeat, notice that no read spans it, and excise it from > the unitig. With the pacbio, bogart again detects the repeat, but now that > a read spans it, the repeat is left in the unitig. > That would be great, except that the repeat illumina mates are now a total > mess. With just illumina, the repeats are isolated to short unitigs, and > only those mates are a mess, but scaffolder was designed to handle this > case. With the longer repeats included in longer unitigs, and illumina > mates placed incorrectly in those, the scaffold graph is a mess. > > E.g., > unitig1: unique1-repeatA-unique2 > unitig2: unique3-repeatB-unique4 (where repeatA and repeatB are related) > It is possible to get a mate between repeatA and unique4, when really it > should be in repeatB. > Your pacbio-only assembly was from correction of the pacbio with illumina? > I'm surprised it was that bad. > > > On Mon, Jul 21, 2014 at 6:32 PM, Waldbieser, Geoff > <Geo...@ar...<mailto:Geo...@ar...>> > wrote: > First of all, thanks for saving us $100k on a high Mem server. > > When I mapped BAC end sequences to the Illumina-only assembly > (MaSuRCA-2.2.0) the avg insert length of contained mates was 165kb which > was on the dot for that BAC library. When I mapped to the PacBio-only > assembly the insert sizes were in the 30kb range, so I knew something was > wrong. That would support your idea of assembling through repeats and > perhaps through the wrong repeats. So I thought including the Illumina mate > pairs might help the PacBio assembly but apparently the MPs just made it > more convoluted. > > Aleksey had suggested not using the PacBio at all for assembly, just for > gap closure. Maybe it’s time to pull the plug on this one, maybe shred the > PacBio reads to overlapping 2kb lengths to use on MaSuRCA. But then again > it could end soon (I tell myself every day). Is there a reasonable way to > estimate how many contigs have been incorporated thus estimating how many > there are to go? > > > > From: Brian Walenz [mailto:th...@gm...<mailto:th...@gm...>] > Sent: Monday, July 21, 2014 5:19 PM > > To: Waldbieser, Geoff > Subject: Re: [wgs-assembler-users] Does scaffolding scale with available > RAM? > > Yup, that looks like a perfectly well behaved process. I can't explain > what Linux is doing with the memory -- filesystem cache would be my guess > -- but the cgw process is small, and more importantly, getting 100% CPU and > using no swap. > My guess is that the PacBio sequenced/assembled through repeats, and the > illumina is now overlapping to the wrong repeat copy, resulting in a very > messy mate graph. Compare this against an illumina only assembly where > unitigs broke at repeat boundaries. The graph is much cleaner, but > possibly disjoint. > I think Aleksey Zimin @ UMD had some success removing overlaps where none > of the kmer seeds were 'unique', for some definition of unique. The > process was rather involved: build unitigs, then decide what isn't unique > (by counting kmers in the assembled unitigs), recompute overlaps, and > re-unitig. I've never seen code to do it, nor the results. Just word of > mouth. > > > On Mon, Jul 21, 2014 at 9:21 AM, Waldbieser, Geoff > <Geo...@ar...<mailto:Geo...@ar...>> > wrote: > The Bri, > > So for Linux halfwits like me, I look at the Mem line and see that it’s > using about all the 512M RAM available. But then I look at the cgw command > line and see that it’s only using 5.7% of memory. So is that what you’re > talking about - that most of the RAM is taken up in cached data and only 5% > of the memory is actually involved in the active processes of cgw? > > [cid:image001.png@01CFA7F2.858DF130] > > The PacBio-only assemblies (no scaffolds) require about 2 days to > complete. The Illumina-only assemblies complete in about 2 weeks. So in the > present case, when the Illumina mate pairs are added to PacBio data but > Illumina PE reads are not included, is it something like the PacBio data > not having the depth of coverage to identify the repetitive elements like > the deep Illumina PE data did, therefore the Illumina mates are aligning to > more repetitive sequence? > > Geoff > > > > > > From: Brian Walenz [mailto:th...@gm...<mailto:th...@gm...>] > Sent: Saturday, July 19, 2014 10:40 AM > > To: Waldbieser, Geoff > Subject: Re: [wgs-assembler-users] Does scaffolding scale with available > RAM? > > Aye, no improvement by moving to 3tb....assuming it's not paging on > whatever tiny machine it is running on now! > -recomputegaps, I think, only matters only at the start of the run, and > only on the later iterations. kickOutNonOvlContigs=0 is the previous > default, so no trouble there. Filter level 2 was developed during our > salmon assembly headache. It seemed to be as sensitive as the default, > maybe a little faster, and also decreased the 'huge gap in scaffold' > problem that results in massive slow downs and enormous (and incorrect) > scaffolds. > > > On Fri, Jul 18, 2014 at 1:38 PM, Waldbieser, Geoff > <Geo...@ar...<mailto:Geo...@ar...>> > wrote: > Maybe I have exacerbated the slowdown by using ‘cgwMergeFilterLevel=2 > –recomputegaps’ and ‘kickOutNonOvlContigs = 0’? At least for now it seems > to be avoiding the 50Mb incorrect scaffold or the constant cycle of > merge/exclude specific contigs. If it’s a good assembly then it will have > been worth the time. > > From: Brian Walenz [mailto:th...@gm...<mailto:th...@gm...>] > Sent: Thursday, July 17, 2014 5:29 AM > To: Waldbieser, Geoff > Subject: Re: [wgs-assembler-users] Does scaffolding scale with available > RAM? > > Hi, Geoff- > Sadly, no control over memory in CGW. Its already using the most it can. > Most of the memory usage is for caching untigis/contigs, if space is really > tight, the cache can be turned off and they'll be loaded from disk every > time. Not what you're after. > Before we had a large memory machine, I ran a ~200gb CGW on a 128gb > machine. It ran perfectly fine. The infrequently used unitigs/contigs > ended up swapped out, just as if the cache was disabled. So, unless your > CGW process is much much bigger than 512gb, you won't gain anything. > There are a few options that can make significant improvements in run > time. cgwMergeFilterLevel of 2 should be a little faster and not that much > worse. cgwMergeFilterLevel of 5 will be quite speedy, but not aggressive. > cgwMinMergeWeight sets the minimum number of mates that are needed to > attempt a scaffold join; default is 2. This is shown in the logs. If it > gets stuck doing a bunch of weight 2 merges, increasing to 3 will help, but > could sacrifice some joins. > > b > > On Wed, Jul 16, 2014 at 4:07 PM, Waldbieser, Geoff > <Geo...@ar...<mailto:Geo...@ar...>> > wrote: > Hi Brian, > I’m once again using a calendar to measure a scaffolding job (basically > scaffolding PacBio reads with Illumina mate pairs). Does the scaffolding > speed scale with increases in RAM? The current setup has 512GB RAM but if > this were to run on a node that contains 1TB or 2TB RAM would the job be ½ > or ¼ the length of time? > > Geoff > > > Geoff Waldbieser > USDA, ARS, Warmwater Aquaculture Research Unit > 141 Experiment Station Road > Stoneville, Mississippi 38776 > Ofc. 662-686-3593<tel:662-686-3593> > Fax. 662-686-3567<tel:662-686-3567> > > > > > > This electronic message contains information generated by the USDA solely > for the intended recipients. Any unauthorized interception of this message > or the use or disclosure of the information it contains may violate the law > and subject the violator to civil or criminal penalties. If you believe you > have received this message in error, please notify the sender and delete > the email immediately. > > ------------------------------------------------------------------------------ > Want fast and easy access to all the code in your enterprise? Index and > search up to 200,000 lines of code with a free copy of Black Duck > Code Sight - the same software that powers the world's largest code > search on Ohloh, the Black Duck Open Hub! Try it now. > http://p.sf.net/sfu/bds > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li...<mailto:wgs...@li...> > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users -- --------------------------------------------------------------- Dr. Heiner Kuhl MPI Molecular Genetics Tel: + 49 + 30 / 8413 1776 Next Generation Sequencing Ihnestrasse 73 email: ku...@mo... D-14195 Berlin http://www.molgen.mpg.de/SeqCore --------------------------------------------------------------- |
From: 任一 <upf...@gm...> - 2014-07-18 01:00:21
|
Hi All: I try the PBcR in was-8.2alpha for assembly of sample data from phage and ecoli data. unfortunately it failed at 5-consensus。the errors see below: I also changed the parameters to " consensus=cns” and retry it ,but it still failed at the same step. using the corrected data, I test the runCA in other version such as 8.1、8.0 and 7.0. all of them failed. I thought it maybe because some older library in my OS? Can anyone help me? Thanks very much. /mnt/lustre/users/renyi/bio-softs/wgs-download/sampledata/ecoli/ry/5-consensus/consensus.sh 1 > /dev/null 2>&1 ----------------------------------------END CONCURRENT Thu Jul 17 18:52:47 2014 (8004 seconds) /mnt/lustre/users/renyi/bio-softs/wgs-download/sampledata/ecoli/ry/5-consensus/asm_001 failed -- no .success. ================================================================================ runCA failed. ---------------------------------------- Stack trace: at /mnt/lustre/users/renyi/bio-softs/wgs-8.2alpha/Linux-amd64/bin/runCA line 1568. main::caFailure("1 unitig consensus jobs failed; remove /mnt/lustre/users/reny"..., undef) called at /mnt/lustre/users/renyi/bio-softs/wgs-8.2alpha/Linux-amd64/bin/runCA line 4944 main::postUnitiggerConsensus() called at /mnt/lustre/users/renyi/bio-softs/wgs-8.2alpha/Linux-amd64/bin/runCA line 6479 |
From: Waldbieser, G. <Geo...@AR...> - 2014-07-16 20:08:26
|
Hi Brian, I'm once again using a calendar to measure a scaffolding job (basically scaffolding PacBio reads with Illumina mate pairs). Does the scaffolding speed scale with increases in RAM? The current setup has 512GB RAM but if this were to run on a node that contains 1TB or 2TB RAM would the job be ½ or ¼ the length of time? Geoff Geoff Waldbieser USDA, ARS, Warmwater Aquaculture Research Unit 141 Experiment Station Road Stoneville, Mississippi 38776 Ofc. 662-686-3593 Fax. 662-686-3567 This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. |
From: Serge K. <ser...@gm...> - 2014-07-11 16:26:04
|
Hi, For the PacBio raw reads you do need to have one fastq file. You can just concatenate all your smrtcell filtered data together or run the filtering on all the SMRTcells at once. For the correction data (Illumina/etc), you can provide an arbitrary number of FRG files which will get used for correction. On Jul 11, 2014, at 11:12 AM, nic blouin <nb...@ma...> wrote: > Hi there- > > I looked through the archives and din't see a post regarding this item, which makes me think i am being obtuse here. > > I wish to use pacBioToCa to correct a PacBio data set I have. > For error correction i can see that i can input an illumina dataset and off I go. > I have quite alot of gDNA data for this organism and would like to use it all thinking that more is beter. > From looking over the documentation i believe that I can submit only one correction file is this correct? > Or is there a way for me to include 4 read sets with different pared/mate distances to correct my PacBio data? > For example i have a 4 illumina runs with 300 bp, 500 bp, 4kb, and 7 kb inserts respectively. > > Thanks for any advice. > > > > > nic > > > Nicolas Achille Blouin, Ph.D. > Dept. of Biological Sciences > University of Rhode Island > 120 Flagg Road, CBLS 260 > Kingston, RI 02881 > > ------------------------------------------------------------------------------ > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |
From: nic b. <nb...@ma...> - 2014-07-11 16:11:08
|
Hi there- I looked through the archives and din't see a post regarding this item, which makes me think i am being obtuse here. I wish to use pacBioToCa to correct a PacBio data set I have. For error correction i can see that i can input an illumina dataset and off I go. I have quite alot of gDNA data for this organism and would like to use it all thinking that more is beter. From looking over the documentation i believe that I can submit only one correction file is this correct? Or is there a way for me to include 4 read sets with different pared/mate distances to correct my PacBio data? For example i have a 4 illumina runs with 300 bp, 500 bp, 4kb, and 7 kb inserts respectively. Thanks for any advice. nic Nicolas Achille Blouin, Ph.D. Dept. of Biological Sciences University of Rhode Island 120 Flagg Road, CBLS 260 Kingston, RI 02881 |
From: Serge K. <ser...@gm...> - 2014-06-27 19:23:13
|
The reason the last file didn't have an error is because it is only performing a self-comparison since overlaps are symmetric so it doesn't use the stream directory. When you specified the localScratch directory, did you remove all the temporary output and re-ran? Could you also send your overlap.sh file in 1-overlapper as well? On Jun 27, 2014, at 3:17 PM, Matthew Conte <co...@gm...> wrote: > 1.err is attached. 1.hash.err didn't get created. > > Also the overlap was broken up into 34 parts and only the last part (34.err) didn't have the "java.io.FileNotFoundException" in it, the rest all did. > > -Matt > > > On Fri, Jun 27, 2014 at 12:20 PM, Serge Koren <ser...@gm...> wrote: > Hmm, that is strange. Could you send the output in your 1.hash.err and 1.err files? > > Sergey > > On Jun 26, 2014, at 4:59 PM, Matthew Conte <co...@gm...> wrote: > >> Hi, >> >> I had tried adding the localStaging flag, but still got the same "java.io.FileNotFoundException" during the overlap step. I did try out the lambda phage sample data set and it ran fine so I don't think it is something with my installation. >> >> We currently only have 16X but are thinking of going higher. I wanted to try a de novo assembly with this current dataset and MHAP finally seems like a reasonable way to do so =) >> >> Thanks, >> Matt >> >> >> On Thu, Jun 26, 2014 at 2:34 PM, Serge Koren <ser...@gm...> wrote: >> Hi, >> >> Thanks, yes this looks like a bug in that the code recognized your genome is too big to do the precompute but didn't properly turn it off. Adding the localStaging="<path to local disk on node>" should let you work around the issue. We will make a new release candidate and fix this bug and the other one you encountered. I will say that with 16X you are probably not going to get a very good assembly because you'll likely have less than 10X after correction. I'd suggest trying ECTools as well (https://github.com/jgurtowski/ectools) as it is designed to work best with coverage in the 10-20X range in combination with short-read sequencing data. >> >> Sergey >> >> On Jun 25, 2014, at 2:33 PM, Matthew Conte <co...@gm...> wrote: >> >>> Hi Serge, >>> >>> On Wed, Jun 25, 2014 at 11:36 AM, Serge Koren <ser...@gm...> wrote: >>> Hi, >>> >>> On Jun 24, 2014, at 5:40 PM, Matthew Conte <co...@gm...> wrote: >>> >>>> Hi all, >>>> >>>> I'm trying out PBcR to make use of the new MHAP overlapper for self correcting a set of PacBio reads and I'm running into an issue. >>>> >>>> I'm getting the following errors in the temp_dir/1-overlapper/1.err: >>>> Exception in thread "main" java.io.FileNotFoundException: /raid3/PBcR_CA_8.2_alpha/tempLibrary/1-overlapper/stream_1/correct_reads_part000002.dat (No such file or directory) >>> The dat file is a pre-computed index that is used to speed up the computation for smaller genomes. For larger genomes or if you are using local disk, it should not get created. Do you have the output of the pipeline up to this step along with the command-line you used to start the run? That will help diagnose why it is not properly recognizing that the index is not built. As a workaround, you can add "localStaging=</path to local disk>" to your PBcR command which will force the pipeline to never pre-compute the index. >>> >>> The command that I ran was: >>> /sw/wgs-8.2alpha/Linux-amd64/bin/PBcR "mhap=-k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04 localStaging=/path_to_working_dir/temp_staging" merSize=16 -length 500 -partitions 200 -threads 27 -lib >>> aryname PBcR -s pacbio.spec fastqFile=filtered_subreads.bbmap.rm_adapters.split.fastq -genomeSize 1000000000 >>> >>> I changed the MHAP settings according to the PBcR wiki since I only have about 16x coverage of PacBio data. >>> >>> I should mention that runCA continues to run until the '5-consensus' step, and errors out there. But I think the start of the problem is at this overlap step. >>> >>> The relevant output was: >>> ### Reading options from 'pacbio.spec' >>> ### Reading options from the command line. >>> >>> Warning: no frag files specified, assuming self-correction of pacbio sequences. >>> Running with 27 threads and 200 partitions >>> ********* Starting correction... >>> ... >>> ******** Configuration Summary ******** >>> bankPath = >>> maxCoverage = 40 >>> ... >>> mhap = -k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04 localStaging=/path_to_working_dir/temp_staging >>> ovlRefBlockLength = 100000000000 >>> cnsErrorRate = 0.25 >>> ... >>> ----------------------------------------START Wed Jun 25 11:24:30 2014 >>> mkdir tempPBcR >>> ----------------------------------------END Wed Jun 25 11:24:30 2014 (0 seconds) >>> ----------------------------------------START Wed Jun 25 11:24:30 2014 >>> /sw/wgs-8.2alpha/Linux-amd64/bin/fastqToCA -libraryname PBcR -type sanger -technology none -feature doConsensusCorrection 1 -reads /path_to_working_dir/filtered_subreads.bbmap.rm_adapters.split.fastq > /path_to_working_dir//tempPBcR/PBcR.frg >>> ----------------------------------------END Wed Jun 25 11:24:30 2014 (0 seconds) >>> ----------------------------------------START Wed Jun 25 11:24:30 2014 >>> /sw/wgs-8.2alpha/Linux-amd64/bin/runCA -s /path_to_working_dir//tempPBcR/PBcR.spec -p asm -d tempPBcR stopAfter=initialStoreBuilding /path_to_working_dir//tempPBcR/PBcR.frg >>> ----------------------------------------START Wed Jun 25 11:24:30 2014 >>> /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -o /path_to_working_dir/tempPBcR/asm.gkpStore.BUILDING -F /path_to_working_dir//tempPBcR/PBcR.frg > /path_to_working_dir/tempPBcR/asm.gkpStore.err 2>&1 >>> ----------------------------------------END Wed Jun 25 11:35:27 2014 (657 seconds) >>> numFrags = 2995674 >>> Stop requested after 'initialstorebuilding'. >>> ----------------------------------------END Wed Jun 25 11:35:27 2014 (657 seconds) >>> Will be correcting PacBio library 1 with librarie[s] 1 - 1 >>> ----------------------------------------START Wed Jun 25 11:35:29 2014 >>> /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -invert -tabular -longestovermin 1 500 -longestlength 1 8268329152 /path_to_working_dir//tempPBcR/asm.gkpStore 2> /path_to_working_dir//tempPBcR/asm.seedlength |awk '{if (!(match($1, "UID") != 0 && length($1) == 3)) { print "frg uid "$1" isdeleted 1"; } }' > /path_to_working_dir//tempPBcR/asm.toerase.uid >>> ----------------------------------------END Wed Jun 25 11:35:38 2014 (9 seconds) >>> ----------------------------------------START Wed Jun 25 11:35:38 2014 >>> /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper --edit /path_to_working_dir//tempPBcR/asm.toerase.uid /path_to_working_dir//tempPBcR/asm.gkpStore > /path_to_working_dir//tempPBcR/asm.toerase.out 2> /path_to_working_dir//tempPBcR/asm.toerase.err >>> ----------------------------------------END Wed Jun 25 11:35:44 2014 (6 seconds) >>> Running with 8.268329256X (for genome size 1000000000) of PBcR sequences (8268329256 bp). >>> Correcting with 16X sequences (16536658304 bp). >>> Warning: performing self-correction with a total of 16. For best performance, at least 50 is recommended. >>> ----------------------------------------START Wed Jun 25 11:35:44 2014 >>> /sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish count -m 16 -s 120000000 -t 32 -o /path_to_working_dir//tempPBcR/asm.mers /path_to_working_dir/filtered_subreads.bbmap.rm_adapters.split.fastq >>> ----------------------------------------END Wed Jun 25 12:05:11 2014 (1767 seconds) >>> ----------------------------------------START Wed Jun 25 12:05:11 2014 >>> /sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish histo -t 32 -f /path_to_working_dir//tempPBcR/asm.mers > /path_to_working_dir//tempPBcR/asm.hist >>> ----------------------------------------END Wed Jun 25 12:09:10 2014 (239 seconds) >>> ----------------------------------------START Wed Jun 25 12:09:10 2014 >>> /sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish dump -c -t -L 34 /path_to_working_dir//tempPBcR/asm.mers |awk -v TOTAL=3328265613 '{printf("%s\t%0.10f\t%d\t%d\n", $1, $2/TOTAL, $2, TOTAL)}' |sort -T . -rnk2> /path_to_working_dir//tempPBcR/asm.ignore >>> ----------------------------------------END Wed Jun 25 12:21:17 2014 (727 seconds) >>> ----------------------------------------START Wed Jun 25 12:21:17 2014 >>> rm /path_to_working_dir//tempPBcR/asm.mers* >>> ----------------------------------------END Wed Jun 25 12:21:23 2014 (6 seconds) >>> ----------------------------------------START Wed Jun 25 12:21:23 2014 >>> mkdir /path_to_working_dir//tempPBcR/1-overlapper >>> ----------------------------------------END Wed Jun 25 12:21:23 2014 (0 seconds) >>> ----------------------------------------START Wed Jun 25 12:21:23 2014 >>> /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -tabular asm.gkpStore |awk '{print $1"\t"$2}' > asm.eidToIID >>> ----------------------------------------END Wed Jun 25 12:21:28 2014 (5 seconds) >>> ----------------------------------------START Wed Jun 25 12:21:28 2014 >>> /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -tabular asm.gkpStore |awk '{print $2"\t"$10}' > asm.iidToLen >>> ----------------------------------------END Wed Jun 25 12:21:33 2014 (5 seconds) >>> ----------------------------------------START CONCURRENT Wed Jun 25 12:21:33 2014 >>> /path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 1 >>> Scanning store to find libraries used and reads to dump. >>> Added 0 reads to maintain mate relationships. >>> Dumping 0 fragments from unknown library (version 1 has these) >>> Dumping 133125 fragments from library IID 1 >>> /path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 2 >>> Scanning store to find libraries used and reads to dump. >>> Added 0 reads to maintain mate relationships. >>> ... >>> /path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 23 >>> Scanning store to find libraries used and reads to dump. >>> Added 0 reads to maintain mate relationships. >>> Dumping 0 fragments from unknown library (version 1 has these) >>> Dumping 66924 fragments from library IID 1 >>> ----------------------------------------END CONCURRENT Wed Jun 25 12:27:16 2014 (343 seconds) >>> ----------------------------------------START CONCURRENT Wed Jun 25 12:27:16 2014 >>> /path_to_working_dir//tempPBcR/1-overlapper/overlap.sh 1 >>> Running partition 000001 with options -h 1-133125 -r 133126-1597500 start 133125 end 1597500 total 1464375 zero job 0 and stride 1 >>> /path_to_working_dir//tempPBcR/1-overlapper/overlap.sh 2 >>> Running partition 000002 with options -h 1-133125 -r 1597501-2995674 start 1597500 end 2995674 total 1398174 zero job 0 and stride 1 >>> ... >>> >>> >>> Thanks, >>> Matt >>> >>> >>>> >>>> There is no 'correct_reads_part000002.dat' file there, but there is a 'correct_reads_part000002.fasta' file where the 'stream_1/correct_reads_part000002.dat' points to. I'm not sure if it is just an extension naming issue or if the .dat files weren't created properly. >>>> >>>> Also, I've found another minor issue with the '-threads' option supplied to PBcR on the command line. It doesn't seem to use the number of threads supplied and simply uses the max number of cpus on the machine available. >>> Thanks, I'll check this and update the code. >>>> >>>> Thanks, >>>> Matt >>>> ------------------------------------------------------------------------------ >>>> Open source business process management suite built on Java and Eclipse >>>> Turn processes into business applications with Bonita BPM Community Edition >>>> Quickly connect people, data, and systems into organized workflows >>>> Winner of BOSSIE, CODIE, OW2 and Gartner awards >>>> http://p.sf.net/sfu/Bonitasoft_______________________________________________ >>>> wgs-assembler-users mailing list >>>> wgs...@li... >>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >>> >>> >> >> > > > <1.err> |
From: Serge K. <ser...@gm...> - 2014-06-27 16:20:56
|
Hmm, that is strange. Could you send the output in your 1.hash.err and 1.err files? Sergey On Jun 26, 2014, at 4:59 PM, Matthew Conte <co...@gm...> wrote: > Hi, > > I had tried adding the localStaging flag, but still got the same "java.io.FileNotFoundException" during the overlap step. I did try out the lambda phage sample data set and it ran fine so I don't think it is something with my installation. > > We currently only have 16X but are thinking of going higher. I wanted to try a de novo assembly with this current dataset and MHAP finally seems like a reasonable way to do so =) > > Thanks, > Matt > > > On Thu, Jun 26, 2014 at 2:34 PM, Serge Koren <ser...@gm...> wrote: > Hi, > > Thanks, yes this looks like a bug in that the code recognized your genome is too big to do the precompute but didn't properly turn it off. Adding the localStaging="<path to local disk on node>" should let you work around the issue. We will make a new release candidate and fix this bug and the other one you encountered. I will say that with 16X you are probably not going to get a very good assembly because you'll likely have less than 10X after correction. I'd suggest trying ECTools as well (https://github.com/jgurtowski/ectools) as it is designed to work best with coverage in the 10-20X range in combination with short-read sequencing data. > > Sergey > > On Jun 25, 2014, at 2:33 PM, Matthew Conte <co...@gm...> wrote: > >> Hi Serge, >> >> On Wed, Jun 25, 2014 at 11:36 AM, Serge Koren <ser...@gm...> wrote: >> Hi, >> >> On Jun 24, 2014, at 5:40 PM, Matthew Conte <co...@gm...> wrote: >> >>> Hi all, >>> >>> I'm trying out PBcR to make use of the new MHAP overlapper for self correcting a set of PacBio reads and I'm running into an issue. >>> >>> I'm getting the following errors in the temp_dir/1-overlapper/1.err: >>> Exception in thread "main" java.io.FileNotFoundException: /raid3/PBcR_CA_8.2_alpha/tempLibrary/1-overlapper/stream_1/correct_reads_part000002.dat (No such file or directory) >> The dat file is a pre-computed index that is used to speed up the computation for smaller genomes. For larger genomes or if you are using local disk, it should not get created. Do you have the output of the pipeline up to this step along with the command-line you used to start the run? That will help diagnose why it is not properly recognizing that the index is not built. As a workaround, you can add "localStaging=</path to local disk>" to your PBcR command which will force the pipeline to never pre-compute the index. >> >> The command that I ran was: >> /sw/wgs-8.2alpha/Linux-amd64/bin/PBcR "mhap=-k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04 localStaging=/path_to_working_dir/temp_staging" merSize=16 -length 500 -partitions 200 -threads 27 -lib >> aryname PBcR -s pacbio.spec fastqFile=filtered_subreads.bbmap.rm_adapters.split.fastq -genomeSize 1000000000 >> >> I changed the MHAP settings according to the PBcR wiki since I only have about 16x coverage of PacBio data. >> >> I should mention that runCA continues to run until the '5-consensus' step, and errors out there. But I think the start of the problem is at this overlap step. >> >> The relevant output was: >> ### Reading options from 'pacbio.spec' >> ### Reading options from the command line. >> >> Warning: no frag files specified, assuming self-correction of pacbio sequences. >> Running with 27 threads and 200 partitions >> ********* Starting correction... >> ... >> ******** Configuration Summary ******** >> bankPath = >> maxCoverage = 40 >> ... >> mhap = -k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04 localStaging=/path_to_working_dir/temp_staging >> ovlRefBlockLength = 100000000000 >> cnsErrorRate = 0.25 >> ... >> ----------------------------------------START Wed Jun 25 11:24:30 2014 >> mkdir tempPBcR >> ----------------------------------------END Wed Jun 25 11:24:30 2014 (0 seconds) >> ----------------------------------------START Wed Jun 25 11:24:30 2014 >> /sw/wgs-8.2alpha/Linux-amd64/bin/fastqToCA -libraryname PBcR -type sanger -technology none -feature doConsensusCorrection 1 -reads /path_to_working_dir/filtered_subreads.bbmap.rm_adapters.split.fastq > /path_to_working_dir//tempPBcR/PBcR.frg >> ----------------------------------------END Wed Jun 25 11:24:30 2014 (0 seconds) >> ----------------------------------------START Wed Jun 25 11:24:30 2014 >> /sw/wgs-8.2alpha/Linux-amd64/bin/runCA -s /path_to_working_dir//tempPBcR/PBcR.spec -p asm -d tempPBcR stopAfter=initialStoreBuilding /path_to_working_dir//tempPBcR/PBcR.frg >> ----------------------------------------START Wed Jun 25 11:24:30 2014 >> /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -o /path_to_working_dir/tempPBcR/asm.gkpStore.BUILDING -F /path_to_working_dir//tempPBcR/PBcR.frg > /path_to_working_dir/tempPBcR/asm.gkpStore.err 2>&1 >> ----------------------------------------END Wed Jun 25 11:35:27 2014 (657 seconds) >> numFrags = 2995674 >> Stop requested after 'initialstorebuilding'. >> ----------------------------------------END Wed Jun 25 11:35:27 2014 (657 seconds) >> Will be correcting PacBio library 1 with librarie[s] 1 - 1 >> ----------------------------------------START Wed Jun 25 11:35:29 2014 >> /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -invert -tabular -longestovermin 1 500 -longestlength 1 8268329152 /path_to_working_dir//tempPBcR/asm.gkpStore 2> /path_to_working_dir//tempPBcR/asm.seedlength |awk '{if (!(match($1, "UID") != 0 && length($1) == 3)) { print "frg uid "$1" isdeleted 1"; } }' > /path_to_working_dir//tempPBcR/asm.toerase.uid >> ----------------------------------------END Wed Jun 25 11:35:38 2014 (9 seconds) >> ----------------------------------------START Wed Jun 25 11:35:38 2014 >> /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper --edit /path_to_working_dir//tempPBcR/asm.toerase.uid /path_to_working_dir//tempPBcR/asm.gkpStore > /path_to_working_dir//tempPBcR/asm.toerase.out 2> /path_to_working_dir//tempPBcR/asm.toerase.err >> ----------------------------------------END Wed Jun 25 11:35:44 2014 (6 seconds) >> Running with 8.268329256X (for genome size 1000000000) of PBcR sequences (8268329256 bp). >> Correcting with 16X sequences (16536658304 bp). >> Warning: performing self-correction with a total of 16. For best performance, at least 50 is recommended. >> ----------------------------------------START Wed Jun 25 11:35:44 2014 >> /sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish count -m 16 -s 120000000 -t 32 -o /path_to_working_dir//tempPBcR/asm.mers /path_to_working_dir/filtered_subreads.bbmap.rm_adapters.split.fastq >> ----------------------------------------END Wed Jun 25 12:05:11 2014 (1767 seconds) >> ----------------------------------------START Wed Jun 25 12:05:11 2014 >> /sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish histo -t 32 -f /path_to_working_dir//tempPBcR/asm.mers > /path_to_working_dir//tempPBcR/asm.hist >> ----------------------------------------END Wed Jun 25 12:09:10 2014 (239 seconds) >> ----------------------------------------START Wed Jun 25 12:09:10 2014 >> /sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish dump -c -t -L 34 /path_to_working_dir//tempPBcR/asm.mers |awk -v TOTAL=3328265613 '{printf("%s\t%0.10f\t%d\t%d\n", $1, $2/TOTAL, $2, TOTAL)}' |sort -T . -rnk2> /path_to_working_dir//tempPBcR/asm.ignore >> ----------------------------------------END Wed Jun 25 12:21:17 2014 (727 seconds) >> ----------------------------------------START Wed Jun 25 12:21:17 2014 >> rm /path_to_working_dir//tempPBcR/asm.mers* >> ----------------------------------------END Wed Jun 25 12:21:23 2014 (6 seconds) >> ----------------------------------------START Wed Jun 25 12:21:23 2014 >> mkdir /path_to_working_dir//tempPBcR/1-overlapper >> ----------------------------------------END Wed Jun 25 12:21:23 2014 (0 seconds) >> ----------------------------------------START Wed Jun 25 12:21:23 2014 >> /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -tabular asm.gkpStore |awk '{print $1"\t"$2}' > asm.eidToIID >> ----------------------------------------END Wed Jun 25 12:21:28 2014 (5 seconds) >> ----------------------------------------START Wed Jun 25 12:21:28 2014 >> /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -tabular asm.gkpStore |awk '{print $2"\t"$10}' > asm.iidToLen >> ----------------------------------------END Wed Jun 25 12:21:33 2014 (5 seconds) >> ----------------------------------------START CONCURRENT Wed Jun 25 12:21:33 2014 >> /path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 1 >> Scanning store to find libraries used and reads to dump. >> Added 0 reads to maintain mate relationships. >> Dumping 0 fragments from unknown library (version 1 has these) >> Dumping 133125 fragments from library IID 1 >> /path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 2 >> Scanning store to find libraries used and reads to dump. >> Added 0 reads to maintain mate relationships. >> ... >> /path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 23 >> Scanning store to find libraries used and reads to dump. >> Added 0 reads to maintain mate relationships. >> Dumping 0 fragments from unknown library (version 1 has these) >> Dumping 66924 fragments from library IID 1 >> ----------------------------------------END CONCURRENT Wed Jun 25 12:27:16 2014 (343 seconds) >> ----------------------------------------START CONCURRENT Wed Jun 25 12:27:16 2014 >> /path_to_working_dir//tempPBcR/1-overlapper/overlap.sh 1 >> Running partition 000001 with options -h 1-133125 -r 133126-1597500 start 133125 end 1597500 total 1464375 zero job 0 and stride 1 >> /path_to_working_dir//tempPBcR/1-overlapper/overlap.sh 2 >> Running partition 000002 with options -h 1-133125 -r 1597501-2995674 start 1597500 end 2995674 total 1398174 zero job 0 and stride 1 >> ... >> >> >> Thanks, >> Matt >> >> >>> >>> There is no 'correct_reads_part000002.dat' file there, but there is a 'correct_reads_part000002.fasta' file where the 'stream_1/correct_reads_part000002.dat' points to. I'm not sure if it is just an extension naming issue or if the .dat files weren't created properly. >>> >>> Also, I've found another minor issue with the '-threads' option supplied to PBcR on the command line. It doesn't seem to use the number of threads supplied and simply uses the max number of cpus on the machine available. >> Thanks, I'll check this and update the code. >>> >>> Thanks, >>> Matt >>> ------------------------------------------------------------------------------ >>> Open source business process management suite built on Java and Eclipse >>> Turn processes into business applications with Bonita BPM Community Edition >>> Quickly connect people, data, and systems into organized workflows >>> Winner of BOSSIE, CODIE, OW2 and Gartner awards >>> http://p.sf.net/sfu/Bonitasoft_______________________________________________ >>> wgs-assembler-users mailing list >>> wgs...@li... >>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >> >> > > |
From: Matthew C. <co...@gm...> - 2014-06-26 21:00:27
|
Hi, I had tried adding the localStaging flag, but still got the same " *java.io.FileNotFoundException*" during the overlap step. I did try out the lambda phage sample data set and it ran fine so I don't think it is something with my installation. We currently only have 16X but are thinking of going higher. I wanted to try a de novo assembly with this current dataset and MHAP finally seems like a reasonable way to do so =) Thanks, Matt On Thu, Jun 26, 2014 at 2:34 PM, Serge Koren <ser...@gm...> wrote: > Hi, > > Thanks, yes this looks like a bug in that the code recognized your genome > is too big to do the precompute but didn't properly turn it off. Adding the > localStaging="<path to local disk on node>" should let you work around the > issue. We will make a new release candidate and fix this bug and the other > one you encountered. I will say that with 16X you are probably not going to > get a very good assembly because you'll likely have less than 10X after > correction. I'd suggest trying ECTools as well ( > https://github.com/jgurtowski/ectools) as it is designed to work best > with coverage in the 10-20X range in combination with short-read sequencing > data. > > Sergey > > On Jun 25, 2014, at 2:33 PM, Matthew Conte <co...@gm...> wrote: > > Hi Serge, > > On Wed, Jun 25, 2014 at 11:36 AM, Serge Koren <ser...@gm...> > wrote: > >> Hi, >> >> On Jun 24, 2014, at 5:40 PM, Matthew Conte <co...@gm...> wrote: >> >> Hi all, >> >> I'm trying out PBcR to make use of the new MHAP overlapper for self >> correcting a set of PacBio reads and I'm running into an issue. >> >> I'm getting the following errors in the temp_dir/1-overlapper/1.err: >> *Exception in thread "main" java.io.FileNotFoundException: >> /raid3/PBcR_CA_8.2_alpha/tempLibrary/1-overlapper/stream_1/correct_reads_part000002.dat >> (No such file or directory)* >> >> The dat file is a pre-computed index that is used to speed up the >> computation for smaller genomes. For larger genomes or if you are using >> local disk, it should not get created. Do you have the output of the >> pipeline up to this step along with the command-line you used to start the >> run? That will help diagnose why it is not properly recognizing that the >> index is not built. As a workaround, you can add "localStaging=</path to >> local disk>" to your PBcR command which will force the pipeline to never >> pre-compute the index. >> > > The command that I ran was: > */sw/wgs-8.2alpha/Linux-amd64/bin/PBcR "mhap=-k 16 --num-hashes 1256 > --num-min-matches 3 --threshold 0.04 > localStaging=/path_to_working_dir/temp_staging" merSize=16 -length 500 > -partitions 200 -threads 27 -lib* > *aryname PBcR -s pacbio.spec > fastqFile=filtered_subreads.bbmap.rm_adapters.split.fastq -genomeSize > 1000000000* > > I changed the MHAP settings according to the PBcR wiki since I only have > about 16x coverage of PacBio data. > > I should mention that runCA continues to run until the '5-consensus' step, > and errors out there. But I think the start of the problem is at this > overlap step. > > The relevant output was: > *### Reading options from 'pacbio.spec'* > *### Reading options from the command line.* > > *Warning: no frag files specified, assuming self-correction of pacbio > sequences.* > *Running with 27 threads and 200 partitions* > ********** Starting correction...* > *...* > ********* Configuration Summary ********* > *bankPath = * > *maxCoverage = 40* > *...* > *mhap = -k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04 > localStaging=/path_to_working_dir/temp_staging* > *ovlRefBlockLength = 100000000000* > *cnsErrorRate = 0.25* > *...* > *----------------------------------------START Wed Jun 25 11:24:30 2014* > *mkdir tempPBcR* > *----------------------------------------END Wed Jun 25 11:24:30 2014 (0 > seconds)* > *----------------------------------------START Wed Jun 25 11:24:30 2014* > */sw/wgs-8.2alpha/Linux-amd64/bin/fastqToCA -libraryname PBcR -type sanger > -technology none -feature doConsensusCorrection 1 -reads > /path_to_working_dir/filtered_subreads.bbmap.rm_adapters.split.fastq > > /path_to_working_dir//tempPBcR/PBcR.frg* > *----------------------------------------END Wed Jun 25 11:24:30 2014 (0 > seconds)* > *----------------------------------------START Wed Jun 25 11:24:30 2014* > */sw/wgs-8.2alpha/Linux-amd64/bin/runCA -s > /path_to_working_dir//tempPBcR/PBcR.spec -p asm -d tempPBcR > stopAfter=initialStoreBuilding /path_to_working_dir//tempPBcR/PBcR.frg* > *----------------------------------------START Wed Jun 25 11:24:30 2014* > */sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -o > /path_to_working_dir/tempPBcR/asm.gkpStore.BUILDING -F > /path_to_working_dir//tempPBcR/PBcR.frg > > /path_to_working_dir/tempPBcR/asm.gkpStore.err 2>&1* > *----------------------------------------END Wed Jun 25 11:35:27 2014 (657 > seconds)* > *numFrags = 2995674* > *Stop requested after 'initialstorebuilding'.* > *----------------------------------------END Wed Jun 25 11:35:27 2014 (657 > seconds)* > *Will be correcting PacBio library 1 with librarie[s] 1 - 1* > *----------------------------------------START Wed Jun 25 11:35:29 2014* > */sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -invert > -tabular -longestovermin 1 500 -longestlength 1 8268329152 > /path_to_working_dir//tempPBcR/asm.gkpStore 2> > /path_to_working_dir//tempPBcR/asm.seedlength |awk '{if (!(match($1, "UID") > != 0 && length($1) == 3)) { print "frg uid "$1" isdeleted 1"; } }' > > /path_to_working_dir//tempPBcR/asm.toerase.uid* > *----------------------------------------END Wed Jun 25 11:35:38 2014 (9 > seconds)* > *----------------------------------------START Wed Jun 25 11:35:38 2014* > */sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper --edit > /path_to_working_dir//tempPBcR/asm.toerase.uid > /path_to_working_dir//tempPBcR/asm.gkpStore > > /path_to_working_dir//tempPBcR/asm.toerase.out 2> > /path_to_working_dir//tempPBcR/asm.toerase.err* > *----------------------------------------END Wed Jun 25 11:35:44 2014 (6 > seconds)* > *Running with 8.268329256X (for genome size 1000000000) of PBcR sequences > (8268329256 bp).* > *Correcting with 16X sequences (16536658304 bp).* > *Warning: performing self-correction with a total of 16. For best > performance, at least 50 is recommended.* > *----------------------------------------START Wed Jun 25 11:35:44 2014* > */sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish count -m 16 -s 120000000 -t > 32 -o /path_to_working_dir//tempPBcR/asm.mers > /path_to_working_dir/filtered_subreads.bbmap.rm_adapters.split.fastq* > *----------------------------------------END Wed Jun 25 12:05:11 2014 > (1767 seconds)* > *----------------------------------------START Wed Jun 25 12:05:11 2014* > */sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish histo -t 32 -f > /path_to_working_dir//tempPBcR/asm.mers > > /path_to_working_dir//tempPBcR/asm.hist* > *----------------------------------------END Wed Jun 25 12:09:10 2014 (239 > seconds)* > *----------------------------------------START Wed Jun 25 12:09:10 2014* > */sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish dump -c -t -L 34 > /path_to_working_dir//tempPBcR/asm.mers |awk -v TOTAL=3328265613 > '{printf("%s\t%0.10f\t%d\t%d\n", $1, $2/TOTAL, $2, TOTAL)}' |sort -T . > -rnk2> /path_to_working_dir//tempPBcR/asm.ignore* > *----------------------------------------END Wed Jun 25 12:21:17 2014 (727 > seconds)* > *----------------------------------------START Wed Jun 25 12:21:17 2014* > *rm /path_to_working_dir//tempPBcR/asm.mers** > *----------------------------------------END Wed Jun 25 12:21:23 2014 (6 > seconds)* > *----------------------------------------START Wed Jun 25 12:21:23 2014* > *mkdir /path_to_working_dir//tempPBcR/1-overlapper* > *----------------------------------------END Wed Jun 25 12:21:23 2014 (0 > seconds)* > *----------------------------------------START Wed Jun 25 12:21:23 2014* > */sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -tabular > asm.gkpStore |awk '{print $1"\t"$2}' > asm.eidToIID* > *----------------------------------------END Wed Jun 25 12:21:28 2014 (5 > seconds)* > *----------------------------------------START Wed Jun 25 12:21:28 2014* > */sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -tabular > asm.gkpStore |awk '{print $2"\t"$10}' > asm.iidToLen* > *----------------------------------------END Wed Jun 25 12:21:33 2014 (5 > seconds)* > *----------------------------------------START CONCURRENT Wed Jun 25 > 12:21:33 2014* > */path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 1* > *Scanning store to find libraries used and reads to dump.* > *Added 0 reads to maintain mate relationships.* > *Dumping 0 fragments from unknown library (version 1 has these)* > *Dumping 133125 fragments from library IID 1* > */path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 2* > *Scanning store to find libraries used and reads to dump.* > *Added 0 reads to maintain mate relationships.* > *...* > */path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 23* > *Scanning store to find libraries used and reads to dump.* > *Added 0 reads to maintain mate relationships.* > *Dumping 0 fragments from unknown library (version 1 has these)* > *Dumping 66924 fragments from library IID 1* > *----------------------------------------END CONCURRENT Wed Jun 25 > 12:27:16 2014 (343 seconds)* > *----------------------------------------START CONCURRENT Wed Jun 25 > 12:27:16 2014* > */path_to_working_dir//tempPBcR/1-overlapper/overlap.sh 1* > *Running partition 000001 with options -h 1-133125 -r 133126-1597500 start > 133125 end 1597500 total 1464375 zero job 0 and stride 1* > */path_to_working_dir//tempPBcR/1-overlapper/overlap.sh 2* > *Running partition 000002 with options -h 1-133125 -r 1597501-2995674 > start 1597500 end 2995674 total 1398174 zero job 0 and stride 1* > *...* > > > Thanks, > Matt > > >> >> >> There is no 'correct_reads_part000002.dat' file there, but there is a >> 'correct_reads_part000002.fasta' file where the >> 'stream_1/correct_reads_part000002.dat' points to. I'm not sure if it is >> just an extension naming issue or if the .dat files weren't created >> properly. >> >> Also, I've found another minor issue with the '*-threads*' option >> supplied to PBcR on the command line. It doesn't seem to use the number of >> threads supplied and simply uses the max number of cpus on the machine >> available. >> >> Thanks, I'll check this and update the code. >> >> >> Thanks, >> Matt >> >> ------------------------------------------------------------------------------ >> Open source business process management suite built on Java and Eclipse >> Turn processes into business applications with Bonita BPM Community >> Edition >> Quickly connect people, data, and systems into organized workflows >> Winner of BOSSIE, CODIE, OW2 and Gartner awards >> >> http://p.sf.net/sfu/Bonitasoft_______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >> >> >> > > |
From: Serge K. <ser...@gm...> - 2014-06-26 18:34:25
|
Hi, Thanks, yes this looks like a bug in that the code recognized your genome is too big to do the precompute but didn't properly turn it off. Adding the localStaging="<path to local disk on node>" should let you work around the issue. We will make a new release candidate and fix this bug and the other one you encountered. I will say that with 16X you are probably not going to get a very good assembly because you'll likely have less than 10X after correction. I'd suggest trying ECTools as well (https://github.com/jgurtowski/ectools) as it is designed to work best with coverage in the 10-20X range in combination with short-read sequencing data. Sergey On Jun 25, 2014, at 2:33 PM, Matthew Conte <co...@gm...> wrote: > Hi Serge, > > On Wed, Jun 25, 2014 at 11:36 AM, Serge Koren <ser...@gm...> wrote: > Hi, > > On Jun 24, 2014, at 5:40 PM, Matthew Conte <co...@gm...> wrote: > >> Hi all, >> >> I'm trying out PBcR to make use of the new MHAP overlapper for self correcting a set of PacBio reads and I'm running into an issue. >> >> I'm getting the following errors in the temp_dir/1-overlapper/1.err: >> Exception in thread "main" java.io.FileNotFoundException: /raid3/PBcR_CA_8.2_alpha/tempLibrary/1-overlapper/stream_1/correct_reads_part000002.dat (No such file or directory) > The dat file is a pre-computed index that is used to speed up the computation for smaller genomes. For larger genomes or if you are using local disk, it should not get created. Do you have the output of the pipeline up to this step along with the command-line you used to start the run? That will help diagnose why it is not properly recognizing that the index is not built. As a workaround, you can add "localStaging=</path to local disk>" to your PBcR command which will force the pipeline to never pre-compute the index. > > The command that I ran was: > /sw/wgs-8.2alpha/Linux-amd64/bin/PBcR "mhap=-k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04 localStaging=/path_to_working_dir/temp_staging" merSize=16 -length 500 -partitions 200 -threads 27 -lib > aryname PBcR -s pacbio.spec fastqFile=filtered_subreads.bbmap.rm_adapters.split.fastq -genomeSize 1000000000 > > I changed the MHAP settings according to the PBcR wiki since I only have about 16x coverage of PacBio data. > > I should mention that runCA continues to run until the '5-consensus' step, and errors out there. But I think the start of the problem is at this overlap step. > > The relevant output was: > ### Reading options from 'pacbio.spec' > ### Reading options from the command line. > > Warning: no frag files specified, assuming self-correction of pacbio sequences. > Running with 27 threads and 200 partitions > ********* Starting correction... > ... > ******** Configuration Summary ******** > bankPath = > maxCoverage = 40 > ... > mhap = -k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04 localStaging=/path_to_working_dir/temp_staging > ovlRefBlockLength = 100000000000 > cnsErrorRate = 0.25 > ... > ----------------------------------------START Wed Jun 25 11:24:30 2014 > mkdir tempPBcR > ----------------------------------------END Wed Jun 25 11:24:30 2014 (0 seconds) > ----------------------------------------START Wed Jun 25 11:24:30 2014 > /sw/wgs-8.2alpha/Linux-amd64/bin/fastqToCA -libraryname PBcR -type sanger -technology none -feature doConsensusCorrection 1 -reads /path_to_working_dir/filtered_subreads.bbmap.rm_adapters.split.fastq > /path_to_working_dir//tempPBcR/PBcR.frg > ----------------------------------------END Wed Jun 25 11:24:30 2014 (0 seconds) > ----------------------------------------START Wed Jun 25 11:24:30 2014 > /sw/wgs-8.2alpha/Linux-amd64/bin/runCA -s /path_to_working_dir//tempPBcR/PBcR.spec -p asm -d tempPBcR stopAfter=initialStoreBuilding /path_to_working_dir//tempPBcR/PBcR.frg > ----------------------------------------START Wed Jun 25 11:24:30 2014 > /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -o /path_to_working_dir/tempPBcR/asm.gkpStore.BUILDING -F /path_to_working_dir//tempPBcR/PBcR.frg > /path_to_working_dir/tempPBcR/asm.gkpStore.err 2>&1 > ----------------------------------------END Wed Jun 25 11:35:27 2014 (657 seconds) > numFrags = 2995674 > Stop requested after 'initialstorebuilding'. > ----------------------------------------END Wed Jun 25 11:35:27 2014 (657 seconds) > Will be correcting PacBio library 1 with librarie[s] 1 - 1 > ----------------------------------------START Wed Jun 25 11:35:29 2014 > /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -invert -tabular -longestovermin 1 500 -longestlength 1 8268329152 /path_to_working_dir//tempPBcR/asm.gkpStore 2> /path_to_working_dir//tempPBcR/asm.seedlength |awk '{if (!(match($1, "UID") != 0 && length($1) == 3)) { print "frg uid "$1" isdeleted 1"; } }' > /path_to_working_dir//tempPBcR/asm.toerase.uid > ----------------------------------------END Wed Jun 25 11:35:38 2014 (9 seconds) > ----------------------------------------START Wed Jun 25 11:35:38 2014 > /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper --edit /path_to_working_dir//tempPBcR/asm.toerase.uid /path_to_working_dir//tempPBcR/asm.gkpStore > /path_to_working_dir//tempPBcR/asm.toerase.out 2> /path_to_working_dir//tempPBcR/asm.toerase.err > ----------------------------------------END Wed Jun 25 11:35:44 2014 (6 seconds) > Running with 8.268329256X (for genome size 1000000000) of PBcR sequences (8268329256 bp). > Correcting with 16X sequences (16536658304 bp). > Warning: performing self-correction with a total of 16. For best performance, at least 50 is recommended. > ----------------------------------------START Wed Jun 25 11:35:44 2014 > /sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish count -m 16 -s 120000000 -t 32 -o /path_to_working_dir//tempPBcR/asm.mers /path_to_working_dir/filtered_subreads.bbmap.rm_adapters.split.fastq > ----------------------------------------END Wed Jun 25 12:05:11 2014 (1767 seconds) > ----------------------------------------START Wed Jun 25 12:05:11 2014 > /sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish histo -t 32 -f /path_to_working_dir//tempPBcR/asm.mers > /path_to_working_dir//tempPBcR/asm.hist > ----------------------------------------END Wed Jun 25 12:09:10 2014 (239 seconds) > ----------------------------------------START Wed Jun 25 12:09:10 2014 > /sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish dump -c -t -L 34 /path_to_working_dir//tempPBcR/asm.mers |awk -v TOTAL=3328265613 '{printf("%s\t%0.10f\t%d\t%d\n", $1, $2/TOTAL, $2, TOTAL)}' |sort -T . -rnk2> /path_to_working_dir//tempPBcR/asm.ignore > ----------------------------------------END Wed Jun 25 12:21:17 2014 (727 seconds) > ----------------------------------------START Wed Jun 25 12:21:17 2014 > rm /path_to_working_dir//tempPBcR/asm.mers* > ----------------------------------------END Wed Jun 25 12:21:23 2014 (6 seconds) > ----------------------------------------START Wed Jun 25 12:21:23 2014 > mkdir /path_to_working_dir//tempPBcR/1-overlapper > ----------------------------------------END Wed Jun 25 12:21:23 2014 (0 seconds) > ----------------------------------------START Wed Jun 25 12:21:23 2014 > /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -tabular asm.gkpStore |awk '{print $1"\t"$2}' > asm.eidToIID > ----------------------------------------END Wed Jun 25 12:21:28 2014 (5 seconds) > ----------------------------------------START Wed Jun 25 12:21:28 2014 > /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -tabular asm.gkpStore |awk '{print $2"\t"$10}' > asm.iidToLen > ----------------------------------------END Wed Jun 25 12:21:33 2014 (5 seconds) > ----------------------------------------START CONCURRENT Wed Jun 25 12:21:33 2014 > /path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 1 > Scanning store to find libraries used and reads to dump. > Added 0 reads to maintain mate relationships. > Dumping 0 fragments from unknown library (version 1 has these) > Dumping 133125 fragments from library IID 1 > /path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 2 > Scanning store to find libraries used and reads to dump. > Added 0 reads to maintain mate relationships. > ... > /path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 23 > Scanning store to find libraries used and reads to dump. > Added 0 reads to maintain mate relationships. > Dumping 0 fragments from unknown library (version 1 has these) > Dumping 66924 fragments from library IID 1 > ----------------------------------------END CONCURRENT Wed Jun 25 12:27:16 2014 (343 seconds) > ----------------------------------------START CONCURRENT Wed Jun 25 12:27:16 2014 > /path_to_working_dir//tempPBcR/1-overlapper/overlap.sh 1 > Running partition 000001 with options -h 1-133125 -r 133126-1597500 start 133125 end 1597500 total 1464375 zero job 0 and stride 1 > /path_to_working_dir//tempPBcR/1-overlapper/overlap.sh 2 > Running partition 000002 with options -h 1-133125 -r 1597501-2995674 start 1597500 end 2995674 total 1398174 zero job 0 and stride 1 > ... > > > Thanks, > Matt > > >> >> There is no 'correct_reads_part000002.dat' file there, but there is a 'correct_reads_part000002.fasta' file where the 'stream_1/correct_reads_part000002.dat' points to. I'm not sure if it is just an extension naming issue or if the .dat files weren't created properly. >> >> Also, I've found another minor issue with the '-threads' option supplied to PBcR on the command line. It doesn't seem to use the number of threads supplied and simply uses the max number of cpus on the machine available. > Thanks, I'll check this and update the code. >> >> Thanks, >> Matt >> ------------------------------------------------------------------------------ >> Open source business process management suite built on Java and Eclipse >> Turn processes into business applications with Bonita BPM Community Edition >> Quickly connect people, data, and systems into organized workflows >> Winner of BOSSIE, CODIE, OW2 and Gartner awards >> http://p.sf.net/sfu/Bonitasoft_______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > > |
From: Frank B. <fra...@gm...> - 2014-06-26 14:42:52
|
Hi everybody, Has anybody had the opportunity to use Illumina's synthetic long reads (SLR = Moleculo) with CABOG? I am curious if particular overlap parameters need to be adjusted to account for the high accuracy and haplotype nature of the assembled fragments. I am trying to compile recommendations on what to do with the SLRs for new customers and CABOG is on the top of my list for assemblers. Typical genome coverage that has been generated in the beta program through FastTrack Services has been low and I was curious if anybody is willing to share their early experience with Illumina's long read technology. Thank you very much for your help. Kind regards, Frank Frank Boellmann, PhD Regional Marketing Specialist, Informatics Illumina |
From: Matthew C. <co...@gm...> - 2014-06-25 18:33:41
|
Hi Serge, On Wed, Jun 25, 2014 at 11:36 AM, Serge Koren <ser...@gm...> wrote: > Hi, > > On Jun 24, 2014, at 5:40 PM, Matthew Conte <co...@gm...> wrote: > > Hi all, > > I'm trying out PBcR to make use of the new MHAP overlapper for self > correcting a set of PacBio reads and I'm running into an issue. > > I'm getting the following errors in the temp_dir/1-overlapper/1.err: > *Exception in thread "main" java.io.FileNotFoundException: > /raid3/PBcR_CA_8.2_alpha/tempLibrary/1-overlapper/stream_1/correct_reads_part000002.dat > (No such file or directory)* > > The dat file is a pre-computed index that is used to speed up the > computation for smaller genomes. For larger genomes or if you are using > local disk, it should not get created. Do you have the output of the > pipeline up to this step along with the command-line you used to start the > run? That will help diagnose why it is not properly recognizing that the > index is not built. As a workaround, you can add "localStaging=</path to > local disk>" to your PBcR command which will force the pipeline to never > pre-compute the index. > The command that I ran was: */sw/wgs-8.2alpha/Linux-amd64/bin/PBcR "mhap=-k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04 localStaging=/path_to_working_dir/temp_staging" merSize=16 -length 500 -partitions 200 -threads 27 -lib* *aryname PBcR -s pacbio.spec fastqFile=filtered_subreads.bbmap.rm_adapters.split.fastq -genomeSize 1000000000* I changed the MHAP settings according to the PBcR wiki since I only have about 16x coverage of PacBio data. I should mention that runCA continues to run until the '5-consensus' step, and errors out there. But I think the start of the problem is at this overlap step. The relevant output was: *### Reading options from 'pacbio.spec'* *### Reading options from the command line.* *Warning: no frag files specified, assuming self-correction of pacbio sequences.* *Running with 27 threads and 200 partitions* ********** Starting correction...* *...* ********* Configuration Summary ********* *bankPath = * *maxCoverage = 40* *...* *mhap = -k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04 localStaging=/path_to_working_dir/temp_staging* *ovlRefBlockLength = 100000000000* *cnsErrorRate = 0.25* *...* *----------------------------------------START Wed Jun 25 11:24:30 2014* *mkdir tempPBcR* *----------------------------------------END Wed Jun 25 11:24:30 2014 (0 seconds)* *----------------------------------------START Wed Jun 25 11:24:30 2014* */sw/wgs-8.2alpha/Linux-amd64/bin/fastqToCA -libraryname PBcR -type sanger -technology none -feature doConsensusCorrection 1 -reads /path_to_working_dir/filtered_subreads.bbmap.rm_adapters.split.fastq > /path_to_working_dir//tempPBcR/PBcR.frg* *----------------------------------------END Wed Jun 25 11:24:30 2014 (0 seconds)* *----------------------------------------START Wed Jun 25 11:24:30 2014* */sw/wgs-8.2alpha/Linux-amd64/bin/runCA -s /path_to_working_dir//tempPBcR/PBcR.spec -p asm -d tempPBcR stopAfter=initialStoreBuilding /path_to_working_dir//tempPBcR/PBcR.frg* *----------------------------------------START Wed Jun 25 11:24:30 2014* */sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -o /path_to_working_dir/tempPBcR/asm.gkpStore.BUILDING -F /path_to_working_dir//tempPBcR/PBcR.frg > /path_to_working_dir/tempPBcR/asm.gkpStore.err 2>&1* *----------------------------------------END Wed Jun 25 11:35:27 2014 (657 seconds)* *numFrags = 2995674* *Stop requested after 'initialstorebuilding'.* *----------------------------------------END Wed Jun 25 11:35:27 2014 (657 seconds)* *Will be correcting PacBio library 1 with librarie[s] 1 - 1* *----------------------------------------START Wed Jun 25 11:35:29 2014* */sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -invert -tabular -longestovermin 1 500 -longestlength 1 8268329152 /path_to_working_dir//tempPBcR/asm.gkpStore 2> /path_to_working_dir//tempPBcR/asm.seedlength |awk '{if (!(match($1, "UID") != 0 && length($1) == 3)) { print "frg uid "$1" isdeleted 1"; } }' > /path_to_working_dir//tempPBcR/asm.toerase.uid* *----------------------------------------END Wed Jun 25 11:35:38 2014 (9 seconds)* *----------------------------------------START Wed Jun 25 11:35:38 2014* */sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper --edit /path_to_working_dir//tempPBcR/asm.toerase.uid /path_to_working_dir//tempPBcR/asm.gkpStore > /path_to_working_dir//tempPBcR/asm.toerase.out 2> /path_to_working_dir//tempPBcR/asm.toerase.err* *----------------------------------------END Wed Jun 25 11:35:44 2014 (6 seconds)* *Running with 8.268329256X (for genome size 1000000000) of PBcR sequences (8268329256 bp).* *Correcting with 16X sequences (16536658304 bp).* *Warning: performing self-correction with a total of 16. For best performance, at least 50 is recommended.* *----------------------------------------START Wed Jun 25 11:35:44 2014* */sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish count -m 16 -s 120000000 -t 32 -o /path_to_working_dir//tempPBcR/asm.mers /path_to_working_dir/filtered_subreads.bbmap.rm_adapters.split.fastq* *----------------------------------------END Wed Jun 25 12:05:11 2014 (1767 seconds)* *----------------------------------------START Wed Jun 25 12:05:11 2014* */sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish histo -t 32 -f /path_to_working_dir//tempPBcR/asm.mers > /path_to_working_dir//tempPBcR/asm.hist* *----------------------------------------END Wed Jun 25 12:09:10 2014 (239 seconds)* *----------------------------------------START Wed Jun 25 12:09:10 2014* */sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish dump -c -t -L 34 /path_to_working_dir//tempPBcR/asm.mers |awk -v TOTAL=3328265613 '{printf("%s\t%0.10f\t%d\t%d\n", $1, $2/TOTAL, $2, TOTAL)}' |sort -T . -rnk2> /path_to_working_dir//tempPBcR/asm.ignore* *----------------------------------------END Wed Jun 25 12:21:17 2014 (727 seconds)* *----------------------------------------START Wed Jun 25 12:21:17 2014* *rm /path_to_working_dir//tempPBcR/asm.mers** *----------------------------------------END Wed Jun 25 12:21:23 2014 (6 seconds)* *----------------------------------------START Wed Jun 25 12:21:23 2014* *mkdir /path_to_working_dir//tempPBcR/1-overlapper* *----------------------------------------END Wed Jun 25 12:21:23 2014 (0 seconds)* *----------------------------------------START Wed Jun 25 12:21:23 2014* */sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -tabular asm.gkpStore |awk '{print $1"\t"$2}' > asm.eidToIID* *----------------------------------------END Wed Jun 25 12:21:28 2014 (5 seconds)* *----------------------------------------START Wed Jun 25 12:21:28 2014* */sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -tabular asm.gkpStore |awk '{print $2"\t"$10}' > asm.iidToLen* *----------------------------------------END Wed Jun 25 12:21:33 2014 (5 seconds)* *----------------------------------------START CONCURRENT Wed Jun 25 12:21:33 2014* */path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 1* *Scanning store to find libraries used and reads to dump.* *Added 0 reads to maintain mate relationships.* *Dumping 0 fragments from unknown library (version 1 has these)* *Dumping 133125 fragments from library IID 1* */path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 2* *Scanning store to find libraries used and reads to dump.* *Added 0 reads to maintain mate relationships.* *...* */path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 23* *Scanning store to find libraries used and reads to dump.* *Added 0 reads to maintain mate relationships.* *Dumping 0 fragments from unknown library (version 1 has these)* *Dumping 66924 fragments from library IID 1* *----------------------------------------END CONCURRENT Wed Jun 25 12:27:16 2014 (343 seconds)* *----------------------------------------START CONCURRENT Wed Jun 25 12:27:16 2014* */path_to_working_dir//tempPBcR/1-overlapper/overlap.sh 1* *Running partition 000001 with options -h 1-133125 -r 133126-1597500 start 133125 end 1597500 total 1464375 zero job 0 and stride 1* */path_to_working_dir//tempPBcR/1-overlapper/overlap.sh 2* *Running partition 000002 with options -h 1-133125 -r 1597501-2995674 start 1597500 end 2995674 total 1398174 zero job 0 and stride 1* *...* Thanks, Matt > > > There is no 'correct_reads_part000002.dat' file there, but there is a > 'correct_reads_part000002.fasta' file where the > 'stream_1/correct_reads_part000002.dat' points to. I'm not sure if it is > just an extension naming issue or if the .dat files weren't created > properly. > > Also, I've found another minor issue with the '*-threads*' option > supplied to PBcR on the command line. It doesn't seem to use the number of > threads supplied and simply uses the max number of cpus on the machine > available. > > Thanks, I'll check this and update the code. > > > Thanks, > Matt > > ------------------------------------------------------------------------------ > Open source business process management suite built on Java and Eclipse > Turn processes into business applications with Bonita BPM Community Edition > Quickly connect people, data, and systems into organized workflows > Winner of BOSSIE, CODIE, OW2 and Gartner awards > > http://p.sf.net/sfu/Bonitasoft_______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > > > |
From: Serge K. <ser...@gm...> - 2014-06-25 15:36:49
|
Hi, On Jun 24, 2014, at 5:40 PM, Matthew Conte <co...@gm...> wrote: > Hi all, > > I'm trying out PBcR to make use of the new MHAP overlapper for self correcting a set of PacBio reads and I'm running into an issue. > > I'm getting the following errors in the temp_dir/1-overlapper/1.err: > Exception in thread "main" java.io.FileNotFoundException: /raid3/PBcR_CA_8.2_alpha/tempLibrary/1-overlapper/stream_1/correct_reads_part000002.dat (No such file or directory) The dat file is a pre-computed index that is used to speed up the computation for smaller genomes. For larger genomes or if you are using local disk, it should not get created. Do you have the output of the pipeline up to this step along with the command-line you used to start the run? That will help diagnose why it is not properly recognizing that the index is not built. As a workaround, you can add "localStaging=</path to local disk>" to your PBcR command which will force the pipeline to never pre-compute the index. > > There is no 'correct_reads_part000002.dat' file there, but there is a 'correct_reads_part000002.fasta' file where the 'stream_1/correct_reads_part000002.dat' points to. I'm not sure if it is just an extension naming issue or if the .dat files weren't created properly. > > Also, I've found another minor issue with the '-threads' option supplied to PBcR on the command line. It doesn't seem to use the number of threads supplied and simply uses the max number of cpus on the machine available. Thanks, I'll check this and update the code. > > Thanks, > Matt > ------------------------------------------------------------------------------ > Open source business process management suite built on Java and Eclipse > Turn processes into business applications with Bonita BPM Community Edition > Quickly connect people, data, and systems into organized workflows > Winner of BOSSIE, CODIE, OW2 and Gartner awards > http://p.sf.net/sfu/Bonitasoft_______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |
From: Matthew C. <co...@gm...> - 2014-06-24 21:40:43
|
Hi all, I'm trying out PBcR to make use of the new MHAP overlapper for self correcting a set of PacBio reads and I'm running into an issue. I'm getting the following errors in the temp_dir/1-overlapper/1.err: *Exception in thread "main" java.io.FileNotFoundException: /raid3/PBcR_CA_8.2_alpha/tempLibrary/1-overlapper/stream_1/correct_reads_part000002.dat (No such file or directory)* There is no 'correct_reads_part000002.dat' file there, but there is a 'correct_reads_part000002.fasta' file where the 'stream_1/correct_reads_part000002.dat' points to. I'm not sure if it is just an extension naming issue or if the .dat files weren't created properly. Also, I've found another minor issue with the '*-threads*' option supplied to PBcR on the command line. It doesn't seem to use the number of threads supplied and simply uses the max number of cpus on the machine available. Thanks, Matt |
From: Santiago R. <san...@gm...> - 2014-06-24 14:57:28
|
Hi guys, 1. ovb files were using 1.2T (they were not compressed) and .fasta, .qual and .qv, another 850Gb. All gone now. 2. in regards the -pbCNS option, no, haven't seen it by the time I've started. My problem now is that the process has been running for 3 days and at the moment it is using about 97.2% of available memory (and growing). It is a 256Gb standalone server where I'm just a guest. Should I wait a little more for it to finished? Why is it using all available memory? It is running the layout step (runCorrection.sh script). I'm attaching the pacBioToCA log, the runCorrection.sh script and the asm.layout.err file as a reference for the options, specs and status. Any help would be really appreciated. Thank you very much in advance again. Regards, Santiago On Mon, Jun 23, 2014 at 2:27 PM, Serge Koren <ser...@gm...> wrote: > Hi, > > 1. Yes, as long as you have the asm.ovlStore constructed you can delete > the contents of the 1-overlapper directory. I'm guessing it is fasta/qual > files that are taking al the space > > 2. The overlapping is the most expensive part of the computation so the > remaining steps should be relatively quick. The consensus can be another > expensive step. I'm not sure if you specified -pbCNS when you ran > pacBioToCA but if you haven't relaunched the run yet, you can add that > option and it will use a faster consensus module (which is actually on by > default in the next CA release). > > Sergey > > On Jun 21, 2014, at 11:58 AM, Santiago Revale <san...@gm...> > wrote: > > Hi Brian/Serge, > > Brian's patch worked like a charm. I'll be continue executing the > pacBioToCA script. > > A couple of quick questions before: > > 1) can I delete the "1-overlapper/" directory before the pacBioToCA script > ended? Because it is 2Tb long as "asm.ovlStore" is that size too (1.8Tb). > > 2) could you give an estimated time the remaining portion of the script > would take? And also an estimate on cores and memory usage? > > Thank you very much for your help and assistance. > > Regards, > Santiago > > > On Thu, Jun 19, 2014 at 12:53 PM, Santiago Revale < > san...@gm...> wrote: > >> Thank you very much, guys. >> >> I'll be trying your suggestions this days, starting from Brian's, and >> I'll be back to you with the outcome. >> >> Regards, >> Santiago >> >> >> >> On Thu, Jun 19, 2014 at 8:34 AM, Brian Walenz <th...@gm...> wrote: >> >>> Sergey is right; the vacation must be getting to me... >>> >>> Here is a simple patch to AS_OVS/overlapStoreBuild.C. This will change >>> the way the data is partitioned, so that the first partitions are merged >>> into a few and the last one is split into many. This should result in >>> partitions of around 10gb in size -- the 1tb partition should be split into >>> 128 pieces. >>> >>> The change is only an addition of ~15 lines, to function >>> writeToDumpFile(). The new lines are enclosed in a #if/#endif block, >>> currently enabled. You can just drop this file into a svn checkout and >>> recompile. DO NOT USE FOR PRODUCTION! There are hardcoded values specific >>> to your assembly. Please do check these values against gatekeeper >>> dumpinfo. I don't think they're critical to be exact, but if I'm off by an >>> order of magnitude, it probably won't work well. >>> >>> b >>> >>> >>> >>> >>> >>> On Wed, Jun 18, 2014 at 11:43 PM, Serge Koren <ser...@gm...> >>> wrote: >>> >>>> Hi, >>>> >>>> I don't believe the way the overlaps are created is a problem but the >>>> way the overlap store is doing the partitioning is. It looks like you have >>>> about 4X of PacBio data and about 150X of Illumina data. This a larger >>>> difference than we normally use (usually we recommend no more than 50X of >>>> Illumina data and 10X+ PacBio) which is likely why this error has not been >>>> encountered before. The overlaps are only computed between the PacBio and >>>> Illumina reads which are evenly distributed among the partitions so they >>>> should all have approximately the same number of overlaps. This should be >>>> easy to confirm if all your overlap ovb files are approximately the same >>>> size and your output log seems to confirm this. >>>> >>>> The overlap store bucketizing is assuming equal number of overlaps for >>>> each read in your dataset and your Illumina-Illumina overlaps do not exist >>>> so as a result all the IIDs with overlaps end up in the last bucket. You've >>>> got 505,893 pacbio fragments and 1,120,240,607 Illumina reads. To split the >>>> PacBio reads among multiple partitions, you'd want to have be able to open >>>> 10,000-20,000 files (partitions) which is above the current limit you have. >>>> If you can modify it using ulimit -n 50000 and then run the store creation >>>> specifying -f 20480 (or some other large number). That should make your >>>> last partition significantly smaller. If you cannot increase the limit then >>>> modifying the code is the only option. The good news is that if you are >>>> able to build the store, you can re-launch the PBcR pipeline and it will >>>> resume the correction after the overlapping step. >>>> >>>> Sergey >>>> >>>> >>>> The hash is only composed of the last set of reads (PacBio) and the >>>> refr sequences streamed against the hash are the Illumina data. >>>> On Jun 18, 2014, at 8:16 PM, Brian Walenz <th...@gm...> wrote: >>>> >>>> Unfortunately, I'm on vacation at the moment, and finding little time >>>> to spend helping you. >>>> >>>> "Too many open files" is a limit imposed by the OS. Can you increase >>>> this? We've set our large memory machines to allow 100,000 open files. >>>> >>>> The output files sizes -- and the problem you're suffering from -- are >>>> all caused by the way overlaps are created. Correction asked for only >>>> overlaps between Illumina and PacBio reads. All the illumina reads are >>>> 'first' in the store, and all the pacbio reads are at the end. Overlap >>>> jobs will find overlaps between 'other' reads and some subset of the store >>>> - e.g., the first overlap job will process the first 10% of the reads, the >>>> second will do the second 10% of the reads, etc. Since the pacbio are >>>> last, the last job found all the overlaps, so only the last file is of >>>> significant size. This also breaks the partitioning scheme used when >>>> sorting overlaps. It assumes overlaps are distributed randomly, but yours >>>> are all piled up at the end. >>>> >>>> I don't see an easy fix here, but I think I can come up with a one-off >>>> hack to get your store built. Are you comfortable working with C code and >>>> compiling? Send the output of 'gatekeeper -dumpinfo *gkpStore' so I can >>>> see the number of reads per library. >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Jun 17, 2014 at 6:45 PM, Santiago Revale < >>>> san...@gm...> wrote: >>>> >>>>> Hi Brian, >>>>> >>>>> When using 1024, it said the OS wasn't able to handle it, and it >>>>> recommended using 1008. >>>>> When using 1008, CA ended arguing "Failed to open output file... Too >>>>> many open files". >>>>> >>>>> Now I'm trying with fewer parts, but I don't think this would solve >>>>> the problem. >>>>> >>>>> Do you have any more ideas? >>>>> >>>>> Thanks again in advance. >>>>> >>>>> Regards, >>>>> Santiago >>>>> >>>>> >>>>> On Sun, Jun 15, 2014 at 10:10 PM, Santiago Revale < >>>>> san...@gm...> wrote: >>>>> >>>>>> Hi Brian, >>>>>> >>>>>> Thanks for your reply. In regards of your suggestions: >>>>>> >>>>>> 1) the PBcR process generates OVB files without zipping them; just to >>>>>> be sure, I've tried to unzip some of them just in case the extension were >>>>>> missing; >>>>>> >>>>>> 2) I've re-launched the process with the suggested parameters, but >>>>>> using 512 instead of 1024; the result was exactly the same: same error in >>>>>> the same step. Also, again 511 out of 512 files had a size of 2.3Gb while >>>>>> the last file was 1.2Tb long. Do you know why does this happens? >>>>>> >>>>>> I'm trying one last time using 1024 instead. >>>>>> >>>>>> Thanks again for your reply. I'm open to some more suggestions. >>>>>> >>>>>> Regards, >>>>>> Santiago >>>>>> >>>>>> >>>>>> On Fri, Jun 13, 2014 at 4:25 PM, Brian Walenz <th...@gm...> >>>>>> wrote: >>>>>> >>>>>>> Hi- >>>>>>> >>>>>>> This is a flaw in gzip, where it doesn't report the uncompressed >>>>>>> size correctly for files larger than 2gb. I'm not intimately familiar with >>>>>>> this pipeline, so don't know exactly how to implement the fixes below. >>>>>>> >>>>>>> Fix with either: >>>>>>> >>>>>>> 1) gzip -d the *gz files before building the overlap store. The >>>>>>> 'find' command in the log indicates the pipeline will pick up the >>>>>>> uncompressed files. You might need to remove the 'asm.ovlStore.list' file >>>>>>> before restarting (this has the list of inputs to overlapStoreBuild). >>>>>>> >>>>>>> 2) Set ovlStoreMemory to (exactly) "0 -f 1024". This will tell it >>>>>>> to use 0MB memory, and instead use 1024 files regardless of the size. 512 >>>>>>> files will also work, and is a little safer (not near some Linux 'number of >>>>>>> open files' limits). >>>>>>> >>>>>>> 3) Build the overlap store by hand (with either the uncompressed >>>>>>> input, or the -f instead of -M option), outside the script, and then >>>>>>> restart the script. The script will notice there is an overlap store >>>>>>> already present, and skip the build. The command is in the log file -- >>>>>>> make sure the final store is called 'asm.ovlStore', and not >>>>>>> 'asm.ovlStore.BUILDING'. >>>>>>> >>>>>>> Option 1 should work, but option 2 is the easiest to try. I >>>>>>> wouldn't try option 3 until Sergey speaks up. >>>>>>> >>>>>>> b >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 13, 2014 at 12:33 PM, Santiago Revale < >>>>>>> san...@gm...> wrote: >>>>>>> >>>>>>>> Dear CA community, >>>>>>>> >>>>>>>> I'm running the correction of some PacBio reads with high-identity >>>>>>>> Illumina reads, in a high memory server, for a 750 Mbp genome. I've >>>>>>>> considered the known issues addressed in the website when starting the >>>>>>>> correction. >>>>>>>> >>>>>>>> When executing the pipeline, I've reached to the overlapStoreBuild >>>>>>>> step with 48 ovb files, size 26 Gb each (totaling 1.2Tb). ovls files have >>>>>>>> already been deleted by the script. The error happened while executing >>>>>>>> overlapStoreBuild: >>>>>>>> >>>>>>>> ... >>>>>>>> bucketizing DONE! >>>>>>>> overlaps skipped: >>>>>>>> 0 OBT - low quality >>>>>>>> 0 DUP - non-duplicate overlap >>>>>>>> 0 DUP - different library >>>>>>>> 0 DUP - dedup not requested >>>>>>>> terminate called after throwing an instance of 'std::bad_alloc' >>>>>>>> what(): std::bad_alloc >>>>>>>> >>>>>>>> Failed with 'Aborted' >>>>>>>> ... >>>>>>>> >>>>>>>> >>>>>>>> I ran this step twice: the first one having set ovlStoreMemory to >>>>>>>> 8192 Mb, but the second one, set it on 160000 (160 Gb). In the "Overlap >>>>>>>> store failure" FAQ, it mentioned as possible causes "Out of disk space" >>>>>>>> (which is not my case) and "Corrupt gzip files / too many fragments". I >>>>>>>> don't have gzip files and I have only 15 fragments. Also, bucketizing step >>>>>>>> finishes OK. >>>>>>>> >>>>>>>> Also, some odd thing I've noticed (at least odd for me) is that 14 >>>>>>>> of the 15 temp files (tmp.sort.XXX) of the asm.ovlStore.BUILDING folder >>>>>>>> have a size 79Gb while the last one size is 1.2Tb. >>>>>>>> >>>>>>>> Could anybody tell me what could be the cause of this error and how >>>>>>>> to solve it? >>>>>>>> >>>>>>>> I'm attaching the asm.ovlStore.err and the pacBioToCA log files for >>>>>>>> complete descriptions of the error and the executed commands. >>>>>>>> >>>>>>>> Thank you very much in advance. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Santiago >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk >>>>>>>> Solutions >>>>>>>> Find What Matters Most in Your Big Data with HPCC Systems >>>>>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >>>>>>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >>>>>>>> http://p.sf.net/sfu/hpccsystems >>>>>>>> _______________________________________________ >>>>>>>> wgs-assembler-users mailing list >>>>>>>> wgs...@li... >>>>>>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk >>>> Solutions >>>> Find What Matters Most in Your Big Data with HPCC Systems >>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >>>> >>>> http://p.sf.net/sfu/hpccsystems_______________________________________________ >>>> wgs-assembler-users mailing list >>>> wgs...@li... >>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >>>> >>>> >>>> >>> >> > > |
From: Serge K. <ser...@gm...> - 2014-06-23 17:27:21
|
Hi, 1. Yes, as long as you have the asm.ovlStore constructed you can delete the contents of the 1-overlapper directory. I'm guessing it is fasta/qual files that are taking al the space 2. The overlapping is the most expensive part of the computation so the remaining steps should be relatively quick. The consensus can be another expensive step. I'm not sure if you specified -pbCNS when you ran pacBioToCA but if you haven't relaunched the run yet, you can add that option and it will use a faster consensus module (which is actually on by default in the next CA release). Sergey On Jun 21, 2014, at 11:58 AM, Santiago Revale <san...@gm...> wrote: > Hi Brian/Serge, > > Brian's patch worked like a charm. I'll be continue executing the pacBioToCA script. > > A couple of quick questions before: > > 1) can I delete the "1-overlapper/" directory before the pacBioToCA script ended? Because it is 2Tb long as "asm.ovlStore" is that size too (1.8Tb). > > 2) could you give an estimated time the remaining portion of the script would take? And also an estimate on cores and memory usage? > > Thank you very much for your help and assistance. > > Regards, > Santiago > > > On Thu, Jun 19, 2014 at 12:53 PM, Santiago Revale <san...@gm...> wrote: > Thank you very much, guys. > > I'll be trying your suggestions this days, starting from Brian's, and I'll be back to you with the outcome. > > Regards, > Santiago > > > > On Thu, Jun 19, 2014 at 8:34 AM, Brian Walenz <th...@gm...> wrote: > Sergey is right; the vacation must be getting to me... > > Here is a simple patch to AS_OVS/overlapStoreBuild.C. This will change the way the data is partitioned, so that the first partitions are merged into a few and the last one is split into many. This should result in partitions of around 10gb in size -- the 1tb partition should be split into 128 pieces. > > The change is only an addition of ~15 lines, to function writeToDumpFile(). The new lines are enclosed in a #if/#endif block, currently enabled. You can just drop this file into a svn checkout and recompile. DO NOT USE FOR PRODUCTION! There are hardcoded values specific to your assembly. Please do check these values against gatekeeper dumpinfo. I don't think they're critical to be exact, but if I'm off by an order of magnitude, it probably won't work well. > > b > > > > > > On Wed, Jun 18, 2014 at 11:43 PM, Serge Koren <ser...@gm...> wrote: > Hi, > > I don't believe the way the overlaps are created is a problem but the way the overlap store is doing the partitioning is. It looks like you have about 4X of PacBio data and about 150X of Illumina data. This a larger difference than we normally use (usually we recommend no more than 50X of Illumina data and 10X+ PacBio) which is likely why this error has not been encountered before. The overlaps are only computed between the PacBio and Illumina reads which are evenly distributed among the partitions so they should all have approximately the same number of overlaps. This should be easy to confirm if all your overlap ovb files are approximately the same size and your output log seems to confirm this. > > The overlap store bucketizing is assuming equal number of overlaps for each read in your dataset and your Illumina-Illumina overlaps do not exist so as a result all the IIDs with overlaps end up in the last bucket. You've got 505,893 pacbio fragments and 1,120,240,607 Illumina reads. To split the PacBio reads among multiple partitions, you'd want to have be able to open 10,000-20,000 files (partitions) which is above the current limit you have. If you can modify it using ulimit -n 50000 and then run the store creation specifying -f 20480 (or some other large number). That should make your last partition significantly smaller. If you cannot increase the limit then modifying the code is the only option. The good news is that if you are able to build the store, you can re-launch the PBcR pipeline and it will resume the correction after the overlapping step. > > Sergey > > > The hash is only composed of the last set of reads (PacBio) and the refr sequences streamed against the hash are the Illumina data. > On Jun 18, 2014, at 8:16 PM, Brian Walenz <th...@gm...> wrote: > >> Unfortunately, I'm on vacation at the moment, and finding little time to spend helping you. >> >> "Too many open files" is a limit imposed by the OS. Can you increase this? We've set our large memory machines to allow 100,000 open files. >> >> The output files sizes -- and the problem you're suffering from -- are all caused by the way overlaps are created. Correction asked for only overlaps between Illumina and PacBio reads. All the illumina reads are 'first' in the store, and all the pacbio reads are at the end. Overlap jobs will find overlaps between 'other' reads and some subset of the store - e.g., the first overlap job will process the first 10% of the reads, the second will do the second 10% of the reads, etc. Since the pacbio are last, the last job found all the overlaps, so only the last file is of significant size. This also breaks the partitioning scheme used when sorting overlaps. It assumes overlaps are distributed randomly, but yours are all piled up at the end. >> >> I don't see an easy fix here, but I think I can come up with a one-off hack to get your store built. Are you comfortable working with C code and compiling? Send the output of 'gatekeeper -dumpinfo *gkpStore' so I can see the number of reads per library. >> >> >> >> >> >> >> On Tue, Jun 17, 2014 at 6:45 PM, Santiago Revale <san...@gm...> wrote: >> Hi Brian, >> >> When using 1024, it said the OS wasn't able to handle it, and it recommended using 1008. >> When using 1008, CA ended arguing "Failed to open output file... Too many open files". >> >> Now I'm trying with fewer parts, but I don't think this would solve the problem. >> >> Do you have any more ideas? >> >> Thanks again in advance. >> >> Regards, >> Santiago >> >> >> On Sun, Jun 15, 2014 at 10:10 PM, Santiago Revale <san...@gm...> wrote: >> Hi Brian, >> >> Thanks for your reply. In regards of your suggestions: >> >> 1) the PBcR process generates OVB files without zipping them; just to be sure, I've tried to unzip some of them just in case the extension were missing; >> >> 2) I've re-launched the process with the suggested parameters, but using 512 instead of 1024; the result was exactly the same: same error in the same step. Also, again 511 out of 512 files had a size of 2.3Gb while the last file was 1.2Tb long. Do you know why does this happens? >> >> I'm trying one last time using 1024 instead. >> >> Thanks again for your reply. I'm open to some more suggestions. >> >> Regards, >> Santiago >> >> >> On Fri, Jun 13, 2014 at 4:25 PM, Brian Walenz <th...@gm...> wrote: >> Hi- >> >> This is a flaw in gzip, where it doesn't report the uncompressed size correctly for files larger than 2gb. I'm not intimately familiar with this pipeline, so don't know exactly how to implement the fixes below. >> >> Fix with either: >> >> 1) gzip -d the *gz files before building the overlap store. The 'find' command in the log indicates the pipeline will pick up the uncompressed files. You might need to remove the 'asm.ovlStore.list' file before restarting (this has the list of inputs to overlapStoreBuild). >> >> 2) Set ovlStoreMemory to (exactly) "0 -f 1024". This will tell it to use 0MB memory, and instead use 1024 files regardless of the size. 512 files will also work, and is a little safer (not near some Linux 'number of open files' limits). >> >> 3) Build the overlap store by hand (with either the uncompressed input, or the -f instead of -M option), outside the script, and then restart the script. The script will notice there is an overlap store already present, and skip the build. The command is in the log file -- make sure the final store is called 'asm.ovlStore', and not 'asm.ovlStore.BUILDING'. >> >> Option 1 should work, but option 2 is the easiest to try. I wouldn't try option 3 until Sergey speaks up. >> >> b >> >> >> >> >> On Fri, Jun 13, 2014 at 12:33 PM, Santiago Revale <san...@gm...> wrote: >> Dear CA community, >> >> I'm running the correction of some PacBio reads with high-identity Illumina reads, in a high memory server, for a 750 Mbp genome. I've considered the known issues addressed in the website when starting the correction. >> >> When executing the pipeline, I've reached to the overlapStoreBuild step with 48 ovb files, size 26 Gb each (totaling 1.2Tb). ovls files have already been deleted by the script. The error happened while executing overlapStoreBuild: >> >> ... >> bucketizing DONE! >> overlaps skipped: >> 0 OBT - low quality >> 0 DUP - non-duplicate overlap >> 0 DUP - different library >> 0 DUP - dedup not requested >> terminate called after throwing an instance of 'std::bad_alloc' >> what(): std::bad_alloc >> >> Failed with 'Aborted' >> ... >> >> I ran this step twice: the first one having set ovlStoreMemory to 8192 Mb, but the second one, set it on 160000 (160 Gb). In the "Overlap store failure" FAQ, it mentioned as possible causes "Out of disk space" (which is not my case) and "Corrupt gzip files / too many fragments". I don't have gzip files and I have only 15 fragments. Also, bucketizing step finishes OK. >> >> Also, some odd thing I've noticed (at least odd for me) is that 14 of the 15 temp files (tmp.sort.XXX) of the asm.ovlStore.BUILDING folder have a size 79Gb while the last one size is 1.2Tb. >> >> Could anybody tell me what could be the cause of this error and how to solve it? >> >> I'm attaching the asm.ovlStore.err and the pacBioToCA log files for complete descriptions of the error and the executed commands. >> >> Thank you very much in advance. >> >> Regards, >> Santiago >> >> >> >> ------------------------------------------------------------------------------ >> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions >> Find What Matters Most in Your Big Data with HPCC Systems >> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >> http://p.sf.net/sfu/hpccsystems >> _______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >> >> >> >> >> >> ------------------------------------------------------------------------------ >> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions >> Find What Matters Most in Your Big Data with HPCC Systems >> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >> http://p.sf.net/sfu/hpccsystems_______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > > > > |
From: Santiago R. <san...@gm...> - 2014-06-21 15:59:14
|
Hi Brian/Serge, Brian's patch worked like a charm. I'll be continue executing the pacBioToCA script. A couple of quick questions before: 1) can I delete the "1-overlapper/" directory before the pacBioToCA script ended? Because it is 2Tb long as "asm.ovlStore" is that size too (1.8Tb). 2) could you give an estimated time the remaining portion of the script would take? And also an estimate on cores and memory usage? Thank you very much for your help and assistance. Regards, Santiago On Thu, Jun 19, 2014 at 12:53 PM, Santiago Revale <san...@gm...> wrote: > Thank you very much, guys. > > I'll be trying your suggestions this days, starting from Brian's, and I'll > be back to you with the outcome. > > Regards, > Santiago > > > > On Thu, Jun 19, 2014 at 8:34 AM, Brian Walenz <th...@gm...> wrote: > >> Sergey is right; the vacation must be getting to me... >> >> Here is a simple patch to AS_OVS/overlapStoreBuild.C. This will change >> the way the data is partitioned, so that the first partitions are merged >> into a few and the last one is split into many. This should result in >> partitions of around 10gb in size -- the 1tb partition should be split into >> 128 pieces. >> >> The change is only an addition of ~15 lines, to function >> writeToDumpFile(). The new lines are enclosed in a #if/#endif block, >> currently enabled. You can just drop this file into a svn checkout and >> recompile. DO NOT USE FOR PRODUCTION! There are hardcoded values specific >> to your assembly. Please do check these values against gatekeeper >> dumpinfo. I don't think they're critical to be exact, but if I'm off by an >> order of magnitude, it probably won't work well. >> >> b >> >> >> >> >> >> On Wed, Jun 18, 2014 at 11:43 PM, Serge Koren <ser...@gm...> >> wrote: >> >>> Hi, >>> >>> I don't believe the way the overlaps are created is a problem but the >>> way the overlap store is doing the partitioning is. It looks like you have >>> about 4X of PacBio data and about 150X of Illumina data. This a larger >>> difference than we normally use (usually we recommend no more than 50X of >>> Illumina data and 10X+ PacBio) which is likely why this error has not been >>> encountered before. The overlaps are only computed between the PacBio and >>> Illumina reads which are evenly distributed among the partitions so they >>> should all have approximately the same number of overlaps. This should be >>> easy to confirm if all your overlap ovb files are approximately the same >>> size and your output log seems to confirm this. >>> >>> The overlap store bucketizing is assuming equal number of overlaps for >>> each read in your dataset and your Illumina-Illumina overlaps do not exist >>> so as a result all the IIDs with overlaps end up in the last bucket. You've >>> got 505,893 pacbio fragments and 1,120,240,607 Illumina reads. To split the >>> PacBio reads among multiple partitions, you'd want to have be able to open >>> 10,000-20,000 files (partitions) which is above the current limit you have. >>> If you can modify it using ulimit -n 50000 and then run the store creation >>> specifying -f 20480 (or some other large number). That should make your >>> last partition significantly smaller. If you cannot increase the limit then >>> modifying the code is the only option. The good news is that if you are >>> able to build the store, you can re-launch the PBcR pipeline and it will >>> resume the correction after the overlapping step. >>> >>> Sergey >>> >>> >>> The hash is only composed of the last set of reads (PacBio) and the refr >>> sequences streamed against the hash are the Illumina data. >>> On Jun 18, 2014, at 8:16 PM, Brian Walenz <th...@gm...> wrote: >>> >>> Unfortunately, I'm on vacation at the moment, and finding little time to >>> spend helping you. >>> >>> "Too many open files" is a limit imposed by the OS. Can you increase >>> this? We've set our large memory machines to allow 100,000 open files. >>> >>> The output files sizes -- and the problem you're suffering from -- are >>> all caused by the way overlaps are created. Correction asked for only >>> overlaps between Illumina and PacBio reads. All the illumina reads are >>> 'first' in the store, and all the pacbio reads are at the end. Overlap >>> jobs will find overlaps between 'other' reads and some subset of the store >>> - e.g., the first overlap job will process the first 10% of the reads, the >>> second will do the second 10% of the reads, etc. Since the pacbio are >>> last, the last job found all the overlaps, so only the last file is of >>> significant size. This also breaks the partitioning scheme used when >>> sorting overlaps. It assumes overlaps are distributed randomly, but yours >>> are all piled up at the end. >>> >>> I don't see an easy fix here, but I think I can come up with a one-off >>> hack to get your store built. Are you comfortable working with C code and >>> compiling? Send the output of 'gatekeeper -dumpinfo *gkpStore' so I can >>> see the number of reads per library. >>> >>> >>> >>> >>> >>> >>> On Tue, Jun 17, 2014 at 6:45 PM, Santiago Revale < >>> san...@gm...> wrote: >>> >>>> Hi Brian, >>>> >>>> When using 1024, it said the OS wasn't able to handle it, and it >>>> recommended using 1008. >>>> When using 1008, CA ended arguing "Failed to open output file... Too >>>> many open files". >>>> >>>> Now I'm trying with fewer parts, but I don't think this would solve the >>>> problem. >>>> >>>> Do you have any more ideas? >>>> >>>> Thanks again in advance. >>>> >>>> Regards, >>>> Santiago >>>> >>>> >>>> On Sun, Jun 15, 2014 at 10:10 PM, Santiago Revale < >>>> san...@gm...> wrote: >>>> >>>>> Hi Brian, >>>>> >>>>> Thanks for your reply. In regards of your suggestions: >>>>> >>>>> 1) the PBcR process generates OVB files without zipping them; just to >>>>> be sure, I've tried to unzip some of them just in case the extension were >>>>> missing; >>>>> >>>>> 2) I've re-launched the process with the suggested parameters, but >>>>> using 512 instead of 1024; the result was exactly the same: same error in >>>>> the same step. Also, again 511 out of 512 files had a size of 2.3Gb while >>>>> the last file was 1.2Tb long. Do you know why does this happens? >>>>> >>>>> I'm trying one last time using 1024 instead. >>>>> >>>>> Thanks again for your reply. I'm open to some more suggestions. >>>>> >>>>> Regards, >>>>> Santiago >>>>> >>>>> >>>>> On Fri, Jun 13, 2014 at 4:25 PM, Brian Walenz <th...@gm...> >>>>> wrote: >>>>> >>>>>> Hi- >>>>>> >>>>>> This is a flaw in gzip, where it doesn't report the uncompressed size >>>>>> correctly for files larger than 2gb. I'm not intimately familiar with this >>>>>> pipeline, so don't know exactly how to implement the fixes below. >>>>>> >>>>>> Fix with either: >>>>>> >>>>>> 1) gzip -d the *gz files before building the overlap store. The >>>>>> 'find' command in the log indicates the pipeline will pick up the >>>>>> uncompressed files. You might need to remove the 'asm.ovlStore.list' file >>>>>> before restarting (this has the list of inputs to overlapStoreBuild). >>>>>> >>>>>> 2) Set ovlStoreMemory to (exactly) "0 -f 1024". This will tell it to >>>>>> use 0MB memory, and instead use 1024 files regardless of the size. 512 >>>>>> files will also work, and is a little safer (not near some Linux 'number of >>>>>> open files' limits). >>>>>> >>>>>> 3) Build the overlap store by hand (with either the uncompressed >>>>>> input, or the -f instead of -M option), outside the script, and then >>>>>> restart the script. The script will notice there is an overlap store >>>>>> already present, and skip the build. The command is in the log file -- >>>>>> make sure the final store is called 'asm.ovlStore', and not >>>>>> 'asm.ovlStore.BUILDING'. >>>>>> >>>>>> Option 1 should work, but option 2 is the easiest to try. I wouldn't >>>>>> try option 3 until Sergey speaks up. >>>>>> >>>>>> b >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Jun 13, 2014 at 12:33 PM, Santiago Revale < >>>>>> san...@gm...> wrote: >>>>>> >>>>>>> Dear CA community, >>>>>>> >>>>>>> I'm running the correction of some PacBio reads with high-identity >>>>>>> Illumina reads, in a high memory server, for a 750 Mbp genome. I've >>>>>>> considered the known issues addressed in the website when starting the >>>>>>> correction. >>>>>>> >>>>>>> When executing the pipeline, I've reached to the overlapStoreBuild >>>>>>> step with 48 ovb files, size 26 Gb each (totaling 1.2Tb). ovls files have >>>>>>> already been deleted by the script. The error happened while executing >>>>>>> overlapStoreBuild: >>>>>>> >>>>>>> ... >>>>>>> bucketizing DONE! >>>>>>> overlaps skipped: >>>>>>> 0 OBT - low quality >>>>>>> 0 DUP - non-duplicate overlap >>>>>>> 0 DUP - different library >>>>>>> 0 DUP - dedup not requested >>>>>>> terminate called after throwing an instance of 'std::bad_alloc' >>>>>>> what(): std::bad_alloc >>>>>>> >>>>>>> Failed with 'Aborted' >>>>>>> ... >>>>>>> >>>>>>> >>>>>>> I ran this step twice: the first one having set ovlStoreMemory to >>>>>>> 8192 Mb, but the second one, set it on 160000 (160 Gb). In the "Overlap >>>>>>> store failure" FAQ, it mentioned as possible causes "Out of disk space" >>>>>>> (which is not my case) and "Corrupt gzip files / too many fragments". I >>>>>>> don't have gzip files and I have only 15 fragments. Also, bucketizing step >>>>>>> finishes OK. >>>>>>> >>>>>>> Also, some odd thing I've noticed (at least odd for me) is that 14 >>>>>>> of the 15 temp files (tmp.sort.XXX) of the asm.ovlStore.BUILDING folder >>>>>>> have a size 79Gb while the last one size is 1.2Tb. >>>>>>> >>>>>>> Could anybody tell me what could be the cause of this error and how >>>>>>> to solve it? >>>>>>> >>>>>>> I'm attaching the asm.ovlStore.err and the pacBioToCA log files for >>>>>>> complete descriptions of the error and the executed commands. >>>>>>> >>>>>>> Thank you very much in advance. >>>>>>> >>>>>>> Regards, >>>>>>> Santiago >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk >>>>>>> Solutions >>>>>>> Find What Matters Most in Your Big Data with HPCC Systems >>>>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >>>>>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >>>>>>> http://p.sf.net/sfu/hpccsystems >>>>>>> _______________________________________________ >>>>>>> wgs-assembler-users mailing list >>>>>>> wgs...@li... >>>>>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> ------------------------------------------------------------------------------ >>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions >>> Find What Matters Most in Your Big Data with HPCC Systems >>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >>> >>> http://p.sf.net/sfu/hpccsystems_______________________________________________ >>> wgs-assembler-users mailing list >>> wgs...@li... >>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >>> >>> >>> >> > |
From: Santiago R. <san...@gm...> - 2014-06-19 15:54:08
|
Thank you very much, guys. I'll be trying your suggestions this days, starting from Brian's, and I'll be back to you with the outcome. Regards, Santiago On Thu, Jun 19, 2014 at 8:34 AM, Brian Walenz <th...@gm...> wrote: > Sergey is right; the vacation must be getting to me... > > Here is a simple patch to AS_OVS/overlapStoreBuild.C. This will change > the way the data is partitioned, so that the first partitions are merged > into a few and the last one is split into many. This should result in > partitions of around 10gb in size -- the 1tb partition should be split into > 128 pieces. > > The change is only an addition of ~15 lines, to function > writeToDumpFile(). The new lines are enclosed in a #if/#endif block, > currently enabled. You can just drop this file into a svn checkout and > recompile. DO NOT USE FOR PRODUCTION! There are hardcoded values specific > to your assembly. Please do check these values against gatekeeper > dumpinfo. I don't think they're critical to be exact, but if I'm off by an > order of magnitude, it probably won't work well. > > b > > > > > > On Wed, Jun 18, 2014 at 11:43 PM, Serge Koren <ser...@gm...> > wrote: > >> Hi, >> >> I don't believe the way the overlaps are created is a problem but the way >> the overlap store is doing the partitioning is. It looks like you have >> about 4X of PacBio data and about 150X of Illumina data. This a larger >> difference than we normally use (usually we recommend no more than 50X of >> Illumina data and 10X+ PacBio) which is likely why this error has not been >> encountered before. The overlaps are only computed between the PacBio and >> Illumina reads which are evenly distributed among the partitions so they >> should all have approximately the same number of overlaps. This should be >> easy to confirm if all your overlap ovb files are approximately the same >> size and your output log seems to confirm this. >> >> The overlap store bucketizing is assuming equal number of overlaps for >> each read in your dataset and your Illumina-Illumina overlaps do not exist >> so as a result all the IIDs with overlaps end up in the last bucket. You've >> got 505,893 pacbio fragments and 1,120,240,607 Illumina reads. To split the >> PacBio reads among multiple partitions, you'd want to have be able to open >> 10,000-20,000 files (partitions) which is above the current limit you have. >> If you can modify it using ulimit -n 50000 and then run the store creation >> specifying -f 20480 (or some other large number). That should make your >> last partition significantly smaller. If you cannot increase the limit then >> modifying the code is the only option. The good news is that if you are >> able to build the store, you can re-launch the PBcR pipeline and it will >> resume the correction after the overlapping step. >> >> Sergey >> >> >> The hash is only composed of the last set of reads (PacBio) and the refr >> sequences streamed against the hash are the Illumina data. >> On Jun 18, 2014, at 8:16 PM, Brian Walenz <th...@gm...> wrote: >> >> Unfortunately, I'm on vacation at the moment, and finding little time to >> spend helping you. >> >> "Too many open files" is a limit imposed by the OS. Can you increase >> this? We've set our large memory machines to allow 100,000 open files. >> >> The output files sizes -- and the problem you're suffering from -- are >> all caused by the way overlaps are created. Correction asked for only >> overlaps between Illumina and PacBio reads. All the illumina reads are >> 'first' in the store, and all the pacbio reads are at the end. Overlap >> jobs will find overlaps between 'other' reads and some subset of the store >> - e.g., the first overlap job will process the first 10% of the reads, the >> second will do the second 10% of the reads, etc. Since the pacbio are >> last, the last job found all the overlaps, so only the last file is of >> significant size. This also breaks the partitioning scheme used when >> sorting overlaps. It assumes overlaps are distributed randomly, but yours >> are all piled up at the end. >> >> I don't see an easy fix here, but I think I can come up with a one-off >> hack to get your store built. Are you comfortable working with C code and >> compiling? Send the output of 'gatekeeper -dumpinfo *gkpStore' so I can >> see the number of reads per library. >> >> >> >> >> >> >> On Tue, Jun 17, 2014 at 6:45 PM, Santiago Revale < >> san...@gm...> wrote: >> >>> Hi Brian, >>> >>> When using 1024, it said the OS wasn't able to handle it, and it >>> recommended using 1008. >>> When using 1008, CA ended arguing "Failed to open output file... Too >>> many open files". >>> >>> Now I'm trying with fewer parts, but I don't think this would solve the >>> problem. >>> >>> Do you have any more ideas? >>> >>> Thanks again in advance. >>> >>> Regards, >>> Santiago >>> >>> >>> On Sun, Jun 15, 2014 at 10:10 PM, Santiago Revale < >>> san...@gm...> wrote: >>> >>>> Hi Brian, >>>> >>>> Thanks for your reply. In regards of your suggestions: >>>> >>>> 1) the PBcR process generates OVB files without zipping them; just to >>>> be sure, I've tried to unzip some of them just in case the extension were >>>> missing; >>>> >>>> 2) I've re-launched the process with the suggested parameters, but >>>> using 512 instead of 1024; the result was exactly the same: same error in >>>> the same step. Also, again 511 out of 512 files had a size of 2.3Gb while >>>> the last file was 1.2Tb long. Do you know why does this happens? >>>> >>>> I'm trying one last time using 1024 instead. >>>> >>>> Thanks again for your reply. I'm open to some more suggestions. >>>> >>>> Regards, >>>> Santiago >>>> >>>> >>>> On Fri, Jun 13, 2014 at 4:25 PM, Brian Walenz <th...@gm...> wrote: >>>> >>>>> Hi- >>>>> >>>>> This is a flaw in gzip, where it doesn't report the uncompressed size >>>>> correctly for files larger than 2gb. I'm not intimately familiar with this >>>>> pipeline, so don't know exactly how to implement the fixes below. >>>>> >>>>> Fix with either: >>>>> >>>>> 1) gzip -d the *gz files before building the overlap store. The >>>>> 'find' command in the log indicates the pipeline will pick up the >>>>> uncompressed files. You might need to remove the 'asm.ovlStore.list' file >>>>> before restarting (this has the list of inputs to overlapStoreBuild). >>>>> >>>>> 2) Set ovlStoreMemory to (exactly) "0 -f 1024". This will tell it to >>>>> use 0MB memory, and instead use 1024 files regardless of the size. 512 >>>>> files will also work, and is a little safer (not near some Linux 'number of >>>>> open files' limits). >>>>> >>>>> 3) Build the overlap store by hand (with either the uncompressed >>>>> input, or the -f instead of -M option), outside the script, and then >>>>> restart the script. The script will notice there is an overlap store >>>>> already present, and skip the build. The command is in the log file -- >>>>> make sure the final store is called 'asm.ovlStore', and not >>>>> 'asm.ovlStore.BUILDING'. >>>>> >>>>> Option 1 should work, but option 2 is the easiest to try. I wouldn't >>>>> try option 3 until Sergey speaks up. >>>>> >>>>> b >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Jun 13, 2014 at 12:33 PM, Santiago Revale < >>>>> san...@gm...> wrote: >>>>> >>>>>> Dear CA community, >>>>>> >>>>>> I'm running the correction of some PacBio reads with high-identity >>>>>> Illumina reads, in a high memory server, for a 750 Mbp genome. I've >>>>>> considered the known issues addressed in the website when starting the >>>>>> correction. >>>>>> >>>>>> When executing the pipeline, I've reached to the overlapStoreBuild >>>>>> step with 48 ovb files, size 26 Gb each (totaling 1.2Tb). ovls files have >>>>>> already been deleted by the script. The error happened while executing >>>>>> overlapStoreBuild: >>>>>> >>>>>> ... >>>>>> bucketizing DONE! >>>>>> overlaps skipped: >>>>>> 0 OBT - low quality >>>>>> 0 DUP - non-duplicate overlap >>>>>> 0 DUP - different library >>>>>> 0 DUP - dedup not requested >>>>>> terminate called after throwing an instance of 'std::bad_alloc' >>>>>> what(): std::bad_alloc >>>>>> >>>>>> Failed with 'Aborted' >>>>>> ... >>>>>> >>>>>> >>>>>> I ran this step twice: the first one having set ovlStoreMemory to >>>>>> 8192 Mb, but the second one, set it on 160000 (160 Gb). In the "Overlap >>>>>> store failure" FAQ, it mentioned as possible causes "Out of disk space" >>>>>> (which is not my case) and "Corrupt gzip files / too many fragments". I >>>>>> don't have gzip files and I have only 15 fragments. Also, bucketizing step >>>>>> finishes OK. >>>>>> >>>>>> Also, some odd thing I've noticed (at least odd for me) is that 14 of >>>>>> the 15 temp files (tmp.sort.XXX) of the asm.ovlStore.BUILDING folder have a >>>>>> size 79Gb while the last one size is 1.2Tb. >>>>>> >>>>>> Could anybody tell me what could be the cause of this error and how >>>>>> to solve it? >>>>>> >>>>>> I'm attaching the asm.ovlStore.err and the pacBioToCA log files for >>>>>> complete descriptions of the error and the executed commands. >>>>>> >>>>>> Thank you very much in advance. >>>>>> >>>>>> Regards, >>>>>> Santiago >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk >>>>>> Solutions >>>>>> Find What Matters Most in Your Big Data with HPCC Systems >>>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >>>>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >>>>>> http://p.sf.net/sfu/hpccsystems >>>>>> _______________________________________________ >>>>>> wgs-assembler-users mailing list >>>>>> wgs...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >>>>>> >>>>>> >>>>> >>>> >>> >> >> ------------------------------------------------------------------------------ >> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions >> Find What Matters Most in Your Big Data with HPCC Systems >> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >> >> http://p.sf.net/sfu/hpccsystems_______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >> >> >> > |
From: Serge K. <ser...@gm...> - 2014-06-19 03:43:53
|
Hi, I don't believe the way the overlaps are created is a problem but the way the overlap store is doing the partitioning is. It looks like you have about 4X of PacBio data and about 150X of Illumina data. This a larger difference than we normally use (usually we recommend no more than 50X of Illumina data and 10X+ PacBio) which is likely why this error has not been encountered before. The overlaps are only computed between the PacBio and Illumina reads which are evenly distributed among the partitions so they should all have approximately the same number of overlaps. This should be easy to confirm if all your overlap ovb files are approximately the same size and your output log seems to confirm this. The overlap store bucketizing is assuming equal number of overlaps for each read in your dataset and your Illumina-Illumina overlaps do not exist so as a result all the IIDs with overlaps end up in the last bucket. You've got 505,893 pacbio fragments and 1,120,240,607 Illumina reads. To split the PacBio reads among multiple partitions, you'd want to have be able to open 10,000-20,000 files (partitions) which is above the current limit you have. If you can modify it using ulimit -n 50000 and then run the store creation specifying -f 20480 (or some other large number). That should make your last partition significantly smaller. If you cannot increase the limit then modifying the code is the only option. The good news is that if you are able to build the store, you can re-launch the PBcR pipeline and it will resume the correction after the overlapping step. Sergey The hash is only composed of the last set of reads (PacBio) and the refr sequences streamed against the hash are the Illumina data. On Jun 18, 2014, at 8:16 PM, Brian Walenz <th...@gm...> wrote: > Unfortunately, I'm on vacation at the moment, and finding little time to spend helping you. > > "Too many open files" is a limit imposed by the OS. Can you increase this? We've set our large memory machines to allow 100,000 open files. > > The output files sizes -- and the problem you're suffering from -- are all caused by the way overlaps are created. Correction asked for only overlaps between Illumina and PacBio reads. All the illumina reads are 'first' in the store, and all the pacbio reads are at the end. Overlap jobs will find overlaps between 'other' reads and some subset of the store - e.g., the first overlap job will process the first 10% of the reads, the second will do the second 10% of the reads, etc. Since the pacbio are last, the last job found all the overlaps, so only the last file is of significant size. This also breaks the partitioning scheme used when sorting overlaps. It assumes overlaps are distributed randomly, but yours are all piled up at the end. > > I don't see an easy fix here, but I think I can come up with a one-off hack to get your store built. Are you comfortable working with C code and compiling? Send the output of 'gatekeeper -dumpinfo *gkpStore' so I can see the number of reads per library. > > > > > > > On Tue, Jun 17, 2014 at 6:45 PM, Santiago Revale <san...@gm...> wrote: > Hi Brian, > > When using 1024, it said the OS wasn't able to handle it, and it recommended using 1008. > When using 1008, CA ended arguing "Failed to open output file... Too many open files". > > Now I'm trying with fewer parts, but I don't think this would solve the problem. > > Do you have any more ideas? > > Thanks again in advance. > > Regards, > Santiago > > > On Sun, Jun 15, 2014 at 10:10 PM, Santiago Revale <san...@gm...> wrote: > Hi Brian, > > Thanks for your reply. In regards of your suggestions: > > 1) the PBcR process generates OVB files without zipping them; just to be sure, I've tried to unzip some of them just in case the extension were missing; > > 2) I've re-launched the process with the suggested parameters, but using 512 instead of 1024; the result was exactly the same: same error in the same step. Also, again 511 out of 512 files had a size of 2.3Gb while the last file was 1.2Tb long. Do you know why does this happens? > > I'm trying one last time using 1024 instead. > > Thanks again for your reply. I'm open to some more suggestions. > > Regards, > Santiago > > > On Fri, Jun 13, 2014 at 4:25 PM, Brian Walenz <th...@gm...> wrote: > Hi- > > This is a flaw in gzip, where it doesn't report the uncompressed size correctly for files larger than 2gb. I'm not intimately familiar with this pipeline, so don't know exactly how to implement the fixes below. > > Fix with either: > > 1) gzip -d the *gz files before building the overlap store. The 'find' command in the log indicates the pipeline will pick up the uncompressed files. You might need to remove the 'asm.ovlStore.list' file before restarting (this has the list of inputs to overlapStoreBuild). > > 2) Set ovlStoreMemory to (exactly) "0 -f 1024". This will tell it to use 0MB memory, and instead use 1024 files regardless of the size. 512 files will also work, and is a little safer (not near some Linux 'number of open files' limits). > > 3) Build the overlap store by hand (with either the uncompressed input, or the -f instead of -M option), outside the script, and then restart the script. The script will notice there is an overlap store already present, and skip the build. The command is in the log file -- make sure the final store is called 'asm.ovlStore', and not 'asm.ovlStore.BUILDING'. > > Option 1 should work, but option 2 is the easiest to try. I wouldn't try option 3 until Sergey speaks up. > > b > > > > > On Fri, Jun 13, 2014 at 12:33 PM, Santiago Revale <san...@gm...> wrote: > Dear CA community, > > I'm running the correction of some PacBio reads with high-identity Illumina reads, in a high memory server, for a 750 Mbp genome. I've considered the known issues addressed in the website when starting the correction. > > When executing the pipeline, I've reached to the overlapStoreBuild step with 48 ovb files, size 26 Gb each (totaling 1.2Tb). ovls files have already been deleted by the script. The error happened while executing overlapStoreBuild: > > ... > bucketizing DONE! > overlaps skipped: > 0 OBT - low quality > 0 DUP - non-duplicate overlap > 0 DUP - different library > 0 DUP - dedup not requested > terminate called after throwing an instance of 'std::bad_alloc' > what(): std::bad_alloc > > Failed with 'Aborted' > ... > > I ran this step twice: the first one having set ovlStoreMemory to 8192 Mb, but the second one, set it on 160000 (160 Gb). In the "Overlap store failure" FAQ, it mentioned as possible causes "Out of disk space" (which is not my case) and "Corrupt gzip files / too many fragments". I don't have gzip files and I have only 15 fragments. Also, bucketizing step finishes OK. > > Also, some odd thing I've noticed (at least odd for me) is that 14 of the 15 temp files (tmp.sort.XXX) of the asm.ovlStore.BUILDING folder have a size 79Gb while the last one size is 1.2Tb. > > Could anybody tell me what could be the cause of this error and how to solve it? > > I'm attaching the asm.ovlStore.err and the pacBioToCA log files for complete descriptions of the error and the executed commands. > > Thank you very much in advance. > > Regards, > Santiago > > > > ------------------------------------------------------------------------------ > HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions > Find What Matters Most in Your Big Data with HPCC Systems > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > Leverages Graph Analysis for Fast Processing & Easy Data Exploration > http://p.sf.net/sfu/hpccsystems > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > > > > > > ------------------------------------------------------------------------------ > HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions > Find What Matters Most in Your Big Data with HPCC Systems > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > Leverages Graph Analysis for Fast Processing & Easy Data Exploration > http://p.sf.net/sfu/hpccsystems_______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |