Menu

#140 extreme disk usage issue

Assembly_analysis
open
nobody
None
5
2015-05-18
2015-05-17
No

I am having an issue with wgs 8.3 using a very large amount of disk space while assembling a 2.6Gb mammal genome. It is current doing the 2nd overlap step, and is using 14Tb of HD space. With this insane amount of disk space I am sure something is wrong, but no idea what. Please help me understand if this is reasonable and how I can reduce the amount of disk space used. Are there folders that are no longer needed and can be removed?

Inputs:

42Gb .frg file “super reads"output of the beginning Masurca steps - basically the PE assembled contigs.

4 MP libraries, 5.2Gb, 11Gb, 32Gb, 39Gb.

5x pacbio sequence coverage.

outputs

du -h CA_4May14
0 CA_4May14/0-overlaptrim-overlap/002
0 CA_4May14/0-overlaptrim-overlap/001
63M CA_4May14/0-overlaptrim-overlap
0 CA_4May14/0-mertrim
40G CA_4May14/0-mercounts
4.5T CA_4May14/0-overlaptrim/peer_genome.obtStore
4.5T CA_4May14/0-overlaptrim
5.5M CA_4May14/runCA-logs
107G CA_4May14/peer_genome.gkpStore
1.6T CA_4May14/1-overlapper/001
1.6T CA_4May14/1-overlapper
7.3T CA_4May14/peer_genome.ovlStore.BUILDING
14T CA_4May14

config file:

gkpFixInsertSizes=0
cgwErrorRate=0.15
ovlHashBits=25
ovlHashBlockLength=180000000
ovlCorrConcurrency=50
ovlConcurrency=50
ovlThreads=1
ovlRefBlockSize=406896261
ovlCorrBatchSize=40689626
doFragmentCorrection=1
utgErrorRate=0.03
bogBreakAtIntersections=0
unitigger=bogart
bogBadMateDepth=1000000
merylMemory=16192
merylThreads=50
mbtConcurrency=10
frgCorrThreads=1
frgCorrConcurrency=50
cnsConcurrency=13
doOverlapBasedTrimming=1
doExtendClearRanges=1
ovlMerSize=22
cgwCompressTigStore=1

/mnt/data3/macmanes/masurca/superReadSequences_shr.frg
/mnt/data3/macmanes/masurca/pacbio.frg
/mnt/data3/macmanes/masurca/ac.cor.clean.frg
/mnt/data3/macmanes/masurca/ad.cor.clean.frg
/mnt/data3/macmanes/masurca/ae.cor.clean.frg
/mnt/data3/macmanes/masurca/af.cor.clean.frg

Discussion

  • Matt MacManes

    Matt MacManes - 2015-05-17

    sorry - I think I posted thsi in the wrong section, but as far as I can tell there is no way to move it to another section or delete it..

     
  • Sergey Koren

    Sergey Koren - 2015-05-18

    First, you can definitely erase 0-overlaptrim/*Store folders to free up space since you're done with trimming. There are potential changes you can make to decrease the space used but you would have to re-start the assembly. You can see how close to completing the overlap store building you are by checking the asm.ovlStore.err file.

    The most likely cause of a large overlap store are repeats in the sequences. There was a recent question on the user group about a large overlap store for Illumina data. I'm paraphrasing most of the response below.

    You can drop shorter reads. The historical minimum is 64 bases, but you can set it higher depending on your sequence lengths.

    In addition to throwing out short reads, definitely increase the minimum overlap size (ovlMinLen) to whatever the length of the shortest read is, -1 (or 2 or ...).

    What kmer threshold did it pick (0-mercounts, one of the *err files)? Can you send the histogram file? Plotting the first two columns should show a definite hump at the expected coverage, with a large tail. Any humps after that are repeats that probably should be excluded from seeding overlaps. Be sure to check way out on the X axis, with Y zoomed in, for any very common repeats.

    Do you have a (cumulative) histogram of read lengths?

    Any chance there is adapter present?

    The most recent versions of CA have been optimized for PacBio (long) sequences which makes the data structures take more space for the short Illumina reads.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.