Whole-Genome Shotgun Assembler / Feature Requests / #140 extreme disk usage issue

extreme disk usage issue

#140 extreme disk usage issue

Milestone: Assembly_analysis

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2015-05-18

Created: 2015-05-17

Creator: Matt MacManes

Private: No

I am having an issue with wgs 8.3 using a very large amount of disk space while assembling a 2.6Gb mammal genome. It is current doing the 2nd overlap step, and is using 14Tb of HD space. With this insane amount of disk space I am sure something is wrong, but no idea what. Please help me understand if this is reasonable and how I can reduce the amount of disk space used. Are there folders that are no longer needed and can be removed?

Inputs:

42Gb .frg file “super reads"output of the beginning Masurca steps - basically the PE assembled contigs.

4 MP libraries, 5.2Gb, 11Gb, 32Gb, 39Gb.

5x pacbio sequence coverage.

outputs

du -h CA_4May14
0 CA_4May14/0-overlaptrim-overlap/002
0 CA_4May14/0-overlaptrim-overlap/001
63M CA_4May14/0-overlaptrim-overlap
0 CA_4May14/0-mertrim
40G CA_4May14/0-mercounts
4.5T CA_4May14/0-overlaptrim/peer_genome.obtStore
4.5T CA_4May14/0-overlaptrim
5.5M CA_4May14/runCA-logs
107G CA_4May14/peer_genome.gkpStore
1.6T CA_4May14/1-overlapper/001
1.6T CA_4May14/1-overlapper
7.3T CA_4May14/peer_genome.ovlStore.BUILDING
14T CA_4May14

config file:

gkpFixInsertSizes=0
cgwErrorRate=0.15
ovlHashBits=25
ovlHashBlockLength=180000000
ovlCorrConcurrency=50
ovlConcurrency=50
ovlThreads=1
ovlRefBlockSize=406896261
ovlCorrBatchSize=40689626
doFragmentCorrection=1
utgErrorRate=0.03
bogBreakAtIntersections=0
unitigger=bogart
bogBadMateDepth=1000000
merylMemory=16192
merylThreads=50
mbtConcurrency=10
frgCorrThreads=1
frgCorrConcurrency=50
cnsConcurrency=13
doOverlapBasedTrimming=1
doExtendClearRanges=1
ovlMerSize=22
cgwCompressTigStore=1

/mnt/data3/macmanes/masurca/superReadSequences_shr.frg
/mnt/data3/macmanes/masurca/pacbio.frg
/mnt/data3/macmanes/masurca/ac.cor.clean.frg
/mnt/data3/macmanes/masurca/ad.cor.clean.frg
/mnt/data3/macmanes/masurca/ae.cor.clean.frg
/mnt/data3/macmanes/masurca/af.cor.clean.frg

Discussion

Matt MacManes - 2015-05-17

sorry - I think I posted thsi in the wrong section, but as far as I can tell there is no way to move it to another section or delete it..

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sergey Koren - 2015-05-18

First, you can definitely erase 0-overlaptrim/*Store folders to free up space since you're done with trimming. There are potential changes you can make to decrease the space used but you would have to re-start the assembly. You can see how close to completing the overlap store building you are by checking the asm.ovlStore.err file.

The most likely cause of a large overlap store are repeats in the sequences. There was a recent question on the user group about a large overlap store for Illumina data. I'm paraphrasing most of the response below.

You can drop shorter reads. The historical minimum is 64 bases, but you can set it higher depending on your sequence lengths.

In addition to throwing out short reads, definitely increase the minimum overlap size (ovlMinLen) to whatever the length of the shortest read is, -1 (or 2 or ...).

What kmer threshold did it pick (0-mercounts, one of the *err files)? Can you send the histogram file? Plotting the first two columns should show a definite hump at the expected coverage, with a large tail. Any humps after that are repeats that probably should be excluded from seeding overlaps. Be sure to check way out on the X axis, with Y zoomed in, for any very common repeats.

Do you have a (cumulative) histogram of read lengths?

Any chance there is adapter present?

The most recent versions of CA have been optimized for PacBio (long) sequences which makes the data structures take more space for the short Illumina reads.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

extreme disk usage issue

Group

Searches

Help

#140 extreme disk usage issue

Discussion