I am having an issue with wgs 8.3 using a very large amount of disk space while assembling a 2.6Gb mammal genome. It is current doing the 2nd overlap step, and is using 14Tb of HD space. With this insane amount of disk space I am sure something is wrong, but no idea what. Please help me understand if this is reasonable and how I can reduce the amount of disk space used. Are there folders that are no longer needed and can be removed?
Inputs:
42Gb .frg file “super reads"output of the beginning Masurca steps - basically the PE assembled contigs.
4 MP libraries, 5.2Gb, 11Gb, 32Gb, 39Gb.
5x pacbio sequence coverage.
outputs
du -h CA_4May14 0 CA_4May14/0-overlaptrim-overlap/002 0 CA_4May14/0-overlaptrim-overlap/001 63M CA_4May14/0-overlaptrim-overlap 0 CA_4May14/0-mertrim 40G CA_4May14/0-mercounts 4.5T CA_4May14/0-overlaptrim/peer_genome.obtStore 4.5T CA_4May14/0-overlaptrim 5.5M CA_4May14/runCA-logs 107G CA_4May14/peer_genome.gkpStore 1.6T CA_4May14/1-overlapper/001 1.6T CA_4May14/1-overlapper 7.3T CA_4May14/peer_genome.ovlStore.BUILDING 14T CA_4May14
config file:
gkpFixInsertSizes=0 cgwErrorRate=0.15 ovlHashBits=25 ovlHashBlockLength=180000000 ovlCorrConcurrency=50 ovlConcurrency=50 ovlThreads=1 ovlRefBlockSize=406896261 ovlCorrBatchSize=40689626 doFragmentCorrection=1 utgErrorRate=0.03 bogBreakAtIntersections=0 unitigger=bogart bogBadMateDepth=1000000 merylMemory=16192 merylThreads=50 mbtConcurrency=10 frgCorrThreads=1 frgCorrConcurrency=50 cnsConcurrency=13 doOverlapBasedTrimming=1 doExtendClearRanges=1 ovlMerSize=22 cgwCompressTigStore=1 /mnt/data3/macmanes/masurca/superReadSequences_shr.frg /mnt/data3/macmanes/masurca/pacbio.frg /mnt/data3/macmanes/masurca/ac.cor.clean.frg /mnt/data3/macmanes/masurca/ad.cor.clean.frg /mnt/data3/macmanes/masurca/ae.cor.clean.frg /mnt/data3/macmanes/masurca/af.cor.clean.frg
sorry - I think I posted thsi in the wrong section, but as far as I can tell there is no way to move it to another section or delete it..
First, you can definitely erase 0-overlaptrim/*Store folders to free up space since you're done with trimming. There are potential changes you can make to decrease the space used but you would have to re-start the assembly. You can see how close to completing the overlap store building you are by checking the asm.ovlStore.err file.
The most likely cause of a large overlap store are repeats in the sequences. There was a recent question on the user group about a large overlap store for Illumina data. I'm paraphrasing most of the response below.
You can drop shorter reads. The historical minimum is 64 bases, but you can set it higher depending on your sequence lengths.
In addition to throwing out short reads, definitely increase the minimum overlap size (ovlMinLen) to whatever the length of the shortest read is, -1 (or 2 or ...).
What kmer threshold did it pick (0-mercounts, one of the *err files)? Can you send the histogram file? Plotting the first two columns should show a definite hump at the expected coverage, with a large tail. Any humps after that are repeats that probably should be excluded from seeding overlaps. Be sure to check way out on the X axis, with Y zoomed in, for any very common repeats.
Do you have a (cumulative) histogram of read lengths?
Any chance there is adapter present?
The most recent versions of CA have been optimized for PacBio (long) sequences which makes the data structures take more space for the short Illumina reads.