Whole-Genome Shotgun Assembler / Bugs / #337 Too large output when running runCorrection.sh

#337 Too large output when running runCorrection.sh

Milestone: correction

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2015-12-14

Created: 2015-12-08

Creator: shichengcheng

Private: No

Hello.
I am using the PBcR pipeline to assemble genome with pacbio data. The genome is with high heterozygous and the size is about 1.5~2Gb. I use ~19Gb(~10X) pacbio reads to perform self-correction and assembly.

When the process run into “runCorrection.sh”, it generate large output files, which are all named “asm.[1-100].shortmap.var” without any error or warning in “asm.layout.err”. Till now these files have account for 4.6T on my disk and keep on increasing... Is there something wrong or is it normal for large genome assembly? Output with >4.6T is too large!

Here is my main script:
…/wgs-8.3rc2/Linux-amd64/bin/PBcR -length 500 -partitions 100 -libraryname xjp24 -threads 7 -fastq pacbio.fastq -s pacbio.spec

And the pacbio.spec is:

asmUtgErrorRate=0.10
asmCnsErrorRate=0.10
asmCgwErrorRate=0.10
asmOBT=1
asmObtErrorRate=0.08
asmObtErrorLimit=4.5
utgGraphErrorRate=0.05
utgMergeErrorRate=0.05
ovlHashBits=24
ovlHashLoad=0.80

merSize = 14

merylMemory = 32000
merylThreads = 8

ovlStoreMemory = 32000
ovlMemory = 32

useGrid = 0
scriptOnGrid = 0
frgCorrOnGrid = 0
ovlCorrOnGrid = 0

ovlHashBits = 25
ovlThreads = 3
ovlHashBlockLength = 1000000000
ovlRefBlockSize = 1000000000

frgCorrThreads = 10
frgCorrBatchSize = 100000

ovlCorrBatchSize = 500000

ovlConcurrency = 10
cnsConcurrency = 10
frgCorrConcurrency = 10
ovlCorrConcurrency = 10
cnsConcurrency = 10

And I also attarched the “runCorrection.sh” and “asm.layout.err” here.

Would you please help to check them and give some suggestion abot the large output? Is there any more appropriate parameter settings to reduce outputs for large genome assembly(For example, assmble a 2Gb genome with ~50X pacbio data)?

Thank you very much!

2 Attachments

asm.layout.err

runCorrection.sh

Discussion

Sergey Koren - 2015-12-08

Generally 50X of a human genome requires 2-4TB to complete. However, this will vary based on your genome's repeat content and settings. I see you are using a merSize of 14, increasing to the default of 16 would definitely reduce the # of overlaps and the space used.

That said, if you only have 10X of data, you're probably not going to get a very good assembly, I would recommend 20X+ at a minimum with the low-coverage settings.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- shichengcheng - 2015-12-09
  
  Thanks a lot for your kind reply!!! I will add data to 20X+ and change the merSize to 16 and re-run the pipeline.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

shichengcheng - 2015-12-14

Hi,
I changed the merSize to 16 and re-runed the pipeline,but it seems that the merSize 16 does not work when running "runCorrection.sh"... this script has not finished yet and has generated 5.6T outputs now...

Are there any other options or suggestion to reduce the outputs?
Thank you very much!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sergey Koren - 2015-12-14

I looked through your asm.layout.err you posted earlier. It seems that either the genome is very repetitive or most of the overlaps are coming from repeat regions because of the low coverage. Specifically looking at the file it says:
Picking cutoff as 3197 mean would be 554.784298 +- 998.996804 (2553)
Are you sure your PacBio data have a good representation of the genome and even coverage?

The high average number of mappings per sequence, over 3000 when you only have 10-20X coverage, is most likely responsible for inflating your output size.

You can try a couple of things, first you can set a genome size in the run which will limit the cutoff that can be chosen. Since you already have a run, you can edit runCorrection.sh to be:

correctPacBio \ -L \ -l 500 \ -C 20 \ -t 20 \ -p 100 \ -o asm \ \ -O asm.ovlStore \ -G asm.gkpStore \ -e 0.35 -c 0.35 -E 6.5 > asm.layout.err 2> asm.layout.err && touch asm.layout.success

This will limit the cutoff to 20 (you should set it to your coverage) and also remove the -M flag which will reduce the output size but may shorten your corrected reads slightly. Other than this there isn't much you can do to reduce the file size, if your genome really is as repetitive as the logs would indicate it will require more space than a human assembly.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

shichengcheng - 2015-12-14

Many thanks!
I firstly checked the input data, and the bases number of PacBio data that used in this test is less than 20X.
As you can see this genome is very comlicated because it not only contains much repeat sequences but also with high heterouzygous. So I will change the runCorrection.sh and try again.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Too large output when running runCorrection.sh

Group

Searches

Help

#337 Too large output when running runCorrection.sh

Discussion