Re: [wgs-assembler-users] Possibility to improve assembly result

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi, Xueping-

That’s frustrating!  Can you send along the qc report?

We’re just finishing up a repetitive fish.  We had some success changing the ‘astat’ cutoffs for labeling unitigs unique/not-unique.  We used astatHighBound=0 and astatLowBound=-20 based on a plot of unitig length vs astat (numbers came from 5-consensus-coverage-stat, but I didn’t do the analysis and would have to pester someone to get any scripts to pass along).  If there are large degenerate contigs, this will help by labeling them as unique and letting them be used for scaffolds.

Or, it’s possible that unitig construction was poor.  I’ll have to think about how to measure this — are they small because of bad trimming, low coverage, biased coverage or repeat boundaries?  The signal for all of these looks basically the same, but the resolution is quite different.

Sorry I’m not much help yet.

b

On 11/20/12 6:05 AM, "Quan, Xueping" <x....@im...> wrote:

Dear All

I have a large plant genome (3.5Gb in size) with high repeat content (more than 60%). The sequencing data I got are about 45x Illumina paired-end and mate pair data (after data cleaning), and 0.5x 454 mate pair data. I have finished the assembly using celera. However, the coverage of contig (600mb) and scaffold sequences (660mb) for the genome is very low. Most of the unitigs (about 5Gb) sequences are  failed to be combined into any scaffold). Below is my spec file, could anyone help to give suggestion about how to improve the assembly:

"
utgGraphErrorRate=0.03  # bogart use utgGraphErrorRate, utgGraphErrorLimit, utgMergeErrorRate, utgMergeErrorLimit
utgGraphErrorLimit=3.25  #
utgMergeErrorRate=0.045
utgMergeErrorLimit=5.25
ovlErrorRate=0.04 # Larger than utg to allow for correction.
cnsErrorRate=0.08 # Larger than utg to avoid occasional consensus failures
cgwErrorRate=0.10 # Larger than utg to allow contig merges across high-error ends
gkpAllowInefficientStorage=1

#
frgMinLen=64 # fragment shorter than this length are not loaded into the assembler
ovlMinLen=40 # overlaps shorter than this length are not computed
#
merSize =22 # default=22; use lower to combine across heterozygosity, higher to separate near-identical repeat copies
overlapper=ovl # the mer overlapper for 454-like data is insensitive to homopolymer problems but requires more RAM and disk

#UNITIGGER configuration
unitigger = bogart
batMemory=650
utgBubblePopping = 1
batThreads=64

# utgGenomeSize = 3.5gb
#
#  MERYL calculates K-mer seeds
merylMemory   = 512000
merylThreads    = 32
#
#  OVERLAPPER calculates overlaps
ovlHashBits=24
ovlHashBlockLength=700000000
ovlThreads          = 2
ovlConcurrency      = 32
ovlRefBlockSize  = 320000000
#
#  OVERLAP STORE build the database
ovlStoreMemory = 109210 # Mbp

# ERROR CORRECTION not applied to overlaps
doFragmentCorrection=0

# Scafolder

# CONSENSUS configuration
cnsConcurrency   = 64

L1_GAIIx.frg
L2_GAIIx.frg
L3_GAIIx.frg
L4_GAIIx.frg
L5_GAIIx.frg
L6_GAIIx.frg
L7_GAIIx.frg
L8_GAIIx.frg
L3_HiSeq.frg
L4_HiSeq.frg
L5_HiSeq.frg
L6_HiSeq.frg
L1_454.frg
L2_454.frg
L3_454.frg
L4_454.frg
L5_454.frg
L6_454.frg
L7_454.frg
L8_454.frg
L9_454.frg
L10_454.frg
"

Thanks very much!

Xueping Quan

Imperial College London
Tel: +44(0)207 594 17 80