Re: [wgs-assembler-users] understanding PBcR error-correction

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

Fundamentally, it’s the same approach as the non-low coverage data. It computes overlaps with MHAP using more sensitive parameters than the default (same k-mer size but larger sketch size, which decreases your chance of missing an overlap). The PBDAGCON step creates the consensus but the threshold to keep a base is reduced from the default of 4-fold to 2-fold. Most bases have higher coverage than this but the lowered threshold is what is responsible for keeping many more sequences at this lower coverage level while still trimming artifacts in the data when only a single sequence supports it. The worst sequences will remain relatively high error but the median error rates of the sequences is significantly increased since the majority of them have more than 2-fold coverage correcting them. They may have some regions of remaining higher error. The assembly step will trim the corrected sequences again to remove any artifacts not filtered by the initial correction.

As far as I know, HGAP performs an all-vs-longest brute force alignment (using BLASR, this is why it’s computationally expensive). There is an index built on the longest sequences but the same is true for pretty much all methods (including PBcR which uses the min-hash as its index). PBcR will use a similar approach and try to correct the longest 40X of data using all data (i.e. all data mapped to longest 40X) but since you have less than 40X it’s all-vs-all. PBcR will use partial overlaps when doing a correction (that is part of a read is contained in part of another read, not only fully contained ones like the default in HGAP). There is a global filter which only allows each sequence to map to its best coverage positions, where best is based on size and identity.

Serge

> On Jun 18, 2015, at 10:39 AM, mic...@ip... wrote:
> 
> Dear Serge, 
> 
> Could you give me some information about how PBcR does the error-correction (specially for low coverage). 
> This might sound like a bold question but i have to ask since could not find any detailed information about it. 
> 
> I fed PBcR with 22 x PacBio data of a 1.3 Gb genome (low coverage settings) and it returned 15 x of error-corrected reads. This result is amazing (evenwhen considering the quality to be "only" 97-98 instead of 99+).
> 
> I know that overlaps are found using your MHAP aligner and that those overlaps are fed to PBDAGCON to create consensus, which then results in high confident base-information of the whole sequence. 
> 
> Does PBcR(like HGAP) use long sequences as initial "references" for the alignments or is it just brute-force all-against-all alignment and piling the overlaps up to find as many overlaps (coverage) per position as possible? 
> 
> Is there is a lower coverage threshold to do consensus calling at a given position of the read? 
> 
> Those questions relate more to PBDAGCON, for which i could not find much information. Maybe you could point me to some information about PBDAGCON or briefly explain its settings in PBcR. 
> 
> Thank you, 
> 
> Michel