From: Ivan S. <iva...@gm...> - 2014-09-14 06:58:43
|
Hi Brian! Thank you for your reply, and I apologize for my slow response. It's nice to hear that I'm not the only one with this problem :) I would be happy to share an example. Here is the first 5 lines of the posmap.frg.ctg file, where I have replaced the IDs of reads with their actual names (the relation was taken from asm.gkpStore.fastqUIDmap): m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/9515/0_4256 ctg7180000000002 0 8435 f m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24884/2806_11942 ctg7180000000002 1495 8147 f m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24752/0_1244 ctg7180000000002 1617 12822 f m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/14271/1648_4730 ctg7180000000002 1699 8847 r m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/10369/11194_16593 ctg7180000000002 1760 8558 r The last two numbers of each read's name roughly gives its length (I think they are subreads, so read 2 should be 9136 bases long). Here is where BLASR placed them (I copy only the first few fields of the SAM entries, up to the CIGAR string): m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/9515/0_4256/0_4256 0 ctg7180000000002 4174066 254 m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24884/2806_11942/0_9136 16 ctg7180000000002 1510215 254 m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24752/0_1244/0_1244 16 ctg7180000000002 881151 254 m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/14271/1648_4730/0_3082 16 ctg7180000000002 4413614 254 m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/10369/11194_16593/0_5399 0 ctg7180000000002 1891829 254 Contig placement is good, but it's kind of hard to miss that - there are only two contigs, one is the size of the genome (E. Coli, ctg7180000000002), and the other is the size of two reads (7180000000003). I checked manually, none of the listed reads were clipped (according to their CIGAR strings). The assembly is described here, together with the E. Coli datasets (PacBio reads extracted into FASTQ files) and with instructions on how to run the assembly: http://wgs-assembler.sourceforge.net/wiki/index.php/Escherichia_coli_K12_MG1655,_using_uncorrected_PacBio_reads,_with_CA8.1 It runs for about half an hour, and produces a complete assembly of E. Coli. Do you have any ideas what's going on with these files? Thank you and best regards! Ivan On Thu, Sep 11, 2014 at 8:17 PM, Brian Walenz <th...@gm...> wrote: > When evaluating the read trimming used in the uncorrected assemblies, we > had _great_ trouble comparing results from mappings (blasr, nucmer, blast, > whatever) against what CA was doing. BLASR was probably the worst offender > here, usually failing to map portions of the read that we thought were > good. I think you're seeing the same effect. > > Are the placements to different contigs, or are they mostly overlapping > but with different end points? Can you share a small example? I'll try > the same experiment here. > > Mapping trimmed reads might get closer to what posmap claims, but aside > from a sanity check, there might be little value in it. Kind of like > validating with only "good" mate pairs, you won't see any mistakes. > > b > > > On Thu, Sep 11, 2014 at 2:08 AM, Ivan Sovic <iva...@gm...> wrote: > >> Hi everyone! >> >> I have trouble with interpreting the POSMAP data of an assembly. >> In short - when I compare the positions of reads that are given in the >> asm.posmap.frgctg file with the positions I obtain after aligning the reads >> to the assembly in asm.ctg.fasta, I can see no relation between the two. >> For alignment, I used both BLASR and BWA-MEM. >> >> Description of what I am doing in more details: >> Following this tutorial ( >> http://wgs-assembler.sourceforge.net/wiki/index.php/Escherichia_coli_K12_MG1655,_using_uncorrected_PacBio_reads,_with_CA8.1) >> I assembled the E. Coli genome from a set of PacBio reads, and the results >> were exactly as described. >> After that, I parsed the asm.posmap.frgctg file to obtain the list of >> reads that were actually used in the assembly. >> I extracted their original headers from the asm.gkpStore.fastqUIDmap >> file, and filtered the initial set of reads, so the resulting set contains >> only those reads listed in the asm.posmap.frgctg file. >> After that, I used both BLASR with default parameters, and BWA-MEM with >> PacBio parameters to align those reads on the contig file asm.ctg.fasta. >> I then compared the positions of obtained alignments to the positions >> that are reported in asm.posmap.frgctg, and I see no correspondance. >> >> Can anyone provide any insight into this? >> Am I missing something? >> Or maybe the POSMAP files weren't updated with the rest of Celera? >> >> >> Thank you for your help! >> >> >> Best regards, >> Ivan Sovic. >> >> >> >> ------------------------------------------------------------------------------ >> Want excitement? >> Manually upgrade your production database. >> When you want reliability, choose Perforce >> Perforce version control. Predictably reliable. >> >> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk >> _______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >> >> > |