From: Brian W. <th...@gm...> - 2014-09-16 13:44:41
|
The posmap positions are derived from the untig/contig multialignments, and I doubt they're incorrect. Too much other stuff would be broken too. There are some big repeats in this genome, if I remember, one at the start of the contig. Since most reads are in the same contig, can you compute the distance between posmap-position and blasr-position? I don't have (yet) this assembly to analyze. On Sun, Sep 14, 2014 at 2:58 AM, Ivan Sovic <iva...@gm...> wrote: > Hi Brian! > > Thank you for your reply, and I apologize for my slow response. > It's nice to hear that I'm not the only one with this problem :) > > I would be happy to share an example. > Here is the first 5 lines of the posmap.frg.ctg file, where I have > replaced the IDs of reads with their actual names (the relation was taken > from asm.gkpStore.fastqUIDmap): > m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/9515/0_4256 > ctg7180000000002 0 8435 f > m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24884/2806_11942 > ctg7180000000002 1495 8147 f > m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24752/0_1244 > ctg7180000000002 1617 12822 f > m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/14271/1648_4730 > ctg7180000000002 1699 8847 r > m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/10369/11194_16593 > ctg7180000000002 1760 8558 r > > The last two numbers of each read's name roughly gives its length (I think > they are subreads, so read 2 should be 9136 bases long). > Here is where BLASR placed them (I copy only the first few fields of the > SAM entries, up to the CIGAR string): > m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/9515/0_4256/0_4256 > 0 ctg7180000000002 4174066 254 > m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24884/2806_11942/0_9136 > 16 ctg7180000000002 1510215 254 > m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24752/0_1244/0_1244 > 16 ctg7180000000002 881151 254 > m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/14271/1648_4730/0_3082 > 16 ctg7180000000002 4413614 254 > m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/10369/11194_16593/0_5399 > 0 ctg7180000000002 1891829 254 > > Contig placement is good, but it's kind of hard to miss that - there are > only two contigs, one is the size of the genome (E. Coli, > ctg7180000000002), and the other is the size of two reads (7180000000003). > I checked manually, none of the listed reads were clipped (according to > their CIGAR strings). > > The assembly is described here, together with the E. Coli datasets (PacBio > reads extracted into FASTQ files) and with instructions on how to run the > assembly: > > http://wgs-assembler.sourceforge.net/wiki/index.php/Escherichia_coli_K12_MG1655,_using_uncorrected_PacBio_reads,_with_CA8.1 > It runs for about half an hour, and produces a complete assembly of E. > Coli. > > Do you have any ideas what's going on with these files? > > > Thank you and best regards! > Ivan > > > > > On Thu, Sep 11, 2014 at 8:17 PM, Brian Walenz <th...@gm...> wrote: > >> When evaluating the read trimming used in the uncorrected assemblies, we >> had _great_ trouble comparing results from mappings (blasr, nucmer, blast, >> whatever) against what CA was doing. BLASR was probably the worst offender >> here, usually failing to map portions of the read that we thought were >> good. I think you're seeing the same effect. >> >> Are the placements to different contigs, or are they mostly overlapping >> but with different end points? Can you share a small example? I'll try >> the same experiment here. >> >> Mapping trimmed reads might get closer to what posmap claims, but aside >> from a sanity check, there might be little value in it. Kind of like >> validating with only "good" mate pairs, you won't see any mistakes. >> >> b >> >> >> On Thu, Sep 11, 2014 at 2:08 AM, Ivan Sovic <iva...@gm...> wrote: >> >>> Hi everyone! >>> >>> I have trouble with interpreting the POSMAP data of an assembly. >>> In short - when I compare the positions of reads that are given in the >>> asm.posmap.frgctg file with the positions I obtain after aligning the reads >>> to the assembly in asm.ctg.fasta, I can see no relation between the two. >>> For alignment, I used both BLASR and BWA-MEM. >>> >>> Description of what I am doing in more details: >>> Following this tutorial ( >>> http://wgs-assembler.sourceforge.net/wiki/index.php/Escherichia_coli_K12_MG1655,_using_uncorrected_PacBio_reads,_with_CA8.1) >>> I assembled the E. Coli genome from a set of PacBio reads, and the results >>> were exactly as described. >>> After that, I parsed the asm.posmap.frgctg file to obtain the list of >>> reads that were actually used in the assembly. >>> I extracted their original headers from the asm.gkpStore.fastqUIDmap >>> file, and filtered the initial set of reads, so the resulting set contains >>> only those reads listed in the asm.posmap.frgctg file. >>> After that, I used both BLASR with default parameters, and BWA-MEM with >>> PacBio parameters to align those reads on the contig file asm.ctg.fasta. >>> I then compared the positions of obtained alignments to the positions >>> that are reported in asm.posmap.frgctg, and I see no correspondance. >>> >>> Can anyone provide any insight into this? >>> Am I missing something? >>> Or maybe the POSMAP files weren't updated with the rest of Celera? >>> >>> >>> Thank you for your help! >>> >>> >>> Best regards, >>> Ivan Sovic. >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Want excitement? >>> Manually upgrade your production database. >>> When you want reliability, choose Perforce >>> Perforce version control. Predictably reliable. >>> >>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk >>> _______________________________________________ >>> wgs-assembler-users mailing list >>> wgs...@li... >>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >>> >>> >> > |