From: Brian W. <th...@gm...> - 2014-09-26 20:28:53
|
Hi, Ivan- I finally had a chance to look at this. I see no problems. I computed the distance between the posmap placement and a placement found by mapping with 'bwa mem'. Most of the placements are within 100bp of the two methods. I suspect you used the fastqUIDmap from the 'trimming' run, and not from the 'assembly' run. The two read sets are different; trimming deletes many reads. In particular, the second read id (#24884) you list in the posmap file doesn't exist in my assembly. b On Tue, Sep 16, 2014 at 9:44 AM, Brian Walenz <th...@gm...> wrote: > The posmap positions are derived from the untig/contig multialignments, > and I doubt they're incorrect. Too much other stuff would be broken too. > > There are some big repeats in this genome, if I remember, one at the start > of the contig. Since most reads are in the same contig, can you compute > the distance between posmap-position and blasr-position? I don't have > (yet) this assembly to analyze. > > On Sun, Sep 14, 2014 at 2:58 AM, Ivan Sovic <iva...@gm...> wrote: > >> Hi Brian! >> >> Thank you for your reply, and I apologize for my slow response. >> It's nice to hear that I'm not the only one with this problem :) >> >> I would be happy to share an example. >> Here is the first 5 lines of the posmap.frg.ctg file, where I have >> replaced the IDs of reads with their actual names (the relation was taken >> from asm.gkpStore.fastqUIDmap): >> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/9515/0_4256 >> ctg7180000000002 0 8435 f >> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24884/2806_11942 >> ctg7180000000002 1495 8147 f >> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24752/0_1244 >> ctg7180000000002 1617 12822 f >> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/14271/1648_4730 >> ctg7180000000002 1699 8847 r >> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/10369/11194_16593 >> ctg7180000000002 1760 8558 r >> >> The last two numbers of each read's name roughly gives its length (I >> think they are subreads, so read 2 should be 9136 bases long). >> Here is where BLASR placed them (I copy only the first few fields of the >> SAM entries, up to the CIGAR string): >> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/9515/0_4256/0_4256 >> 0 ctg7180000000002 4174066 254 >> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24884/2806_11942/0_9136 >> 16 ctg7180000000002 1510215 254 >> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24752/0_1244/0_1244 >> 16 ctg7180000000002 881151 254 >> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/14271/1648_4730/0_3082 >> 16 ctg7180000000002 4413614 254 >> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/10369/11194_16593/0_5399 >> 0 ctg7180000000002 1891829 254 >> >> Contig placement is good, but it's kind of hard to miss that - there are >> only two contigs, one is the size of the genome (E. Coli, >> ctg7180000000002), and the other is the size of two reads (7180000000003). >> I checked manually, none of the listed reads were clipped (according to >> their CIGAR strings). >> >> The assembly is described here, together with the E. Coli datasets >> (PacBio reads extracted into FASTQ files) and with instructions on how to >> run the assembly: >> >> http://wgs-assembler.sourceforge.net/wiki/index.php/Escherichia_coli_K12_MG1655,_using_uncorrected_PacBio_reads,_with_CA8.1 >> It runs for about half an hour, and produces a complete assembly of E. >> Coli. >> >> Do you have any ideas what's going on with these files? >> >> >> Thank you and best regards! >> Ivan >> >> >> >> >> On Thu, Sep 11, 2014 at 8:17 PM, Brian Walenz <th...@gm...> wrote: >> >>> When evaluating the read trimming used in the uncorrected assemblies, we >>> had _great_ trouble comparing results from mappings (blasr, nucmer, blast, >>> whatever) against what CA was doing. BLASR was probably the worst offender >>> here, usually failing to map portions of the read that we thought were >>> good. I think you're seeing the same effect. >>> >>> Are the placements to different contigs, or are they mostly overlapping >>> but with different end points? Can you share a small example? I'll try >>> the same experiment here. >>> >>> Mapping trimmed reads might get closer to what posmap claims, but aside >>> from a sanity check, there might be little value in it. Kind of like >>> validating with only "good" mate pairs, you won't see any mistakes. >>> >>> b >>> >>> >>> On Thu, Sep 11, 2014 at 2:08 AM, Ivan Sovic <iva...@gm...> >>> wrote: >>> >>>> Hi everyone! >>>> >>>> I have trouble with interpreting the POSMAP data of an assembly. >>>> In short - when I compare the positions of reads that are given in the >>>> asm.posmap.frgctg file with the positions I obtain after aligning the reads >>>> to the assembly in asm.ctg.fasta, I can see no relation between the two. >>>> For alignment, I used both BLASR and BWA-MEM. >>>> >>>> Description of what I am doing in more details: >>>> Following this tutorial ( >>>> http://wgs-assembler.sourceforge.net/wiki/index.php/Escherichia_coli_K12_MG1655,_using_uncorrected_PacBio_reads,_with_CA8.1) >>>> I assembled the E. Coli genome from a set of PacBio reads, and the results >>>> were exactly as described. >>>> After that, I parsed the asm.posmap.frgctg file to obtain the list of >>>> reads that were actually used in the assembly. >>>> I extracted their original headers from the asm.gkpStore.fastqUIDmap >>>> file, and filtered the initial set of reads, so the resulting set contains >>>> only those reads listed in the asm.posmap.frgctg file. >>>> After that, I used both BLASR with default parameters, and BWA-MEM with >>>> PacBio parameters to align those reads on the contig file asm.ctg.fasta. >>>> I then compared the positions of obtained alignments to the positions >>>> that are reported in asm.posmap.frgctg, and I see no correspondance. >>>> >>>> Can anyone provide any insight into this? >>>> Am I missing something? >>>> Or maybe the POSMAP files weren't updated with the rest of Celera? >>>> >>>> >>>> Thank you for your help! >>>> >>>> >>>> Best regards, >>>> Ivan Sovic. >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Want excitement? >>>> Manually upgrade your production database. >>>> When you want reliability, choose Perforce >>>> Perforce version control. Predictably reliable. >>>> >>>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk >>>> _______________________________________________ >>>> wgs-assembler-users mailing list >>>> wgs...@li... >>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >>>> >>>> >>> >> > |