Re: [wgs-assembler-users] Trouble interpreting POSMAP info

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

The posmap positions are derived from the untig/contig multialignments, and
I doubt they're incorrect.  Too much other stuff would be broken too.

There are some big repeats in this genome, if I remember, one at the start
of the contig.  Since most reads are in the same contig, can you compute
the distance between posmap-position and blasr-position?  I don't have
(yet) this assembly to analyze.

On Sun, Sep 14, 2014 at 2:58 AM, Ivan Sovic <iva...@gm...> wrote:

> Hi Brian!
>
> Thank you for your reply, and I apologize for my slow response.
> It's nice to hear that I'm not the only one with this problem :)
>
> I would be happy to share an example.
> Here is the first 5 lines of the posmap.frg.ctg file, where I have
> replaced the IDs of reads with their actual names (the relation was taken
> from asm.gkpStore.fastqUIDmap):
> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/9515/0_4256
> ctg7180000000002    0    8435    f
> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24884/2806_11942
> ctg7180000000002    1495    8147    f
> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24752/0_1244
> ctg7180000000002    1617    12822    f
> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/14271/1648_4730
> ctg7180000000002    1699    8847    r
> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/10369/11194_16593
> ctg7180000000002    1760    8558    r
>
> The last two numbers of each read's name roughly gives its length (I think
> they are subreads, so read 2 should be 9136 bases long).
> Here is where BLASR placed  them (I copy only the first few fields of the
> SAM entries, up to the CIGAR string):
> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/9515/0_4256/0_4256
> 0    ctg7180000000002    4174066    254
> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24884/2806_11942/0_9136
> 16    ctg7180000000002    1510215    254
> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24752/0_1244/0_1244
> 16    ctg7180000000002    881151    254
> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/14271/1648_4730/0_3082
> 16    ctg7180000000002    4413614    254
> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/10369/11194_16593/0_5399
> 0    ctg7180000000002    1891829    254
>
> Contig placement is good, but it's kind of hard to miss that - there are
> only two contigs, one is the size of the genome (E. Coli,
> ctg7180000000002), and the other is the size of two reads (7180000000003).
> I checked manually, none of the listed reads were clipped (according to
> their CIGAR strings).
>
> The assembly is described here, together with the E. Coli datasets (PacBio
> reads extracted into FASTQ files) and with instructions on how to run the
> assembly:
>
> http://wgs-assembler.sourceforge.net/wiki/index.php/Escherichia_coli_K12_MG1655,_using_uncorrected_PacBio_reads,_with_CA8.1
> It runs for about half an hour, and produces a complete assembly of E.
> Coli.
>
> Do you have any ideas what's going on with these files?
>
>
> Thank you and best regards!
> Ivan
>
>
>
>
> On Thu, Sep 11, 2014 at 8:17 PM, Brian Walenz <th...@gm...> wrote:
>
>> When evaluating the read trimming used in the uncorrected assemblies, we
>> had _great_ trouble comparing results from mappings (blasr, nucmer, blast,
>> whatever) against what CA was doing.  BLASR was probably the worst offender
>> here, usually failing to map portions of the read that we thought were
>> good.  I think you're seeing the same effect.
>>
>> Are the placements to different contigs, or are they mostly overlapping
>> but with different end points?  Can you share a small example?  I'll try
>> the same experiment here.
>>
>> Mapping trimmed reads might get closer to what posmap claims, but aside
>> from a sanity check, there might be little value in it.  Kind of like
>> validating with only "good" mate pairs, you won't see any mistakes.
>>
>> b
>>
>>
>> On Thu, Sep 11, 2014 at 2:08 AM, Ivan Sovic <iva...@gm...> wrote:
>>
>>> Hi everyone!
>>>
>>> I have trouble with interpreting the POSMAP data of an assembly.
>>> In short - when I compare the positions of reads that are given in the
>>> asm.posmap.frgctg file with the positions I obtain after aligning the reads
>>> to the assembly in asm.ctg.fasta, I can see no relation between the two.
>>> For alignment, I used both BLASR and BWA-MEM.
>>>
>>> Description of what I am doing in more details:
>>> Following this tutorial (
>>> http://wgs-assembler.sourceforge.net/wiki/index.php/Escherichia_coli_K12_MG1655,_using_uncorrected_PacBio_reads,_with_CA8.1)
>>> I assembled the E. Coli genome from a set of PacBio reads, and the results
>>> were exactly as described.
>>> After that, I parsed the asm.posmap.frgctg file to obtain the list of
>>> reads that were actually used in the assembly.
>>> I extracted their original headers from the asm.gkpStore.fastqUIDmap
>>> file, and filtered the initial set of reads, so the resulting set contains
>>> only those reads listed in the asm.posmap.frgctg file.
>>> After that, I used both BLASR with default parameters, and BWA-MEM with
>>> PacBio parameters to align those reads on the contig file asm.ctg.fasta.
>>> I then compared the positions of obtained alignments to the positions
>>> that are reported in asm.posmap.frgctg, and I see no correspondance.
>>>
>>> Can anyone provide any insight into this?
>>> Am I missing something?
>>> Or maybe the POSMAP files weren't updated with the rest of Celera?
>>>
>>>
>>> Thank you for your help!
>>>
>>>
>>> Best regards,
>>> Ivan Sovic.
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Want excitement?
>>> Manually upgrade your production database.
>>> When you want reliability, choose Perforce
>>> Perforce version control. Predictably reliable.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> wgs-assembler-users mailing list
>>> wgs...@li...
>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>>>
>>>
>>
>