Re: [wgs-assembler-users] Trouble interpreting POSMAP info

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi, Ivan-

I finally had a chance to look at this.  I see no problems.  I computed the
distance between the posmap placement and a placement found by mapping with
'bwa mem'.  Most of the placements are within 100bp of the two methods.

I suspect you used the fastqUIDmap from the 'trimming' run, and not from
the 'assembly' run.  The two read sets are different; trimming deletes many
reads.  In particular, the second read id (#24884) you list in the posmap
file doesn't exist in my assembly.

b

On Tue, Sep 16, 2014 at 9:44 AM, Brian Walenz <th...@gm...> wrote:

> The posmap positions are derived from the untig/contig multialignments,
> and I doubt they're incorrect.  Too much other stuff would be broken too.
>
> There are some big repeats in this genome, if I remember, one at the start
> of the contig.  Since most reads are in the same contig, can you compute
> the distance between posmap-position and blasr-position?  I don't have
> (yet) this assembly to analyze.
>
> On Sun, Sep 14, 2014 at 2:58 AM, Ivan Sovic <iva...@gm...> wrote:
>
>> Hi Brian!
>>
>> Thank you for your reply, and I apologize for my slow response.
>> It's nice to hear that I'm not the only one with this problem :)
>>
>> I would be happy to share an example.
>> Here is the first 5 lines of the posmap.frg.ctg file, where I have
>> replaced the IDs of reads with their actual names (the relation was taken
>> from asm.gkpStore.fastqUIDmap):
>> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/9515/0_4256
>> ctg7180000000002    0    8435    f
>> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24884/2806_11942
>> ctg7180000000002    1495    8147    f
>> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24752/0_1244
>> ctg7180000000002    1617    12822    f
>> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/14271/1648_4730
>> ctg7180000000002    1699    8847    r
>> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/10369/11194_16593
>> ctg7180000000002    1760    8558    r
>>
>> The last two numbers of each read's name roughly gives its length (I
>> think they are subreads, so read 2 should be 9136 bases long).
>> Here is where BLASR placed  them (I copy only the first few fields of the
>> SAM entries, up to the CIGAR string):
>> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/9515/0_4256/0_4256
>> 0    ctg7180000000002    4174066    254
>> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24884/2806_11942/0_9136
>> 16    ctg7180000000002    1510215    254
>> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24752/0_1244/0_1244
>> 16    ctg7180000000002    881151    254
>> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/14271/1648_4730/0_3082
>> 16    ctg7180000000002    4413614    254
>> m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/10369/11194_16593/0_5399
>> 0    ctg7180000000002    1891829    254
>>
>> Contig placement is good, but it's kind of hard to miss that - there are
>> only two contigs, one is the size of the genome (E. Coli,
>> ctg7180000000002), and the other is the size of two reads (7180000000003).
>> I checked manually, none of the listed reads were clipped (according to
>> their CIGAR strings).
>>
>> The assembly is described here, together with the E. Coli datasets
>> (PacBio reads extracted into FASTQ files) and with instructions on how to
>> run the assembly:
>>
>> http://wgs-assembler.sourceforge.net/wiki/index.php/Escherichia_coli_K12_MG1655,_using_uncorrected_PacBio_reads,_with_CA8.1
>> It runs for about half an hour, and produces a complete assembly of E.
>> Coli.
>>
>> Do you have any ideas what's going on with these files?
>>
>>
>> Thank you and best regards!
>> Ivan
>>
>>
>>
>>
>> On Thu, Sep 11, 2014 at 8:17 PM, Brian Walenz <th...@gm...> wrote:
>>
>>> When evaluating the read trimming used in the uncorrected assemblies, we
>>> had _great_ trouble comparing results from mappings (blasr, nucmer, blast,
>>> whatever) against what CA was doing.  BLASR was probably the worst offender
>>> here, usually failing to map portions of the read that we thought were
>>> good.  I think you're seeing the same effect.
>>>
>>> Are the placements to different contigs, or are they mostly overlapping
>>> but with different end points?  Can you share a small example?  I'll try
>>> the same experiment here.
>>>
>>> Mapping trimmed reads might get closer to what posmap claims, but aside
>>> from a sanity check, there might be little value in it.  Kind of like
>>> validating with only "good" mate pairs, you won't see any mistakes.
>>>
>>> b
>>>
>>>
>>> On Thu, Sep 11, 2014 at 2:08 AM, Ivan Sovic <iva...@gm...>
>>> wrote:
>>>
>>>> Hi everyone!
>>>>
>>>> I have trouble with interpreting the POSMAP data of an assembly.
>>>> In short - when I compare the positions of reads that are given in the
>>>> asm.posmap.frgctg file with the positions I obtain after aligning the reads
>>>> to the assembly in asm.ctg.fasta, I can see no relation between the two.
>>>> For alignment, I used both BLASR and BWA-MEM.
>>>>
>>>> Description of what I am doing in more details:
>>>> Following this tutorial (
>>>> http://wgs-assembler.sourceforge.net/wiki/index.php/Escherichia_coli_K12_MG1655,_using_uncorrected_PacBio_reads,_with_CA8.1)
>>>> I assembled the E. Coli genome from a set of PacBio reads, and the results
>>>> were exactly as described.
>>>> After that, I parsed the asm.posmap.frgctg file to obtain the list of
>>>> reads that were actually used in the assembly.
>>>> I extracted their original headers from the asm.gkpStore.fastqUIDmap
>>>> file, and filtered the initial set of reads, so the resulting set contains
>>>> only those reads listed in the asm.posmap.frgctg file.
>>>> After that, I used both BLASR with default parameters, and BWA-MEM with
>>>> PacBio parameters to align those reads on the contig file asm.ctg.fasta.
>>>> I then compared the positions of obtained alignments to the positions
>>>> that are reported in asm.posmap.frgctg, and I see no correspondance.
>>>>
>>>> Can anyone provide any insight into this?
>>>> Am I missing something?
>>>> Or maybe the POSMAP files weren't updated with the rest of Celera?
>>>>
>>>>
>>>> Thank you for your help!
>>>>
>>>>
>>>> Best regards,
>>>> Ivan Sovic.
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Want excitement?
>>>> Manually upgrade your production database.
>>>> When you want reliability, choose Perforce
>>>> Perforce version control. Predictably reliable.
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> wgs-assembler-users mailing list
>>>> wgs...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>>>>
>>>>
>>>
>>
>