Re: [wgs-assembler-users] Trouble interpreting POSMAP info

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Brian!

Thank you for your reply, and I apologize for my slow response.
It's nice to hear that I'm not the only one with this problem :)

I would be happy to share an example.
Here is the first 5 lines of the posmap.frg.ctg file, where I have replaced
the IDs of reads with their actual names (the relation was taken from
asm.gkpStore.fastqUIDmap):
m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/9515/0_4256
ctg7180000000002    0    8435    f
m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24884/2806_11942
ctg7180000000002    1495    8147    f
m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24752/0_1244
ctg7180000000002    1617    12822    f
m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/14271/1648_4730
ctg7180000000002    1699    8847    r
m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/10369/11194_16593
ctg7180000000002    1760    8558    r

The last two numbers of each read's name roughly gives its length (I think
they are subreads, so read 2 should be 9136 bases long).
Here is where BLASR placed  them (I copy only the first few fields of the
SAM entries, up to the CIGAR string):
m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/9515/0_4256/0_4256
0    ctg7180000000002    4174066    254
m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24884/2806_11942/0_9136
16    ctg7180000000002    1510215    254
m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/24752/0_1244/0_1244
16    ctg7180000000002    881151    254
m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/14271/1648_4730/0_3082
16    ctg7180000000002    4413614    254
m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/10369/11194_16593/0_5399
0    ctg7180000000002    1891829    254

Contig placement is good, but it's kind of hard to miss that - there are
only two contigs, one is the size of the genome (E. Coli,
ctg7180000000002), and the other is the size of two reads (7180000000003).
I checked manually, none of the listed reads were clipped (according to
their CIGAR strings).

The assembly is described here, together with the E. Coli datasets (PacBio
reads extracted into FASTQ files) and with instructions on how to run the
assembly:
http://wgs-assembler.sourceforge.net/wiki/index.php/Escherichia_coli_K12_MG1655,_using_uncorrected_PacBio_reads,_with_CA8.1
It runs for about half an hour, and produces a complete assembly of E. Coli.

Do you have any ideas what's going on with these files?

Thank you and best regards!
Ivan

On Thu, Sep 11, 2014 at 8:17 PM, Brian Walenz <th...@gm...> wrote:

> When evaluating the read trimming used in the uncorrected assemblies, we
> had _great_ trouble comparing results from mappings (blasr, nucmer, blast,
> whatever) against what CA was doing.  BLASR was probably the worst offender
> here, usually failing to map portions of the read that we thought were
> good.  I think you're seeing the same effect.
>
> Are the placements to different contigs, or are they mostly overlapping
> but with different end points?  Can you share a small example?  I'll try
> the same experiment here.
>
> Mapping trimmed reads might get closer to what posmap claims, but aside
> from a sanity check, there might be little value in it.  Kind of like
> validating with only "good" mate pairs, you won't see any mistakes.
>
> b
>
>
> On Thu, Sep 11, 2014 at 2:08 AM, Ivan Sovic <iva...@gm...> wrote:
>
>> Hi everyone!
>>
>> I have trouble with interpreting the POSMAP data of an assembly.
>> In short - when I compare the positions of reads that are given in the
>> asm.posmap.frgctg file with the positions I obtain after aligning the reads
>> to the assembly in asm.ctg.fasta, I can see no relation between the two.
>> For alignment, I used both BLASR and BWA-MEM.
>>
>> Description of what I am doing in more details:
>> Following this tutorial (
>> http://wgs-assembler.sourceforge.net/wiki/index.php/Escherichia_coli_K12_MG1655,_using_uncorrected_PacBio_reads,_with_CA8.1)
>> I assembled the E. Coli genome from a set of PacBio reads, and the results
>> were exactly as described.
>> After that, I parsed the asm.posmap.frgctg file to obtain the list of
>> reads that were actually used in the assembly.
>> I extracted their original headers from the asm.gkpStore.fastqUIDmap
>> file, and filtered the initial set of reads, so the resulting set contains
>> only those reads listed in the asm.posmap.frgctg file.
>> After that, I used both BLASR with default parameters, and BWA-MEM with
>> PacBio parameters to align those reads on the contig file asm.ctg.fasta.
>> I then compared the positions of obtained alignments to the positions
>> that are reported in asm.posmap.frgctg, and I see no correspondance.
>>
>> Can anyone provide any insight into this?
>> Am I missing something?
>> Or maybe the POSMAP files weren't updated with the rest of Celera?
>>
>>
>> Thank you for your help!
>>
>>
>> Best regards,
>> Ivan Sovic.
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Want excitement?
>> Manually upgrade your production database.
>> When you want reliability, choose Perforce
>> Perforce version control. Predictably reliable.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>> _______________________________________________
>> wgs-assembler-users mailing list
>> wgs...@li...
>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>>
>>
>