Re: [Bio-bwa-help] BWA vs Tophat

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Alec,

I am now analysis Illumina pair-end Human RNA-seq data.

I want to get the expression on gene/exon/isoform level. I also want
to investigate the alternative splicing. On read1/2, there are about
4,000,000 reads.

For the unmapped reads, it is possible there are splicing? So I think
it is better to keep them as Michael suggested in this email topic. I
think tophat can do some estimate for splicing, not very sure. The
genomic reference here I used is DNA, not cDNA.

Thanks.

On Thu, Jul 7, 2011 at 3:23 PM, Alec Wysoker <al...@br...> wrote:
> Hi Fabrice,
>
> You'd have to describe to me in more detail what you're doing in order for
> me to have a chance of answering this question intelligently.
>
> Note, that Tophat uses Bowtie to do its alignments, and I believe it prefers
> a genomic alignment to a transcriptomic alignment, so I am not sure that
> using BWA to try to align the reads that Tophat doesn't align will produce
> very many alignments.  I'm not an expert, though.
>
> -Alec
>
> On 7/7/11 5:46 AM, Fabrice Tourre wrote:
>>
>> Alec,
>>
>> If I use Tophat do the mapping for downstream analysis. Then using BWA
>> mapping again to get the unmapped reads. Will this have some problems?
>>
>> Thanks.
>>
>> On Wed, Jul 6, 2011 at 4:22 PM, Alec Wysoker<al...@br...>
>>  wrote:
>>>
>>> The Picard team at Broad definitely advocates retaining unmapped reads.
>>>  However, we don't trust aligners to output unmapped reads in proper
>>> format
>>> (or mapped reads, for that matter).  We use Picard MergeBamAlignment,
>>> which
>>> takes an unmapped BAM containing the reads before alignment, and a BAM
>>> produced by the aligner, and merges them, producing a BAM that extracts
>>> the
>>> alignment information from the aligned BAM, and almost everything else
>>> from
>>> the unmapped BAM.  This way you don't have to rely on the aligner to
>>> output
>>> the unmapped reads.
>>>
>>> -Alec
>>>
>>> On 7/6/11 10:12 AM, Rusch, Michael wrote:
>>>>
>>>> QC isn't actually the issue.  You'll see that both report that you have
>>>> no
>>>> QC-failed reads, and all reads are QC-pass.  So, the issue isn't the
>>>> number
>>>> of QC-passed reads per se, but rather the total number of reads.
>>>>
>>>> Bwa includes all reads in its output, not just the ones that are mapped.
>>>>  Unmapped reads are still included (but they are indicated as being
>>>> unmapped).  I don't know much about TopHat, but one thing is clear from
>>>> the
>>>> flagstat output: it is dropping the unmapped reads.  You'll see that the
>>>> number of mapped reads = number of total reads.
>>>>
>>>> Based on my experience, I would do everything you can to keep the
>>>> unmapped
>>>> reads in the BAM, even though they are labeled as unmapped.  There are
>>>> probably a dozen reasons to do this, but I will sum it up to say that
>>>> time
>>>> and time again I have seen that not having all of the reads in the BAM
>>>> file
>>>> can cause you serious headache downstream.
>>>>
>>>> As an aside: as you move forward in putting your pipeline together,
>>>> you'll
>>>> probably wind up needing to do the standard sort-merge-mark dups-index
>>>> set
>>>> of steps, and I recommend marking, rather than removing, your duplicates
>>>> when you get to that point.  You can use Picard MarkDuplicates to mark,
>>>> and
>>>> newer versions of samtools may allow you to mark instead of removing
>>>> when
>>>> you use rmdup, but I'm not sure.
>>>>
>>>> If you're going to use TopHat, I'd see if there's a way to tell it to
>>>> keep
>>>> the unmapped reads.  If not, you might find that there's a commonly
>>>> accepted
>>>> process that people use to re-introduce the unmapped reads back into the
>>>> bam.
>>>>
>>>> Michael
>>>>
>>>> -----Original Message-----
>>>> From: Fabrice Tourre [mailto:fab...@gm...]
>>>> Sent: Wednesday, July 06, 2011 4:14 AM
>>>> To: bio...@li...
>>>> Subject: [Bio-bwa-help] BWA vs Tophat. .
>>>>
>>>> Dear expert,
>>>>
>>>> I am now analysis Illumina pair-end Human RNA-seq data.
>>>>
>>>> At the first step, I want to choose a mapping software. I have tried
>>>> BWA and tophat to Ensembl human DNA reference.
>>>> I used samtools flagstat to statics the result. I cannot understand
>>>> why the number of QC-passed reads are different here. The inpunt for
>>>> BWA and tophat are extract same.
>>>>
>>>> Does anyone can give me some suggestion? Thanks.
>>>>
>>>> -------------------------------------------
>>>> BWA----------------------------------------------------
>>>> samtools flagstat accepted_hits.bam
>>>> 66037724 + 0 in total (QC-passed reads + QC-failed reads)
>>>> 0 + 0 duplicates
>>>> 53622250 + 0 mapped (81.20%:nan%)
>>>> 66037724 + 0 paired in sequencing
>>>> 33018862 + 0 read1
>>>> 33018862 + 0 read2
>>>> 41577968 + 0 properly paired (62.96%:nan%)
>>>> 48987028 + 0 with itself and mate mapped
>>>> 4635222 + 0 singletons (7.02%:nan%)
>>>> 1444118 + 0 with mate mapped to a different chr
>>>> 1214890 + 0 with mate mapped to a different chr (mapQ>=5)
>>>>
>>>>
>>>>
>>>> -------------------------------------------Tophat----------------------------------------------------
>>>> samtools flagstat accepted_hits.bam
>>>> 60223041 + 0 in total (QC-passed reads + QC-failed reads)
>>>> 0 + 0 duplicates
>>>> 60223041 + 0 mapped (100.00%:nan%)
>>>> 60223041 + 0 paired in sequencing
>>>> 30279902 + 0 read1
>>>> 29943139 + 0 read2
>>>> 40219740 + 0 properly paired (66.78%:nan%)
>>>> 56286158 + 0 with itself and mate mapped
>>>> 3936883 + 0 singletons (6.54%:nan%)
>>>> 0 + 0 with mate mapped to a different chr
>>>> 0 + 0 with mate mapped to a different chr (mapQ>=5)
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> All of the data generated in your IT infrastructure is seriously
>>>> valuable.
>>>> Why? It contains a definitive record of application performance,
>>>> security
>>>> threats, fraudulent activity, and more. Splunk takes this data and makes
>>>> sense of it. IT sense. And common sense.
>>>> http://p.sf.net/sfu/splunk-d2d-c2
>>>> _______________________________________________
>>>> Bio-bwa-help mailing list
>>>> Bio...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/bio-bwa-help
>>>>
>>>>
>>>> Email Disclaimer:  www.stjude.org/emaildisclaimer
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> All of the data generated in your IT infrastructure is seriously
>>>> valuable.
>>>> Why? It contains a definitive record of application performance,
>>>> security
>>>> threats, fraudulent activity, and more. Splunk takes this data and makes
>>>> sense of it. IT sense. And common sense.
>>>> http://p.sf.net/sfu/splunk-d2d-c2
>>>> _______________________________________________
>>>> Bio-bwa-help mailing list
>>>> Bio...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/bio-bwa-help
>