Menu

#26 Different Read Counts‏: Samtools flagstst VS HTseq

--
open
nobody
False
7
2013-12-06
2013-11-26
No

Dear Simon


I used the Stampy program to map reads into Ensembl Danio rerio genome, and then HTseq to count reads into genes for expression quantification (using Ensembl .gtf file).
I also used Samtools flagstat to count the number of total mapped reads.
However the number of reads are not consistent between Samtools flagstat and HTseq:






HTSEQ


SumReadsMappedIntoGenes 8493585
no_feature 844881
ambiguous 1247928
too_low_aQual 0
not_aligned 1664519
alignment_not_unique 0


SumMappedReads 10586394




FAGSTATS


24524248 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
18695744 + 0 mapped (76.23%:-nan%)
24524248 + 0 paired in sequencing
12262124 + 0 read1
12262124 + 0 read2
10564892 + 0 properly paired (43.08%:-nan%)
16196278 + 0 with itself and mate mapped
2499466 + 0 singletons (10.19%:-nan%)
994470 + 0 with mate mapped to a different chr
376058 + 0 with mate mapped to a different chr (mapQ>=5)






There are 8 109 350 missing mapped reads, and 4 163 985 missing not aligned reads in the HTseq counts.
I hope you could help me understanding these differences.
I used the following HTseq command:


htseq-count --stranded=no --type=exon /path/to/sam/file/library_sortedByName.sam exon /path/to/gtf/file/Danio_rerio.Zv9.69.gtf




Thank you in advance
Best regards


Miguel Machado

Discussion

  • Paul Theodor Pyl

    Dear Miguel,

    your flagstat shows that of the 24 million reads about 76% are mapped which would yield ca. 18 million mapped reads.
    Since you have a paired end library htseq-count actually counts the number of pairs (or fragments) overlapping genes, i.e. roughly half the number of reads.

    So your numbers seem to be in order.

    Kind Regards,
    Paul Pyl

     
  • Miguel Machado

    Miguel Machado - 2013-11-28

    Dear Paul

    Sorry for my ignorance, but I never read that HTseq counted fragments in a pair-end data.
    If so, there’s still some numbers that don’t match.
    Samtools flagstat reported 8 098 139 pairs mapped (with itself and mate mapped). If HTseq counts fragments, HTseq counted 8 493 585 fragments mapped into genes, which means that it counted 395 446 more fragments than those that were mapped.
    I hope I’m interpreting HTseq and Samtools flagstat outputs properly.
    I used HTSeq version 0.5.3p9.

    Thank you for your help

    Best regards

    Miguel Machado

     
  • Paul Theodor Pyl

    Dear Miguel,

    could you provide a small example file where this effect appears (bam file with 10 reads or so).

    I think it might just be some unclarity in the interpretation of SAM flags / their fuzzy definition in the SAM standard. If you are sure your file is not corrupted please go ahead and send an appropriately sized subset so that I can check for reproducibility of the error.

    Thanks,
    Paul

     
  • Miguel Machado

    Miguel Machado - 2013-12-04

    Dear Paul

    Sorry for the delay in answering you.
    I prepared an example of the Stampy sam file used in HTseq.
    However, since the read/fragments counts of HTseq and Samtools flagstat don’t matched also for Tophat, I also prepared an example of a sam file (derived from Tophat output) and a gtf file used in HTseq (for a de novo assembly strategy).
    Let me also use this opportunity to ask you, in a pair-end data, how are the single mapped reads counted by HTseq?
    Bellow you can find three links for the mention files (I couldn’t upload those files) and the commands used.
    Thank you a lot for your attention

    Bets regards

    Miguel Machado

    https://app.box.com/s/3jymcxnb65evj1autyr6
    https://app.box.com/s/d1sxey2l0op887qf2pfl
    https://app.box.com/s/9fcqpae2osh59jk1axng

    $ python2.6 ~/bin/stampy-1.0.21/stampy -g ~/ensembl/stampyFiles/Zv9_69 -h ~/ensembl/stampyFiles/Zv9_69 --substitutionrate=0.11 --output=~/stampy/female_brain.sam -M ~/reads/pp_f_brain_1_trimmed.fastq ~/reads/pp_f_brain_2_trimmed.fastq
    $ head -n 70000 ~/stampy/female_brain.sam 1> ~/stampy/stampy_female_brain_70000lines.sam
    $ ~/virtualenv/env/bin/htseq-count --stranded=no --type=exon ~/stampy/stampy_female_brain_70000lines.sam ~/ensembl/Danio_rerio.Zv9.69.gtf

    $ tophat -p 4 --phred64-quals --mate-inner-dist 20 --mate-std-dev 50 --output-dir ~/tophat/female_brain/ --max-multihits 1 ~/fasta/squaliusPyrenaicus_RNAseq_ssh ~/reads/pp_f_brain_1_trimmed.fastq ~/reads/pp_f_brain_2_trimmed.fastq
    $ samtools sort -n ~/tophat/female_brain/accepted_hits.bam ~/tophat/female_brain/accepted_hits_sortedName
    $ samtools view -h -o ~/tophat/female_brain/accepted_hits_sortedName.sam ~/tophat/female_brain/accepted_hits_sortedName.bam
    $ head -n 300001 ~/tophat/female_brain/accepted_hits_sortedName.sam 1> ~/tophat/female_brain/tophat_accepted_hits_sortedName_300001lines.sam
    $ ~/virtualenv/env/bin/htseq-count --stranded=no --type=transcript ~/tophat/female_brain/tophat_accepted_hits_sortedName_300001lines.sam ~/transcriptome_pp_RNAseq_ssh.gtf

     
  • Simon Anders

    Simon Anders - 2013-12-05

    Sorry, where does the number "8098139" mentioned in your post above come from, again?

     
    • Miguel Machado

      Miguel Machado - 2013-12-05

      It’s the number of pairs/fragments that mapped determined using Samtools flagstat (“with itself and mate mapped”). I reached that value by dividing the number of reads with itself and mate mapped by 2 (16196278 + 0 with itself and mate mapped / 2).
      I hope I explained myself properly.

       
  • Simon Anders

    Simon Anders - 2013-12-05

    htseq-count does not discard a read pair if one one of the mates is mapped. Rather it counts the pair for the gene that the aligned mate overlaps, if this is a single unambiguous overlap. So, the 395,446 extra reads are properly those of the 2,499,466 singletons wherev the aligned mate fell onto a gene.

     
    • Miguel Machado

      Miguel Machado - 2013-12-06

      If I understood it right, for pair-end data, htseq-count does not count only reads neither only fragments, since it counts fragments and single mapped reads (unambiguous overlap singletons). It’s a mixture of both.
      Therefore, one cannot use htseq-count to calculate RPKM neither FPKM.
      Am I right?

       
  • Simon Anders

    Simon Anders - 2013-12-06

    Well, htseq-count is meant as a tool to prepare a count table for subsequent testing for differential expression. Therefore, it aims to count how many cDNA fragments can be mapped unambiguously to a given gene. This is why a we count read pairs, not reads, and discard a read pair if its mates' mapiings contradict each other but not if one mate has not been mapped.

    If you understand RPKM as reads per kilobase length and per million mapped reads, then htseq-count is unsuitable. But why would you want to count a fragment twice if both its mates are mapped and only once if one read is unmapped? As for FPKM: The difference between RPKM and FPKM, in the parlance of the cufflinks paper, is that the latter tries to get the transcript length correct, while the former does something arbitrary. And, no, htseq-count does not dive into this whole transcript-length issue.

    See also http://seqanswers.com/forums/showthread.php?t=9129, especially my post #4 there.

     

Log in to post a comment.