Re: [Useq-users] ORSS not counting reads properly?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi David,

This is what I expected as the reason for what I was seeing.  Thanks in advance for making the changes.  Can you please let me know when the new version is available?

Thanks,

Andrew

Andrew Oler, Ph.D.
High-Throughput Sequencing Bioinformatics Specialist
Contractor – Medical Science & Computing, Inc.
Computational Biology Section
Bioinformatics and Computational Biosciences Branch (BCBB)
OCICB/OSMO/OD/NIAID/NIH

31 Center Drive, Room 3B62
Bethesda, MD 20892
Mobile: 240-507-3791
Office: 301-402-5685
http://bioinformatics.niaid.nih.gov<http://bioinformatics.niaid.nih.gov/> (Within NIH)
http://exon.niaid.nih.gov<http://exon.niaid.nih.gov/> (Public)

Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives.

From: David Nix <Dav...@hc...<mailto:Dav...@hc...>>
To: "Oler, Andrew (NIH/NIAID) [C]" <and...@ni...<mailto:and...@ni...>>
Cc: "use...@li...<mailto:use...@li...>" <use...@li...<mailto:use...@li...>>
Subject: Re: ORSS not counting reads properly?

Hello Andrew,

OK, I found out the issue.  I'm calling Picard's SAMRecordIterator i =
reader.queryOverlapping(chromosome, start, stop); method to retrieve all
alignments that overlap each exon.  I then hash the names of the hits to
reduce this to the number of fragments found to hit the entire gene.

For our standard Extended Junction Analysis with know splices and
annotation derived theoretical splices this works well
(http://useq.sourceforge.net/usageRNASeq.html) since alignments will share
sequence with at least with one of the genes exons.  (There are very few
to no multi gene spanners with this approach, see below.)

The problem is that you're using tophat to call splice matches and that
package is generating a lot of splices that have huge spans that overlap
all the gene's exons in genomic space but share no sequence.

For example with Cav3, there are 572 such spanners, where the read is
split in half with 300k+ bp between them.  These reads span multiple genes
and are counted as hits to each.  Here's a couple of the spanners:

HWI-ST183_0264:7:2208:7929:127398#0 0 chr6 112343683 255 28M309966N16M * 0
0 GGACGTTCGGATCTTCTTCTTTTTGTGGCTGGGGGGGGCGGGTT DFADFA;FHGAFGIICGG@@CFHGGCGG
FFFHBBA6@####### NH:i:1 NM:i:3 XS:A:-
HWI-ST183_0264:7:1105:15979:27960#0 0 chr6 112343684 255 27M309966N17M * 0
0 GGCGTTCGGATCTTCTTCTTTTTGTGGCTGGGGGGGGGGTGTTC FFHHHHHJJFHGHIHIJIJJJJJFBGHI
II6D<A6@######## NH:i:1 NM:i:3 XS:A:-

HWI-ST183_0264:7:2205:3254:80300#0 0 chr6 112343686 255 25M309966N19M * 0 0
CGTTCGGATCTTCTTCTTTTTGTGGCTGGGGGGGGGGGTTTGTC FFHHHGGGIIJJJJIIJJJJJJHGIIJJI
III@B@######### NH:i:1 NM:i:3 XS:A:-
HWI-ST183_0264:7:1101:15952:159627#0 0 chr6 112343687 255 24M309966N20M * 0
0 GTTCGGATCTTCTTCTTTTTGTGGCTGGGGGGGGGGGTTTCTCC FFHHHHFHHHIHIIIJJJJJJIJIIIJ
CB?7B########### NH:i:1 NM:i:3 XS:A:-

One fix here is to put in a requirement that each alignment actually share
sequence with the exons in a gene.  Should be a pretty simple change.

Thanks for bringing this to my attention!

Note, for those looking for hit counts for every gene, set the -e 1.
Currently this defaults to 20 and only genes with this number of total
hits will be scored.  I'll change this too in the next release.

-cheers, D

On 4/19/12 10:12 AM, "Oler, Andrew (NIH/NIAID) [C]" <and...@ni...<mailto:and...@ni...>>
wrote:

Hi David,

I have two questions about ORSS.  I'm using ORSS in USeq 8.2.0.  I was
checking on a few genes (randomly chosen) that had no read counts and
noticed that there are some reads for those exons of those genes.  Is
there a threshold number of reads that must be passed before the counts
are reported?  (e.g., 5 in at least one dataset?)  Is there a way to get
ORSS to report all of the read counts regardless of threshold (and only
skip the differential testing if below a threshold)? (e.g., Dnajb7,
Apol6, Dnahc5).

What I'm more concerned about is a few genes that I noticed with large
differences between replicates.  ORSS reports high counts, but when I
look at the raw reads, there is actually no expression for these genes at
all.  See the screenshots for Cav3 and Oxtr.  In the attached
spreadsheet, you can see that the read counts for the genes are 572 and
574, respectively for T1 (aka KO1 bam file in IGB).  But there are no
reads for these exons in the KO1 bam file.  I repeated with just a single
replicate and I got the same result (KO1 as treatment and WT1 as
control).  If you look at the wide shot of Cav3, you see that there are
some spliced reads that flank these genes.  I think these are being
counted, even though their aligned bases are not overlapping any exons;
there are about 500-600 reads that are hidden that probably have the same
profile, which is probably why the counts are ~572/4.  I also just tried
this with USeq 8.2.2 and it gives the same result.

Thanks,

Andrew

Andrew Oler, Ph.D.
High-Throughput Sequencing Bioinformatics Specialist
Contractor  Medical Science & Computing, Inc.
Computational Biology Section
Bioinformatics and Computational Biosciences Branch (BCBB)
OCICB/OSMO/OD/NIAID/NIH

31 Center Drive, Room 3B62
Bethesda, MD 20892
Mobile: 240-507-3791
Office: 301-402-5685
http://bioinformatics.niaid.nih.gov<http://bioinformatics.niaid.nih.gov/>
(Within NIH)
http://exon.niaid.nih.gov<http://exon.niaid.nih.gov/> (Public)

Disclaimer: The information in this e-mail and any of its attachments is
confidential and may contain sensitive information. It should not be used
by anyone who is not the original intended recipient. If you have
received this e-mail in error please inform the sender and delete it from
your mailbox or any other storage devices. National Institute of Allergy
and Infectious Diseases shall not accept liability for any statements
made that are sender's own and not expressly made on behalf of the NIAID
by one of its representatives.