From: Oler, A. (NIH/N. [C] <and...@ni...> - 2012-04-19 16:13:51
Attachments:
120419_Screen_Shots_RNAseq.zip
|
Hi David, I have two questions about ORSS. I'm using ORSS in USeq 8.2.0. I was checking on a few genes (randomly chosen) that had no read counts and noticed that there are some reads for those exons of those genes. Is there a threshold number of reads that must be passed before the counts are reported? (e.g., 5 in at least one dataset?) Is there a way to get ORSS to report all of the read counts regardless of threshold (and only skip the differential testing if below a threshold)? (e.g., Dnajb7, Apol6, Dnahc5). What I'm more concerned about is a few genes that I noticed with large differences between replicates. ORSS reports high counts, but when I look at the raw reads, there is actually no expression for these genes at all. See the screenshots for Cav3 and Oxtr. In the attached spreadsheet, you can see that the read counts for the genes are 572 and 574, respectively for T1 (aka KO1 bam file in IGB). But there are no reads for these exons in the KO1 bam file. I repeated with just a single replicate and I got the same result (KO1 as treatment and WT1 as control). If you look at the wide shot of Cav3, you see that there are some spliced reads that flank these genes. I think these are being counted, even though their aligned bases are not overlapping any exons; there are about 500-600 reads that are hidden that probably have the same profile, which is probably why the counts are ~572/4. I also just tried this with USeq 8.2.2 and it gives the same result. Thanks, Andrew Andrew Oler, Ph.D. High-Throughput Sequencing Bioinformatics Specialist Contractor – Medical Science & Computing, Inc. Computational Biology Section Bioinformatics and Computational Biosciences Branch (BCBB) OCICB/OSMO/OD/NIAID/NIH 31 Center Drive, Room 3B62 Bethesda, MD 20892 Mobile: 240-507-3791 Office: 301-402-5685 http://bioinformatics.niaid.nih.gov<http://bioinformatics.niaid.nih.gov/> (Within NIH) http://exon.niaid.nih.gov<http://exon.niaid.nih.gov/> (Public) Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. |
From: David N. <Dav...@hc...> - 2012-04-19 16:54:13
|
Let me take a look, there are several filters that are applied to exclude bad alignments before being counted by ODRSS. Thus the alignments counted will likely be a subset of the input file. On 4/19/12 10:12 AM, "Oler, Andrew (NIH/NIAID) [C]" <and...@ni...> wrote: >Hi David, > >I have two questions about ORSS. I'm using ORSS in USeq 8.2.0. I was >checking on a few genes (randomly chosen) that had no read counts and >noticed that there are some reads for those exons of those genes. Is >there a threshold number of reads that must be passed before the counts >are reported? (e.g., 5 in at least one dataset?) Is there a way to get >ORSS to report all of the read counts regardless of threshold (and only >skip the differential testing if below a threshold)? (e.g., Dnajb7, >Apol6, Dnahc5). > >What I'm more concerned about is a few genes that I noticed with large >differences between replicates. ORSS reports high counts, but when I >look at the raw reads, there is actually no expression for these genes at >all. See the screenshots for Cav3 and Oxtr. In the attached >spreadsheet, you can see that the read counts for the genes are 572 and >574, respectively for T1 (aka KO1 bam file in IGB). But there are no >reads for these exons in the KO1 bam file. I repeated with just a single >replicate and I got the same result (KO1 as treatment and WT1 as >control). If you look at the wide shot of Cav3, you see that there are >some spliced reads that flank these genes. I think these are being >counted, even though their aligned bases are not overlapping any exons; >there are about 500-600 reads that are hidden that probably have the same >profile, which is probably why the counts are ~572/4. I also just tried >this with USeq 8.2.2 and it gives the same result. > >Thanks, > >Andrew > >Andrew Oler, Ph.D. >High-Throughput Sequencing Bioinformatics Specialist >Contractor Medical Science & Computing, Inc. >Computational Biology Section >Bioinformatics and Computational Biosciences Branch (BCBB) >OCICB/OSMO/OD/NIAID/NIH > >31 Center Drive, Room 3B62 >Bethesda, MD 20892 >Mobile: 240-507-3791 >Office: 301-402-5685 >http://bioinformatics.niaid.nih.gov<http://bioinformatics.niaid.nih.gov/> >(Within NIH) >http://exon.niaid.nih.gov<http://exon.niaid.nih.gov/> (Public) > >Disclaimer: The information in this e-mail and any of its attachments is >confidential and may contain sensitive information. It should not be used >by anyone who is not the original intended recipient. If you have >received this e-mail in error please inform the sender and delete it from >your mailbox or any other storage devices. National Institute of Allergy >and Infectious Diseases shall not accept liability for any statements >made that are sender's own and not expressly made on behalf of the NIAID >by one of its representatives. > |
From: David N. <Dav...@hc...> - 2012-04-20 19:02:33
|
Hello Andrew, OK, I found out the issue. I'm calling Picard's SAMRecordIterator i = reader.queryOverlapping(chromosome, start, stop); method to retrieve all alignments that overlap each exon. I then hash the names of the hits to reduce this to the number of fragments found to hit the entire gene. For our standard Extended Junction Analysis with know splices and annotation derived theoretical splices this works well (http://useq.sourceforge.net/usageRNASeq.html) since alignments will share sequence with at least with one of the genes exons. (There are very few to no multi gene spanners with this approach, see below.) The problem is that you're using tophat to call splice matches and that package is generating a lot of splices that have huge spans that overlap all the gene's exons in genomic space but share no sequence. For example with Cav3, there are 572 such spanners, where the read is split in half with 300k+ bp between them. These reads span multiple genes and are counted as hits to each. Here's a couple of the spanners: HWI-ST183_0264:7:2208:7929:127398#0 0 chr6 112343683 255 28M309966N16M * 0 0 GGACGTTCGGATCTTCTTCTTTTTGTGGCTGGGGGGGGCGGGTT DFADFA;FHGAFGIICGG@@CFHGGCGG FFFHBBA6@####### NH:i:1 NM:i:3 XS:A:- HWI-ST183_0264:7:1105:15979:27960#0 0 chr6 112343684 255 27M309966N17M * 0 0 GGCGTTCGGATCTTCTTCTTTTTGTGGCTGGGGGGGGGGTGTTC FFHHHHHJJFHGHIHIJIJJJJJFBGHI II6D<A6@######## NH:i:1 NM:i:3 XS:A:- HWI-ST183_0264:7:2205:3254:80300#0 0 chr6 112343686 255 25M309966N19M * 0 0 CGTTCGGATCTTCTTCTTTTTGTGGCTGGGGGGGGGGGTTTGTC FFHHHGGGIIJJJJIIJJJJJJHGIIJJI III@B@######### NH:i:1 NM:i:3 XS:A:- HWI-ST183_0264:7:1101:15952:159627#0 0 chr6 112343687 255 24M309966N20M * 0 0 GTTCGGATCTTCTTCTTTTTGTGGCTGGGGGGGGGGGTTTCTCC FFHHHHFHHHIHIIIJJJJJJIJIIIJ >CB?7B########### NH:i:1 NM:i:3 XS:A:- One fix here is to put in a requirement that each alignment actually share sequence with the exons in a gene. Should be a pretty simple change. Thanks for bringing this to my attention! Note, for those looking for hit counts for every gene, set the -e 1. Currently this defaults to 20 and only genes with this number of total hits will be scored. I'll change this too in the next release. -cheers, D On 4/19/12 10:12 AM, "Oler, Andrew (NIH/NIAID) [C]" <and...@ni...> wrote: >Hi David, > >I have two questions about ORSS. I'm using ORSS in USeq 8.2.0. I was >checking on a few genes (randomly chosen) that had no read counts and >noticed that there are some reads for those exons of those genes. Is >there a threshold number of reads that must be passed before the counts >are reported? (e.g., 5 in at least one dataset?) Is there a way to get >ORSS to report all of the read counts regardless of threshold (and only >skip the differential testing if below a threshold)? (e.g., Dnajb7, >Apol6, Dnahc5). > >What I'm more concerned about is a few genes that I noticed with large >differences between replicates. ORSS reports high counts, but when I >look at the raw reads, there is actually no expression for these genes at >all. See the screenshots for Cav3 and Oxtr. In the attached >spreadsheet, you can see that the read counts for the genes are 572 and >574, respectively for T1 (aka KO1 bam file in IGB). But there are no >reads for these exons in the KO1 bam file. I repeated with just a single >replicate and I got the same result (KO1 as treatment and WT1 as >control). If you look at the wide shot of Cav3, you see that there are >some spliced reads that flank these genes. I think these are being >counted, even though their aligned bases are not overlapping any exons; >there are about 500-600 reads that are hidden that probably have the same >profile, which is probably why the counts are ~572/4. I also just tried >this with USeq 8.2.2 and it gives the same result. > >Thanks, > >Andrew > >Andrew Oler, Ph.D. >High-Throughput Sequencing Bioinformatics Specialist >Contractor Medical Science & Computing, Inc. >Computational Biology Section >Bioinformatics and Computational Biosciences Branch (BCBB) >OCICB/OSMO/OD/NIAID/NIH > >31 Center Drive, Room 3B62 >Bethesda, MD 20892 >Mobile: 240-507-3791 >Office: 301-402-5685 >http://bioinformatics.niaid.nih.gov<http://bioinformatics.niaid.nih.gov/> >(Within NIH) >http://exon.niaid.nih.gov<http://exon.niaid.nih.gov/> (Public) > >Disclaimer: The information in this e-mail and any of its attachments is >confidential and may contain sensitive information. It should not be used >by anyone who is not the original intended recipient. If you have >received this e-mail in error please inform the sender and delete it from >your mailbox or any other storage devices. National Institute of Allergy >and Infectious Diseases shall not accept liability for any statements >made that are sender's own and not expressly made on behalf of the NIAID >by one of its representatives. > |
From: Oler, A. (NIH/N. [C] <and...@ni...> - 2012-04-23 15:15:25
|
Hi David, This is what I expected as the reason for what I was seeing. Thanks in advance for making the changes. Can you please let me know when the new version is available? Thanks, Andrew Andrew Oler, Ph.D. High-Throughput Sequencing Bioinformatics Specialist Contractor – Medical Science & Computing, Inc. Computational Biology Section Bioinformatics and Computational Biosciences Branch (BCBB) OCICB/OSMO/OD/NIAID/NIH 31 Center Drive, Room 3B62 Bethesda, MD 20892 Mobile: 240-507-3791 Office: 301-402-5685 http://bioinformatics.niaid.nih.gov<http://bioinformatics.niaid.nih.gov/> (Within NIH) http://exon.niaid.nih.gov<http://exon.niaid.nih.gov/> (Public) Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. From: David Nix <Dav...@hc...<mailto:Dav...@hc...>> To: "Oler, Andrew (NIH/NIAID) [C]" <and...@ni...<mailto:and...@ni...>> Cc: "use...@li...<mailto:use...@li...>" <use...@li...<mailto:use...@li...>> Subject: Re: ORSS not counting reads properly? Hello Andrew, OK, I found out the issue. I'm calling Picard's SAMRecordIterator i = reader.queryOverlapping(chromosome, start, stop); method to retrieve all alignments that overlap each exon. I then hash the names of the hits to reduce this to the number of fragments found to hit the entire gene. For our standard Extended Junction Analysis with know splices and annotation derived theoretical splices this works well (http://useq.sourceforge.net/usageRNASeq.html) since alignments will share sequence with at least with one of the genes exons. (There are very few to no multi gene spanners with this approach, see below.) The problem is that you're using tophat to call splice matches and that package is generating a lot of splices that have huge spans that overlap all the gene's exons in genomic space but share no sequence. For example with Cav3, there are 572 such spanners, where the read is split in half with 300k+ bp between them. These reads span multiple genes and are counted as hits to each. Here's a couple of the spanners: HWI-ST183_0264:7:2208:7929:127398#0 0 chr6 112343683 255 28M309966N16M * 0 0 GGACGTTCGGATCTTCTTCTTTTTGTGGCTGGGGGGGGCGGGTT DFADFA;FHGAFGIICGG@@CFHGGCGG FFFHBBA6@####### NH:i:1 NM:i:3 XS:A:- HWI-ST183_0264:7:1105:15979:27960#0 0 chr6 112343684 255 27M309966N17M * 0 0 GGCGTTCGGATCTTCTTCTTTTTGTGGCTGGGGGGGGGGTGTTC FFHHHHHJJFHGHIHIJIJJJJJFBGHI II6D<A6@######## NH:i:1 NM:i:3 XS:A:- HWI-ST183_0264:7:2205:3254:80300#0 0 chr6 112343686 255 25M309966N19M * 0 0 CGTTCGGATCTTCTTCTTTTTGTGGCTGGGGGGGGGGGTTTGTC FFHHHGGGIIJJJJIIJJJJJJHGIIJJI III@B@######### NH:i:1 NM:i:3 XS:A:- HWI-ST183_0264:7:1101:15952:159627#0 0 chr6 112343687 255 24M309966N20M * 0 0 GTTCGGATCTTCTTCTTTTTGTGGCTGGGGGGGGGGGTTTCTCC FFHHHHFHHHIHIIIJJJJJJIJIIIJ CB?7B########### NH:i:1 NM:i:3 XS:A:- One fix here is to put in a requirement that each alignment actually share sequence with the exons in a gene. Should be a pretty simple change. Thanks for bringing this to my attention! Note, for those looking for hit counts for every gene, set the -e 1. Currently this defaults to 20 and only genes with this number of total hits will be scored. I'll change this too in the next release. -cheers, D On 4/19/12 10:12 AM, "Oler, Andrew (NIH/NIAID) [C]" <and...@ni...<mailto:and...@ni...>> wrote: Hi David, I have two questions about ORSS. I'm using ORSS in USeq 8.2.0. I was checking on a few genes (randomly chosen) that had no read counts and noticed that there are some reads for those exons of those genes. Is there a threshold number of reads that must be passed before the counts are reported? (e.g., 5 in at least one dataset?) Is there a way to get ORSS to report all of the read counts regardless of threshold (and only skip the differential testing if below a threshold)? (e.g., Dnajb7, Apol6, Dnahc5). What I'm more concerned about is a few genes that I noticed with large differences between replicates. ORSS reports high counts, but when I look at the raw reads, there is actually no expression for these genes at all. See the screenshots for Cav3 and Oxtr. In the attached spreadsheet, you can see that the read counts for the genes are 572 and 574, respectively for T1 (aka KO1 bam file in IGB). But there are no reads for these exons in the KO1 bam file. I repeated with just a single replicate and I got the same result (KO1 as treatment and WT1 as control). If you look at the wide shot of Cav3, you see that there are some spliced reads that flank these genes. I think these are being counted, even though their aligned bases are not overlapping any exons; there are about 500-600 reads that are hidden that probably have the same profile, which is probably why the counts are ~572/4. I also just tried this with USeq 8.2.2 and it gives the same result. Thanks, Andrew Andrew Oler, Ph.D. High-Throughput Sequencing Bioinformatics Specialist Contractor Medical Science & Computing, Inc. Computational Biology Section Bioinformatics and Computational Biosciences Branch (BCBB) OCICB/OSMO/OD/NIAID/NIH 31 Center Drive, Room 3B62 Bethesda, MD 20892 Mobile: 240-507-3791 Office: 301-402-5685 http://bioinformatics.niaid.nih.gov<http://bioinformatics.niaid.nih.gov/> (Within NIH) http://exon.niaid.nih.gov<http://exon.niaid.nih.gov/> (Public) Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. |
From: David N. <Dav...@hc...> - 2012-04-25 17:23:20
|
OK Folks, Give USeq_8.2.3 a try. This checks that each read actually touches down on the exon under interrogation and should allow folks to use TopHat alignments more accurately. USeq 8.2.3 also reworks the MultipleReplicaScanSeqs app enabling one to bypass DESeq's variance outlier filtering. There were also mods to the NovoalignBisulfiteParser allowing it to play nice with spliced RNASeq alignments. -cheers, David -- David Austin Nix, PhD Huntsman Cancer Institute Dept. OncSci University of Utah Bioinformatics Shared Resource HCI 3165 (801) 587-4611 dav...@hc... Skype: NextGenBio On 4/23/12 9:12 AM, "Oler, Andrew (NIH/NIAID) [C]" <and...@ni...> wrote: >Hi David, > >This is what I expected as the reason for what I was seeing. Thanks in >advance for making the changes. Can you please let me know when the new >version is available? > >Thanks, > >Andrew > >Andrew Oler, Ph.D. >High-Throughput Sequencing Bioinformatics Specialist >Contractor Medical Science & Computing, Inc. >Computational Biology Section >Bioinformatics and Computational Biosciences Branch (BCBB) >OCICB/OSMO/OD/NIAID/NIH > >31 Center Drive, Room 3B62 >Bethesda, MD 20892 >Mobile: 240-507-3791 >Office: 301-402-5685 >http://bioinformatics.niaid.nih.gov<http://bioinformatics.niaid.nih.gov/> >(Within NIH) >http://exon.niaid.nih.gov<http://exon.niaid.nih.gov/> (Public) > >Disclaimer: The information in this e-mail and any of its attachments is >confidential and may contain sensitive information. It should not be used >by anyone who is not the original intended recipient. If you have >received this e-mail in error please inform the sender and delete it from >your mailbox or any other storage devices. National Institute of Allergy >and Infectious Diseases shall not accept liability for any statements >made that are sender's own and not expressly made on behalf of the NIAID >by one of its representatives. > > >From: David Nix <Dav...@hc...<mailto:Dav...@hc...>> >To: "Oler, Andrew (NIH/NIAID) [C]" ><and...@ni...<mailto:and...@ni...>> >Cc: >"use...@li...<mailto:use...@li...> >" ><use...@li...<mailto:use...@li...> >> >Subject: Re: ORSS not counting reads properly? > >Hello Andrew, > >OK, I found out the issue. I'm calling Picard's SAMRecordIterator i = >reader.queryOverlapping(chromosome, start, stop); method to retrieve all >alignments that overlap each exon. I then hash the names of the hits to >reduce this to the number of fragments found to hit the entire gene. > >For our standard Extended Junction Analysis with know splices and >annotation derived theoretical splices this works well >(http://useq.sourceforge.net/usageRNASeq.html) since alignments will share >sequence with at least with one of the genes exons. (There are very few >to no multi gene spanners with this approach, see below.) > >The problem is that you're using tophat to call splice matches and that >package is generating a lot of splices that have huge spans that overlap >all the gene's exons in genomic space but share no sequence. > >For example with Cav3, there are 572 such spanners, where the read is >split in half with 300k+ bp between them. These reads span multiple genes >and are counted as hits to each. Here's a couple of the spanners: > >HWI-ST183_0264:7:2208:7929:127398#0 0 chr6 112343683 255 28M309966N16M * 0 >0 GGACGTTCGGATCTTCTTCTTTTTGTGGCTGGGGGGGGCGGGTT >DFADFA;FHGAFGIICGG@@CFHGGCGG >FFFHBBA6@####### NH:i:1 NM:i:3 XS:A:- >HWI-ST183_0264:7:1105:15979:27960#0 0 chr6 112343684 255 27M309966N17M * 0 >0 GGCGTTCGGATCTTCTTCTTTTTGTGGCTGGGGGGGGGGTGTTC >FFHHHHHJJFHGHIHIJIJJJJJFBGHI >II6D<A6@######## NH:i:1 NM:i:3 XS:A:- > >HWI-ST183_0264:7:2205:3254:80300#0 0 chr6 112343686 255 25M309966N19M * 0 >0 >CGTTCGGATCTTCTTCTTTTTGTGGCTGGGGGGGGGGGTTTGTC FFHHHGGGIIJJJJIIJJJJJJHGIIJJI >III@B@######### NH:i:1 NM:i:3 XS:A:- >HWI-ST183_0264:7:1101:15952:159627#0 0 chr6 112343687 255 24M309966N20M * >0 >0 GTTCGGATCTTCTTCTTTTTGTGGCTGGGGGGGGGGGTTTCTCC FFHHHHFHHHIHIIIJJJJJJIJIIIJ >CB?7B########### NH:i:1 NM:i:3 XS:A:- > > >One fix here is to put in a requirement that each alignment actually share >sequence with the exons in a gene. Should be a pretty simple change. > >Thanks for bringing this to my attention! > >Note, for those looking for hit counts for every gene, set the -e 1. >Currently this defaults to 20 and only genes with this number of total >hits will be scored. I'll change this too in the next release. > >-cheers, D > > > >On 4/19/12 10:12 AM, "Oler, Andrew (NIH/NIAID) [C]" ><and...@ni...<mailto:and...@ni...>> >wrote: > >Hi David, > >I have two questions about ORSS. I'm using ORSS in USeq 8.2.0. I was >checking on a few genes (randomly chosen) that had no read counts and >noticed that there are some reads for those exons of those genes. Is >there a threshold number of reads that must be passed before the counts >are reported? (e.g., 5 in at least one dataset?) Is there a way to get >ORSS to report all of the read counts regardless of threshold (and only >skip the differential testing if below a threshold)? (e.g., Dnajb7, >Apol6, Dnahc5). > >What I'm more concerned about is a few genes that I noticed with large >differences between replicates. ORSS reports high counts, but when I >look at the raw reads, there is actually no expression for these genes at >all. See the screenshots for Cav3 and Oxtr. In the attached >spreadsheet, you can see that the read counts for the genes are 572 and >574, respectively for T1 (aka KO1 bam file in IGB). But there are no >reads for these exons in the KO1 bam file. I repeated with just a single >replicate and I got the same result (KO1 as treatment and WT1 as >control). If you look at the wide shot of Cav3, you see that there are >some spliced reads that flank these genes. I think these are being >counted, even though their aligned bases are not overlapping any exons; >there are about 500-600 reads that are hidden that probably have the same >profile, which is probably why the counts are ~572/4. I also just tried >this with USeq 8.2.2 and it gives the same result. > >Thanks, > >Andrew > >Andrew Oler, Ph.D. >High-Throughput Sequencing Bioinformatics Specialist >Contractor Medical Science & Computing, Inc. >Computational Biology Section >Bioinformatics and Computational Biosciences Branch (BCBB) >OCICB/OSMO/OD/NIAID/NIH > >31 Center Drive, Room 3B62 >Bethesda, MD 20892 >Mobile: 240-507-3791 >Office: 301-402-5685 >http://bioinformatics.niaid.nih.gov<http://bioinformatics.niaid.nih.gov/> >(Within NIH) >http://exon.niaid.nih.gov<http://exon.niaid.nih.gov/> (Public) > >Disclaimer: The information in this e-mail and any of its attachments is >confidential and may contain sensitive information. It should not be used >by anyone who is not the original intended recipient. If you have >received this e-mail in error please inform the sender and delete it from >your mailbox or any other storage devices. National Institute of Allergy >and Infectious Diseases shall not accept liability for any statements >made that are sender's own and not expressly made on behalf of the NIAID >by one of its representatives. > > > |