You can subscribe to this list here.
2010 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(4) |
Jun
(2) |
Jul
(3) |
Aug
(3) |
Sep
(5) |
Oct
(2) |
Nov
(4) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2011 |
Jan
(12) |
Feb
|
Mar
(5) |
Apr
(6) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(15) |
Nov
(3) |
Dec
|
2012 |
Jan
|
Feb
(7) |
Mar
(3) |
Apr
(17) |
May
(5) |
Jun
|
Jul
(5) |
Aug
(1) |
Sep
(2) |
Oct
(3) |
Nov
(2) |
Dec
(1) |
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
(2) |
Oct
(2) |
Nov
(2) |
Dec
(2) |
2014 |
Jan
|
Feb
(2) |
Mar
(9) |
Apr
(2) |
May
|
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
|
Oct
|
Nov
(1) |
Dec
(1) |
2015 |
Jan
|
Feb
|
Mar
(4) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Noboru Jo S. <ns...@uc...> - 2013-12-17 00:25:56
|
Hi David, analyzing a sample from my lab that seems to be a failed IP, I found a number of regions in the genome that have many reads, but are not real peaks. They are basically the same across samples, including inputs and failed IPs. One example region is mm9 chrX:166,393,669-166,477,668 (figure attached). This is not a problem exclusive to USeq, but I was wondering if there's something that can be done to improve peak callers, since inputs are also "enriched" in these regions. I know that people have generated files containing regions that should be masked, but since input samples are also enriched, I wonder why they still come up as peaks. Thank you. |
From: David N. <Dav...@hc...> - 2013-11-20 18:54:25
|
Good point Sue. See the novoalign docs too. The -u setting varies by species. Note, it worthwhile piloting what setting are appropriate, e.g. -b2 or -b4. The additional compute time is considerable with -b4 and isn't needed for standard Illumina libraries run on the HiSeq or MiSeq. It's these amplicon/ pcr derived libraries that need special processing. -cheers, D On Nov 20, 2013, at 11:33 AM, Sue Hammoud <Sue...@hc...<mailto:Sue...@hc...>> wrote: Hi Everyone, Also when aligning miseq data make sure to you use –u12 in addition to the -b4. Best Sue From: David Nix <Dav...@hc...<mailto:Dav...@hc...>> Date: Wednesday, November 20, 2013 11:30 AM To: Sue Hammoud <sue...@hc...<mailto:sue...@hc...>>, Magdalena Potok <Mag...@hc...<mailto:Mag...@hc...>>, Brad Cairns <Bra...@hc...<mailto:Bra...@hc...>>, Bushra Gorsi <bg...@ge...<mailto:bg...@ge...>>, Somaye Dehghanizadeh <Som...@hc...<mailto:Som...@hc...>> Cc: "bio...@ut...<mailto:bio...@ut...>" <bio...@ut...<mailto:bio...@ut...>>, USeq <use...@li...<mailto:use...@li...>> Subject: Important USeq_8.7.0 update for bisulfite data analysis Hello Folks, I've posted a new useq release that contains a critical patch for bisulfite sequencing analysis derived from amplicon based library preps. These need to be aligned with the -b 4 setting to maximize the number of aligned reads and then processed through the new NovoalignBisufiteParser 8.7.0 . See http://useq.sourceforge.net/usageBisSeq.html The old NBP won't correctly parse these new alignment types. I have also corrected an issue with the merging of paired overlapping alignments. Some of these were being skipped and double counting of overlapping paired reads occurred. Note the later fix changes the counts but not the percents of methylation. See below for an illustration on a Hg19 sperm dataset. So moving forward, pay very close attention to amplicon -b4 aligned and parsed datasets. Visually inspect the base fraction methylation and converted and non converted point datasets along side the duplicate filtered bam alignment tracks in IGB. Note any discrepancies. This is a novel complex datatype that warrants close scrutiny. -cheers, David #### New NovoalignBisulfiteParser Stats Filtering statistics for 174477814 alignments: 5621028 Failed mapping quality score (13.0) 2192225 Failed alignment score (300.0) 1331036 Aligned to phiX or adapters 0 Failed vendor QC 55078179 Are unmapped 110255346 Passed filters (63.2%) 88083944 Total non-converted Cs sequenced 1504140872 Total converted Cs sequenced 0.055 Fraction non converted C's. 0.998 Fraction bp passing quality (13) 2164389480 BPs overlapping paired sequence 10644938422 BPs paired sequence 0.203 Fraction overlapping bps from paired reads. #### Old NovoalignBisulfiteParser Stats Filtering statistics for 174477814 alignments: 5621028 Failed mapping quality score (13.0) 2192225 Failed alignment score (300.0) 1331036 Aligned to phiX or adapters 0 Failed vendor QC 55078179 Are unmapped 110255346 Passed filters (63.2%) 94996528 Total non-converted Cs sequenced 1640576210 Total converted Cs sequenced 0.055 Fraction non converted C's. 0.998 Fraction bp passing quality (13) 2164389480 BPs overlapping paired sequence 10644938422 BPs paired sequence 0.203 Fraction overlapping bps from paired reads. #### New NBP run through BisStat Using Lambda data to set the expected fraction non-converted Cs to 0.00123 (16257/(16257+13221619)) Stats based on aligned genomic contexts that meet a minimum FDR threshold of 20.0. WARNING: datasets must be subsampled to the same bp aligned for these stats to be cross dataset comparable. 0.978 (12631645/12919931) mCG/mC 0.007 (89908/12919931) mCHG/mC 0.015 (198378/12919931) mCHH/mC Stats based on cumulative sums of all read sequences, no thresholds: 0.968 (85257586/88067661) mCG/mC 0.010 (875639/88067661) mCHG/mC 0.022 (1934436/88067661) mCHH/mC 0.059 (88067661/1490918281) mC/C 3.809 (85257586/22383883) mCG/CG 0.002 (875639/459687208) mCHG/CHG 0.002 (1934436/1008847190) mCHH/CHH 0.056 (88067661/1578985942) mC/(C+mC) 0.792 (85257586/107641469) mCG/(CG+mCG) 0.002 (875639/460562847) mCHG/(CHG+mCHG) 0.002 (1934436/1010781626) mCHH/(CHH+mCHH) #### Old NBP run through BisStat Using Lambda data to set the expected fraction non-converted Cs to 0.00120 (15986/(15986+13272652)) Stats based on aligned genomic contexts that meet a minimum FDR threshold of 20.0. WARNING: datasets must be subsampled to the same bp aligned for these stats to be cross dataset comparable. 0.966 (14109977/14599772) mCG/mC 0.010 (147396/14599772) mCHG/mC 0.023 (342399/14599772) mCHH/mC Stats based on cumulative sums of all read sequences, no thresholds: 0.969 (92009703/94980515) mCG/mC 0.010 (935216/94980515) mCHG/mC 0.021 (2035596/94980515) mCHH/mC 0.058 (94980515/1627302558) mC/C 3.842 (92009703/23946641) mCG/CG 0.002 (935216/497817018) mCHG/CHG 0.002 (2035596/1105538899) mCHH/CHH 0.055 (94980515/1722283073) mC/(C+mC) 0.793 (92009703/115956344) mCG/(CG+mCG) 0.002 (935216/498752234) mCHG/(CHG+mCHG) 0.002 (2035596/1107574495) mCHH/(CHH+mCHH) |
From: David N. <Dav...@hc...> - 2013-11-20 18:33:12
|
Hello Folks, I've posted a new useq release that contains a critical patch for bisulfite sequencing analysis derived from amplicon based library preps. These need to be aligned with the -b 4 setting to maximize the number of aligned reads and then processed through the new NovoalignBisufiteParser 8.7.0 . See http://useq.sourceforge.net/usageBisSeq.html The old NBP won't correctly parse these new alignment types. I have also corrected an issue with the merging of paired overlapping alignments. Some of these were being skipped and double counting of overlapping paired reads occurred. Note the later fix changes the counts but not the percents of methylation. See below for an illustration on a Hg19 sperm dataset. So moving forward, pay very close attention to amplicon -b4 aligned and parsed datasets. Visually inspect the base fraction methylation and converted and non converted point datasets along side the duplicate filtered bam alignment tracks in IGB. Note any discrepancies. This is a novel complex datatype that warrants close scrutiny. -cheers, David #### New NovoalignBisulfiteParser Stats Filtering statistics for 174477814 alignments: 5621028 Failed mapping quality score (13.0) 2192225 Failed alignment score (300.0) 1331036 Aligned to phiX or adapters 0 Failed vendor QC 55078179 Are unmapped 110255346 Passed filters (63.2%) 88083944 Total non-converted Cs sequenced 1504140872 Total converted Cs sequenced 0.055 Fraction non converted C's. 0.998 Fraction bp passing quality (13) 2164389480 BPs overlapping paired sequence 10644938422 BPs paired sequence 0.203 Fraction overlapping bps from paired reads. #### Old NovoalignBisulfiteParser Stats Filtering statistics for 174477814 alignments: 5621028 Failed mapping quality score (13.0) 2192225 Failed alignment score (300.0) 1331036 Aligned to phiX or adapters 0 Failed vendor QC 55078179 Are unmapped 110255346 Passed filters (63.2%) 94996528 Total non-converted Cs sequenced 1640576210 Total converted Cs sequenced 0.055 Fraction non converted C's. 0.998 Fraction bp passing quality (13) 2164389480 BPs overlapping paired sequence 10644938422 BPs paired sequence 0.203 Fraction overlapping bps from paired reads. #### New NBP run through BisStat Using Lambda data to set the expected fraction non-converted Cs to 0.00123 (16257/(16257+13221619)) Stats based on aligned genomic contexts that meet a minimum FDR threshold of 20.0. WARNING: datasets must be subsampled to the same bp aligned for these stats to be cross dataset comparable. 0.978 (12631645/12919931) mCG/mC 0.007 (89908/12919931) mCHG/mC 0.015 (198378/12919931) mCHH/mC Stats based on cumulative sums of all read sequences, no thresholds: 0.968 (85257586/88067661) mCG/mC 0.010 (875639/88067661) mCHG/mC 0.022 (1934436/88067661) mCHH/mC 0.059 (88067661/1490918281) mC/C 3.809 (85257586/22383883) mCG/CG 0.002 (875639/459687208) mCHG/CHG 0.002 (1934436/1008847190) mCHH/CHH 0.056 (88067661/1578985942) mC/(C+mC) 0.792 (85257586/107641469) mCG/(CG+mCG) 0.002 (875639/460562847) mCHG/(CHG+mCHG) 0.002 (1934436/1010781626) mCHH/(CHH+mCHH) #### Old NBP run through BisStat Using Lambda data to set the expected fraction non-converted Cs to 0.00120 (15986/(15986+13272652)) Stats based on aligned genomic contexts that meet a minimum FDR threshold of 20.0. WARNING: datasets must be subsampled to the same bp aligned for these stats to be cross dataset comparable. 0.966 (14109977/14599772) mCG/mC 0.010 (147396/14599772) mCHG/mC 0.023 (342399/14599772) mCHH/mC Stats based on cumulative sums of all read sequences, no thresholds: 0.969 (92009703/94980515) mCG/mC 0.010 (935216/94980515) mCHG/mC 0.021 (2035596/94980515) mCHH/mC 0.058 (94980515/1627302558) mC/C 3.842 (92009703/23946641) mCG/CG 0.002 (935216/497817018) mCHG/CHG 0.002 (2035596/1105538899) mCHH/CHH 0.055 (94980515/1722283073) mC/(C+mC) 0.793 (92009703/115956344) mCG/(CG+mCG) 0.002 (935216/498752234) mCHG/(CHG+mCHG) 0.002 (2035596/1107574495) mCHH/(CHH+mCHH) |
From: David N. <Dav...@hc...> - 2013-10-09 19:29:27
|
OK Folks, Just a heads up. For all novoalign runs, run a gunzip -t on each xxx.txt.gz file before aligning. Novoalign silently completes its run but truncates the output when aligning a broken file. This is a very nasty bug and has/ is causing us no end of problems. We also strongly recommend using rsync or FDT when transferring all of your data files. -cheers, Dave -- David Austin Nix, PhD Huntsman Cancer Institute Dept. OncSci University of Utah Bioinformatics Shared Resource HCI 3165 (801) 587-4611 dav...@hc... |
From: David N. <Dav...@hc...> - 2013-10-01 14:31:21
|
Sounds good! Begin forwarded message: From: Don Delker <don...@hs...<mailto:don...@hs...>> Subject: FW: *New Course* Java 3! Date: October 1, 2013 8:14:43 AM MDT To: David Nix <Dav...@hc...<mailto:Dav...@hc...>> David, Could you forward this new Java 3 class information to the Useq user group? Thanks, Don From: Technology Education at The University of Utah [mailto:mar...@ao...<http://aoce.utah.edu>] Sent: Friday, September 27, 2013 12:33 PM To: Don Delker Subject: *New Course* Java 3! Trouble viewing this email? Read it online<http://secure.campaigner.com/Campaigner/Public/t.show?5xq3w--33e7s-r6tle49&_v=2> [Image removed by sender.]<http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2p-r6tle46&_v=2> Java 3 Course Now Available! You have taken Java 1 & 2. Now take your knowledge of Java to the mastery level with this brand new class at Tech Ed. ________________________________ Java Programming Language Part 3 This class provides hands-on Java programming including continued Java GUI programming, introduction to applets, recursion and using Java with databases. Java Part 3 will focus on expanding the skills learned in Part 1 and Part 2 toward building real world applications. Prerequisite: Completion of Java Part 1 and Part 2, or instructor approval. MW 10/07/13 - 10/16/13 5:30 pm - 9:00 pm U of U Annex, $799 Click here for more info and to register<http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2q-r6tle47&_v=2> Browse Technology Classes Online! Choose from dozens of cutting-edge design, network, and productivity software classes and certificates. Sign up before they fill up! Unsure Which Class You Need? Call us at 801-581-6061, and we'll help you discover the perfect fit for your career and organizational goals! Download Our Catalog: Choose from dozens of classes in design, productivity, programming, and much more! Click on the image below to download our latest catalogs: [Image removed by sender.]<http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2s-r6tle49&_v=2> [Image removed by sender.] <http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2t-r6tle40&_v=2> or browse our classes - and register - on our website: edtech.utah.edu<http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2u-r6tle41&_v=2> CONTACT US Technology Education at the University of Utah 801-581-6061 edtech.utah.edu/contact<http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2v-r6tle42&_v=2> Continuing Education at The University of Utah 1901 E South Campus Dr. #1215, SLC, UT 84112 [Image removed by sender. Facebook Page]<http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2w-r6tle43&_v=2> [Image removed by sender. Twitter Page] <http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2x-r6tle44&_v=2> [Image removed by sender. YouTube Page] <http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2y-r6tle45&_v=2> [Image removed by sender. Flickr Page] <http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2z-r6tle46&_v=2> The University of Utah 1901 E South Campus dr, RM 1215 Salt Lake City Utah 84112 United States You are subscribed to this mailing list as don...@hs...<mailto:don...@hs...>. Please click here<http://trk.cp20.com/Tracking/t.fo?5xq3w--bxvv-r6tle40&sl=3b&t=5&_v=2> to modify your message preferences or to unsubscribe from any future mailings. We will respect all unsubscribe requests. ________________________________ [Image removed by sender.]<http://www.campaigner.com/campaignerPro.php?utm_source=campaigner&utm_medium=email&utm_campaign=deliveryfooter>[Image removed by sender.] |
From: David N. <Dav...@hc...> - 2013-09-24 13:13:32
|
These guys are putting on a series of tutorials, this one being a good intro to the browser. -cheers, David Begin forwarded message: From: "Loraine, Ann" <Ann...@un...<mailto:Ann...@un...>> Subject: Please forward: Wed 9/25 Integrated Genome Browser tutorial Date: September 24, 2013 6:39:28 AM MDT Cc: "Loraine, Ann" <Ann...@un...<mailto:Ann...@un...>> Hello, My apologies if you have received multiple copies of this notice. I'm writing to let you know about a free, on-line tutorial my lab is hosting tomorrow (9/25) for Integrated Genome Browser. IGB is open source, free software developed in my lab that helps researchers visualize and analyze data from high-throughput sequencing experiments, including (but not limited to) RNA-Seq, ChIP-Seq, and epigenetics data sets. Would you forward this announcement to interested students and/or colleagues? Many thanks, Ann Loraine Associate Professor Dept. of Bioinformatics and Genomics UNC Charlotte ---- Please join us for a live, on-line tutorial on Integrated Genome Browser: "Focus on a Feature: Working with Sequence Data." Wed 1-2 pm Eastern (US) Sept 25 Integrated Genome Browser (IGB) is a highly interactive, fast and flexible desktop genome browser mainly used to view data from RNA-Seq, ChIP-Seq, and other genome-scale data sets. IGB also supports viewing and interacting with sequence data from reference genomes, gene models, or alignments. For example, in IGB, you can select and copy sequence directly from the genomic sequence track or view six-frame translations in a text-based Sequence Viewer display. In this tutorial, you'll learn how to * select and copy reference genomic sequence * view, copy, or save sequence data * translate sequences in all 6 frames * view, copy or save gene model sequences with or without introns * configure how sequence bases are shown * open genomic sequence as a new track Example data sets will likely include cold stress or pollen RNA-Seq data sets from Arabidopsis thaliana. To ensure that you'll be able to do the exercises along with the presenter, please test-run IGB before the tutorial. To download and start IGB, go to bioviz.org<http://bioviz.org> and followed the "Download" links. Once IGB starts, choose Help / Tutorials for a quick introduction to IGB. If you have questions or need assistance, let us know. Contact: Alyssa Gulledge agu...@un...<mailto:agu...@un...> Ann Loraine alo...@un...<mailto:alo...@un...> To attend the tutorial: 1. Go to https://global.gotomeeting.com/meeting/join/238242685 2. Use your microphone and speakers (VoIP) - a headset is recommended. Or, call in using your telephone. United States: +1 (213) 493-0606 Access Code: 238-242-685 Audio PIN: Shown after joining the meeting Meeting ID:238-242-685 Attendance is limited to the first 25 participants who sign in. This is the second of several monthly "Focus on a Feature" tutorials to be offered the fourth Wednesday of every month at 1 pm Eastern time US. For recordings visit the IGB YouTube channel http://www.youtube.com/channel/UC0DA2d3YdbQ55ljkRKHRBkg |
From: David N. <Dav...@hc...> - 2013-09-04 17:33:34
|
OK! On Sep 4, 2013, at 11:24 AM, Don Delker <don...@hs...<mailto:don...@hs...>> wrote: Hi David, The next Java programming class starts next Monday, September 9th. If you know of anyone that might be interested please forward this along. We need one more person for the class. Thanks, Don <Java Syllabus2.pdf> |
From: nimrod r. <ru...@po...> - 2013-07-10 18:03:21
|
Hi, I have RNA-seq data that I've aligned using novoAlign to mm10 knownGene indexed with Useq, and converted transcriptomic coordinates to genomic coordinates with Useq as well. Trying to call variants with samtools mpileup then crashes for some of the libraries I have, with this error message: [bcf_sync] incorrect number of fields (6 != 5) at 15:62640916. I was wondering if this may have anything to do with Useq and if so how can that be fixed. Thanks, rubi |
From: David N. <dav...@gm...> - 2013-05-31 02:14:12
|
Hey Guys, I think I'm in trouble. My kind wife volunteered me to promote an upcoming fundraiser at our daughter's school (Madeline Choir). All proceeds to the school. ("We" also apparently volunteered to pay for the show. My palms are sweating.... What if no one shows up?) It's June 8th and features Otter Creek (http://www.ottercreekduo.com ), a phenomenal acoustic duo from SLC. (Peter is my mandolin teacher.) I cannot say enough good things about these guys. Classically trained in stringed instruments and voice. Technically and lyrically brilliant. Like all artists pushing the edge, their music is hard to describe, a heady mix of folk, bluegrass, newgrass, traditional, protest, mountain music.... They are really good and not to be missed. So I hope you can make it. -cheers, Dave http://bioserver.hci.utah.edu/Misc/BCB/flyer.pdf |
From: Barry M. <bar...@ge...> - 2013-05-18 17:06:18
|
Nice David - thanks. B On May 17, 2013, at 12:47 PM, David Nix wrote: > OK Folks, > > Here's a new USeq app for comparing VCF files to a key of trusted calls. To use it: > > Download the latest NIST key set (a vcf file of het and homo non ref calls and a bed file of interrogated regions) from Justin Zook <jus...@ni...<mailto:jus...@ni...>> or https://bioserver.hci.utah.edu/gnomex/gnomexFlex.jsp?analysisNumber=A1489 (this isn't necessarily the most recent). > > Download the USeq package from SourceForge http://sourceforge.net/projects/useq/ . > > Run the VCFComparator. > > [u0028003@hci-alta NIST]$ cd /Repository/AnalysisData/2013/A1489/NIST/5thCallSet_2.14/ > [u0028003@hci-alta 5thCallSet_2.14]$ ls > ExampleCalls Key VCFCompTest > [u0028003@hci-alta 5thCallSet_2.14]$ java -jar -Xmx10G /home/BioApps/USeq/Apps/VCFComparator > > ************************************************************************************** > ** VCF Comparator : April 2013 ** > ************************************************************************************** > Compares test vcf file(s) against a gold standard key of trusted vcf calls. Only calls > that fall in the common interrogated regions are compared. WARNING tabix gzipped files > often fail to parse correctly with java. Seeing odd error messages? Try uncompressing. > > Required Options: > -a VCF file for the key dataset (xxx.vcf(.gz/.zip OK)). > -b Bed file of interrogated regions for the key dataset (xxx.bed(.gz/.zip OK)). > -c VCF file for the test dataset (xxx.vcf(.gz/.zip OK)). May also provide a directory > containing xxx.vcf(.gz/.zip OK) files to compare. > -d Bed file of interrogated regions for the test dataset (xxx.bed(.gz/.zip OK)). > > Optional Options: > -g Require the genotype to match, defaults to scoring a match when the alternate > allele is present. > -v Use VQSLOD score as ranking statistic in place of the QUAL score. > -s Only compare SNPs, defaults to all. > -n Only compare non SNPs, defaults to all. > -p Provide a full path directory for saving the parsed data. Defaults to not saving. > -e Exclude test and key records whose FILTER field is not . or PASS. Defaults to > scoring all. > > Example: java -Xmx10G -jar pathTo/USeq/Apps/VCFComparator -a /NIST/NA12878/key.vcf > -b /NIST/NA12878/regions.bed.gz -c /EdgeBio/Exome/testHaploCaller.vcf.zip > -d /EdgeBio/Exome/NimbleGenExomeV3.bed -g -v -s -e -p /CompRes/ > > ************************************************************************************** > > [u0028003@hci-alta 5thCallSet_2.14]$ java -jar -Xmx10G /home/BioApps/USeq/Apps/VCFComparator -a Key/nist2.14.vcf.gz -b Key/nist2.14.bed.gz -c ExampleCalls/ -d Key/nimExomeV3.bed.gz -g -s -e -p VCFCompTest > > This generates a variety of files including matching and non matching vcf files, spreadsheets of stats for each test set, and a combine file listing just the FDR and TPR's for each dataset over a range of thresholds enabling the generation of TPR vs FDR pseudo ROC curves. > > [cid:48F...@hc...] > > Basically the method/ condition that returns the largest TPRs over a range of relevant FDRs for your experiment, say 1-5% is the "best". How much better can be measured off the graph. In the example above, sequencing 3 exomes per lane instead of 8 exomes increases the recovery of the key SNP variants by 5% ( at an FDR of 5%). Is that 5% improvement worth 2.7x the sequencing cost? Possibly. > > One issue I am a little worried about is the low FDR values for unfiltered variants call files with these datasets. Setting GATK's strand_call and strand_emit confidence to 0 doesn't produce a list of unfiltered SNPs with > 7 %FDR. This seems wrong given our experience with variant analysis of non NA12878 exomes. Any suggestions/ debugging tips would be most appreciated. > > -cheers, David > > > > David Austin Nix, PhD > Huntsman Cancer Institute > Dept. OncSci University of Utah > Bioinformatics Shared Resource > HCI 3165 (801) 587-4611 > dav...@hc...<mailto:dav...@hc...> > > > > <PastedGraphic-2.tiff> Barry Moore Research Scientist Dept. of Human Genetics University of Utah Salt Lake City, UT 84112 -------------------------------------------- (801) 585-3543 |
From: David N. <Dav...@hc...> - 2013-05-17 18:49:21
|
OK Folks, Here's a new USeq app for comparing VCF files to a key of trusted calls. To use it: Download the latest NIST key set (a vcf file of het and homo non ref calls and a bed file of interrogated regions) from Justin Zook <jus...@ni...<mailto:jus...@ni...>> or https://bioserver.hci.utah.edu/gnomex/gnomexFlex.jsp?analysisNumber=A1489 (this isn't necessarily the most recent). Download the USeq package from SourceForge http://sourceforge.net/projects/useq/ . Run the VCFComparator. [u0028003@hci-alta NIST]$ cd /Repository/AnalysisData/2013/A1489/NIST/5thCallSet_2.14/ [u0028003@hci-alta 5thCallSet_2.14]$ ls ExampleCalls Key VCFCompTest [u0028003@hci-alta 5thCallSet_2.14]$ java -jar -Xmx10G /home/BioApps/USeq/Apps/VCFComparator ************************************************************************************** ** VCF Comparator : April 2013 ** ************************************************************************************** Compares test vcf file(s) against a gold standard key of trusted vcf calls. Only calls that fall in the common interrogated regions are compared. WARNING tabix gzipped files often fail to parse correctly with java. Seeing odd error messages? Try uncompressing. Required Options: -a VCF file for the key dataset (xxx.vcf(.gz/.zip OK)). -b Bed file of interrogated regions for the key dataset (xxx.bed(.gz/.zip OK)). -c VCF file for the test dataset (xxx.vcf(.gz/.zip OK)). May also provide a directory containing xxx.vcf(.gz/.zip OK) files to compare. -d Bed file of interrogated regions for the test dataset (xxx.bed(.gz/.zip OK)). Optional Options: -g Require the genotype to match, defaults to scoring a match when the alternate allele is present. -v Use VQSLOD score as ranking statistic in place of the QUAL score. -s Only compare SNPs, defaults to all. -n Only compare non SNPs, defaults to all. -p Provide a full path directory for saving the parsed data. Defaults to not saving. -e Exclude test and key records whose FILTER field is not . or PASS. Defaults to scoring all. Example: java -Xmx10G -jar pathTo/USeq/Apps/VCFComparator -a /NIST/NA12878/key.vcf -b /NIST/NA12878/regions.bed.gz -c /EdgeBio/Exome/testHaploCaller.vcf.zip -d /EdgeBio/Exome/NimbleGenExomeV3.bed -g -v -s -e -p /CompRes/ ************************************************************************************** [u0028003@hci-alta 5thCallSet_2.14]$ java -jar -Xmx10G /home/BioApps/USeq/Apps/VCFComparator -a Key/nist2.14.vcf.gz -b Key/nist2.14.bed.gz -c ExampleCalls/ -d Key/nimExomeV3.bed.gz -g -s -e -p VCFCompTest This generates a variety of files including matching and non matching vcf files, spreadsheets of stats for each test set, and a combine file listing just the FDR and TPR's for each dataset over a range of thresholds enabling the generation of TPR vs FDR pseudo ROC curves. [cid:48F...@hc...] Basically the method/ condition that returns the largest TPRs over a range of relevant FDRs for your experiment, say 1-5% is the "best". How much better can be measured off the graph. In the example above, sequencing 3 exomes per lane instead of 8 exomes increases the recovery of the key SNP variants by 5% ( at an FDR of 5%). Is that 5% improvement worth 2.7x the sequencing cost? Possibly. One issue I am a little worried about is the low FDR values for unfiltered variants call files with these datasets. Setting GATK's strand_call and strand_emit confidence to 0 doesn't produce a list of unfiltered SNPs with > 7 %FDR. This seems wrong given our experience with variant analysis of non NA12878 exomes. Any suggestions/ debugging tips would be most appreciated. -cheers, David David Austin Nix, PhD Huntsman Cancer Institute Dept. OncSci University of Utah Bioinformatics Shared Resource HCI 3165 (801) 587-4611 dav...@hc...<mailto:dav...@hc...> |
From: David N. <Dav...@hc...> - 2012-12-11 18:41:54
|
Hello Jesse, Yes I've seen this error before. You'll need to replace the header with one consistent across all bam files. There's a picard app that does this I believe. The issue is STP generates a custom header on the fly based on the end of the last alignment in each chromosome. So the chromosome length differs for each file. You can have STP use a particular header with the -r argument for future parsings. -cheers, D On 12/11/12 11:35 AM, "Jesse Rowley" <jes...@u2...> wrote: >Hi David, >I aligned my RNA-seq reads with >@align -novoalign [-o SAM -r All 50 -a >AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC] -g >hg19EnsTransRad46Num100kMin10SplicesChrPhiXAdaptr -i *.txt.gz" > cmd.txt >this was followed by samtranscriptome parser. Now I ma running GATK. > >I get the following error with GATK: >##### ERROR MESSAGE: Input files reads and reference have incompatible >contigs: Found contigs with the same name but different lengths: >##### ERROR contig reads = chr20 / 62940766 >##### ERROR contig reference = chr20 / 63025520. >##### ERROR reads contigs = [chr20, chr21, chr22, chr5, chr6, chr7, >chr8, chr9, chr1, chr2, chr3, chr4, chr10, chr11, chr12, chr13, chrY, >chrX, chr15, chr14, chr17, chr16, chr19, chr18, chrM] >##### ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, >chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, >chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM] > >These are different as shown below - is it because these are changed by >maketranscriptome or was this transcriptome made with a different >reference? should I just revise my Bam header to match or would that be >problematic? My strategy is to use GATK in the @annot pipeline to >annotate variants (first splitting out splice junction reads prior to >indel realignment since GATK doesn't like those, and then remerging them >before annotation) >Fasta.fai: >chr1 249250621 6 50 51 >chr2 243199373 254235646 50 51 >chr3 198022430 502299013 50 51 >chr4 191154276 704281898 50 51 >chr5 180915260 899259266 50 51 >chr6 171115067 1083792838 50 51 >chr7 159138663 1258330213 50 51 >chr8 146364022 1420651656 50 51 >chr9 141213431 1569942965 50 51 >chr10 135534747 1713980672 50 51 >chr11 135006516 1852226121 50 51 >chr12 133851895 1989932775 50 51 >chr13 115169878 2126461715 50 51 >chr14 107349540 2243934998 50 51 >chr15 102531392 2353431536 50 51 >chr16 90354753 2458013563 50 51 >chr17 81195210 2550175419 50 51 >chr18 78077248 2632994541 50 51 >chr19 59128983 2712633341 50 51 >chr20 63025520 2772944911 50 51 >chr21 48129895 2837230949 50 51 >chr22 51304566 2886323449 50 51 >chrX 155270560 2938654113 50 51 >chrY 59373566 3097030091 50 51 >chrM 16571 3157591135 50 51 > >Bam header: >@HD VN:1.0 SO:coordinate >@SQ SN:chr20 LN:62940766 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chr21 LN:48129227 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chr22 LN:51253224 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chr5 LN:180910700 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chr6 LN:170933915 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chr7 LN:159102511 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chr8 LN:146307037 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chr9 LN:141156475 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chr1 LN:249250020 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chr2 LN:243198952 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chr3 LN:197957767 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chr4 LN:191041935 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chr10 LN:135528848 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXA >daptr >@SQ SN:chr11 LN:134900590 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXA >daptr >@SQ SN:chr12 LN:133811418 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXA >daptr >@SQ SN:chr13 LN:115117205 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXA >daptr >@SQ SN:chrY LN:59014245 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAda >ptr >@SQ SN:chrX LN:155267065 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chr15 LN:102530814 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXA >daptr >@SQ SN:chr14 LN:107295415 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXA >daptr >@SQ SN:chr17 LN:81198313 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chr16 LN:90291808 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chr19 LN:59126095 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chr18 LN:78015294 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd >aptr >@SQ SN:chrM LN:26571 AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAdaptr >@RG ID:unknownReadGroup SM:unknownSample >@PG ID:SamTranscriptomeParser CL: args -f >9657X1_121116_SN141_0600_AD1FYUACXX_1.sam.gz > >thanks for your input, >Jesse Rowley >Division of Pulmonary Medicine >University of Utah School of Medicine |
From: Guorong Xu <guo...@gm...> - 2012-11-12 19:57:41
|
Hi David, I found another issue on the conversion program. When I used Novoalign to align our fastq data against reference genome, I found the MD tag in the SAM record. But the MD tag was missing after I converted the artificial chromosome name to regular chromosome name. You can also found the below four SAM records, all MD tags were missing. Could you double check this issue? Thanks, Guorong On Fri, Oct 26, 2012 at 10:11 AM, David Nix <Dav...@hc...> wrote: > Hello Guorong, > > Yes, looks like an issue. Would you mind grepping your unprocessed > alignment file for those read names so I can see what it started out as. > > -cheers, D > > P.S. We don't use HTSeq so it's not an issue. The USeq apps hash on the > read name after collecting intersecting reads for each exon so paired data > is collapsed if present. I wonder how HTSeq is doing it? As a short term > workaround, use the USeq DefinedRegionDifferentialSeq with the -t option > and it will generate a table of hit counts for each gene and each sample. > http://useq.sourceforge.net/cmdLnMenus.html#DefinedRegionDifferentialSeq > > > > From: Guorong Xu <guo...@gm...<mailto:guo...@gm...>> > Date: Wed, 24 Oct 2012 13:18:51 -0500 > To: David Nix <dav...@hc...<mailto:dav...@hc...>> > Subject: USeq issue > > Hi David, > > I found an issue from Useq SamTranscriptomParser for paired-end dataset. > Please see the below two fragments (4 reads). > > The mate positions of the first fragment are not correct, but the mate > positions of the second fragment are correct. > When I looked into the original alignment file by novoalign, I found the > first fragment was mapped on a junction region (junction library). > The mate position of these two reads are not changed when Useq converts > the artificial chrom name to genomic chrom name. > That’s why the mate positions of the first fragment are very small (457 > and 345). > > Now, we cannot use HTSeq to calculate the counts information due to this > incorrect format (mate position issue). > Do you have any suggestions on that? > > Thanks, > Guorong > > HWI-ST1189:59:C1305ACXX:5:1101:10001:63267 99 chr21 48064279 0 > 51M = 457 0 GGGGTGAGCGTGCGGGCTGCTGTGGGTAC > ATTCCGGCAAACCATGTGGGGA > BCCFDFFFHHHHHJJJJJJJJJGIIJFHIJJJJJJJJJHHHFFFFFFEDDD PG:Z:novoalignMPI > NH:i:12 HI:i:1 AM:i:70 NM:i:0 SM:i:70 GN:Z: > 951 TN:Z:ENST00000440086 ZN:i:12 PQ:i:2 UQ:i:2 AS:i:2 > ZS:Z:R > HWI-ST1189:59:C1305ACXX:5:1101:10001:63267 147 chr21 48064391 > 0 10M3969N41M = 345 0 TGGAACTCTGAAACTCCACTT > GGAGATGTTGGCAGACCAGCCACGAACAAC > EHGHDJIJGGHH9GFHD:BIGIHGGGGFHGIHGEFBJJHHHHHFFFFFCCC PG:Z:novoalignMPI > NH:i:12 HI:i:1 AM:i:70 NM:i:0 SM:i: > 70 GN:Z:951 TN:Z:ENST00000440086 ZN:i:12 PQ:i:2 UQ:i:0 > AS:i:0 XS:A:- ZS:Z:R > > HWI-ST1189:59:C1305ACXX:5:1101:10001:74618 99 chr3 185135345 0 > 51M = 185135507 0 CTGGTTTCTTGGTAGCTGCTG > CCTTTTTTCCCACCAGAGGCTTCTTCTGCT > @@@FBDDEFFHHFHHIIJJIIIJIIJJIIGDHIJIGGGHIFJJGGIBG@EF PG:Z:novoalignMPI > NH:i:6 HI:i:1 AM:i:70 NM:i:0 SM:i: > 70 ZN:i:6 PQ:i:3 UQ:i:3 AS:i:3 ZS:Z:R > HWI-ST1189:59:C1305ACXX:5:1101:10001:74618 147 chr3 185135507 > 0 51M = 185135345 0 CCTTATCCACCCGGAGCTTGT > GATTCCTGGCCTGGCGAAGAATGGTGTTCC > @EGGEGIGDIHAGJIDIIIGHABHIGGDDCGGEGGFHFFFHHHDDDFDC@@ PG:Z:novoalignMPI > NH:i:6 HI:i:1 AM:i:70 NM:i:0 SM:i: > 70 ZN:i:6 PQ:i:3 UQ:i:0 AS:i:0 ZS:Z:R > |
From: Desplat N. <cle...@ya...> - 2012-11-12 07:11:36
|
Hello, I analyse RNA-seq data and I use Novoalign protocol : http://www.novocraft.com/wiki/tiki-index.php?page=RNASeq+analysis%3A+mRNA+and+the+Spliceosome I have a problem with this step : "We now need to convert the splice junction coordinates back to genome coordinates using USeq SamTranscriptomeParser (external link) java -jar pathTo/USeq/USeq_8.4.4/Apps/SamTranscriptomeParser -f ./novo/alignments.sam -a 50000 -n 100 -u -s novo/alignments_Converted.sam #output is novo/alignments_Converted.sam.gz #Unmapped reads are saved with -u option in the same file in SAM format" When I run this command I get this message and the program stops. "USeq_8.4.4 Arguments: -n 100 -a 5000 -u -f Sample_F2_RNA_Seq_P1_alignments.sam -s Sample_F2_RNA_Seq_P1_alignments_converted.sam 5000.0 Maximum alignment score 0.0 Minimum mapping quality score 100 Maximum locations each read may align true Reverse the strand of the second paired alignemnt true Save unmapped and low score reads true Remove control chrPhiX and chrAdapter alignments false Randomly choose an alignment from read blocks that fail the max locations threshold Parsing, filtering, and merging SAM files... Sample_F2_RNA_Seq_P1_alignments.sam.............................................................................................Didn't find an intersecting chunk? for HISEQ9:202:D08DNACXX:4:1306:3933:11725 129 Chr7 81128509 1 16M1I77M2I5M Chr2 161017396 0 AAGATGGCTTGGGAGCAAAAGCTCCGCCAAGCTTGTCGAGCATCCAATGC TTGGGCACATTGAGCCTCTTCAAGTGCTTCTTCAATCCCCTAGCCTAGGAC 74688888888888788887876877888888777788988788889888898888789998888888888988788888888888888899887875232 PG:Z:novoalign AS:i :147 UQ:i:147 NM:i:5 ZS:Z:R ZN:i:5 NH:i:5 HI:i:1 IH:i:0 GN:Z:GRMZM2G542753 SJ:Z:Chr7:81128504-81128524_81128618-81128695_81128957-81128957_81139426-81139 523" Could you help me to understand this error? Thank you. Nelly |
From: David N. <Dav...@hc...> - 2012-10-26 15:11:29
|
Hello Guorong, Yes, looks like an issue. Would you mind grepping your unprocessed alignment file for those read names so I can see what it started out as. -cheers, D P.S. We don't use HTSeq so it's not an issue. The USeq apps hash on the read name after collecting intersecting reads for each exon so paired data is collapsed if present. I wonder how HTSeq is doing it? As a short term workaround, use the USeq DefinedRegionDifferentialSeq with the -t option and it will generate a table of hit counts for each gene and each sample. http://useq.sourceforge.net/cmdLnMenus.html#DefinedRegionDifferentialSeq From: Guorong Xu <guo...@gm...<mailto:guo...@gm...>> Date: Wed, 24 Oct 2012 13:18:51 -0500 To: David Nix <dav...@hc...<mailto:dav...@hc...>> Subject: USeq issue Hi David, I found an issue from Useq SamTranscriptomParser for paired-end dataset. Please see the below two fragments (4 reads). The mate positions of the first fragment are not correct, but the mate positions of the second fragment are correct. When I looked into the original alignment file by novoalign, I found the first fragment was mapped on a junction region (junction library). The mate position of these two reads are not changed when Useq converts the artificial chrom name to genomic chrom name. That’s why the mate positions of the first fragment are very small (457 and 345). Now, we cannot use HTSeq to calculate the counts information due to this incorrect format (mate position issue). Do you have any suggestions on that? Thanks, Guorong HWI-ST1189:59:C1305ACXX:5:1101:10001:63267 99 chr21 48064279 0 51M = 457 0 GGGGTGAGCGTGCGGGCTGCTGTGGGTAC ATTCCGGCAAACCATGTGGGGA BCCFDFFFHHHHHJJJJJJJJJGIIJFHIJJJJJJJJJHHHFFFFFFEDDD PG:Z:novoalignMPI NH:i:12 HI:i:1 AM:i:70 NM:i:0 SM:i:70 GN:Z: 951 TN:Z:ENST00000440086 ZN:i:12 PQ:i:2 UQ:i:2 AS:i:2 ZS:Z:R HWI-ST1189:59:C1305ACXX:5:1101:10001:63267 147 chr21 48064391 0 10M3969N41M = 345 0 TGGAACTCTGAAACTCCACTT GGAGATGTTGGCAGACCAGCCACGAACAAC EHGHDJIJGGHH9GFHD:BIGIHGGGGFHGIHGEFBJJHHHHHFFFFFCCC PG:Z:novoalignMPI NH:i:12 HI:i:1 AM:i:70 NM:i:0 SM:i: 70 GN:Z:951 TN:Z:ENST00000440086 ZN:i:12 PQ:i:2 UQ:i:0 AS:i:0 XS:A:- ZS:Z:R HWI-ST1189:59:C1305ACXX:5:1101:10001:74618 99 chr3 185135345 0 51M = 185135507 0 CTGGTTTCTTGGTAGCTGCTG CCTTTTTTCCCACCAGAGGCTTCTTCTGCT @@@FBDDEFFHHFHHIIJJIIIJIIJJIIGDHIJIGGGHIFJJGGIBG@EF PG:Z:novoalignMPI NH:i:6 HI:i:1 AM:i:70 NM:i:0 SM:i: 70 ZN:i:6 PQ:i:3 UQ:i:3 AS:i:3 ZS:Z:R HWI-ST1189:59:C1305ACXX:5:1101:10001:74618 147 chr3 185135507 0 51M = 185135345 0 CCTTATCCACCCGGAGCTTGT GATTCCTGGCCTGGCGAAGAATGGTGTTCC @EGGEGIGDIHAGJIDIIIGHABHIGGDDCGGEGGFHFFFHHHDDDFDC@@ PG:Z:novoalignMPI NH:i:6 HI:i:1 AM:i:70 NM:i:0 SM:i: 70 ZN:i:6 PQ:i:3 UQ:i:0 AS:i:0 ZS:Z:R |
From: David N. <dav...@gm...> - 2012-10-21 12:35:20
|
Hello Michael, I break up the transcript file by chromosome and run each separately on a cluster. Another speed up is to set the max number of junctions allowed to 10000 instead of the default 100000. Did you see the RNASeq extended junction guide? http://useq.sourceforge.net/usageRNASeq.html We've built many transcriptomes so take a look here to see if there is something you can use. https://bioserver.hci.utah.edu:443/gnomex/gnomexFlex.jsp?topicNumber=4 -cheers, D -- David Austin Nix, PhD Huntsman Cancer Institute Dept. OncSci University of Utah Bioinformatics Shared Resource HCI 3165 (801) 587-4611 dav...@hc... From: michael xu <mic...@gm...> Date: Fri, 19 Oct 2012 20:35:43 -0500 To: USeq <use...@li...> Subject: [Useq-users] How to speedup the process of building junction library Hi David, Thank you for developing a very useful tool USeq! I am using the USeq to build a novoalign index file with junction library, but I found it takes a very long time (> 5+ hours or longer) to build a junction library. Until now, I still have not finish one junction library for human 19 on a very powerful cluster. Do you have any suggestion on how to build a junction library quickly? Besides, I know I can download a library from your website with 50nt, but the length of read in my dataset is 51nt. I am not sure if I can use the same junction library build by you even the length of read is 51nt. Suppose that -r should set "length of read - 4". Your response would be greatly appreciated! Michael ---------------------------------------------------------------------------- -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct________________________________________ _______ Useq-users mailing list Use...@li... https://lists.sourceforge.net/lists/listinfo/useq-users |
From: michael xu <mic...@gm...> - 2012-10-20 01:35:49
|
Hi David, Thank you for developing a very useful tool USeq! I am using the USeq to build a novoalign index file with junction library, but I found it takes a very long time (> 5+ hours or longer) to build a junction library. Until now, I still have not finish one junction library for human 19 on a very powerful cluster. Do you have any suggestion on how to build a junction library quickly? Besides, I know I can download a library from your website with 50nt, but the length of read in my dataset is 51nt. I am not sure if I can use the same junction library build by you even the length of read is 51nt. Suppose that -r should set "length of read - 4". Your response would be greatly appreciated! Michael |
From: David N. <Dav...@hc...> - 2012-09-11 13:38:23
|
Hello Jeff, I take it this is from running the OverdispersedRegionScanSeqs? There's an improved app that replaces this one called DefinedRegionDifferentialSeq. I'd recommend using this for your subsequent analysis. Regarding your questions. Genes with few total reads across the two condition comparison (less than the minimum, default 20) are ignored for all aspects of analysis and reporting in ODRSS (but not in DRDS). This a gene might be ignored in one analysis but present in another due to that minimum. -cheers, D From: Jeff Alexander <jef...@gm...<mailto:jef...@gm...>> Date: Thu, 6 Sep 2012 15:32:20 -0700 To: USeq <use...@li...<mailto:use...@li...>> Subject: [Useq-users] Question for USeq RNA-seq Analysis Output Hello, I've performed some analysis of RNA-seq data using USeq and I am a bit confused about the output file all.xls. Can someone tell me what it means if a gene has only gene identifiers and genomic location information but no expression data? I have assumed that this meant that there were no reads found for this gene in any of the compared datasets. However, I've recently realized that some of these genes should have reads that map to their coding regions. For instance, in this experiment I have three conditions that were each compared for differential gene expression. In some cases, the comparison between Samples 1 and 2 show measured reads in both conditions for a given gene, but comparing Samples 2 and 3 lists this gene as blank. Any insights would be greatly appreciated. I've attached an example image of this output, showing genes with no RNA-seq information. Thank you! Jeff Jeffrey Alexander Graduate Student Bruneau Laboratory 1650 Owens St. San Francisco CA, 94122 P: 415-734-2885 [cid:6E3...@uc...] ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/_______________________________________________ Useq-users mailing list Use...@li...<mailto:Use...@li...> https://lists.sourceforge.net/lists/listinfo/useq-users |
From: Jeff A. <jef...@gm...> - 2012-09-06 23:07:26
|
Hello, I've performed some analysis of RNA-seq data using USeq and I am a bit confused about the output file all.xls. Can someone tell me what it means if a gene has only gene identifiers and genomic location information but no expression data? I have assumed that this meant that there were no reads found for this gene in any of the compared datasets. However, I've recently realized that some of these genes should have reads that map to their coding regions. For instance, in this experiment I have three conditions that were each compared for differential gene expression. In some cases, the comparison between Samples 1 and 2 show measured reads in both conditions for a given gene, but comparing Samples 2 and 3 lists this gene as blank. Any insights would be greatly appreciated. I've attached an example image of this output, showing genes with no RNA-seq information. Thank you! Jeff Jeffrey Alexander Graduate Student Bruneau Laboratory 1650 Owens St. San Francisco CA, 94122 P: 415-734-2885 |
From: David N. <Dav...@hc...> - 2012-08-28 22:08:09
|
Hello Folks, For those of you running bisulfite sequencing analysis using the BisSeqAggregatePlotter to generate class average plots of methylation over defined regions, please re run your analysis using USeq_8.3.8. https://sourceforge.net/projects/useq/ There was an unfortunate bug that was preventing the merging of data across chromosomes so the results being generated were only from merging regions found on one of the chromosomes (iterating over a hash of chromosome names). Your reanalysis should look similar but a lot smoother since it will include all the regions. Many apologies for the error. Thanks to Magda for raising a red flag about a potential problem. -David -- David Austin Nix, PhD Huntsman Cancer Institute Dept. OncSci University of Utah Bioinformatics Shared Resource HCI 3165 (801) 587-4611 dav...@hc... |
From: David N. <Dav...@hc...> - 2012-07-31 15:32:51
|
Hello Folks, I've depreciated the OverdispersedRegionScanSeqs and MultipleConditionRNASeq apps and replaced them with a 3X faster (no Picard lookup) , more accurate, better diff splice junction detection app called DefinedRegionDifferentialSeq. It still wraps DESeq but calculates the log2Ratio using DESeqs variance corrected counts instead of total count scaling. It works with paired and multiple condition RNASeq data. I've also incorporated this into the RNASeq wrapper app so give these new ones a try. You can download USeq_8.3.5 from Sourceforge. -cheers, D -- David Austin Nix, PhD Huntsman Cancer Institute Dept. OncSci University of Utah Bioinformatics Shared Resource HCI 3165 (801) 587-4611 dav...@hc... |
From: David N. <Dav...@hc...> - 2012-07-23 22:38:43
|
Hello Alex, Good question! No I'm pretty sure it is not a bug. There was a change back in December in how the log2 ratio is calculated to attempt to control for variance outliers. The DESeq method is too severe in this regard so we typically don't use its variance filtered p-values anymore and have attempted a compromise using modifications to how the log2 ratio is calculated. (DESeq will throw out variance outliers no matter where they occur so if your treatment replica counts for a particular gene look like 20, 30,43 and control replicas like 1400, 1300, 8000; this will be tossed.) So how does one estimate the fold difference in expression between two genes? The simple answer is to just correct each replica for total counts, sum them together, add one to avoid dividing by zero, and calculate a log2((sumT+1)/(sumC+1)) ratio. This is a simple straightforward approach and is comparable to your FPKM and the old Log2 approach. The problem is that this exposes you to situations where one of the replicas is quite different that the others in a particular condition. It's extreme value can drive the entire summary statistic. Another approach is to calculate the median (we use a close variant called the pseudo median) of the total count corrected replica counts for each condition, then calculate a Log2((pseT+1)/(pseC+1)) ratio. This works well for datasets with 4 or more replicas in each condition. OK with 3. But what to do with 2 (or 3)? This is tricky. The pseudoMedian method, with few replicas, is like the old sum method, it can be unduly influenced by a variance outlier. So yet another approach is to correct for total counts and calculate all single replica paired log2 ratios and report the smallest log2 ratio. Conservative but not as severe as the DESeq variance filtered p-value approach. You can use the -p flag in ODRSS to force the pseudoMedian approach, see the help menu for ODRSS. So does that solve the problem. No! Yet another option is to make use of DESeq's variance adjusted count values in place of the total count adjusted counts. This, in theory, should be a better estimation of the expression difference between two genes and is the approach taken in the latest USeq app called DefinedRegionDifferentialSeq. (This is a merge of ODRSS and MultipleConditionRNASeq, ~3x faster than ODRSS, with a more sensitive alternative splice estimation. ODRSS and MultipleConditionRNASeq are going to be depreciated.) Remember, you've got the FPKM measurements for all the genes so you can always calculate the straight forward log2((tFPKM+1)/(cFPKM+1)) in Excel, just watch out for variance outliers. Does that help? -cheers, David From: Alexander Williams <ale...@gl...<mailto:ale...@gl...>> Date: Mon, 23 Jul 2012 14:15:19 -0700 To: David Nix <dav...@hc...<mailto:dav...@hc...>> Cc: Alisha Holloway <ali...@gl...<mailto:ali...@gl...>>, Benoit Bruneau <bbr...@gl...<mailto:bbr...@gl...>>, Paul Delgado <pde...@gl...<mailto:pde...@gl...>> Subject: Possible bug in "log2ratio" column in Useq 8.3 -- it is very different from 8.0, and I can't make sense of the values Hi David, 1) In Useq 8.0.x, a column named "Log2((sumT+1)/(sumC+1))" was always very similar to log2(tFPKM / cFPKM), and seemed intuitively obvious. 2) In Use 8.3.x, this column is now "Log2Ratio," which is totally different and doesn't appear to correspond in an obvious fashion to the tFPKM / cFPKM (although it is still correlated with them). 3) The Useq docs that I've been able to find still refer to the old name, and it just says "Log2((sumT+1)/(sumC+1)) - normalized log2 ratio." (http://useq.sourceforge.net/outputFileTypeDescriptions.html) 4) This is the result of a run of OverdispersedRegionScanSeqs. 5) Do you know if this is expected behavior, or if this seems like a bug? I've attached 3 images showing the old USeq out, the new USeq out, and the correlation of the new output to my calculated-in-excel output (=LOG(tFPKM/cFPKM, 2) ). Alex [cid:CEE5C1F4-FBDA-48A5-9BEC-24A43C219536@gs.local] [cid:0F339874-74BC-41FF-8686-CEE9664D8CBA@gs.local] [cid:9CBF128F-2598-4795-A035-83A8A2F18528@gs.local] |
From: Jesse R. <jes...@u2...> - 2012-07-20 14:19:11
|
There have been several previous threads on how to filter results based on fold changes/FDRs etc. where I think that some of this could be related to misinterpretation of the FDR. Although I do not completely understand the use of the FDR (or is it storey's pFDR??), but I came across an article that points to some helps/hazards for interpreting and using this value: http://bioinformatics.oxfordjournals.org/content/24/10/1225.full#T2 any comments? Jesse Rowley Post-Doctoral Research Fellow Andrew Weyrich Lab Eccles Institute of Human Genetics University of Utah |
From: David N. <Dav...@hc...> - 2012-07-11 19:01:27
|
Good catch Andrew, yes genes not meeting the minimum # reads, and thus not scored, had an inappropriate number of blanks inserted into the output. I fixed this and posted it to SourceForge as USeq_8.3.4. -cheers, D On 7/6/12 10:53 AM, "Oler, Andrew (NIH/NIAID) [C]" <and...@ni...> wrote: >Hi David, > >I'm using OverdispersedRegionScanSeqs in USeq 8.2.3 and it looks like the >tFPKM and cFPKM columns are missing for genes that have low counts in the >all.xls file. Because the columns are missing, it looks like the FPKM is >actually pretty high in some cases (e.g., tFPKM is 3 in row 15383 in the >attached image, but my guess is this should actually be the counts for >#T1, not the FPKM). This retains a lot of genes that should be excluded >when I filter by FPKM columns. > >My command line: > > > >Thanks, > >Andrew > >Andrew Oler, Ph.D. >High-Throughput Sequencing Bioinformatics Specialist >Contractor - Medical Science & Computing, Inc. >Computational Biology Section >Bioinformatics and Computational Biosciences Branch (BCBB) >OCICB/OSMO/OD/NIAID/NIH > >31 Center Drive, Room 3B62 >Bethesda, MD 20892 >Mobile: 240-507-3791 >Office: 301-402-5685 >http://bioinformatics.niaid.nih.gov<http://bioinformatics.niaid.nih.gov/> >(Within NIH) >http://exon.niaid.nih.gov<http://exon.niaid.nih.gov/> (Public) > >Disclaimer: The information in this e-mail and any of its attachments is >confidential and may contain sensitive information. It should not be used >by anyone who is not the original intended recipient. If you have >received this e-mail in error please inform the sender and delete it from >your mailbox or any other storage devices. National Institute of Allergy >and Infectious Diseases shall not accept liability for any statements >made that are sender's own and not expressly made on behalf of the NIAID >by one of its representatives. > |
From: Oler, A. (NIH/N. [C] <and...@ni...> - 2012-07-06 16:53:38
|
Hi David, I'm using OverdispersedRegionScanSeqs in USeq 8.2.3 and it looks like the tFPKM and cFPKM columns are missing for genes that have low counts in the all.xls file. Because the columns are missing, it looks like the FPKM is actually pretty high in some cases (e.g., tFPKM is 3 in row 15383 in the attached image, but my guess is this should actually be the counts for #T1, not the FPKM). This retains a lot of genes that should be excluded when I filter by FPKM columns. My command line: Thanks, Andrew Andrew Oler, Ph.D. High-Throughput Sequencing Bioinformatics Specialist Contractor - Medical Science & Computing, Inc. Computational Biology Section Bioinformatics and Computational Biosciences Branch (BCBB) OCICB/OSMO/OD/NIAID/NIH 31 Center Drive, Room 3B62 Bethesda, MD 20892 Mobile: 240-507-3791 Office: 301-402-5685 http://bioinformatics.niaid.nih.gov<http://bioinformatics.niaid.nih.gov/> (Within NIH) http://exon.niaid.nih.gov<http://exon.niaid.nih.gov/> (Public) Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. |