useq-users Mailing List for USeq (Page 2)

Brought to you by: biotelerock, bmilash, elainegee, scourdy, tmosbruger

useq-users — General forum for useq users

You can subscribe to this list here.

2010	Jan	Feb	Mar	Apr (1)	May (4)	Jun (2)	Jul (3)	Aug (3)	Sep (5)	Oct (2)	Nov (4)	Dec
2011	Jan (12)	Feb	Mar (5)	Apr (6)	May	Jun	Jul (1)	Aug	Sep	Oct (15)	Nov (3)	Dec
2012	Jan	Feb (7)	Mar (3)	Apr (17)	May (5)	Jun	Jul (5)	Aug (1)	Sep (2)	Oct (3)	Nov (2)	Dec (1)
2013	Jan	Feb	Mar	Apr	May (4)	Jun	Jul (1)	Aug	Sep (2)	Oct (2)	Nov (2)	Dec (2)
2014	Jan	Feb (2)	Mar (9)	Apr (2)	May	Jun (2)	Jul (1)	Aug (1)	Sep	Oct	Nov (1)	Dec (1)
2015	Jan	Feb	Mar (4)	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2017	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec

Flat | Threaded

<< < 1 2 3 4 .. 6 > >> (Page 2 of 6)

[Useq-users] artefactual peaks

From: Noboru Jo S. <ns...@uc...> - 2013-12-17 00:25:56

Attachments: fig.png nsakabe.vcf

     Hi David, analyzing a sample from my lab that seems to be a failed 
IP, I found a number of regions in the genome that have many reads, but 
are not real peaks. They are basically the same across samples, 
including inputs and failed IPs. One example region is mm9 
chrX:166,393,669-166,477,668 (figure attached).
     This is not a problem exclusive to USeq, but I was wondering if 
there's something that can be done to improve peak callers, since inputs 
are also "enriched" in these regions. I know that people have generated 
files containing regions that should be masked, but since input samples 
are also enriched, I wonder why they still come up as peaks.
     Thank you.

Re: [Useq-users] Important USeq_8.7.0 update for bisulfite data analysis

From: David N. <Dav...@hc...> - 2013-11-20 18:54:25

Good point Sue.

See the novoalign docs too.  The -u setting varies by species.  Note, it worthwhile piloting what setting are appropriate, e.g. -b2 or -b4.  The additional compute time is considerable with -b4 and isn't needed for standard Illumina libraries run on the HiSeq or MiSeq.  It's these amplicon/ pcr derived libraries that need special processing.

-cheers, D

On Nov 20, 2013, at 11:33 AM, Sue Hammoud <Sue...@hc...<mailto:Sue...@hc...>>
 wrote:

Hi Everyone,

Also when aligning miseq data make sure to you  use  –u12 in addition to the -b4.

Best

Sue

From: David Nix <Dav...@hc...<mailto:Dav...@hc...>>
Date: Wednesday, November 20, 2013 11:30 AM
To: Sue Hammoud <sue...@hc...<mailto:sue...@hc...>>, Magdalena Potok <Mag...@hc...<mailto:Mag...@hc...>>, Brad Cairns <Bra...@hc...<mailto:Bra...@hc...>>, Bushra Gorsi <bg...@ge...<mailto:bg...@ge...>>, Somaye Dehghanizadeh <Som...@hc...<mailto:Som...@hc...>>
Cc: "bio...@ut...<mailto:bio...@ut...>" <bio...@ut...<mailto:bio...@ut...>>, USeq <use...@li...<mailto:use...@li...>>
Subject: Important USeq_8.7.0 update for bisulfite data analysis

Hello Folks,

I've posted a new useq release that contains a critical patch for bisulfite sequencing analysis derived from amplicon based library preps.  These need to be aligned with the -b 4 setting to maximize the number of aligned reads and then processed through the new NovoalignBisufiteParser 8.7.0 . See http://useq.sourceforge.net/usageBisSeq.html  The old NBP won't correctly parse these new alignment types.  I have also corrected an issue with the merging of paired overlapping alignments.  Some of these were being skipped and double counting of overlapping paired reads occurred.  Note the later fix changes the counts but not the percents of methylation.  See below for an illustration on a Hg19 sperm dataset.

So moving forward, pay very close attention to amplicon -b4 aligned and parsed datasets.  Visually inspect the base fraction methylation and converted and non converted point datasets along side the duplicate filtered bam alignment tracks in IGB.  Note any discrepancies.  This is a novel complex datatype that warrants close scrutiny.

-cheers, David



#### New NovoalignBisulfiteParser Stats
Filtering statistics for 174477814 alignments:
5621028 Failed mapping quality score (13.0)
2192225 Failed alignment score (300.0)
1331036 Aligned to phiX or adapters
0       Failed vendor QC
55078179        Are unmapped

110255346       Passed filters (63.2%)
88083944        Total non-converted Cs sequenced
1504140872      Total converted Cs sequenced
0.055   Fraction non converted C's.

0.998   Fraction bp passing quality (13)

2164389480      BPs overlapping paired sequence
10644938422     BPs paired sequence
0.203   Fraction overlapping bps from paired reads.

#### Old  NovoalignBisulfiteParser Stats
Filtering statistics for 174477814 alignments:
5621028 Failed mapping quality score (13.0)
2192225 Failed alignment score (300.0)
1331036 Aligned to phiX or adapters
0       Failed vendor QC
55078179        Are unmapped

110255346       Passed filters (63.2%)
94996528        Total non-converted Cs sequenced
1640576210      Total converted Cs sequenced
0.055   Fraction non converted C's.

0.998   Fraction bp passing quality (13)

2164389480      BPs overlapping paired sequence
10644938422     BPs paired sequence
0.203   Fraction overlapping bps from paired reads.

#### New NBP run through BisStat
Using Lambda data to set the expected fraction non-converted Cs to 0.00123 (16257/(16257+13221619))
Stats based on aligned genomic contexts that meet a minimum FDR threshold of 20.0. WARNING: datasets must be subsampled to the same bp aligned for these stats to be cross dataset comparable.
        0.978   (12631645/12919931)     mCG/mC
        0.007   (89908/12919931)        mCHG/mC
        0.015   (198378/12919931)       mCHH/mC

Stats based on cumulative sums of all read sequences, no thresholds:
        0.968   (85257586/88067661)     mCG/mC
        0.010   (875639/88067661)       mCHG/mC
        0.022   (1934436/88067661)      mCHH/mC

        0.059   (88067661/1490918281)   mC/C
        3.809   (85257586/22383883)     mCG/CG
        0.002   (875639/459687208)      mCHG/CHG
        0.002   (1934436/1008847190)    mCHH/CHH

        0.056   (88067661/1578985942)   mC/(C+mC)
        0.792   (85257586/107641469)    mCG/(CG+mCG)
        0.002   (875639/460562847)      mCHG/(CHG+mCHG)
        0.002   (1934436/1010781626)    mCHH/(CHH+mCHH)

#### Old NBP run through BisStat
Using Lambda data to set the expected fraction non-converted Cs to 0.00120 (15986/(15986+13272652))
Stats based on aligned genomic contexts that meet a minimum FDR threshold of 20.0. WARNING: datasets must be subsampled to the same bp aligned for these stats to be cross dataset comparable.
        0.966   (14109977/14599772)     mCG/mC
        0.010   (147396/14599772)       mCHG/mC
        0.023   (342399/14599772)       mCHH/mC

Stats based on cumulative sums of all read sequences, no thresholds:
        0.969   (92009703/94980515)     mCG/mC
        0.010   (935216/94980515)       mCHG/mC
        0.021   (2035596/94980515)      mCHH/mC

        0.058   (94980515/1627302558)   mC/C
        3.842   (92009703/23946641)     mCG/CG
        0.002   (935216/497817018)      mCHG/CHG
        0.002   (2035596/1105538899)    mCHH/CHH

        0.055   (94980515/1722283073)   mC/(C+mC)
        0.793   (92009703/115956344)    mCG/(CG+mCG)
        0.002   (935216/498752234)      mCHG/(CHG+mCHG)
        0.002   (2035596/1107574495)    mCHH/(CHH+mCHH)

[Useq-users] Important USeq_8.7.0 update for bisulfite data analysis

From: David N. <Dav...@hc...> - 2013-11-20 18:33:12

Hello Folks,

I've posted a new useq release that contains a critical patch for bisulfite sequencing analysis derived from amplicon based library preps.  These need to be aligned with the -b 4 setting to maximize the number of aligned reads and then processed through the new NovoalignBisufiteParser 8.7.0 . See http://useq.sourceforge.net/usageBisSeq.html  The old NBP won't correctly parse these new alignment types.  I have also corrected an issue with the merging of paired overlapping alignments.  Some of these were being skipped and double counting of overlapping paired reads occurred.  Note the later fix changes the counts but not the percents of methylation.  See below for an illustration on a Hg19 sperm dataset.

So moving forward, pay very close attention to amplicon -b4 aligned and parsed datasets.  Visually inspect the base fraction methylation and converted and non converted point datasets along side the duplicate filtered bam alignment tracks in IGB.  Note any discrepancies.  This is a novel complex datatype that warrants close scrutiny.

-cheers, David



#### New NovoalignBisulfiteParser Stats
Filtering statistics for 174477814 alignments:
5621028 Failed mapping quality score (13.0)
2192225 Failed alignment score (300.0)
1331036 Aligned to phiX or adapters
0       Failed vendor QC
55078179        Are unmapped

110255346       Passed filters (63.2%)
88083944        Total non-converted Cs sequenced
1504140872      Total converted Cs sequenced
0.055   Fraction non converted C's.

0.998   Fraction bp passing quality (13)

2164389480      BPs overlapping paired sequence
10644938422     BPs paired sequence
0.203   Fraction overlapping bps from paired reads.

#### Old  NovoalignBisulfiteParser Stats
Filtering statistics for 174477814 alignments:
5621028 Failed mapping quality score (13.0)
2192225 Failed alignment score (300.0)
1331036 Aligned to phiX or adapters
0       Failed vendor QC
55078179        Are unmapped

110255346       Passed filters (63.2%)
94996528        Total non-converted Cs sequenced
1640576210      Total converted Cs sequenced
0.055   Fraction non converted C's.

0.998   Fraction bp passing quality (13)

2164389480      BPs overlapping paired sequence
10644938422     BPs paired sequence
0.203   Fraction overlapping bps from paired reads.

#### New NBP run through BisStat
Using Lambda data to set the expected fraction non-converted Cs to 0.00123 (16257/(16257+13221619))
Stats based on aligned genomic contexts that meet a minimum FDR threshold of 20.0. WARNING: datasets must be subsampled to the same bp aligned for these stats to be cross dataset comparable.
        0.978   (12631645/12919931)     mCG/mC
        0.007   (89908/12919931)        mCHG/mC
        0.015   (198378/12919931)       mCHH/mC

Stats based on cumulative sums of all read sequences, no thresholds:
        0.968   (85257586/88067661)     mCG/mC
        0.010   (875639/88067661)       mCHG/mC
        0.022   (1934436/88067661)      mCHH/mC

        0.059   (88067661/1490918281)   mC/C
        3.809   (85257586/22383883)     mCG/CG
        0.002   (875639/459687208)      mCHG/CHG
        0.002   (1934436/1008847190)    mCHH/CHH

        0.056   (88067661/1578985942)   mC/(C+mC)
        0.792   (85257586/107641469)    mCG/(CG+mCG)
        0.002   (875639/460562847)      mCHG/(CHG+mCHG)
        0.002   (1934436/1010781626)    mCHH/(CHH+mCHH)

#### Old NBP run through BisStat
Using Lambda data to set the expected fraction non-converted Cs to 0.00120 (15986/(15986+13272652))
Stats based on aligned genomic contexts that meet a minimum FDR threshold of 20.0. WARNING: datasets must be subsampled to the same bp aligned for these stats to be cross dataset comparable.
        0.966   (14109977/14599772)     mCG/mC
        0.010   (147396/14599772)       mCHG/mC
        0.023   (342399/14599772)       mCHH/mC

Stats based on cumulative sums of all read sequences, no thresholds:
        0.969   (92009703/94980515)     mCG/mC
        0.010   (935216/94980515)       mCHG/mC
        0.021   (2035596/94980515)      mCHH/mC

        0.058   (94980515/1627302558)   mC/C
        3.842   (92009703/23946641)     mCG/CG
        0.002   (935216/497817018)      mCHG/CHG
        0.002   (2035596/1105538899)    mCHH/CHH

        0.055   (94980515/1722283073)   mC/(C+mC)
        0.793   (92009703/115956344)    mCG/(CG+mCG)
        0.002   (935216/498752234)      mCHG/(CHG+mCHG)
        0.002   (2035596/1107574495)    mCHH/(CHH+mCHH)

[Useq-users] Novoalign and corrupted gzip fastq files

From: David N. <Dav...@hc...> - 2013-10-09 19:29:27

OK Folks,

Just a heads up.

For all novoalign runs,  run a gunzip -t on each xxx.txt.gz file before aligning.  Novoalign silently completes its run but truncates the output when aligning a broken file. This is a very nasty bug and has/ is causing us no end of problems.  We also strongly recommend using rsync or FDT when transferring all of your data files.  

-cheers, Dave

-- 
David Austin Nix, PhD
Huntsman Cancer Institute
Dept. OncSci University of Utah
Bioinformatics Shared Resource
HCI 3165 (801) 587-4611
dav...@hc...

[Useq-users] Fwd: *New Course* Java 3! At the U of Utah

From: David N. <Dav...@hc...> - 2013-10-01 14:31:21

Attachments: image001.jpg image002.jpg image003.jpg image004.jpg ~WRD000.jpg

Sounds good!

Begin forwarded message:

From: Don Delker <don...@hs...<mailto:don...@hs...>>
Subject: FW: *New Course* Java 3!
Date: October 1, 2013 8:14:43 AM MDT
To: David Nix <Dav...@hc...<mailto:Dav...@hc...>>

David,

Could you forward this new Java 3 class information to the Useq user group?

Thanks,
Don

From: Technology Education at The University of Utah [mailto:mar...@ao...<http://aoce.utah.edu>]
Sent: Friday, September 27, 2013 12:33 PM
To: Don Delker
Subject: *New Course* Java 3!

Trouble viewing this email? Read it online<http://secure.campaigner.com/Campaigner/Public/t.show?5xq3w--33e7s-r6tle49&_v=2>
[Image removed by sender.]<http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2p-r6tle46&_v=2>

Java 3 Course Now Available!
You have taken Java 1 & 2. Now take your knowledge of Java to the mastery level with this brand new class at Tech Ed.

________________________________

Java Programming Language Part 3
This class provides hands-on Java programming including continued Java GUI programming, introduction to applets, recursion and using Java with databases. Java Part 3 will focus on expanding the skills learned in Part 1 and Part 2 toward building real world applications. Prerequisite: Completion of Java Part 1 and Part 2, or instructor approval.

MW 10/07/13 - 10/16/13 5:30 pm - 9:00 pm
U of U Annex, $799

Click here for more info and to register<http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2q-r6tle47&_v=2>

Browse Technology Classes Online!

Choose from dozens of cutting-edge design, network, and productivity software classes and certificates. Sign up before they fill up!

Unsure Which Class You Need?

Call us at 801-581-6061, and we'll help you discover the perfect fit for your career and organizational goals!

Download Our Catalog:

Choose from dozens of classes in design, productivity, programming, and much more!

Click on the image below to download our latest catalogs:

[Image removed by sender.]<http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2s-r6tle49&_v=2>

  [Image removed by sender.] <http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2t-r6tle40&_v=2>

or browse our classes - and register - on our website: edtech.utah.edu<http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2u-r6tle41&_v=2>

CONTACT US

Technology Education
at the University of Utah
801-581-6061
edtech.utah.edu/contact<http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2v-r6tle42&_v=2>

Continuing Education at The University of Utah
1901 E South Campus Dr. #1215, SLC, UT 84112

[Image removed by sender. Facebook Page]<http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2w-r6tle43&_v=2>  [Image removed by sender. Twitter Page] <http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2x-r6tle44&_v=2>   [Image removed by sender. YouTube Page] <http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2y-r6tle45&_v=2>   [Image removed by sender. Flickr Page] <http://trk.cp20.com/Tracking/t.c?5xq3w-a4r2z-r6tle46&_v=2>

The University of Utah
1901 E South Campus dr, RM 1215
Salt Lake City Utah 84112
United States

You are subscribed to this mailing list as don...@hs...<mailto:don...@hs...>. Please click here<http://trk.cp20.com/Tracking/t.fo?5xq3w--bxvv-r6tle40&sl=3b&t=5&_v=2> to modify your message preferences or to unsubscribe from any future mailings. We will respect all unsubscribe requests.

________________________________
[Image removed by sender.]<http://www.campaigner.com/campaignerPro.php?utm_source=campaigner&utm_medium=email&utm_campaign=deliveryfooter>[Image removed by sender.]

[Useq-users] Fwd: Please forward: Wed 9/25 Integrated Genome Browser tutorial

From: David N. <Dav...@hc...> - 2013-09-24 13:13:32

These guys are putting on a series of tutorials, this one being a good intro to the browser.  -cheers, David

Begin forwarded message:

From: "Loraine, Ann" <Ann...@un...<mailto:Ann...@un...>>
Subject: Please forward: Wed 9/25 Integrated Genome Browser tutorial
Date: September 24, 2013 6:39:28 AM MDT
Cc: "Loraine, Ann" <Ann...@un...<mailto:Ann...@un...>>

Hello,

My apologies if you have received multiple copies of this notice.

I'm writing to let you know about a free, on-line tutorial my lab is hosting tomorrow (9/25) for Integrated Genome Browser. IGB is open source, free software developed in my lab that helps researchers visualize and analyze data from high-throughput sequencing experiments, including (but not limited to) RNA-Seq, ChIP-Seq, and epigenetics data sets.

Would you forward this announcement to interested students and/or colleagues?

Many thanks,

Ann Loraine
Associate Professor
Dept. of Bioinformatics and Genomics
UNC Charlotte

----

Please join us for a live, on-line tutorial on Integrated Genome Browser:

"Focus on a Feature: Working with Sequence Data."
Wed 1-2 pm Eastern (US) Sept 25

Integrated Genome Browser (IGB) is a highly interactive, fast and flexible desktop genome browser mainly used to view data from RNA-Seq, ChIP-Seq, and other genome-scale data sets. IGB also supports viewing and interacting with sequence data from reference genomes, gene models, or alignments. For example, in IGB, you can select and copy sequence directly from the genomic sequence track or view six-frame translations in a text-based Sequence Viewer display.

In this tutorial, you'll learn how to

* select and copy reference genomic sequence
* view, copy, or save sequence data
* translate sequences in all 6 frames
* view, copy or save gene model sequences with or without introns
* configure how sequence bases are shown
* open genomic sequence as a new track

Example data sets will likely include cold stress or pollen RNA-Seq data sets from Arabidopsis thaliana.

To ensure that you'll be able to do the exercises along with the presenter, please test-run IGB before the tutorial. To download and start IGB, go to bioviz.org<http://bioviz.org> and followed the "Download" links. Once IGB starts, choose Help / Tutorials for a quick introduction to IGB.

If you have questions or need assistance, let us know. Contact:

Alyssa Gulledge agu...@un...<mailto:agu...@un...>
Ann Loraine alo...@un...<mailto:alo...@un...>

To attend the tutorial:
1. Go to https://global.gotomeeting.com/meeting/join/238242685
2. Use your microphone and speakers (VoIP) - a headset is recommended.

Or, call in using your telephone.
United States:  +1 (213) 493-0606
Access Code:  238-242-685
Audio PIN: Shown after joining the meeting
Meeting ID:238-242-685

Attendance is limited to the first 25 participants who sign in. This is the second of several monthly "Focus on a Feature" tutorials to be offered the fourth Wednesday of every month at 1 pm Eastern time US. For recordings visit the IGB YouTube channel http://www.youtube.com/channel/UC0DA2d3YdbQ55ljkRKHRBkg

Re: [Useq-users] Java Programming Class starts Monday

From: David N. <Dav...@hc...> - 2013-09-04 17:33:34

OK!

On Sep 4, 2013, at 11:24 AM, Don Delker <don...@hs...<mailto:don...@hs...>>
 wrote:

Hi David,

The next Java programming class starts next Monday, September 9th. If you know of anyone that might be interested please forward this along. We need one more person for the class.

Thanks,
Don

<Java Syllabus2.pdf>

[Useq-users] samtools mpileup error for data aligned with novoAlign and Useq

From: nimrod r. <ru...@po...> - 2013-07-10 18:03:21

Hi,
I have RNA-seq data that I've aligned using novoAlign to mm10 knownGene
indexed with Useq, and converted transcriptomic coordinates to genomic
coordinates with Useq as well. Trying to call variants with samtools
mpileup then crashes for some of the libraries I have, with this error
message:
[bcf_sync] incorrect number of fields (6 != 5) at 15:62640916.

I was wondering if this may have anything to do with Useq and if so how can
that be fixed.

Thanks,
rubi

[Useq-users] Otter Creek at Maia's school on June 8th

From: David N. <dav...@gm...> - 2013-05-31 02:14:12

Attachments: PastedGraphic-1.tiff

Hey Guys,

I think I'm in trouble.  My kind wife volunteered me to promote an upcoming fundraiser at our daughter's school (Madeline Choir).  All proceeds to the school. ("We" also apparently volunteered to pay for the show. My palms are sweating.... What if no one shows up?) 

It's June 8th and features Otter Creek (http://www.ottercreekduo.com ), a phenomenal acoustic duo from SLC. (Peter is my mandolin teacher.) I cannot say enough good things about these guys.  Classically trained in stringed instruments and voice. Technically and lyrically brilliant. Like all artists pushing the edge, their music is hard to describe, a heady mix of folk, bluegrass, newgrass, traditional, protest, mountain music....  They are really good and not to be missed. 

So I hope you can make it.   

-cheers, Dave

http://bioserver.hci.utah.edu/Misc/BCB/flyer.pdf

Re: [Useq-users] VCFComparator

From: Barry M. <bar...@ge...> - 2013-05-18 17:06:18

Nice David - thanks.

B

On May 17, 2013, at 12:47 PM, David Nix wrote:

> OK Folks,
> 
> Here's a new USeq app for comparing VCF files to a key of trusted calls. To use it:
> 
> Download the latest NIST key set (a vcf file of het and homo non ref calls and a bed file of interrogated regions) from Justin Zook <jus...@ni...<mailto:jus...@ni...>> or https://bioserver.hci.utah.edu/gnomex/gnomexFlex.jsp?analysisNumber=A1489 (this isn't necessarily the most recent).
> 
> Download the USeq package from SourceForge http://sourceforge.net/projects/useq/ .
> 
> Run the VCFComparator.
> 
> [u0028003@hci-alta NIST]$ cd /Repository/AnalysisData/2013/A1489/NIST/5thCallSet_2.14/
> [u0028003@hci-alta 5thCallSet_2.14]$ ls
> ExampleCalls  Key  VCFCompTest
> [u0028003@hci-alta 5thCallSet_2.14]$ java -jar -Xmx10G /home/BioApps/USeq/Apps/VCFComparator
> 
> **************************************************************************************
> **                             VCF Comparator : April 2013                          **
> **************************************************************************************
> Compares test vcf file(s) against a gold standard key of trusted vcf calls. Only calls
> that fall in the common interrogated regions are compared. WARNING tabix gzipped files
> often fail to parse correctly with java. Seeing odd error messages? Try uncompressing.
> 
> Required Options:
> -a VCF file for the key dataset (xxx.vcf(.gz/.zip OK)).
> -b Bed file of interrogated regions for the key dataset (xxx.bed(.gz/.zip OK)).
> -c VCF file for the test dataset (xxx.vcf(.gz/.zip OK)). May also provide a directory
>       containing xxx.vcf(.gz/.zip OK) files to compare.
> -d Bed file of interrogated regions for the test dataset (xxx.bed(.gz/.zip OK)).
> 
> Optional Options:
> -g Require the genotype to match, defaults to scoring a match when the alternate
>       allele is present.
> -v Use VQSLOD score as ranking statistic in place of the QUAL score.
> -s Only compare SNPs, defaults to all.
> -n Only compare non SNPs, defaults to all.
> -p Provide a full path directory for saving the parsed data. Defaults to not saving.
> -e Exclude test and key records whose FILTER field is not . or PASS. Defaults to
>       scoring all.
> 
> Example: java -Xmx10G -jar pathTo/USeq/Apps/VCFComparator -a /NIST/NA12878/key.vcf
>       -b /NIST/NA12878/regions.bed.gz -c /EdgeBio/Exome/testHaploCaller.vcf.zip
>       -d /EdgeBio/Exome/NimbleGenExomeV3.bed -g -v -s -e -p /CompRes/
> 
> **************************************************************************************
> 
> [u0028003@hci-alta 5thCallSet_2.14]$ java -jar -Xmx10G /home/BioApps/USeq/Apps/VCFComparator -a Key/nist2.14.vcf.gz -b Key/nist2.14.bed.gz -c ExampleCalls/ -d Key/nimExomeV3.bed.gz -g -s -e -p VCFCompTest
> 
> This generates a variety of files including matching and non matching vcf files, spreadsheets of stats for each test set, and a combine file listing just the FDR and TPR's for each dataset over a range of thresholds enabling the generation of TPR vs FDR pseudo ROC curves.
> 
> [cid:48F...@hc...]
> 
> Basically the method/ condition that returns the largest TPRs over a range of relevant FDRs for your experiment, say 1-5% is the "best".  How much better can be measured off the graph.  In the example above, sequencing 3 exomes per lane instead of 8 exomes increases the recovery of the key SNP variants by 5% ( at an FDR of 5%).  Is that 5% improvement worth 2.7x the sequencing cost?  Possibly.
> 
> One issue I am a little worried about is the low FDR values for unfiltered variants call files with these datasets. Setting GATK's strand_call and strand_emit confidence to 0 doesn't produce a list of unfiltered SNPs with > 7 %FDR.  This seems wrong given our experience with variant analysis of non NA12878 exomes.  Any suggestions/ debugging tips would be most appreciated.
> 
> -cheers, David
> 
> 
> 
> David Austin Nix, PhD
> Huntsman Cancer Institute
> Dept. OncSci University of Utah
> Bioinformatics Shared Resource
> HCI 3165 (801) 587-4611
> dav...@hc...<mailto:dav...@hc...>
> 
> 
> 
> <PastedGraphic-2.tiff>

Barry Moore
Research Scientist
Dept. of Human Genetics
University of Utah
Salt Lake City, UT 84112
--------------------------------------------
(801) 585-3543

[Useq-users] VCFComparator

From: David N. <Dav...@hc...> - 2013-05-17 18:49:21

Attachments: PastedGraphic-2.tiff

OK Folks,

Here's a new USeq app for comparing VCF files to a key of trusted calls. To use it:

Download the latest NIST key set (a vcf file of het and homo non ref calls and a bed file of interrogated regions) from Justin Zook <jus...@ni...<mailto:jus...@ni...>> or https://bioserver.hci.utah.edu/gnomex/gnomexFlex.jsp?analysisNumber=A1489 (this isn't necessarily the most recent).

Download the USeq package from SourceForge http://sourceforge.net/projects/useq/ .

Run the VCFComparator.

[u0028003@hci-alta NIST]$ cd /Repository/AnalysisData/2013/A1489/NIST/5thCallSet_2.14/
[u0028003@hci-alta 5thCallSet_2.14]$ ls
ExampleCalls Key VCFCompTest
[u0028003@hci-alta 5thCallSet_2.14]$ java -jar -Xmx10G /home/BioApps/USeq/Apps/VCFComparator

**************************************************************************************
** VCF Comparator : April 2013 **
**************************************************************************************
Compares test vcf file(s) against a gold standard key of trusted vcf calls. Only calls
that fall in the common interrogated regions are compared. WARNING tabix gzipped files
often fail to parse correctly with java. Seeing odd error messages? Try uncompressing.

Required Options:
-a VCF file for the key dataset (xxx.vcf(.gz/.zip OK)).
-b Bed file of interrogated regions for the key dataset (xxx.bed(.gz/.zip OK)).
-c VCF file for the test dataset (xxx.vcf(.gz/.zip OK)). May also provide a directory
containing xxx.vcf(.gz/.zip OK) files to compare.
-d Bed file of interrogated regions for the test dataset (xxx.bed(.gz/.zip OK)).

Optional Options:
-g Require the genotype to match, defaults to scoring a match when the alternate
allele is present.
-v Use VQSLOD score as ranking statistic in place of the QUAL score.
-s Only compare SNPs, defaults to all.
-n Only compare non SNPs, defaults to all.
-p Provide a full path directory for saving the parsed data. Defaults to not saving.
-e Exclude test and key records whose FILTER field is not . or PASS. Defaults to
scoring all.

Example: java -Xmx10G -jar pathTo/USeq/Apps/VCFComparator -a /NIST/NA12878/key.vcf
-b /NIST/NA12878/regions.bed.gz -c /EdgeBio/Exome/testHaploCaller.vcf.zip
-d /EdgeBio/Exome/NimbleGenExomeV3.bed -g -v -s -e -p /CompRes/

**************************************************************************************

[u0028003@hci-alta 5thCallSet_2.14]$ java -jar -Xmx10G /home/BioApps/USeq/Apps/VCFComparator -a Key/nist2.14.vcf.gz -b Key/nist2.14.bed.gz -c ExampleCalls/ -d Key/nimExomeV3.bed.gz -g -s -e -p VCFCompTest

This generates a variety of files including matching and non matching vcf files, spreadsheets of stats for each test set, and a combine file listing just the FDR and TPR's for each dataset over a range of thresholds enabling the generation of TPR vs FDR pseudo ROC curves.

[cid:48F...@hc...]

Basically the method/ condition that returns the largest TPRs over a range of relevant FDRs for your experiment, say 1-5% is the "best". How much better can be measured off the graph. In the example above, sequencing 3 exomes per lane instead of 8 exomes increases the recovery of the key SNP variants by 5% ( at an FDR of 5%). Is that 5% improvement worth 2.7x the sequencing cost? Possibly.

One issue I am a little worried about is the low FDR values for unfiltered variants call files with these datasets. Setting GATK's strand_call and strand_emit confidence to 0 doesn't produce a list of unfiltered SNPs with > 7 %FDR. This seems wrong given our experience with variant analysis of non NA12878 exomes. Any suggestions/ debugging tips would be most appreciated.

-cheers, David

David Austin Nix, PhD
Huntsman Cancer Institute
Dept. OncSci University of Utah
Bioinformatics Shared Resource
HCI 3165 (801) 587-4611
dav...@hc...<mailto:dav...@hc...>

[Useq-users] Header for hg19EnsTransRad46Num100kMin10SplicesChrPhiXAdaptr not matching hg19 used in GATK?

From: David N. <Dav...@hc...> - 2012-12-11 18:41:54

Hello Jesse,

Yes I've seen this error before.  You'll need to replace the header with
one consistent across all bam files.  There's a picard app that does this
I believe.

The issue is STP generates a custom header on the fly based on the end of
the last alignment in each chromosome.  So the chromosome length differs
for each file.

You can have STP use a particular header with the -r argument for future
parsings.

-cheers, D

On 12/11/12 11:35 AM, "Jesse Rowley" <jes...@u2...> wrote:

>Hi David, 
>I aligned my RNA-seq reads with
>@align -novoalign [-o SAM -r All 50 -a
>AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC] -g
>hg19EnsTransRad46Num100kMin10SplicesChrPhiXAdaptr -i *.txt.gz" > cmd.txt
>this was followed by samtranscriptome parser. Now I ma running GATK.
>
>I get the following error with GATK:
>##### ERROR MESSAGE: Input files reads and reference have incompatible
>contigs: Found contigs with the same name but different lengths:
>##### ERROR   contig reads = chr20 / 62940766
>##### ERROR   contig reference = chr20 / 63025520.
>##### ERROR   reads contigs = [chr20, chr21, chr22, chr5, chr6, chr7,
>chr8, chr9, chr1, chr2, chr3, chr4, chr10, chr11, chr12, chr13, chrY,
>chrX, chr15, chr14, chr17, chr16, chr19, chr18, chrM]
>##### ERROR   reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6,
>chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17,
>chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
>
>These are different as shown below - is it because these are changed by
>maketranscriptome or was this transcriptome made with a different
>reference? should I just revise my Bam header to match or would that be
>problematic? My strategy is to use GATK in the @annot pipeline to
>annotate variants (first splitting out splice junction reads prior to
>indel realignment since GATK doesn't like those, and then remerging them
>before annotation)
>Fasta.fai:
>chr1	249250621	6	50	51
>chr2	243199373	254235646	50	51
>chr3	198022430	502299013	50	51
>chr4	191154276	704281898	50	51
>chr5	180915260	899259266	50	51
>chr6	171115067	1083792838	50	51
>chr7	159138663	1258330213	50	51
>chr8	146364022	1420651656	50	51
>chr9	141213431	1569942965	50	51
>chr10	135534747	1713980672	50	51
>chr11	135006516	1852226121	50	51
>chr12	133851895	1989932775	50	51
>chr13	115169878	2126461715	50	51
>chr14	107349540	2243934998	50	51
>chr15	102531392	2353431536	50	51
>chr16	90354753	2458013563	50	51
>chr17	81195210	2550175419	50	51
>chr18	78077248	2632994541	50	51
>chr19	59128983	2712633341	50	51
>chr20	63025520	2772944911	50	51
>chr21	48129895	2837230949	50	51
>chr22	51304566	2886323449	50	51
>chrX	155270560	2938654113	50	51
>chrY	59373566	3097030091	50	51
>chrM	16571	3157591135	50	51
>
>Bam header:
>@HD	VN:1.0	SO:coordinate
>@SQ	SN:chr20	LN:62940766	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chr21	LN:48129227	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chr22	LN:51253224	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chr5	LN:180910700	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chr6	LN:170933915	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chr7	LN:159102511	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chr8	LN:146307037	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chr9	LN:141156475	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chr1	LN:249250020	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chr2	LN:243198952	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chr3	LN:197957767	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chr4	LN:191041935	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chr10	LN:135528848	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXA
>daptr
>@SQ	SN:chr11	LN:134900590	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXA
>daptr
>@SQ	SN:chr12	LN:133811418	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXA
>daptr
>@SQ	SN:chr13	LN:115117205	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXA
>daptr
>@SQ	SN:chrY	LN:59014245	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAda
>ptr
>@SQ	SN:chrX	LN:155267065	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chr15	LN:102530814	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXA
>daptr
>@SQ	SN:chr14	LN:107295415	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXA
>daptr
>@SQ	SN:chr17	LN:81198313	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chr16	LN:90291808	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chr19	LN:59126095	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chr18	LN:78015294	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAd
>aptr
>@SQ	SN:chrM	LN:26571	AS:AS:hgEnsTransRad46Num100kMin10SplicesChrPhiXAdaptr
>@RG	ID:unknownReadGroup	SM:unknownSample
>@PG	ID:SamTranscriptomeParser	CL: args -f
>9657X1_121116_SN141_0600_AD1FYUACXX_1.sam.gz
>
>thanks for your input,
>Jesse Rowley
>Division of Pulmonary Medicine
>University of Utah School of Medicine

Re: [Useq-users] USeq issue with incorrect mate position information for paired RNASeq data

From: Guorong Xu <guo...@gm...> - 2012-11-12 19:57:41

Hi David,

I found another issue on the conversion program.
When I used Novoalign to align our fastq data against reference genome, I
found the MD tag in the SAM record.
But the MD tag was missing after I converted the artificial chromosome name
to regular chromosome name.
You can also found the below four SAM records, all MD tags were missing.
Could you double check this issue?

Thanks,
Guorong

On Fri, Oct 26, 2012 at 10:11 AM, David Nix <Dav...@hc...> wrote:

> Hello Guorong,
>
> Yes, looks like an issue.  Would you mind grepping your unprocessed
> alignment file for those read names so I can see what it started out as.
>
> -cheers, D
>
> P.S. We don't use HTSeq so it's not an issue.  The USeq apps hash on the
> read name after collecting intersecting reads for each exon so paired data
> is collapsed if present.  I wonder how HTSeq is doing it?  As a short term
> workaround, use the USeq DefinedRegionDifferentialSeq with the -t option
> and it will generate a table of hit counts for each gene and each sample.
> http://useq.sourceforge.net/cmdLnMenus.html#DefinedRegionDifferentialSeq
>
>
>
> From: Guorong Xu <guo...@gm...<mailto:guo...@gm...>>
> Date: Wed, 24 Oct 2012 13:18:51 -0500
> To: David Nix <dav...@hc...<mailto:dav...@hc...>>
> Subject: USeq issue
>
> Hi David,
>
> I found an issue from Useq SamTranscriptomParser for paired-end dataset.
> Please see the below two fragments (4 reads).
>
> The mate positions of the first fragment are not correct, but the mate
> positions of the second fragment are correct.
> When I looked into the original alignment file by novoalign, I found the
> first fragment was mapped on a junction region (junction library).
> The mate position of these two reads are not changed when Useq converts
> the artificial chrom name to genomic chrom name.
> That’s why the mate positions of the first fragment are very small (457
> and 345).
>
> Now, we cannot use HTSeq to calculate the counts information due to this
> incorrect format (mate position issue).
> Do you have any suggestions on that?
>
> Thanks,
> Guorong
>
> HWI-ST1189:59:C1305ACXX:5:1101:10001:63267    99    chr21    48064279    0
>    51M    =    457    0    GGGGTGAGCGTGCGGGCTGCTGTGGGTAC
> ATTCCGGCAAACCATGTGGGGA
>  BCCFDFFFHHHHHJJJJJJJJJGIIJFHIJJJJJJJJJHHHFFFFFFEDDD    PG:Z:novoalignMPI
>  NH:i:12    HI:i:1    AM:i:70    NM:i:0    SM:i:70    GN:Z:
> 951    TN:Z:ENST00000440086    ZN:i:12    PQ:i:2    UQ:i:2    AS:i:2
>  ZS:Z:R
> HWI-ST1189:59:C1305ACXX:5:1101:10001:63267    147    chr21    48064391
>  0    10M3969N41M    =    345    0    TGGAACTCTGAAACTCCACTT
> GGAGATGTTGGCAGACCAGCCACGAACAAC
>  EHGHDJIJGGHH9GFHD:BIGIHGGGGFHGIHGEFBJJHHHHHFFFFFCCC    PG:Z:novoalignMPI
>  NH:i:12    HI:i:1    AM:i:70    NM:i:0    SM:i:
> 70    GN:Z:951    TN:Z:ENST00000440086    ZN:i:12    PQ:i:2    UQ:i:0
>  AS:i:0    XS:A:-    ZS:Z:R
>
> HWI-ST1189:59:C1305ACXX:5:1101:10001:74618    99    chr3    185135345    0
>    51M    =    185135507    0    CTGGTTTCTTGGTAGCTGCTG
> CCTTTTTTCCCACCAGAGGCTTCTTCTGCT
>  @@@FBDDEFFHHFHHIIJJIIIJIIJJIIGDHIJIGGGHIFJJGGIBG@EF    PG:Z:novoalignMPI
>    NH:i:6    HI:i:1    AM:i:70    NM:i:0    SM:i:
> 70    ZN:i:6    PQ:i:3    UQ:i:3    AS:i:3    ZS:Z:R
> HWI-ST1189:59:C1305ACXX:5:1101:10001:74618    147    chr3    185135507
>  0    51M    =    185135345    0    CCTTATCCACCCGGAGCTTGT
> GATTCCTGGCCTGGCGAAGAATGGTGTTCC
>  @EGGEGIGDIHAGJIDIIIGHABHIGGDDCGGEGGFHFFFHHHDDDFDC@@    PG:Z:novoalignMPI
>    NH:i:6    HI:i:1    AM:i:70    NM:i:0    SM:i:
> 70    ZN:i:6    PQ:i:3    UQ:i:0    AS:i:0    ZS:Z:R
>

[Useq-users] SamTranscriptomeParser error

From: Desplat N. <cle...@ya...> - 2012-11-12 07:11:36

Hello,
I analyse RNA-seq data and I use Novoalign protocol :

http://www.novocraft.com/wiki/tiki-index.php?page=RNASeq+analysis%3A+mRNA+and+the+Spliceosome

I have a problem with this step :

"We now need to convert the splice junction coordinates back to genome 
coordinates using USeq SamTranscriptomeParser (external link)

java
 -jar pathTo/USeq/USeq_8.4.4/Apps/SamTranscriptomeParser -f 
./novo/alignments.sam -a 50000 -n 100 -u -s 
novo/alignments_Converted.sam 
#output is novo/alignments_Converted.sam.gz 
#Unmapped reads are saved with -u option in the same file in SAM format"

When I run this command I get this message and the program stops.

"USeq_8.4.4 Arguments: -n 100 -a 5000 -u -f Sample_F2_RNA_Seq_P1_alignments.sam -s Sample_F2_RNA_Seq_P1_alignments_converted.sam

5000.0  Maximum alignment score
0.0     Minimum mapping quality score
100     Maximum locations each read may align
true    Reverse the strand of the second paired alignemnt
true    Save unmapped and low score reads
true    Remove control chrPhiX and chrAdapter alignments
false   Randomly choose an alignment from read blocks that fail the max locations threshold


Parsing, filtering, and merging SAM files...
       
 
Sample_F2_RNA_Seq_P1_alignments.sam.............................................................................................Didn't
 find an intersecting chunk? for
HISEQ9:202:D08DNACXX:4:1306:3933:11725  129     Chr7   
 81128509        1       16M1I77M2I5M    Chr2    161017396       0       AAGATGGCTTGGGAGCAAAAGCTCCGCCAAGCTTGTCGAGCATCCAATGC
TTGGGCACATTGAGCCTCTTCAAGTGCTTCTTCAATCCCCTAGCCTAGGAC  
 
74688888888888788887876877888888777788988788889888898888789998888888888988788888888888888899887875232  
 PG:Z:novoalign  AS:i
:147        UQ:i:147        NM:i:5  ZS:Z:R  
ZN:i:5  NH:i:5  HI:i:1  IH:i:0  GN:Z:GRMZM2G542753      
SJ:Z:Chr7:81128504-81128524_81128618-81128695_81128957-81128957_81139426-81139
523"

Could you help me to understand this error?

Thank you.

Nelly

[Useq-users] USeq issue with incorrect mate position information for paired RNASeq data

From: David N. <Dav...@hc...> - 2012-10-26 15:11:29

Hello Guorong,

Yes, looks like an issue.  Would you mind grepping your unprocessed alignment file for those read names so I can see what it started out as.

-cheers, D

P.S. We don't use HTSeq so it's not an issue.  The USeq apps hash on the read name after collecting intersecting reads for each exon so paired data is collapsed if present.  I wonder how HTSeq is doing it?  As a short term workaround, use the USeq DefinedRegionDifferentialSeq with the -t option and it will generate a table of hit counts for each gene and each sample. http://useq.sourceforge.net/cmdLnMenus.html#DefinedRegionDifferentialSeq



From: Guorong Xu <guo...@gm...<mailto:guo...@gm...>>
Date: Wed, 24 Oct 2012 13:18:51 -0500
To: David Nix <dav...@hc...<mailto:dav...@hc...>>
Subject: USeq issue

Hi David,

I found an issue from Useq SamTranscriptomParser for paired-end dataset. Please see the below two fragments (4 reads).

The mate positions of the first fragment are not correct, but the mate positions of the second fragment are correct.
When I looked into the original alignment file by novoalign, I found the first fragment was mapped on a junction region (junction library).
The mate position of these two reads are not changed when Useq converts the artificial chrom name to genomic chrom name.
That’s why the mate positions of the first fragment are very small (457 and 345).

Now, we cannot use HTSeq to calculate the counts information due to this incorrect format (mate position issue).
Do you have any suggestions on that?

Thanks,
Guorong

HWI-ST1189:59:C1305ACXX:5:1101:10001:63267    99    chr21    48064279    0    51M    =    457    0    GGGGTGAGCGTGCGGGCTGCTGTGGGTAC
ATTCCGGCAAACCATGTGGGGA    BCCFDFFFHHHHHJJJJJJJJJGIIJFHIJJJJJJJJJHHHFFFFFFEDDD    PG:Z:novoalignMPI    NH:i:12    HI:i:1    AM:i:70    NM:i:0    SM:i:70    GN:Z:
951    TN:Z:ENST00000440086    ZN:i:12    PQ:i:2    UQ:i:2    AS:i:2    ZS:Z:R
HWI-ST1189:59:C1305ACXX:5:1101:10001:63267    147    chr21    48064391    0    10M3969N41M    =    345    0    TGGAACTCTGAAACTCCACTT
GGAGATGTTGGCAGACCAGCCACGAACAAC    EHGHDJIJGGHH9GFHD:BIGIHGGGGFHGIHGEFBJJHHHHHFFFFFCCC    PG:Z:novoalignMPI    NH:i:12    HI:i:1    AM:i:70    NM:i:0    SM:i:
70    GN:Z:951    TN:Z:ENST00000440086    ZN:i:12    PQ:i:2    UQ:i:0    AS:i:0    XS:A:-    ZS:Z:R

HWI-ST1189:59:C1305ACXX:5:1101:10001:74618    99    chr3    185135345    0    51M    =    185135507    0    CTGGTTTCTTGGTAGCTGCTG
CCTTTTTTCCCACCAGAGGCTTCTTCTGCT    @@@FBDDEFFHHFHHIIJJIIIJIIJJIIGDHIJIGGGHIFJJGGIBG@EF    PG:Z:novoalignMPI    NH:i:6    HI:i:1    AM:i:70    NM:i:0    SM:i:
70    ZN:i:6    PQ:i:3    UQ:i:3    AS:i:3    ZS:Z:R
HWI-ST1189:59:C1305ACXX:5:1101:10001:74618    147    chr3    185135507    0    51M    =    185135345    0    CCTTATCCACCCGGAGCTTGT
GATTCCTGGCCTGGCGAAGAATGGTGTTCC    @EGGEGIGDIHAGJIDIIIGHABHIGGDDCGGEGGFHFFFHHHDDDFDC@@    PG:Z:novoalignMPI    NH:i:6    HI:i:1    AM:i:70    NM:i:0    SM:i:
70    ZN:i:6    PQ:i:3    UQ:i:0    AS:i:0    ZS:Z:R

Re: [Useq-users] How to speedup the process of building junction library

From: David N. <dav...@gm...> - 2012-10-21 12:35:20

Hello Michael,

I break up the transcript file by chromosome and run each separately on a
cluster.   Another speed up is to set the max number of junctions allowed to
10000 instead of the default 100000.

Did you see the RNASeq extended junction guide?
http://useq.sourceforge.net/usageRNASeq.html

We've built many transcriptomes so take a look here to see if there is
something you can use.
https://bioserver.hci.utah.edu:443/gnomex/gnomexFlex.jsp?topicNumber=4

-cheers, D

-- 
David Austin Nix, PhD
Huntsman Cancer Institute
Dept. OncSci University of Utah
Bioinformatics Shared Resource
HCI 3165 (801) 587-4611
dav...@hc...


From:  michael xu <mic...@gm...>
Date:  Fri, 19 Oct 2012 20:35:43 -0500
To:  USeq <use...@li...>
Subject:  [Useq-users] How to speedup the process of building junction
library

Hi David,

Thank you for developing a very useful tool USeq!

I am using the USeq to build a novoalign index file with junction library,
but I found it takes a very long time (> 5+ hours or longer) to build a
junction library.
Until now, I still have not finish one junction library for human 19 on a
very powerful cluster.
Do you have any suggestion on how to build a junction library quickly?

Besides, I know I can download a library from your website with 50nt, but
the length of read in my dataset is 51nt. I am not sure if I can use the
same junction library build by you even the length of read is 51nt. Suppose
that -r should set "length of read - 4".

Your response would be greatly appreciated!

Michael
----------------------------------------------------------------------------
-- Everyone hates slow websites. So do we. Make your web apps faster with
AppDynamics Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct________________________________________
_______ Useq-users mailing list Use...@li...
https://lists.sourceforge.net/lists/listinfo/useq-users

[Useq-users] How to speedup the process of building junction library

From: michael xu <mic...@gm...> - 2012-10-20 01:35:49

Hi David,

Thank you for developing a very useful tool USeq!

I am using the USeq to build a novoalign index file with junction library,
but I found it takes a very long time (> 5+ hours or longer) to build a
junction library.
Until now, I still have not finish one junction library for human 19 on a
very powerful cluster.
Do you have any suggestion on how to build a junction library quickly?

Besides, I know I can download a library from your website with 50nt, but
the length of read in my dataset is 51nt. I am not sure if I can use the
same junction library build by you even the length of read is 51nt. Suppose
that -r should set "length of read - 4".

Your response would be greatly appreciated!

Michael

Re: [Useq-users] Question for USeq RNA-seq Analysis Output

From: David N. <Dav...@hc...> - 2012-09-11 13:38:23

Attachments: Screen shot 2012-09-06 at 3.15.15 PM.png

Hello Jeff,

I take it this is from running the OverdispersedRegionScanSeqs?  There's an improved app that replaces this one called DefinedRegionDifferentialSeq.  I'd recommend using this for your subsequent analysis.

Regarding your questions.  Genes with few total reads across the two condition comparison (less than the minimum, default 20) are ignored for all aspects of analysis and reporting in ODRSS (but not in DRDS).  This a gene might be ignored in one analysis but present in another due to that minimum.

-cheers, D

From: Jeff Alexander <jef...@gm...<mailto:jef...@gm...>>
Date: Thu, 6 Sep 2012 15:32:20 -0700
To: USeq <use...@li...<mailto:use...@li...>>
Subject: [Useq-users] Question for USeq RNA-seq Analysis Output

Hello,

I've performed some analysis of RNA-seq data using USeq and I am a bit confused about the output file all.xls.  Can someone tell me what it means if a gene has only gene identifiers and genomic location information but no expression data?  I have assumed that this meant that there were no reads found for this gene in any of the compared datasets.  However, I've recently realized that some of these genes should have reads that map to their coding regions.  For instance, in this experiment I have three conditions that were each compared for differential gene expression.  In some cases, the comparison between Samples 1 and 2 show measured reads in both conditions for a given gene, but comparing Samples 2 and 3 lists this gene as blank.  Any insights would be greatly appreciated.  I've attached an example image of this output, showing genes with no RNA-seq information.

Thank you!

Jeff

Jeffrey Alexander
Graduate Student
Bruneau Laboratory
1650 Owens St.
San Francisco CA, 94122
P: 415-734-2885

[cid:6E3...@uc...]
------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/_______________________________________________ Useq-users mailing list Use...@li...<mailto:Use...@li...> https://lists.sourceforge.net/lists/listinfo/useq-users

[Useq-users] Question for USeq RNA-seq Analysis Output

From: Jeff A. <jef...@gm...> - 2012-09-06 23:07:26

Hello,

I've performed some analysis of RNA-seq data using USeq and I am a bit confused about the output file all.xls.  Can someone tell me what it means if a gene has only gene identifiers and genomic location information but no expression data?  I have assumed that this meant that there were no reads found for this gene in any of the compared datasets.  However, I've recently realized that some of these genes should have reads that map to their coding regions.  For instance, in this experiment I have three conditions that were each compared for differential gene expression.  In some cases, the comparison between Samples 1 and 2 show measured reads in both conditions for a given gene, but comparing Samples 2 and 3 lists this gene as blank.  Any insights would be greatly appreciated.  I've attached an example image of this output, showing genes with no RNA-seq information.

Thank you!

Jeff

Jeffrey Alexander
Graduate Student
Bruneau Laboratory
1650 Owens St.
San Francisco CA, 94122
P: 415-734-2885

[Useq-users] Bug with BisSeqAggregatePlotter in USeq

From: David N. <Dav...@hc...> - 2012-08-28 22:08:09

Hello Folks,

For those of you running bisulfite sequencing analysis using the BisSeqAggregatePlotter to generate class average plots of methylation over defined regions, please re run your analysis using USeq_8.3.8. https://sourceforge.net/projects/useq/

There was an unfortunate bug that was preventing the merging of data across chromosomes so the results being generated were only from merging regions found on one of the chromosomes (iterating over a hash of chromosome names).

Your reanalysis should look similar but a lot smoother since it will include all the regions.

Many apologies for the error.

Thanks to Magda for raising a red flag about a potential problem.

-David

--
David Austin Nix, PhD
Huntsman Cancer Institute
Dept. OncSci University of Utah
Bioinformatics Shared Resource
HCI 3165 (801) 587-4611
dav...@hc...

[Useq-users] USeq_8.3.5 release and changes to RNASeq apps

From: David N. <Dav...@hc...> - 2012-07-31 15:32:51

Hello Folks,

I've depreciated the OverdispersedRegionScanSeqs and MultipleConditionRNASeq apps and replaced them with a 3X faster (no Picard lookup) , more accurate, better diff splice junction detection app called DefinedRegionDifferentialSeq.  It still wraps DESeq but calculates the log2Ratio using DESeqs variance corrected counts instead of total count scaling. It works with paired and multiple condition RNASeq data. I've also incorporated this into the RNASeq wrapper app so give these new ones a try.

You can download USeq_8.3.5 from Sourceforge.

-cheers, D

--
David Austin Nix, PhD
Huntsman Cancer Institute
Dept. OncSci University of Utah
Bioinformatics Shared Resource
HCI 3165 (801) 587-4611
dav...@hc...

[Useq-users] Possible bug in "log2ratio" column in Useq 8.3 -- it is very different from 8.0, and I can't make sense of the values

From: David N. <Dav...@hc...> - 2012-07-23 22:38:43

Attachments: Useq 8.3.5 - log2 is different - edited.png Useq 8.3.5 ratio vs my own calculated in Excel.png Useq 8.0.2 - log2 looks like what I would expect - edited.png

Hello Alex,

Good question! No I'm pretty sure it is not a bug.

There was a change back in December in how the log2 ratio is calculated to attempt to control for variance outliers.  The DESeq method is too severe in this regard so we typically don't use its variance filtered p-values anymore and have attempted a compromise using modifications to how the log2 ratio is calculated. (DESeq will throw out variance outliers no matter where they occur so if your treatment replica counts for a particular gene look like 20, 30,43 and control replicas like 1400, 1300, 8000; this will be tossed.)

So how does one estimate the fold difference in expression between two genes?  The simple answer is to just correct each replica for total counts, sum them together, add one to avoid dividing by zero, and calculate a log2((sumT+1)/(sumC+1)) ratio.  This is a simple straightforward approach and is comparable to your FPKM and the old Log2 approach. The problem is that this exposes you to situations where one of the replicas is quite different that the others in a particular condition.  It's extreme value can drive the entire summary statistic.   Another approach is to calculate the median (we use a close variant called the pseudo median) of the total count corrected replica counts for each condition,  then calculate a Log2((pseT+1)/(pseC+1)) ratio. This works well for datasets with  4  or more replicas in each condition. OK with 3. But what to do with 2 (or 3)?  This is tricky.  The pseudoMedian method, with few replicas, is like the old sum method, it can be unduly influenced by a variance outlier.  So yet another approach is to correct for total counts and calculate all single replica paired log2 ratios and report the smallest log2 ratio.  Conservative but not as severe as the DESeq variance filtered p-value approach.  You can use the -p flag in ODRSS to force the pseudoMedian approach, see the help menu for ODRSS.

So does that solve the problem. No!  Yet another option is to make use of DESeq's variance adjusted count values in place of the total count adjusted counts.  This, in theory, should be a better estimation of the expression difference between two genes and is the approach taken in the latest USeq app called DefinedRegionDifferentialSeq. (This is a merge of ODRSS and MultipleConditionRNASeq, ~3x faster than ODRSS, with a more sensitive alternative splice estimation. ODRSS and MultipleConditionRNASeq are going to be depreciated.)

Remember, you've got the FPKM measurements for all the genes so you can always calculate the straight forward  log2((tFPKM+1)/(cFPKM+1)) in Excel, just watch out for variance outliers.

Does that help?

-cheers, David



From: Alexander Williams <ale...@gl...<mailto:ale...@gl...>>
Date: Mon, 23 Jul 2012 14:15:19 -0700
To: David Nix <dav...@hc...<mailto:dav...@hc...>>
Cc: Alisha Holloway <ali...@gl...<mailto:ali...@gl...>>, Benoit Bruneau <bbr...@gl...<mailto:bbr...@gl...>>, Paul Delgado <pde...@gl...<mailto:pde...@gl...>>
Subject: Possible bug in "log2ratio" column in Useq 8.3 -- it is very different from 8.0, and I can't make sense of the values

Hi David,

1) In Useq 8.0.x, a column named "Log2((sumT+1)/(sumC+1))" was always very similar to log2(tFPKM / cFPKM), and seemed intuitively obvious.

2) In Use 8.3.x, this column is now "Log2Ratio," which is totally different and doesn't appear to correspond in an obvious fashion to the tFPKM / cFPKM (although it is still correlated with them).

3) The Useq docs that I've been able to find still refer to the old name, and it just says "Log2((sumT+1)/(sumC+1)) - normalized log2 ratio." (http://useq.sourceforge.net/outputFileTypeDescriptions.html)

4) This is the result of a run of OverdispersedRegionScanSeqs.

5) Do you know if this is expected behavior, or if this seems like a bug?

I've attached 3 images showing the old USeq out, the new USeq out, and the correlation of the new output to my calculated-in-excel output (=LOG(tFPKM/cFPKM, 2) ).

 Alex


 [cid:CEE...@gs...] [cid:0F3...@gs...] [cid:9CB...@gs...]

[Useq-users] comment on the use of FDR

From: Jesse R. <jes...@u2...> - 2012-07-20 14:19:11

There have been several previous threads on how to filter results based on fold changes/FDRs etc. where I think that some of this could be related to misinterpretation of the FDR.
Although I do not completely understand the use of the FDR (or is it storey's pFDR??), but I came across an article that points to some helps/hazards for interpreting and using this value:

http://bioinformatics.oxfordjournals.org/content/24/10/1225.full#T2

any comments?

Jesse Rowley
Post-Doctoral Research Fellow
Andrew Weyrich Lab
Eccles Institute of Human Genetics
University of Utah

Re: [Useq-users] ORSS missing columns

From: David N. <Dav...@hc...> - 2012-07-11 19:01:27

Good catch Andrew,  yes genes not meeting the minimum # reads, and thus
not scored, had an inappropriate number of blanks inserted into the output.

I fixed this and posted it to SourceForge as USeq_8.3.4.

-cheers, D

On 7/6/12 10:53 AM, "Oler, Andrew (NIH/NIAID) [C]" <and...@ni...>
wrote:

>Hi David,
>
>I'm using OverdispersedRegionScanSeqs in USeq 8.2.3 and it looks like the
>tFPKM and cFPKM columns are missing for genes that have low counts in the
>all.xls file.  Because the columns are missing, it looks like the FPKM is
>actually pretty high in some cases (e.g., tFPKM is 3 in row 15383 in the
>attached image, but my guess is this should actually be the counts for
>#T1, not the FPKM).  This retains a lot of genes that should be excluded
>when I filter by FPKM columns.
>
>My command line:
>
>
>
>Thanks,
>
>Andrew
>
>Andrew Oler, Ph.D.
>High-Throughput Sequencing Bioinformatics Specialist
>Contractor - Medical Science & Computing, Inc.
>Computational Biology Section
>Bioinformatics and Computational Biosciences Branch (BCBB)
>OCICB/OSMO/OD/NIAID/NIH
>
>31 Center Drive, Room 3B62
>Bethesda, MD 20892
>Mobile: 240-507-3791
>Office: 301-402-5685
>http://bioinformatics.niaid.nih.gov<http://bioinformatics.niaid.nih.gov/>
>(Within NIH)
>http://exon.niaid.nih.gov<http://exon.niaid.nih.gov/> (Public)
>
>Disclaimer: The information in this e-mail and any of its attachments is
>confidential and may contain sensitive information. It should not be used
>by anyone who is not the original intended recipient. If you have
>received this e-mail in error please inform the sender and delete it from
>your mailbox or any other storage devices. National Institute of Allergy
>and Infectious Diseases shall not accept liability for any statements
>made that are sender's own and not expressly made on behalf of the NIAID
>by one of its representatives.
>

[Useq-users] ORSS missing columns

From: Oler, A. (NIH/N. [C] <and...@ni...> - 2012-07-06 16:53:38

Attachments: Screen shot 2012-07-06 at 12.47.47 PM.png

Hi David,

I'm using OverdispersedRegionScanSeqs in USeq 8.2.3 and it looks like the tFPKM and cFPKM columns are missing for genes that have low counts in the all.xls file. Because the columns are missing, it looks like the FPKM is actually pretty high in some cases (e.g., tFPKM is 3 in row 15383 in the attached image, but my guess is this should actually be the counts for #T1, not the FPKM). This retains a lot of genes that should be excluded when I filter by FPKM columns.

My command line:

Thanks,

Andrew

Andrew Oler, Ph.D.
High-Throughput Sequencing Bioinformatics Specialist
Contractor - Medical Science & Computing, Inc.
Computational Biology Section
Bioinformatics and Computational Biosciences Branch (BCBB)
OCICB/OSMO/OD/NIAID/NIH

31 Center Drive, Room 3B62
Bethesda, MD 20892
Mobile: 240-507-3791
Office: 301-402-5685
http://bioinformatics.niaid.nih.gov<http://bioinformatics.niaid.nih.gov/> (Within NIH)
http://exon.niaid.nih.gov<http://exon.niaid.nih.gov/> (Public)

Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives.

2 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 2 3 4 .. 6 > >> (Page 2 of 6)