Download Latest Version RNAseqR_1.1.tar.gz (17.5 MB)
Email in envelope

Get an email when there's a new version of RNAseqR

Home / v1.0
Name Modified Size InfoDownloads / Week
Parent folder
RNAseqR.tar.gz 2011-11-22 17.3 MB
INSTALL.txt 2011-11-22 1.2 kB
README.txt 2011-11-22 8.2 kB
Totals: 3 Items   17.3 MB 0
**********************************************************
*************RNAseqR R Statistic Calculator***************
**********************************************************
Biology and math by Thomas Walk and Scott Geib
Coded by Theodore DeRego, Thomas Walk, and Scott Geib 
2011

contact
tom.walk@ars.usda.gov, scott.geib@ars.usda.gov


INTRODUCTION

RNAseqR reads and analyzes a table of transcript counts. Counts can be converted to RPKM values, natural log values, or both (first RPKM, then log transformation). You can also specify a cutoff, in which transcripts with lower count sums will be removed before any calculations are applied.  From there, the R test statistic (Stekel et al, 2000, Genome Research 10:2055) is calculated for actual data, as well as for data randomized by shuffling within libraries or by generating Poisson or negative binomially distributed random numbers for each transcript. Output files allow comparison of actual R values to random R values, or to R values expected given the count mean, for each transformation and randomization applied.



SAMPLE SCENARIO

If you don't want to read through this entire thing, here is a sample scenario of a calculation


The simplest thing to do is to open the GUI screen and make selections there.

./RNAseqR

It will read from a file named "input" by default.  This can be changed in the GUI or command line options.

If the command line is preferred for running batch jobs or anything else, a number of options are available.  Here is an example.

./RNAseqR my_input.txt -o my_output -min 80 -max 99 -inc 1 -rtc 100 -gui false -rpkm -log


The gui screen opens by default (-gui true), so to prevent opening of the gui screen use the following flag.
-gui false

Input Files:
my_input.txt	tab delimited count data, with ID in the first column, and column headings in the first row
my_input.txt_Lengths	tab delimited transcript length file for RPKM transformation, with ID in the first column, transcript length (integer) in the second coumn, and without a column heading row on top.

Output Files:
my_output	base name for ouputs within a directory tree created by the program for the transformations and randomizations selected

In this scenario, we are inputing our tabbed table in the file my_input.txt. We are changing the read values to RPKM values, which requires a file called my_input.txt_Lengths in the same folder as my_input.txt.  

Average dispersion and counts of overdispersed transcripts are written to the Overdispersion file.  R values for each transcript are written to another file.  A third output file contains the percentile cutoffs showing R values associated with the percentiles, along with counts of actual and randomized transcripts with greater R values.  In this example, the percentiles will start at 80, end at 99, and increment by 1.  The program automatically returns the highest R value.  As an example, if there are 1000 transcripts, the percentile cutoffs will return the mean R value and standard deviation associated with position 800, 810, 820, ... 990 from the randomized tables when the R values are sorted from lowest to highest, along with the number of greater R values in real data.


COMMAND LINE PARAMETERS

ARG		DEFAULT		DESCRIPTION
-o		output		Output file
-min		80		Percentile to begin the cutoff calculations at
-max		99		Percentile to stop the cutoff calculations at 
-inc		1		Value to increment the percentile cutoff calculator by
-c		1		Minimum sum of counts across libraries for inclusion of transcript in analysis
-rtc		10		Number of random tables to generate, default is low, 100-1000 is common in research
-gui		true		Whether or not to open the GUI (true/false)
-raw		true		Analyzes untransformed data
-log		false		Natural log transformation of data
-rpkm		false		RPKM transformation, requires separate length file 
-rpkmlog	false		RPKM transformation, requires length file, then natural log transformation 
-shuffle	false		Makes random tables for comparing to actual data by shuffling within columns
-Poisson	false		Makes random tables using Poisson distribution from row mean
-negbinom	false		Makes random tables using negative binomial distribution from row mean and variance



GUI BUTTONS


Open Button: Selects the input file to use. Files should be in tab-delimited format as such:

Transcript	Libray1	Library2 ...	LibraryN
Transcript1	5	20	100	300
Transcript2	2	2	3	4
Transcript3	7	6	7	7
........	5	3	2	1
TranscriptN	2	7	21	58

Input values can be reads, RPKM, etc. If you only have reads, the program can calculate the RPKM for you, as long as tab-delimted length file named the same as the input file, but with "_Lengths" appended is provided. The first row in the input file is read as a header row of column headings/library IDs, while the first column is read as transcript IDs.



Select Output Button: Selects the output file name base.

The resulting output files after running program are:

1. A table of dispersion statistics for each transformation selected.  Many overdispersed genes indicates negative binomial distribution predominates, while few overdispersed genes indicates Poisson distributions may fit better.

2.  A modified input file with the R values added on after transcript counts.  The first R value is calculated for the actual data.  The second R value is the mean of those from randomized data.  Finally, the standard deviation of randomized R values is included in the last column.

3.  A table of R value of each percentile cutoff specified in input parameters (this is "believability" statistic reported in Stekel et al) . It also gives you the amount of genes above this cutoff in both the real and random data sets.

4.  A table of all transcripts and their R values, along with expected R values from randomized tables, standard deviation of expected R, and the difference in standard deviations between the observed and expected R.


Run Button: Begins the calculations.

Help Button: Describes buttons and input options.

Minimum expression cutoff: The percentile to begin the cutoff calculations ("Believability" statistic).

Maximum percentile for R calculations: The percentile to stop the cutoff calculations ("Believability" statistic).

Percentile increment: Increment for the percentile cutoff calculator.

Random Tables: The number of random tables to generate.


Transformations
Performs transformations and passes those data for analysis.
If none selected, dispersion will still be calculated and the overdispersion output written.
At least one has to be checked for R calculations and associated output.

None: Analyzes untransformed data.

Log: Natural log transformation.

RPKM: RPKM transformation.

RPKM Log: RPKM first, then Log transformation.


Randomizations
Randomizations to perform for comparison to actual data.
If none selected, R calculations for actual data will still be performed and written to file.

Shuffle: Randomly shuffles data within columns/libraries

Poisson:  Generates random numbers from Poisson distributions based on mean data for each transcript

Negative Binomial:  Generates random numbers from negative binomial distributions based on mean and expected variance data for each transcript.  The expected variance is calculated as follows

variance_exp = mean + (mean^2)*dispersion_mean

Where mean is the mean of count data across libraries for each transcript, and dispersion_mean is the average of dispersions calculated for all transcripts.

dispersion_transcript = ( variance_transcript - mean ) / mean^2
dispersion_mean = sum(dispersion_transcript) / number of transcripts

If each transcript dispersion is used to generate random data, then there will be little or no difference between distributions of actual and randomized data.  The goal is to identify differentially expressed genes, which would have higher variance, and dispersions, than nondifferentially expressed genes.  Therefore, we seek to compare actual data to randomizations based on dispersion expected for the whole table, not just within the row.

Source: README.txt, updated 2011-11-22