Download Latest Version RNAseqR_1.1.tar.gz (17.5 MB)
Email in envelope

Get an email when there's a new version of RNAseqR

Home / v1.1
Name Modified Size InfoDownloads / Week
Parent folder
README.txt 2012-03-31 11.5 kB
INSTALL.txt 2012-03-31 1.3 kB
RNAseqR_1.1.tar.gz 2012-03-31 17.5 MB
Totals: 3 Items   17.5 MB 0
**********************************************************
*************RNAseqR R Statistic Calculator***************
**********************************************************
Biology and math by Thomas Walk and Scott Geib
Coded by Theodore DeRego, Thomas Walk, and Scott Geib 
2011

contact
tom.walk@ars.usda.gov, scott.geib@ars.usda.gov


INTRODUCTION

RNAseqR reads and analyzes a table of transcript counts. Counts can be converted to PPM, RPKM, natural log values, or combinations (first PPM or RPKM, then log transformation). You can also specify a cutoff, in which transcripts with lower count sums will be removed before any calculations are applied. From there, two types of statistics can be run.

The CDF test fits a negative binomial distribution to each transcript based on the mean read count for that transcript and the average dispersion over all transcripts.  A p value is generated for each read count, which is the cumulative distribution for that count to the nearest tail.  Cumulative distribution p values are also calculated for each count relative to the tail closest to zero, and maximum difference in p values is determined.  With these methods, you can see the probability of any individual count belonging to the fitted negative binomial distribution, as well as the amount of distribution separating the 2 most widely dispersed read counts.

NOTE:  Boost is required for the CDF test, http://www.boost.org/.   We tried to make RNAseqR as self contained as possible, with libraries included in the RNAseqR directory.  However, you would be waiting a long, long time for the factorials used in the CDF test if you relied on our programming capabilities.  We didn't want to include boost in our package, so you need to install it on your system if you want to use the CDF test in RNAseqR.

The R test statistic (Stekel et al, 2000, Genome Research 10:2055) is a straight forward log likelihood ratio test that is calculated for actual data, as well as for data randomized by shuffling within libraries or by generating Poisson or negative binomially distributed random numbers for each transcript. Output files allow comparison of actual R values to random R values, or to R values expected given the count mean, for each transformation and randomization applied.



SAMPLE SCENARIO

If you don't want to read through this entire thing, here is a sample scenario.


The simplest thing to do is to open the GUI screen and make selections there.

./RNAseqR

It will read from a file named "input" by default.  This can be changed in the GUI or command line options.


If the command line is preferred for running batch jobs or anything else, a number of options are available.  Here is an example.

./RNAseqR my_input.txt -o my_output -gui false -rpkm -log -cdf true -R true -negbinom -min 50 -max 100 -inc 0.5 -rtc 100     


The gui screen opens by default (-gui true), so to prevent opening of the gui screen use the following flag.
-gui false

Input Files:
my_input.txt	tab delimited count data, with ID in the first column, and column headings in the first row
my_input.txt_Lengths	tab delimited transcript length file for RPKM transformation, with ID in the first column, transcript length (integer) in the second coumn, and without a column heading row on top.

Output Files:
my_output	base name for ouputs within a directory tree created by the program for the transformations and randomizations selected

In this scenario, we are inputing our tabbed table in the file my_input.txt. We are changing the read values to RPKM values, which requires a file called my_input.txt_Lengths in the same folder as my_input.txt.  

Average dispersion and counts of overdispersed transcripts are written to my_output_Overdispersion.

CDF tests are output to my_output_expression_cdf for each transformation in the appropriate transformation directory.

R values for each transcript are written to the my_output_R_values file for each transformation in the appropriate transformation directory.

Another output file, my_output_R_above_cutoffs, contains the percentile cutoffs showing R values associated with the percentiles, along with counts of actual and randomized transcripts with greater R values.  It is written for each transformation and randomization in the appropriate transformation/randomization directory.  In this example, the percentiles will start at 50, end at 100, and increment by 0.5.  Plus, the program returns the highest R value.  As an example, if there are 1000 transcripts, the percentile cutoffs will return the mean R value and standard deviation associated with position 800, 805, 810, ... 1000 from the randomized tables when the R values are sorted from lowest to highest, along with the number of greater R values in real data.

A final output, my_output_R_vs_expected_R, contains real R versus R expected at that position given the relation between R and count mean.  We take all of the random R generated at that mean, calculate the mean and standard deviation for the random R, and calculate how far the real R is away from the random R.  The more standard deviations the real R is above the random R, the better.



So, that's a lot of outputs to sort through.  We find that the CDF test works well when library sizes are relatively uniform, ie less than ~2 X difference between the smallest and biggest libraries.  The R test works well for us when the dispersion is low, or log transformations are used, which is not surprising given that it was developed for EST data with assumed Poisson distributions.  Filtering out transcripts with low read counts helps for both tests.



COMMAND LINE PARAMETERS

ARG		DEFAULT		DESCRIPTION
-gui		true		Whether or not to open the GUI (true/false)

-o		output		Output file

-c		1		Minimum sum of counts summed across libraries for inclusion of transcript in analysis

-raw		true		Analyzes untransformed data
-log		false		Natural log transformation of data
-ppm		false		PPM transformation 
-ppmlog		false		PPM transformation, then natural log transformation 
-rpkm		false		RPKM transformation, requires separate length file 
-rpkmlog	false		RPKM transformation, requires length file, then natural log transformation 

-cdf		true		runs negative binomial cumulative distribution test

-R		true		runs R test with the following R specific parameters

-rtc		10		Number of random tables to generate, default is low, 100-1000 is common in research
-negbinom	false		Makes random tables using negative binomial distribution from row mean and variance
-poisson	false		Makes random tables using Poisson distribution from row mean
-shuffle	false		Makes random tables for comparing to actual data by shuffling within columns
-min		80		Percentile to begin the cutoff calculations at
-max		100		Percentile to stop the cutoff calculations at 
-inc		0.5		Value to increment the percentile cutoff calculator by





GUI BUTTONS


Select Input: Selects the input file to use. Files should be in tab-delimited format as such:

Transcript	Libray1	Library2 ...	LibraryN
Transcript1	5	20	100	300
Transcript2	2	2	3	4
Transcript3	7	6	7	7
........	5	3	2	1
TranscriptN	2	7	21	58

Input values can be reads, RPKM, etc. If you only have reads, the program can do transformations for you, even RPKM, as long as a tab-delimted length file named the same as the input file, but with "_Lengths" appended is provided. The first row in the input file is read as a header row of column headings/library IDs, while the first column is read as transcript IDs.  Don't include a header row in the _Lengths file.


Select Output Button: Selects the output file name base.

The only output printed every time is:

1. A table of dispersion statistics for each transformation selected.  Many overdispersed genes indicates negative binomial distribution predominates, while few overdispersed genes indicates Poisson distributions may fit better.


Other output files that may be generated based on which tests and randomizations you choose are:

2.  A table of read counts and CDF test statistics.  First comes the p for each count relative to the closest tail.  Then the minimum p for the transcript is printed.  Then p values for each count relative to the tail closest to zero are printed.  Finally, the maximum difference in p values is written.

3.  A table of read counts and actual R values for each transformation.

4.  If any randomization is selected, then ther will be a table generated for each transformation and randomization that shows the R value for each percentile cutoff specified in input parameters (this is "believability" statistic reported in Stekel et al) . It also gives you the amount of genes above this cutoff in both the real and random data sets.

5.  Also, with randomization, there will be a table of all transcripts, their R values, expected R values from randomized tables, standard deviation of expected R, and the difference in standard deviations between the observed and expected R.




Run Button: Begins the calculations.

Help Button: Describes buttons and input options.



Minimum expression cutoff: The number of reads summed across libraries below which transcripts are excluded from analysis



Transformation
Performs transformations and passes those data for analysis.
If none selected, dispersion will still be calculated and the overdispersion output written.
At least one has to be checked for CDF and/or R calculations and associated output.

None: Analyzes untransformed data.

Log: Natural log transformation.

PPM: PPM transformation.

PPM Log: PPM first, then Log transformation.

RPKM: RPKM transformation, requires separate length file described above.

RPKM Log: RPKM first, then Log transformation.




Calculate negtive binomial CDF:  Check if you want to run the CDF test.




Run R test:  Check if you want to run the R test

Randomization
Randomizations to perform for comparison to actual data.
If none are selected, R calculations for actual data will still be performed and written to file.

Shuffle: Randomly shuffles data within columns/libraries

Poisson:  Generates random numbers from Poisson distributions based on mean data for each transcript

Negative Binomial:  Generates random numbers from negative binomial distributions based on mean and expected variance data for each transcript.  The expected variance is calculated as follows

variance_exp = mean + (mean^2)*dispersion_mean

Where mean is the mean of count data across libraries for each transcript, and dispersion_mean is the average of dispersions calculated for all transcripts.

dispersion_transcript = ( variance_transcript - mean ) / mean^2
dispersion_mean = sum(dispersion_transcript) / number of transcripts

If each transcript dispersion is used to generate random data, then there will be little or no difference between distributions of actual and randomized data.  The goal is to identify differentially expressed genes, which would have higher variance, and dispersions, than nondifferentially expressed genes.  Therefore, we seek to compare actual data to randomizations based on dispersion expected for the whole table, not just within the row.



Percentile cutoffs determine the range and increments between 0 and 100 at which to compare actual and random R values.

This leads to the believability statistic reported by Stekel et al.

Source: README.txt, updated 2012-03-31